I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt
and neg_1.txt
), but I would prefer to create directories I could dump files into.
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(w+)/*.txt', cat_pattern=r'/(w+)/.txt')
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
This returns:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
---------UPDATE------------- Thanks alvas for your detailed answer! I have two questions, however.
- Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the
method, only grabbing thepos
from the folder name rather than the file name. I ran your code and am experiencing a syntax error on
train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
with the carrot under the first for
. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.
----UPDATE 2---- Error is
File "review.py", line 17
for i in word_features}, tag)
SyntaxError: invalid syntax`