Countvectorizer Deleting Features That Only Appear Once
Solution 1:
So, it's impossible to say without actually seeing the source code of setup_data
, but I have a pretty decent guess as to what is going on here. sklearn
follows the fit_transform
format, meaning there are two stages, specifically fit
, and transform
.
In the example of the CountVectorizer
the fit
stage effectively creates the vocabulary, and the transform
step transforms your input text into that vocabulary space.
My guess is that you're calling fit
on both datasets instead of just one, you need to be using the same "fitted" version of CountVectorizer
on both if you want the results to line up. e.g.:
model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.transform(test_corpus)
Again, this can only be a guess until you post the setup_data
function, but having seen this before I would guess you're doing something more like this:
model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.fit_transform(test_corpus)
which will effectively make a new vocabulary for the test_corpus
, which unsurprisingly won't give you the same vocabulary length in both cases.
Post a Comment for "Countvectorizer Deleting Features That Only Appear Once"