Skip to content Skip to sidebar Skip to footer

Typeerror: Doc2bow Expects An Array Of Unicode Tokens On Input, Not A Single String When Using Gensim.corpora.dictionary()

There is a dataframe like this: index terms 1345 ['jays', 'place', 'great', 'subway'] 1543 ['described', 'communicative', 'friendly'] 9874 ['great', 'sarah

Solution 1:

Each index needs to have its terms be in a sublist, all of which are nested within larger list.

theterms = [['jays', 'place', 'great', 'subway'],['described', 'communicative', 'friendly'], ['great', 'sarahs', 'apartament', 'back'],['great', 'sarahs', 'apartament', 'back']] 

dictionary = corpora.Dictionary(theterms)

Solution 2:

First convert comments['terms'] using comments['terms'].tolist() to a list and then run the corpora, it should work. You can do other preprocessing like stemming or stopwords removal etc. before creating your dictionary.

Post a Comment for "Typeerror: Doc2bow Expects An Array Of Unicode Tokens On Input, Not A Single String When Using Gensim.corpora.dictionary()"