The Similar Method From The Nltk Module Produces Different Results On Different Machines. Why?
Solution 1:
In your example there are 40 other words which have exactly one context in common with the word 'monstrous'
.
In the similar
function a Counter
object is used to count the words with similar contexts and then the most common ones (default 20) are printed. Since all 40 have the same frequency the order can differ.
From the doc of Counter.most_common
:
Elements with equal counts are ordered arbitrarily
I checked the frequency of the similar words with this code (which is essentially a copy of the relevant part of the function code):
from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter
word = 'monstrous'
num = 20
text1.similar(word)
wci = text1._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts andnot w == word)
words = [w for w, _ in fd.most_common(num)]
# print(tokenwrap(words))print(fd)
print(len(fd))
print(fd.most_common(num))
Output: (different runs give different output for me)
Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})
Solution 2:
In short:
It has something to do with how python3
hashes keys when the similar()
function uses the Counter dictionary. See http://pastebin.com/ysAF6p6h
See How and why is the dictionary hashes different in python2 and python3?
In long:
Let's start with:
from nltk.bookimport *
The import here comes from https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text
object and read several corpora into the Text
object.
E.g. This is how the text1
variable was read from nltk.book
:
>>>import nltk.corpus>>>from nltk.text import Text>>>moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Now, if we go down to the code for the similar()
function at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377, we see this initialization if it is the first instance of accessing self._word_context_index
:
defsimilar(self, word, num=20):
"""
Distributional similarity: find other words which appear in the
same contexts as the specified word; list most similar words first.
:param word: The word used to seed the similarity search
:type word: str
:param num: The number of words to generate (default=20)
:type num: int
:seealso: ContextIndex.similar_words()
"""if'_word_context_index'notin self.__dict__:
#print('Building word-context index...')
self._word_context_index = ContextIndex(self.tokens,
filter=lambda x:x.isalpha(),
key=lambda s:s.lower())
word = word.lower()
wci = self._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts andnot w == word)
words = [w for w, _ in fd.most_common(num)]
print(tokenwrap(words))
else:
print("No matches")
So that points us to the nltk.text.ContextIndex
object, that is suppose to collect all the words with the similar context window and store them. The docstring says:
A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.
By default if you're calling the similar()
function, it will initialize the _word_context_index
with the default context settings i.e. the left and right token window, see https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40
@staticmethoddef_default_context(tokens, i):
"""One left token and one right token, normalized to lowercase"""
left = (tokens[i-1].lower() if i != 0else'*START*')
right = (tokens[i+1].lower() if i != len(tokens) - 1else'*END*')
return (left, right)
From the similar()
function, we see that it iterates through the word in context stored in the word_context_index, i.e. wci = self._word_context_index._word_to_contexts
.
Essentially, _word_to_contexts
is a dictionary where the keys are the words in the corpus and the values are the left and right words from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:
self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
for i, w in enumerate(tokens))
And here we see that it's a CFD, which is a nltk.probability.ConditionalFreqDist
object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646.
The only possibly of getting the different result is when the similar()
function loops through the most_common words at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402
Given that two keys in a Counter
object have the same counts, the word with a lower sorted hash will print out first and the hash of the key is dependent on the the CPU's bit-size, see http://www.laurentluce.com/posts/python-dictionary-implementation/
The whole process of finding the similar words itself is deterministic, since:
- the corpus/input is fixed
Text(gutenberg.words('melville-moby_dick.txt'))
- the default context for every word is also fixed, i.e.
self._word_context_index
- the computation of the conditional frequency distribution for
_word_context_index._word_to_contexts
is discrete
Except when the function outputs the most_common
list, which when there's a tie in the Counter
values, it will output the list of keys given their hashes.
In python2
, there's no reason to get a different output from different instances of the same machine with the following code:
$ python
>>>from nltk.book import *>>>text1.similar('monstrous')>>>exit()
$ python
>>>from nltk.book import *>>>text1.similar('monstrous')>>>exit()
$ python
>>>from nltk.book import *>>>text1.similar('monstrous')>>>exit()
But in Python3
, it gives a different output every time you run text1.similar('monstrous')
, see http://pastebin.com/ysAF6p6h
Here's a simple experiment to prove that quirky hashing differences between python2
and python3
:
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
Post a Comment for "The Similar Method From The Nltk Module Produces Different Results On Different Machines. Why?"