Improved accuracy#35
Conversation
| return ((label, feature), value) | ||
| by_label = x.map(doc_to_label) # ((label, feature), value) | ||
| by_label = by_label.reduceByKey(lambda x, y: x+y) | ||
| by_label_map = by_label.collectAsMap() |
There was a problem hiding this comment.
This is gonna be 4*vocab... yikes
There was a problem hiding this comment.
I know! But this is the second change which improved my accuracy.
If you remember the equation of likelyhood probability, in the nominator, we have to count the words in the class given the document. I was simply counting # of words in class.
I am using the by_label_map to fetch count of words given document and if the combination of (label, feature) does not exist, I am returning 0 as a count and of course for each time I get the word count, I am adding 1 to it. Our likelyhood nominator becomes (for each class, each word):
by_label_map.get((label, feature), 0) + 1
Important thing here is that, before this change I was not counting those (0+1)s !
I know the collectAsMap() looks ugly. It was a desperate attempt to execute the idea as soon as possible!
|
The performance improvement is exciting! The It looks like a change to the log-likelihood formulation. What's the high-level idea? |
|
The high level idea is that: in the fit method, while training, we have to create likelyhood rdds considering all the possible labels. Meaning that algorithmically, you have to iterate over our four labels and calculate likelyhood probabilities for those words. This is why you see that ugly cartesian product there! I was not doing that before |
This improved accuracy from 27.xx to 58.xx