Improved accuracy by ankit-vaghela30 · Pull Request #35 · dsp-uga/hydrus

ankit-vaghela30 · 2018-02-02T12:13:16Z

This improved accuracy from 27.xx to 58.xx

cbarrick · 2018-02-02T13:31:43Z

            return ((label, feature), value)
        by_label = x.map(doc_to_label)  # ((label, feature), value)
+        by_label = by_label.reduceByKey(lambda x, y: x+y)
+        by_label_map = by_label.collectAsMap()


This is gonna be 4*vocab... yikes

I know! But this is the second change which improved my accuracy.

If you remember the equation of likelyhood probability, in the nominator, we have to count the words in the class given the document. I was simply counting # of words in class.

I am using the by_label_map to fetch count of words given document and if the combination of (label, feature) does not exist, I am returning 0 as a count and of course for each time I get the word count, I am adding 1 to it. Our likelyhood nominator becomes (for each class, each word):

by_label_map.get((label, feature), 0) + 1

Important thing here is that, before this change I was not counting those (0+1)s !

I know the collectAsMap() looks ugly. It was a desperate attempt to execute the idea as soon as possible!

cbarrick · 2018-02-02T13:33:39Z

The performance improvement is exciting! The collectToMap not so much.

It looks like a change to the log-likelihood formulation. What's the high-level idea?

ankit-vaghela30 · 2018-02-02T18:00:19Z

The high level idea is that: in the fit method, while training, we have to create likelyhood rdds considering all the possible labels. Meaning that algorithmically, you have to iterate over our four labels and calculate likelyhood probabilities for those words.

This is why you see that ugly cartesian product there! I was not doing that before

Improved accuracy

6ac9753

ankit-vaghela30 requested a review from cbarrick February 2, 2018 12:13

cbarrick reviewed Feb 2, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved accuracy#35

Improved accuracy#35
ankit-vaghela30 wants to merge 1 commit into
dsp-uga:masterfrom
ankit-vaghela30:naive-bayes

ankit-vaghela30 commented Feb 2, 2018

Uh oh!

cbarrick Feb 2, 2018

Uh oh!

ankit-vaghela30 Feb 2, 2018

Uh oh!

cbarrick commented Feb 2, 2018

Uh oh!

ankit-vaghela30 commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankit-vaghela30 commented Feb 2, 2018

Uh oh!

cbarrick Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

ankit-vaghela30 Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

cbarrick commented Feb 2, 2018

Uh oh!

ankit-vaghela30 commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants