Extract OCR functions and add unit tests for them. by Double-A-92 · Pull Request #23 · pwnfoo/fakemenot

Double-A-92 · 2017-10-15T00:01:40Z

init file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
A step towards solving #5 .

pwnfoo

This is the best PR I have seen in a while. Thanks a lot for this <3

pwnfoo · 2017-10-15T00:27:56Z

fakemenot/tests/test_ocr.py

+        expected_user = "@NASA"
+        expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \
+                        "history again in a quiet way: go.nasa.gov/2kOO1cc"
+        self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text)


It looks like pngs were ignored by .gitignore. Can you manually add them? :)

pwnfoo · 2017-10-15T00:30:31Z

fakemenot/ocr.py

+from pytesseract import image_to_string
+
+
+def find_user_and_text_in_tweet_image(image_path: str) -> (str, str):


This won't work in Python2 :(

pwnfoo · 2017-10-15T00:30:41Z

fakemenot/ocr.py

+    return extract_values_from_desktop_tweet(words)
+
+
+def extract_values_from_desktop_tweet(words: List[str]) -> (str, str):


pwnfoo · 2017-10-15T00:30:59Z

fakemenot/ocr.py

+    return user, body
+
+
+def prepare_image_for_ocr(image_path: str) -> Optional[Image.Image]:


Python2 compatibility is broken

_init_ file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.

Double-A-92 · 2017-10-15T00:55:28Z

Ok fixed the issues. :)

Btw. if you are really worried about compatibility... My IDE is showing me some warning about the "import configparser" in the main.py. But I don't really know how to avoid that now.

pwnfoo · 2017-10-15T16:03:44Z

fakemenot/__init__.py

-            removed_elements = 0
-            ltweet, orig_len = tweet[0].split(' '), len(tweet[0].split(' '))
-            # Compare each element of body to element in body. TODO: Optimize
-            for ele in body:


Body is being passed as a string here. So, every iteration is going to produce a single character. It's best to body.split(' ') before this :)

I didnt touch that part. Just literally ripped out the OCR bits, e.g. the parts that set the variables potential_user and body.

That whole analysis part can probably be replaced by the SequenceMatcher.ratio() function that I also used in the unit tests. Seems to do the same thing?

pwnfoo · 2017-10-15T16:10:27Z

fakemenot/tests/test_ocr.py

+        expected_user = "@NASA"
+        expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \
+                        "history again in a quiet way: go.nasa.gov/2kOO1cc"
+        self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text)


This one seems to be broken for me. After OCR, I get this :

NASAO @ @NASA m V For 70 years, planes loudly flew supersonic 8L barriers were broken. Now we're making history again in a quiet way: go.nasa.gov/2k001cc

Hmm.. There are probably differences between the linux and windows ocr. But this is a real problem, which needs to be fixed...

It seems to sometimes recognize the "official account" badge as an @, which then messes up the user handle extraction.

I'll try to fix that.

Ok changed it. Try it now :)

Potental handle has to be at least 2 chars long. Also the image was made a bit bigger for more stable OCR results.

pwnfoo · 2017-10-18T03:19:35Z

Hmm, OCR seems to be behaving differently on platforms. Image 3 is broken on Linux now because it's blown up a bit too much and detects handle incorrectly. I'm starting to wonder why. There should be a cleaner solution :/

Double-A-92 · 2017-10-18T17:33:46Z

I don't know how to fix that quickly. The unittests themselves are correct, they check the right thing.

I could revert the ocr extraction code to the original (where it checks for the "v" to select the tweet body), so it's sure that I didn't break anything.

And then just let the test fail (justified) and you could open a seperate issue for that?

pwnfoo requested changes Oct 15, 2017

View reviewed changes

Extract OCR functions and add unit tests for them.

67fa790

_init_ file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.

Double-A-92 force-pushed the master branch from da241e4 to 67fa790 Compare October 15, 2017 00:53

pwnfoo requested changes Oct 15, 2017

View reviewed changes

Improve user handle recognition

4bb64e0

Potental handle has to be at least 2 chars long. Also the image was made a bit bigger for more stable OCR results.

		from pytesseract import image_to_string


		def find_user_and_text_in_tweet_image(image_path: str) -> (str, str):

		return extract_values_from_desktop_tweet(words)


		def extract_values_from_desktop_tweet(words: List[str]) -> (str, str):

		return user, body


		def prepare_image_for_ocr(image_path: str) -> Optional[Image.Image]:

Conversation

Double-A-92 commented Oct 15, 2017

Uh oh!

pwnfoo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Double-A-92 commented Oct 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Double-A-92 Oct 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwnfoo commented Oct 18, 2017

Uh oh!

Double-A-92 commented Oct 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Double-A-92 Oct 15, 2017 •

edited

Loading

Double-A-92 commented Oct 18, 2017 •

edited

Loading