Extract OCR functions and add unit tests for them.#23
Extract OCR functions and add unit tests for them.#23Double-A-92 wants to merge 2 commits intopwnfoo:masterfrom
Conversation
pwnfoo
left a comment
There was a problem hiding this comment.
This is the best PR I have seen in a while. Thanks a lot for this <3
| expected_user = "@NASA" | ||
| expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \ | ||
| "history again in a quiet way: go.nasa.gov/2kOO1cc" | ||
| self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text) |
There was a problem hiding this comment.
It looks like pngs were ignored by .gitignore. Can you manually add them? :)
fakemenot/ocr.py
Outdated
| from pytesseract import image_to_string | ||
|
|
||
|
|
||
| def find_user_and_text_in_tweet_image(image_path: str) -> (str, str): |
fakemenot/ocr.py
Outdated
| return extract_values_from_desktop_tweet(words) | ||
|
|
||
|
|
||
| def extract_values_from_desktop_tweet(words: List[str]) -> (str, str): |
fakemenot/ocr.py
Outdated
| return user, body | ||
|
|
||
|
|
||
| def prepare_image_for_ocr(image_path: str) -> Optional[Image.Image]: |
_init_ file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
|
Ok fixed the issues. :) Btw. if you are really worried about compatibility... My IDE is showing me some warning about the "import configparser" in the main.py. But I don't really know how to avoid that now. |
| removed_elements = 0 | ||
| ltweet, orig_len = tweet[0].split(' '), len(tweet[0].split(' ')) | ||
| # Compare each element of body to element in body. TODO: Optimize | ||
| for ele in body: |
There was a problem hiding this comment.
Body is being passed as a string here. So, every iteration is going to produce a single character. It's best to body.split(' ') before this :)
There was a problem hiding this comment.
I didnt touch that part. Just literally ripped out the OCR bits, e.g. the parts that set the variables potential_user and body.
That whole analysis part can probably be replaced by the SequenceMatcher.ratio() function that I also used in the unit tests. Seems to do the same thing?
| expected_user = "@NASA" | ||
| expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \ | ||
| "history again in a quiet way: go.nasa.gov/2kOO1cc" | ||
| self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text) |
There was a problem hiding this comment.
This one seems to be broken for me. After OCR, I get this :
NASAO
@ @NASA m V
For 70 years, planes loudly flew supersonic 8L
barriers were broken. Now we're making
history again in a quiet way:
go.nasa.gov/2k001cc
There was a problem hiding this comment.
Hmm.. There are probably differences between the linux and windows ocr. But this is a real problem, which needs to be fixed...
It seems to sometimes recognize the "official account" badge as an @, which then messes up the user handle extraction.
I'll try to fix that.
There was a problem hiding this comment.
Ok changed it. Try it now :)
Potental handle has to be at least 2 chars long. Also the image was made a bit bigger for more stable OCR results.
|
Hmm, OCR seems to be behaving differently on platforms. Image 3 is broken on Linux now because it's blown up a bit too much and detects handle incorrectly. I'm starting to wonder why. There should be a cleaner solution :/ |
|
I don't know how to fix that quickly. The unittests themselves are correct, they check the right thing. I could revert the ocr extraction code to the original (where it checks for the "v" to select the tweet body), so it's sure that I didn't break anything. And then just let the test fail (justified) and you could open a seperate issue for that? |
init file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
A step towards solving #5 .