Fix 128bit SimHash, and add 256bit and 512bit SimHash#5
Fix 128bit SimHash, and add 256bit and 512bit SimHash#5nicolaichuk wants to merge 7 commits intotgalopin:masterfrom
Conversation
add 256 and 512 bit SimHash
tgalopin
left a comment
There was a problem hiding this comment.
Hello @nicolaichuk,
Thanks for your PR!
I'm not sure what you want to fix in 128 bits version and why you need all the modifications you did. Unfortunately, I can't accept such a PR without some explanations about your work and fixes on the coding style and BC breaks.
| * @param Fingerprint $fp1 | ||
| * @param Fingerprint $fp2 | ||
| * @return int | ||
| */ |
There was a problem hiding this comment.
I'm pretty sure this method is slower than the current one, why do you need this?
There was a problem hiding this comment.
You can't use scalar type hint, we support php 5.3+
| $this->deviation = $deviation; | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Count differences between fingerprints
lib/Tga/SimHash/Fingerprint.php
Outdated
| * @var float | ||
| */ | ||
| protected $decimalValue; | ||
| protected $value; |
There was a problem hiding this comment.
This is a BC break without any added value, could you revert this?
There was a problem hiding this comment.
"decimalValue" save value as number.
"value" save value as string.
These are two different variables, so they are called differently.
There was a problem hiding this comment.
Could you make this property private as well? I don't think it should be exposed anyway.
| public function tokenize($element) | ||
| { | ||
| return str_pad(base_convert(md5($element), 16, 2), 128, '0', STR_PAD_LEFT); | ||
| $hash = md5($element); |
There was a problem hiding this comment.
Why do you need to split this in parts?
| class String128Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
||
| protected static $search = array('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'); |
There was a problem hiding this comment.
You shouldn't add protected variables like this without a specific need as they will need to be handled by BC promise.
| { | ||
| return $size === 256; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Please add an ending line to this file.
| { | ||
| return $size === 512; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Please add an ending line to this file.
* add an ending line to this file
first: second: |
tgalopin
left a comment
There was a problem hiding this comment.
Thanks for your work, I understand it now :) . A few comments and we should be okay :) !
| * @param Fingerprint $fp1 | ||
| * @param Fingerprint $fp2 | ||
| * @return int | ||
| */ |
lib/Tga/SimHash/Fingerprint.php
Outdated
| * @var float | ||
| */ | ||
| protected $decimalValue; | ||
| protected $value; |
There was a problem hiding this comment.
Could you make this property private as well? I don't think it should be exposed anyway.
| */ | ||
| class String128Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
| */ | ||
| class String256Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
| */ | ||
| class String512Tokenizer implements TokenizerInterface | ||
| { | ||
|
|
|
@tgalopin |
tgalopin
left a comment
There was a problem hiding this comment.
Travis is failing due to scalar type hints.
| return str_pad(base_convert(md5($element), 16, 2), 128, '0', STR_PAD_LEFT); | ||
| $hash = md5($element); | ||
| $hash = str_replace(self::$search, self::$replace, $hash); | ||
| $hash = str_pad($hash, 512, '0', STR_PAD_LEFT); |
There was a problem hiding this comment.
I think there is a typo here. Shouldn't you write:
$hash = str_pad($hash, 128, '0', STR_PAD_LEFT);
|
Hi guys, I know what is to maintain an OSS project, but this PR looks good to me except a typo and the scalar type hint (except if you want to stop to maintain compatibility with older projects). It is a good lib that helped me in the past, but now I am stuck with the error in 128 hashes. How can I help? |
|
Hello @bbalet ! The issue is that as @nicolaichuk didn't gave me access to code modification in this PR, I can't update his code, and using it without crediting him does not feel right in open-source :) . If you want to open a new PR based on his code, fixing the issues and adding a note in your commit message about @nicolaichuk, please don't hesitate to do so and I will review it happily :) ! |
Option "Allow edits from maintainers" is selected in this PR. |
by code review @bbalet tgalopin#5 (comment)
@tgalopin, hi.
Please review and merge to upstream.