Update Duplicate Detection#77
Conversation
Improve edge case logic in comparison functions. Remove unregister driver function. Clean up exp_functions.c.
Simplify dataqa_duplicates component in preparation for making it the boundary into our new duplicate system. Add exp functions: sparse_eql(), ln(), and logn(). Fix bugs in comparison functions. Make minor tweaks to objdrv_cluster.c.
Modify cluster files to use string keys. Build vectors fully sparsely. Add ca_fprint_vector(). Add snprint_llu(). Add exp_fn_trim(). Update exp_fn_cmp(). Organize exp function definitions by group. Add statistics tracking to cluster driver. Reduce minimum hint threshold. Add array handling to ci_xaToTrimmedArray(). Update timer to handle multiple starts and stops properly.
Re-add Levenshtein to exp_functions. Publish edit_dist() in the cluster library. Fix mistakes in cluster driver function signatures. Fix spelling mistakes. Add detail to an error message in the lexer. Remove unused .cluster files. Clean up cluster-schema.cluster. Clean up other unused junk.
Add known issues to string similarity documentation. Clean up and organize todos. Clean up testing code in several files.
|
I'd probably recommend rebasing these into one commit, but that's up to you. |
…ast commit). Update tests to pass with this modification.
… caches). Fix a formatting issue with the stat method. Fix a missing include in the util.c library.
…le hundred bytes. Add check_double() to handle functions that return NAN on failure. Clean up.
…rary. Round similarity results to avoid floating point errors. Enable caching for memory allocated in get_cluster_size(). Rename edit_dist() to ca_edit_dist() to match format for public functions. Rename print_diagnostics() to print_err().
|
@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work. |
Add some error checking. Add expect.h to mtask.c. Add recursion checking.
|
@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work. |
Add HAVE_BUILTIN_EXPECT to config.h.in and cxlibconfig.h.in. Add code to include cxlibconfig.h.in from expect.h. Add comments to explain what each .h.in file is used for. Remove checks for __builtin_expect() from centrallix because we only need them in centrallix-lib. Remove cflags -DHAVE_BUILTIN_EXPECT because using configs is better. Rebuild configure files.
|
@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work. |
|
@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work. |
|
Oops, forgot to post my message before... This PR is cleared for human review. |
| goto end_free; | ||
| } | ||
|
|
||
| exp_fn_levenshtein(tree, objlist, i0, i1, i2); | ||
| //!!! I am not checking for errors here, because IN THEORY we have two strings... if we don't, big uh-oh. | ||
| int lev_dist = tree->Integer; | ||
| /** Allocate space to store metaphone pointers. **/ | ||
| char* primary = NULL; |
There was a problem hiding this comment.
Uninitialized
primary/secondary after schema-check goto end_free
goto end_free at line 4550 jumps past char* primary = NULL; and char* secondary = NULL; (lines 4554–4555). In C, the initializer is never executed when the declaration is jumped over, so both pointers have indeterminate values at end_free:. Because free_strs is true at that point, if (LIKELY(free_strs && primary != NULL)) nmSysFree(primary) may call nmSysFree with a garbage stack pointer, causing undefined behaviour (likely a crash).
Fix: move both declarations to the top of the function, before any goto:
int ret = -1;
bool free_strs = true;
char* primary = NULL;
char* secondary = NULL;| /** Store the results. **/ | ||
| store_data:; | ||
| const size_t length = strlen(primary) + 1lu + strlen(secondary) + 1lu; | ||
| if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) goto end_free; |
There was a problem hiding this comment.
Silent success return when
exp_fn_i_alloc_result_string fails after metaphone succeeds
When meta_double_metaphone succeeds it sets ret = 0. If exp_fn_i_alloc_result_string then fails, control jumps to end_free: still holding ret = 0. The guard if (UNLIKELY(ret != 0)) mssError(...) is never triggered, no error is printed, and the function returns 0 (success) without having stored anything in tree->String. The caller then treats the expression as a valid string result.
| if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) goto end_free; | |
| if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) { ret = -1; goto end_free; } |
|
TODO to self, for later: Add code location support to |
Rename mssError() to mssError_internal() which takes a file name and line number. Add mssError() macro that collects and passes the file and line number to mssError_internal(). Remove newline when the error stack is cleared because libraries might have already printed relevant errors.
… the file wasn't compiled.)
The duplicate detection project is ready to review, although (best case), there are still a couple of things blocking it from being ready to merge.
I would appreciate a full review of all changes, as there's quite a lot here. That said, some areas may require additional special attention, so I've compiled a list of all 28
TODO: Gregcomments below. (Note: Some of my todos assume the reader understands various pieces of nearby context / has generally read the indicated source code.)TODO: Gregs inobjdrv_cluster.cTODO: Greginmtsession.md.TODO: Greginxarray.md.TODO: Greginxstring.md.Please let me know if you have any questions, comments, or concerns about my changes and design choices.