-
Notifications
You must be signed in to change notification settings - Fork 10
Update Duplicate Detection #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Lightning11wins
wants to merge
126
commits into
master
Choose a base branch
from
dups
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
126 commits
Select commit
Hold shift + click to select a range
5f2e901
Checkpoint: Switching to DM UI project.
994e99f
Checkpoint: Switching to DM project.
ea6430f
Checkpoing: Switching to DM project.
cf0dbb5
Finish implementing major features for the cluster driver.
a861fb4
Upgrade memory handling in the cluster driver.
b4634f3
Begin adding query files to search for duplicates.
63a4dc2
Add warning for providing an invalid parameter.
22e55a3
Merge branch 'master' into dups
4b656a4
Improve exp_functions() to use central schema verification.
fa28afa
Add ClusterDriverRequirements (forgot to commit them before).
81a1d2f
Clean up unintended usage of glyph.h
e624d40
Attempt to reduce issues from ambiguously signed chars.
b0e000b
All tests now pass.
0874365
Re-apply reduced weight for duplicate pairs (temporarily turned off l…
01d918a
Clean up.
42a65f1
Update licences.
b281037
Clean up.
ee0bca7
Add "show_less" option to the cache method (skips printing uncomputed…
0c9eb2c
Update cluster library to use dynamic memory for any data over a coup…
394764e
Remove necessary requests for the driver name in objQueryFetch().
9b8cc19
Fix bugs that caused regressions after the updates to the cluster lib…
17156b7
Fix an invalid free (nmFree used instead of nmSysFree()).
648e30a
Merge branch 'master' into dups
29640a1
Minor improvements and clean up.
0fa62d3
Correct minor mistakes.
d3b571c
Merge branch 'master' into dups
06bae81
Implement a more extendable schema verification system.
13fd4b7
Replace old schema verification with the new system.
e83c15f
Expand the new schema verification system with extra data validation …
070cfe3
Clean up, bug fixes, and naming convention updates.
8795aaf
Add tests for log and power functions.
2e948d8
Add exp_fn_i_get_number().
4c347be
Add exp_fn_i_do_math() to bring the power of schema verification to l…
d177522
Minor clean up.
Lightning11wins 7b49a5b
Address Greg's comments
Lightning11wins e9c10a5
Merge branch 'exp-schema' into dups
Lightning11wins b6abca7
Finish exp_functions.c work.
Lightning11wins 8c86b5f
Organize docs.
Lightning11wins 63fa5ba
Fix wrong stAddValue() info caused by reading old code.
Lightning11wins d0d4f54
Clean up stale TOODs.
Lightning11wins 3b86627
Fix more styling mistakes.
Lightning11wins 6b83c67
Fix indentation mistakes (thanks Centrallix Indent extension).
Lightning11wins 66029f5
Rename functions to use the proper prefix everywhere.
Lightning11wins fce7a2c
Update magic.h to prepare for implementing magic on all cluster drive…
Lightning11wins b9defb8
Refactor some cluster driver code to make it cleaner.
Lightning11wins 636814e
Add magic.h to all major cluster driver structs.
Lightning11wins 495597e
Fix a broken test by increasing the tolerance for reasonable deviations.
Lightning11wins ab71333
Compile tests with -lm to prevent sporadic linker errors.
Lightning11wins 65e4458
Fix a critical driver bug causing levenshtein to be executed on the c…
Lightning11wins 68d1c68
Merge branch 'refs/heads/master' into dups
Lightning11wins 8ba449e
Fix a major bug where punctuation and whitespace was not properly ign…
Lightning11wins 2be1d22
Remove a broken link in clusters.h.
Lightning11wins 814fcfa
Fix code that assumed DateTime->Value was seconds since the epoch (it…
Lightning11wins 8917ae2
Clean up code and comments.
Lightning11wins 6eeeb2d
Add seed attribute to cluster driver.
Lightning11wins c561728
Improve support for IntVec and StringVec attribute types.
Lightning11wins ee9c410
Rename NameAttr to DataAttr in SourceData struct to match KeyAttr.
Lightning11wins f8aa100
Fix clusterGetAttrValue() asserting MGK_CL_CLUSTER_DATA on SearchData…
Lightning11wins 7dfc69b
Switch stats struct to use compile-time initialization (for consisten…
Lightning11wins d1c2185
Improve error messages in ci_ComputeSourceData().
Lightning11wins fd289ff
Refactor clusters to store an array of indexes into the SourceData st…
Lightning11wins dbc01f2
Fix spelling errors and improve comments.
Lightning11wins 4e96440
Add a test case for the cluster driver.
Lightning11wins c7c6fcb
Fix the keys field in the SourceData struct not being properly freed …
Lightning11wins b411968
Update sizeof functions to return size_t instead of unsigned int.
Lightning11wins 0fcf5e5
Update generated HTML by running make in centrallix-doc/Widgets.
Lightning11wins f47c678
Update copyright notices to use correct dates.
Lightning11wins e799efa
Fix typo that stated the signed int max was 2147483629.
Lightning11wins b9a9d36
Update macros.
Lightning11wins c9241cf
Fix error handling.
Lightning11wins 5fd3d96
Reimplement some code in ca_build_vector() so we don't need to use a …
Lightning11wins 3e57fa6
Fix spelling mistakes and clean up comments.
Lightning11wins 32ef9e0
Use ca_parse_vector_token() to abstract sparse vector parsing logic.
Lightning11wins 00bdc0e
Rename params for ca_parse_vector_token() to improve readability.
Lightning11wins 118e22b
Fix an incorrect doc comment.
Lightning11wins 19cd50b
Fix vector function bugs.
Lightning11wins ebb9585
Implement improved workaround for mssError() not supporting %c.
Lightning11wins 6544e4e
Update doc comments.
Lightning11wins 9de3d40
Update attribute lists.
Lightning11wins 9cac9b8
Add `README.md` to explain the new datasets directory.
Lightning11wins 98d8a11
Update `README.md` with knowledge learned from testing updates to fil…
Lightning11wins a0936d6
Improve OSDriver_Authoring.md with Noah's comments.
Lightning11wins f8572ad
Rewrite mssError() with lessons learned from mssErrorf().
Lightning11wins ddd99f6
Replace ci_xaToTrimmedArray() with xaToArray().
Lightning11wins d17c043
Fix a bug in exp_fn_compare() any correct use of the function to fals…
Lightning11wins 8d5364e
Add detail to the description in the copyright notice for the cluster…
Lightning11wins 7e2979b
Improve how enums are implemented in objdrv_cluster.c.
Lightning11wins 59eda46
Re-add typecasts that I removed because I thought they were optional.
Lightning11wins cf26c09
Fix a memory error.
Lightning11wins f447c19
Move file name and file path macros to obj.h to promote future reuse.
Lightning11wins 9449178
Refactor attribute name lists to improve reusability and extendability.
Lightning11wins fd84d79
Revisit attributes offered by the cluster driver.
Lightning11wins 00c07f1
Improve the cluster driver testcase to test the default attribute lists.
Lightning11wins b540a38
Improve driver with testing.
Lightning11wins e27308e
Update an example on OSDriver_Authoring.md because the sybase driver …
Lightning11wins bda9eac
Improve structs.
Lightning11wins 62eebee
Modify min_improvement field in the driver schema to only accept doub…
Lightning11wins d6e9d98
Fix a bug in the new mssError() implementation.
Lightning11wins 71b3c60
Clean up.
Lightning11wins fa048eb
Rename Key fields to CacheKey to reduce confusion with the Keys fetch…
Lightning11wins 99f0a29
Clean up and updates.
Lightning11wins 33d6150
Fix bugs and clean up.
Lightning11wins b2e0f35
Address Greptile comments.
Lightning11wins 326a475
Update error handling in exp_fn_metaphone().
Lightning11wins 06f07b9
Modify check function syntax.
Lightning11wins 0c8f5cc
Overhaul error checking in double_metaphone.c.
Lightning11wins 011d952
Fix cluster driver not setting freed computed data to NULL.
Lightning11wins 8506a39
Clean up some typos and bugs.
Lightning11wins f1e6bfe
Fix bug from Greptile.
Lightning11wins e705a68
Improve string buffer handling in exp_functions.c.
Lightning11wins 9a88787
Expand cluster driver test case.
Lightning11wins dfbd6c4
Merge branch 'refs/heads/master' into dups
Lightning11wins 07129f8
Fix expect.
Lightning11wins cc85e7b
Merge branch 'refs/heads/fix-expect' into dups
Lightning11wins b369ee4
Add code to use expect.h features.
Lightning11wins 5ea6848
Fix Greptile comments.
Lightning11wins 0a96f95
Fix Greptile comments.
Lightning11wins 44df6fc
Fix the fix because the previous fix was badly designed.
Lightning11wins a041dca
Merge branch 'refs/heads/fix-expect' into dups
Lightning11wins ef1107b
Update and reduce returns in the cluster driver.
Lightning11wins de74a99
Add recursion checks and update recursion error messages.
Lightning11wins f0844a0
Include "cxlib/mtsession.h" everywhere mssError() is used.
Lightning11wins c5b0607
Add line numbers to mssError().
Lightning11wins e9a166b
Merge branch 'include-msserror' into dups
Lightning11wins 4dafa05
Include "cxlib/mtsession.h" in objdrv_shell.c. (Missed before because…
Lightning11wins 11d6b77
Merge branch 'include-msserror' into dups
Lightning11wins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -65,3 +65,4 @@ perf.data.old | |
| .idea/ | ||
| .vscode/ | ||
| centrallix-os/tmp/* | ||
| centrallix-os/datasets/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
nboard marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| #ifndef CLUSTERS_H | ||
| #define CLUSTERS_H | ||
|
|
||
| /************************************************************************/ | ||
| /* Centrallix Application Server System */ | ||
| /* Centrallix Core */ | ||
| /* */ | ||
| /* Copyright (C) 1998-2026 LightSys Technology Services, Inc. */ | ||
| /* */ | ||
| /* This program is free software; you can redistribute it and/or modify */ | ||
| /* it under the terms of the GNU General Public License as published by */ | ||
| /* the Free Software Foundation; either version 2 of the License, or */ | ||
| /* (at your option) any later version. */ | ||
| /* */ | ||
| /* This program is distributed in the hope that it will be useful, */ | ||
| /* but WITHOUT ANY WARRANTY; without even the implied warranty of */ | ||
| /* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the */ | ||
| /* GNU General Public License for more details. */ | ||
| /* */ | ||
| /* You should have received a copy of the GNU General Public License */ | ||
| /* along with this program; if not, write to the Free Software */ | ||
| /* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA */ | ||
| /* 02111-1307 USA */ | ||
| /* */ | ||
| /* A copy of the GNU General Public License has been included in this */ | ||
| /* distribution in the file "COPYING". */ | ||
| /* */ | ||
| /* Module: lib_cluster.c, lib_cluster.h */ | ||
| /* Author: Israel Fuller */ | ||
| /* Creation: September 29, 2025 */ | ||
| /* Description Clustering library used to cluster and search data with */ | ||
| /* cosine or Levenshtein (aka. edit distance) similarity */ | ||
| /* measures. Used by the "clustering driver". */ | ||
| /* For more information on how to use this library, see */ | ||
| /* string-similarity.md in the centrallix-sysdoc folder. */ | ||
| /************************************************************************/ | ||
|
|
||
| #include <stdlib.h> | ||
| #include <stdbool.h> | ||
|
|
||
| #ifdef CXLIB_INTERNAL | ||
| #include "xarray.h" | ||
| #else | ||
| #include "cxlib/xarray.h" | ||
| #endif | ||
|
|
||
| /** This file has additional documentation in string_similarity.md. **/ | ||
|
|
||
|
|
||
| /*** This value defines the number of dimensions used for a sparse | ||
| *** vector. The higher the number, the fewer collisions will be | ||
| *** encountered when using these vectors for cosine comparisons. | ||
| *** This is also called the vector table size, if viewing the | ||
| *** vector as a hash table of character pairs. | ||
| *** | ||
| *** 2147483647 is the signed int max, and is also a prime number. | ||
| *** Using this value ensures that the longest run of 0s will not | ||
| *** cause an int underflow with the current encoding scheme. | ||
| *** | ||
| *** Unfortunately, we can't use a number this large yet because | ||
| *** kmeans algorithm creates densely allocated centroids with | ||
| *** `CA_NUM_DIMS` dimensions, so a large number causes it to fail. | ||
| *** This, we use 251 as the largest prime number less than 256, | ||
| *** giving us a decent balance between collision reduction and | ||
| *** kmeans centroid performance/memory overhead. | ||
| ***/ | ||
| #define CA_NUM_DIMS 251 | ||
|
|
||
| /*** The character used to create a pair with the first and last characters | ||
| *** of a string. Currently set to 96, the character just before 'a' (97) | ||
| *** in the ASCII table. | ||
| ***/ | ||
| #define CA_BOUNDARY_CHAR ((unsigned char)('a' - 1)) | ||
|
|
||
| /** Types. **/ | ||
| typedef int* pVector; /* Sparse vector. */ | ||
| typedef double* pCentroid; /* Dense centroid. */ | ||
| #define CENTROID_SIZE (CA_NUM_DIMS * sizeof(double)) | ||
|
|
||
| /*** Information about detected matching pairs. | ||
| *** | ||
| *** @param i The index into the provided data for the first element of the pair. | ||
| *** @param j The index into the provided data for the second element of the pair. | ||
| *** @param similarity A number from 0 to 1, from a similarity function, showing | ||
| *** how similar the pairs are. | ||
| ***/ | ||
| typedef struct | ||
| { | ||
| unsigned int i, j; | ||
| double similarity; | ||
| } | ||
| Pair, *pPair; | ||
|
|
||
|
|
||
| /** Edit distance function. **/ | ||
| int ca_edit_dist(const char* str1, const char* str2, const size_t str1_length, const size_t str2_length); | ||
|
|
||
| /** Vector functions. **/ | ||
| pVector ca_build_vector(const char* str); | ||
| unsigned int ca_sparse_len(const pVector vector); | ||
| void ca_print_vector(const pVector vector); | ||
| void ca_free_vector(pVector sparse_vector); | ||
|
|
||
| /** k-means function. **/ | ||
| int ca_kmeans( | ||
| pVector* vectors, | ||
| const unsigned int num_vectors, | ||
| const unsigned int num_clusters, | ||
| const unsigned int max_iter, | ||
| const double min_improvement, | ||
| unsigned int* labels, | ||
| double* vector_sims, | ||
| bool auto_seed); | ||
|
|
||
| /** Vector helper macros. **/ | ||
| #define ca_is_empty(vector) (vector[0] == -CA_NUM_DIMS) | ||
| /*** Note: Given that CA_NUM_DIMS == 251, ca_build_vector("") will give the | ||
| *** vector we check for in the ca_has_no_pairs() macro, [-172, 11, -78], | ||
| *** which has a single pair of boundary characters. | ||
| *** If CA_NUM_DIMS is modified, this macro will need to be updated, hence the | ||
| *** compiler directive causing it to be undefined in this case, likely leading | ||
| *** to a lot of compiler or linker issues to remind the developer about this. | ||
| ***/ | ||
| #if CA_NUM_DIMS == 251 | ||
| #define ca_has_no_pairs(vector) \ | ||
| ({ \ | ||
| __typeof__ (vector) _v = (vector); \ | ||
| _v[0] == -172 && _v[1] == 11 && _v[2] == -78; \ | ||
|
nboard marked this conversation as resolved.
|
||
| }) | ||
| #endif | ||
|
|
||
| /** Comparison functions (see ca_search()). **/ | ||
| double ca_cos_compare(void* v1, void* v2); | ||
| double ca_lev_compare(void* str1, void* str2); | ||
| bool ca_eql(pVector v1, pVector v2); | ||
|
|
||
| /** Similarity search functions. **/ | ||
| void* ca_most_similar( | ||
| void* target, | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const double (*similarity)(void*, void*), | ||
| const double threshold); | ||
| pXArray ca_sliding_search( | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const unsigned int window_size, | ||
| const double (*similarity)(void*, void*), | ||
| const double threshold, | ||
| pXArray maybe_pairs); | ||
| pXArray ca_complete_search( | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const double (*similarity)(void*, void*), | ||
| const double threshold, | ||
| pXArray maybe_pairs); | ||
|
|
||
| /** Module management functions. **/ | ||
| void ca_init(void); | ||
|
|
||
| #endif /* End of .h file. */ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| #ifndef GLYPH_H | ||
| #define GLYPH_H | ||
|
|
||
| /************************************************************************/ | ||
| /* Centrallix Application Server System */ | ||
| /* Centrallix Core */ | ||
| /* */ | ||
| /* Copyright (C) 1998-2026 LightSys Technology Services, Inc. */ | ||
| /* */ | ||
| /* This program is free software; you can redistribute it and/or modify */ | ||
| /* it under the terms of the GNU General Public License as published by */ | ||
| /* the Free Software Foundation; either version 2 of the License, or */ | ||
| /* (at your option) any later version. */ | ||
| /* */ | ||
| /* This program is distributed in the hope that it will be useful, */ | ||
| /* but WITHOUT ANY WARRANTY; without even the implied warranty of */ | ||
| /* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the */ | ||
| /* GNU General Public License for more details. */ | ||
| /* */ | ||
| /* You should have received a copy of the GNU General Public License */ | ||
| /* along with this program; if not, write to the Free Software */ | ||
| /* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA */ | ||
| /* 02111-1307 USA */ | ||
| /* */ | ||
| /* A copy of the GNU General Public License has been included in this */ | ||
| /* distribution in the file "COPYING". */ | ||
| /* */ | ||
| /* Module: glyph.h */ | ||
| /* Author: Israel Fuller */ | ||
| /* Creation: October 27, 2025 */ | ||
| /* Description: A simple debug visualizer to make pretty patterns in */ | ||
| /* developer's terminal which can be surprisingly useful */ | ||
| /* for debugging algorithms. */ | ||
| /************************************************************************/ | ||
|
|
||
| #include <stdlib.h> | ||
|
|
||
| /** Uncomment to activate glyphs. **/ | ||
| /** Should not be enabled in production code on the master branch. */ | ||
| // #define ENABLE_GLYPHS | ||
|
|
||
| #ifdef ENABLE_GLYPHS | ||
| #define glyph_print(s) printf("%s", s); | ||
|
|
||
| /*** Initialize a simple debug visualizer to make pretty patterns in the | ||
| *** developer's terminal. Great for when you need to run a long task and | ||
| *** want a super simple way to make sure it's still working. | ||
| *** | ||
| *** @attention - Relies on storing data in variables in scope, so calling | ||
| *** glyph() requires a call to glyph_init() previously in the same scope. | ||
| *** | ||
| *** @param name The symbol name of the visualizer. | ||
| *** @param str The string printed for the visualization. | ||
| *** @param interval The number of invocations of glyph() required to print. | ||
| *** @param flush Whether to flush on output. | ||
| ***/ | ||
| #define glyph_init(name, str, interval, flush) \ | ||
| const char* vis_##name##_str = str; \ | ||
| const unsigned int vis_##name##_interval = interval; \ | ||
| const bool vis_##name##_flush = flush; \ | ||
| unsigned int vis_##name##_i = 0u; | ||
|
|
||
| /*** Invoke a visualizer. | ||
| *** | ||
| *** @param name The name of the visualizer to invoke. | ||
| ***/ | ||
| #define glyph(name) \ | ||
| if (++vis_##name##_i % vis_##name##_interval == 0) \ | ||
| { \ | ||
| glyph_print(vis_##name##_str); \ | ||
| if (vis_##name##_flush) fflush(stdout); \ | ||
| } | ||
| #else | ||
| #define glyph_print(str) | ||
| #define glyph_init(name, str, interval, flush) | ||
| #define glyph(name) | ||
| #endif | ||
|
|
||
| #endif /* End of .h file. */ |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.