Reimplement ETS to support bag and duplicate_bag#2157
Reimplement ETS to support bag and duplicate_bag#2157mfurga wants to merge 8 commits intoatomvm:mainfrom
Conversation
There was a problem hiding this comment.
Before reviewing the code, I will focus on issues that CI has reported.
-
DCO found sign off is missing, each commit should have at the end "Signed-off-by: Author Name authoremail@example.com". You can fix this with interactive rebase +
git commit --amend --signoff -
There are some minior formatting issues:
clang-format -i src/libAtomVM/ets.cis enough for fixing them on the C part.
erlfmt(https://github.com/WhatsApp/erlfmt) should be used on erlang sources: [warn] ./tests/erlang_tests/test_ets.erl -
You might fix 1 + 2 + 3 together, suggested workflow:
a) git rebase -i main
b) change all commits from pick to edit:
edit c8f87f91 # Add tests for insert_new, bag, and duplicate_bag
edit aa471c07 # Replace ets_hashtable with ets_multimap supporting set, bag, and duplicate_bag
[...]
c) run clang-format and erlfmt (with rebar3) on each updated file, so you clean up history from white space errors and we keep diffs clean.
d) git add -p && git commit --amend --signoff. This allows you also to update commit descriptions etc...
If I may add another advice, I really appreciate how you kept a clean history, but I think we can streamline it a little bit. So before starting with step a) you might consider an initial git rebase -i main:
Each tests + implementation commit pair can be squashed together, into a single commit that contain both tests and implementation. This kind of approach has great benefits if anytime we do a git bisect or we get back in history for tracking any kind of issue, since by doing this ./tests/test-erlang should be always able to pass at each step in git history.
You can do this by using git rebase -i main and by doing a similar rebase plan:
pick c8f87f91 # Add tests for insert_new, bag, and duplicate_bag
pick aa471c07 # Replace ets_hashtable with ets_multimap supporting set, bag, and duplicate_bag
pick 456c0b39 # Add tests for member/2
squash f8d9ac97 # Implement member/2
pick b482dd2d # Add tests for take/2
squash 0291720c # Implement take/2
pick 4d371a40 # Add tests for update_element
squash 995ba867 # Implement update_element
This keeps everything divided in a number of steps, but fewer steps.
So briefly:
First squash together commits and streamline history, after run again git rebase and edit each commit for sign off and white space errors.
If you want to know more about our git usage & style guidelines, you can take a look here: https://github.com/bettio/AtomVM-fork/blob/2a94fa0bafb12826ed20a5e0ebeaed21db9dd37f/GIT_WORKFLOW_AND_HISTORY_STYLE_GUIDE.md
- Tests are failing against all OTP version <= 26:
{undef,[{ets,update_element,
[#Ref<0.2858588063.3122528257.226650>,key_not_exist,
{2,new_value1},
{key,value1,value2}],
[]},
{test_ets,test_update_element,0,
[{file,"/__w/AtomVM/AtomVM/tests/erlang_tests/test_ets.erl"},
{line,394}]},
{test_ets,'-isolated/1-fun-0-',3,
[{file,"/__w/AtomVM/AtomVM/tests/erlang_tests/test_ets.erl"},
{line,753}]}]}
You can use something similar:
get_otp_version() ->
case erlang:system_info(machine) of
"BEAM" -> list_to_integer(erlang:system_info(otp_release));
_ -> atomvm
end.and then something like Version when Version =:= atomvm orelse (is_integer(Version) andalso Version >= 24).
You should apply this when testing a feature that appeared only in a recent OTP version.
- GCC 12 is reporting the following error:
/home/runner/work/AtomVM/AtomVM/src/libAtomVM/ets_multimap.c: In function ‘ets_multimap_remove_tuple’:
/home/runner/work/AtomVM/AtomVM/src/libAtomVM/ets_multimap.c:299:21: error: pointer ‘to_remove’ may be used after ‘realloc’ [-Werror=use-after-free]
299 | free(to_remove);
| ^~~~~~~~~~~~~~~
/home/runner/work/AtomVM/AtomVM/src/libAtomVM/ets_multimap.c:297:52: note: call to ‘realloc’ here
297 | EtsMultimapEntry **new_to_remove = realloc(to_remove, sizeof(EtsMultimapEntry *) * capacity);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
This is a know issue, you can fix with:
// GCC 12 is raising here a false positive warning, according to man realloc:
// "If realloc() fails, the original block is left untouched; it is not freed or moved."
#pragma GCC diagnostic push
#if defined(__GNUC__) && !defined(__clang__) && __GNUC__ == 12
#pragma GCC diagnostic ignored "-Wuse-after-free"
#endif
free(to_remove);
#pragma GCC diagnostic pop/__w/AtomVM/AtomVM/src/libAtomVM/ets.c: In function 'ets_delete_table':
/__w/AtomVM/AtomVM/src/libAtomVM/ets.c:517:21: error: statement with no effect [-Werror=unused-value]
517 | synclist_wrlock(&ctx->global->ets.ets_tables);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/__w/AtomVM/AtomVM/src/libAtomVM/synclist.h:128:31: note: in definition of macro 'synclist_wrlock'
128 | #define synclist_wrlock(list) list
| ^~~~
This happens because the general pattern is like:
struct ListHead *remote_monitors = synclist_wrlock(&conn_obj->remote_monitors);Those functions are defined like:
ListHead *synclist_wrlock(struct SyncList *synclist);So you should at least do:
ListHead *locked_list = synclist_wrlock(...);
UNUSED(locked_list)or maybe (void) synclist_wrlock(...); should be enough.
Still this pattern is a bit suspicious, so I suggest to check if it is 100% legit or if there is any code smell around.
bettio
left a comment
There was a problem hiding this comment.
I have a first round of comments.
| RAISE_ERROR(OUT_OF_MEMORY_ATOM); | ||
| default: | ||
| // unreachable | ||
| AVM_ABORT(); |
There was a problem hiding this comment.
The current nifs for ets use AVM_ABORT() instead of UNREACHABLE(); isn't abort a better fit here?
| case EtsOverflow: | ||
| RAISE_ERROR(OVERFLOW_ATOM); | ||
| default: | ||
| UNREACHABLE(); |
src/libAtomVM/nifs.c
Outdated
| avm_int_t index = term_to_int(pos) - 1; | ||
| if (UNLIKELY(index < 0)) { | ||
| RAISE_ERROR(BADARG_ATOM); | ||
| } |
There was a problem hiding this comment.
term_is_pos_int applies also here.
src/libAtomVM/ets_multimap.h
Outdated
| extern "C" { | ||
| #endif | ||
|
|
||
| #define NUM_BUCKETS 16 |
There was a problem hiding this comment.
Let's prefix this, to avoid any name clash: ETS_MULTIMAP_NUM_BUCKETS
src/libAtomVM/ets_multimap_hash.h
Outdated
| // some large (close to 2^24) primes taken from | ||
| // http://compoasso.free.fr/primelistweb/page/prime/liste_online_en.php | ||
|
|
||
| #ifndef _ETS_MULTIMAP_HASH_H_ |
There was a problem hiding this comment.
Implementing static functions inside a header may cause code duplication: every time this header is included same code is compiled again as a private copy.
There was a problem hiding this comment.
I deleted ets_multimap_hash.h and moved the hashing impl into ets_multimap.c since it's only used there.
|
How to address this issues: you might use |
|
Codex claims this single issue:
So if nothing else add test, so we are confident behaviour is correct. % [] with Default object
S11 = new_table({key, 10, 20, not_number}),
[] = ets:update_counter(S11, key_not_exist2, [], {key, 10, 20, 30}),
[{key, 10, 20, not_number}] = ets:lookup(S11, key),
[{key_not_exist2, 10, 20, 30}] = ets:lookup(S11, key_not_exist2), |
| return NULL; | ||
| } | ||
|
|
||
| Heap *heap = malloc(sizeof(Heap)); |
There was a problem hiding this comment.
I don't think heap should be malloc-ated, I think it should be rather embedded in EtsMultimapEntry, this should both simplify code and increase efficiency.
I don't know how extensive is this change, feel free to decide whether making a new commit or fixing up the commit that introduced ets_multimap.c.
There is an additional improvement we can do here, that is embedding the HeapFragment that holds the storage for that heap. But this specific improvement is beyond the scope of a simple PR and we can just ignore it now and leave a TODO.
Another, more radical, but interesting option would be completely removing EtsMultimapEntry.
typedef struct EtsMultimapNode
{
Heap heap; // common heap for all entries, memory_heap_append_heap can be used when adding entries.
struct EtsMultimapNode *next;
term *entries; //this is a growable array, that is doubled or grown with other strategies. realloc can be used for this purpose.
} EtsMultimapNode;This would remove the need for malloc() in lookup when returning entries, since they are all available in an array that can be safely returned since table is locked.
Removing an entry means just setting that entry in entries to term_nil(), and once in a while garbage collection might be executed.
Also this approach can be super efficient with set
Still iterating here, the first entry might be stored inside the node itself, avoiding allocation overhead for set and any single-item-nodes.
Anyway let's discuss this.
There was a problem hiding this comment.
One more advanced topic: actually we don't even need to keep the Heap struct (and the related bits): the Heap struct is useful when allocating terms, but once this has been done, in our ets case I think the heap can be sealed and we can just throw away the heap.
There was a problem hiding this comment.
I think we can drop the Heap entirely and use a FAM allocation, similar to how it's done in the mailbox:
typedef struct EtsMultimapEntry {
struct EtsMultimapEntry *next;
term tuple;
term storage[]; // storage[0] = mso_list, storage[1...] = term data
} EtsMultimapEntry;Allocation would look like:
entry = malloc(sizeof(EtsMultimapEntry) + (memory_estimate_usage(tuple) + 1) * sizeof(term));
entry->tuple = memory_copy_term_tree_to_storage(entry->storage, &heap_end, tuple);This reduces per-entry allocations from 3 to 1, without changing much of the current implementation.
The array approach removes the malloc in lookup but deletion is harder. With a shared heap we can't free memory for a single removed entry. GC would require allocating a new heap, copying all live terms into it and destroying the old one. The same problem applies to insert_revert().
Also lookup callers would need to skip deleted entries (entries with term_nil) which is a bit awkward.
Not sure if the array approach is better here.
…icate_bag Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
Signed-off-by: Mateusz Furga <mateusz.furga@swmansion.com>
|
Amp delivered this review: All caveats apply - and I'm fully pragmatic about llm being both wrong and pedantic - so pick and choose, and let's land this sooner rather than later, even if there is an todo issue - with further hardening.. PR #2157 Code Review — ETS bag/duplicate_bag & New FunctionsBranch: Commits Reviewed (8)
Files Changed
🔴 Must-Fix (Blockers)1. SMP Deadlock in
|
This PR introduces the following changes:
set,bag, andduplicate_bagtable typesinsert_new/2,member/2,take/2,update_element/3,update_element/4,lookup_element/4, anddelete_object/2nifsMultimap details:
A multimap can be one of the following types:
EtsMultimapTypeSingle- each node contains only a single entry, which is basically a regular hashmap. This type is used forsettables.EtsMultimapTypeSet- a node can contain multiple entries (no duplicates). This is used forbagtables.EtsMultimapTypeList- a node can contain multiple entries (duplicates allowed). This is used forduplicate_bagtables.Collisions are handled using chaining (each node has a
nextpointer to the next one). Rehashing is not supported at the moment (thebucketsarray has a fixed size), so lookup time isn't guaranteed to be constant when there are lots of elements.Each node must contain at least one entry and always corresponds to exactly one key. If all entries are removed, the node itself is removed as well. All entries in a node have the same key. The key for a node is taken from the first entry and calculated by
node_key()asnode->entries->tuple[multimap->key_index]. All tuples stored in the multimap use the same index to indicate the key position in the tuple.New entries are always inserted at the head of the node's entry list. Because allocation can fail during insertion, any entries added so far have to be rolled back. This is handled by
insert_revert(), which iterates over all entries in the multimap, compares them by pointer, and removes the matching ones.insert_revert()cannot fail. Example for theEtsMultimapTypeSettype:A multimap of type
EtsMultimapTypeListbehaves exactly likeEtsMultimapTypeSet, except that duplicate tuples are allowed.For
EtsMultimapTypeSingle, similarly toEtsMultimapTypeSet, all new entries are first inserted at the head of the list. If insertion finishes without errors, the multimap is converted to typeEtsMultimapTypeSingleusingmultimap_to_single(), which keeps only the first entry per node and removes the rest. If an error occurs,insert_revert()is called.multimap_to_single()cannot fail. Example for theEtsMultimapTypeSingletype:These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).
SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later