Add LRU to improve the memory usage of the indexer#2050
Conversation
|
(I did not review the tricky pointer arithmetic of the dllist yet) |
5bf10b0 to
4f75584
Compare
| | In_memory v -> write_child lnk schema v size ~placeholders ~restore | ||
| | On_disk _ -> | ||
| write_child lnk schema (fetch lnk) size ~placeholders ~restore) | ||
| | In_memory v | In_memory_c (v, _) -> |
There was a problem hiding this comment.
I'm curious if your tests ran into the In_memory_c case here? The current code is correct, but if this does happen in practice then In_memory_c would benefit from being translated into a pointed-index On_disk_ptr directly (since an In_memory_c is an in-memory value that was read from an existing file)
There was a problem hiding this comment.
My test doesn't hit the In_memory_c case here.
| in | ||
| List.iter | ||
| (fun (Cached (link, loc, store, schema)) -> | ||
| Cache.remove store.cache loc; |
There was a problem hiding this comment.
This remove doesn't look entirely correct, since we are loosing sharing as we never reintroduce lnk into the cache later if it turns out that offset was still useful (so we'll allocate new links for that same value). Does your benchmarks show that this remove is necessary? :)
There was a problem hiding this comment.
Yes I agree.
On the benchmarks, removing the Cache.remove multiplies the max memory usage by around 1.1.
7f47658 to
2e97c18
Compare
| module Cache = Hashtbl.Make (Int) | ||
|
|
||
| type store = { filename : string; cache : any_link Cache.t } | ||
| type store = { filename : string; id : int; } |
There was a problem hiding this comment.
It looks wrong to remove this cache, it's pretty essential :)
2e97c18 to
516f769
Compare
| | Serialized { loc } -> | ||
| (match Cache.find_opt store.cache loc with | ||
| | Some (Link (type b) ((existing_lnk, type_id') : b link * _)) -> | ||
| (match Type.Id.provably_equal type_id type_id' with | ||
| | Some (Equal : (a link, b link) Type.eq) -> | ||
| lnk := !existing_lnk | ||
| | None -> lnk := On_disk { store; loc; schema }) | ||
| | None -> | ||
| Cache.add store.cache loc (Link (lnk, type_id)); | ||
| lnk := On_disk { store; loc; schema }) |
There was a problem hiding this comment.
@art-w What do you think? The cache keeps a trace (loc to link) of the serialized links.
There was a problem hiding this comment.
yes it looks similar to the previous code :) (which we shouldn't need to update, so to keep the diff short it's best not to introduce minor differences)
art-w
left a comment
There was a problem hiding this comment.
Almost there! I have some minor requests to clean up the changes :)
| | Some (Link (type b) ((existing_lnk, type_id') : b link * _)) -> | ||
| (match Type.Id.provably_equal type_id type_id' with | ||
| | Some (Equal : (a link, b link) Type.eq) -> | ||
| lnk := !existing_lnk |
There was a problem hiding this comment.
I misspoke the other day when I assumed you could remove all the Duplicate, this is a case where we are potentially leaking if we do not use them (the existing_lnk is tracked and released by the LRU, but the lnk copy isn't)
There was a problem hiding this comment.
So I kept the Duplicate, but during my tests, they were never used.
| match !lnk with | ||
| | Small v -> | ||
| schema iter v; | ||
| child_smalls:= (Value (Obj.magic v)) :: !child_smalls; |
There was a problem hiding this comment.
Can you run dune fmt, and also get rid of the magic now? :)
| | Serialized { loc } -> | ||
| (match Cache.find_opt store.cache loc with | ||
| | Some (Link (type b) ((existing_lnk, type_id') : b link * _)) -> | ||
| (match Type.Id.provably_equal type_id type_id' with | ||
| | Some (Equal : (a link, b link) Type.eq) -> | ||
| lnk := !existing_lnk | ||
| | None -> lnk := On_disk { store; loc; schema }) | ||
| | None -> | ||
| Cache.add store.cache loc (Link (lnk, type_id)); | ||
| lnk := On_disk { store; loc; schema }) |
There was a problem hiding this comment.
yes it looks similar to the previous code :) (which we shouldn't need to update, so to keep the diff short it's best not to introduce minor differences)
| val add_front : 'a t -> 'a * int -> 'a cell | ||
| val discard_size : 'a t -> int -> 'a list | ||
| val promote : 'a t -> 'a cell -> unit | ||
| val get : 'a cell -> 'a |
There was a problem hiding this comment.
Yes, it is, to return the value contained in the cell when we are fetching an In_cache value.
|
@art-w I don't know how |
|
I pushed the review's resolution in a separate commit to make it easier to see the difference between before and after. They are going to be squashed and reorganized at the end. |
| | In_cache (_v, t, _children) -> | ||
| let Cached (_, loc, {filename; id; _}, _) = t.content in | ||
| lnk := On_disk_ptr { filename; id; loc } | ||
| | _ -> failwith "Granular_marshal.write: duplicate unexpected state") |
There was a problem hiding this comment.
For Duplicate, does the original code with write_child_reused work? The only new case should be In_cache, which is equivalent to an On_disk. (I believe it shouldn't be possible to have Duplicate (Small ... | In_memory ...), but let me know if that's not the case?)
There was a problem hiding this comment.
I suggest going backward and reintroducing the Duplicate and *_reused, since they are linked, and we cannot remove *_reused while keeping the Duplicate.
| (match !parent with | ||
| | In_cache (_, _, small_poses) -> | ||
| let Value v = small_poses.(pos) in | ||
| Obj.magic v |
There was a problem hiding this comment.
@art-w How can we return properly v without the Obj.magic since it is existentially bound, and the typer rejects the case where we annotated v with type a like:
let (Value (v : a)) = small_poses.(pos) in
with the error on the part v : a:
File "src/index-format/granular_marshal.ml", line 192, characters 19-26:
192 | let (Value (v : a)) = small_poses.(pos) in
^^^^^^^
Error: This pattern matches values of type a
but a pattern was expected which matches values of type $a1
Hint: $a1 is an existential type bound by the constructor Value.
There was a problem hiding this comment.
This is the part where we use a Type.Id.t to convince the type-checker that the value has indeed the right type :)
c2bf2c4 to
33a4290
Compare
The index links are stored in a Hashtable, which grows without bound as we add new entries. This is using a lot of memory.
This PR replaces the use of a Hashtable with an LRU cache to bound memory usage, keeping only the most recently used indices and evicting the least recently used entry when the capacity is reached.