Skip to content

Add LRU to improve the memory usage of the indexer#2050

Open
Lucccyo wants to merge 20 commits into
ocaml:mainfrom
Lucccyo:lru_cache_index
Open

Add LRU to improve the memory usage of the indexer#2050
Lucccyo wants to merge 20 commits into
ocaml:mainfrom
Lucccyo:lru_cache_index

Conversation

@Lucccyo
Copy link
Copy Markdown
Contributor

@Lucccyo Lucccyo commented Mar 24, 2026

The index links are stored in a Hashtable, which grows without bound as we add new entries. This is using a lot of memory.
This PR replaces the use of a Hashtable with an LRU cache to bound memory usage, keeping only the most recently used indices and evicting the least recently used entry when the capacity is reached.

Comment thread src/index-format/dbllist.ml Outdated
Comment thread src/index-format/lru.ml Outdated
Comment thread src/index-format/lru.ml Outdated
Comment thread src/index-format/lru.ml Outdated
Comment thread src/index-format/dbllist.ml Outdated
Comment thread src/index-format/dbllist.ml Outdated
@voodoos
Copy link
Copy Markdown
Collaborator

voodoos commented Mar 24, 2026

(I did not review the tricky pointer arithmetic of the dllist yet)

Comment thread src/index-format/granular_marshal.ml Outdated
| In_memory v -> write_child lnk schema v size ~placeholders ~restore
| On_disk _ ->
write_child lnk schema (fetch lnk) size ~placeholders ~restore)
| In_memory v | In_memory_c (v, _) ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if your tests ran into the In_memory_c case here? The current code is correct, but if this does happen in practice then In_memory_c would benefit from being translated into a pointed-index On_disk_ptr directly (since an In_memory_c is an in-memory value that was read from an existing file)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test doesn't hit the In_memory_c case here.

Comment thread src/index-format/granular_marshal.ml Outdated
in
List.iter
(fun (Cached (link, loc, store, schema)) ->
Cache.remove store.cache loc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remove doesn't look entirely correct, since we are loosing sharing as we never reintroduce lnk into the cache later if it turns out that offset was still useful (so we'll allocate new links for that same value). Does your benchmarks show that this remove is necessary? :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree.
On the benchmarks, removing the Cache.remove multiplies the max memory usage by around 1.1.

@Lucccyo Lucccyo force-pushed the lru_cache_index branch 2 times, most recently from 7f47658 to 2e97c18 Compare April 20, 2026 14:01
Comment thread src/index-format/granular_marshal.ml Outdated
module Cache = Hashtbl.Make (Int)

type store = { filename : string; cache : any_link Cache.t }
type store = { filename : string; id : int; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks wrong to remove this cache, it's pretty essential :)

Comment thread src/index-format/granular_marshal.ml Outdated
Comment on lines +130 to +139
| Serialized { loc } ->
(match Cache.find_opt store.cache loc with
| Some (Link (type b) ((existing_lnk, type_id') : b link * _)) ->
(match Type.Id.provably_equal type_id type_id' with
| Some (Equal : (a link, b link) Type.eq) ->
lnk := !existing_lnk
| None -> lnk := On_disk { store; loc; schema })
| None ->
Cache.add store.cache loc (Link (lnk, type_id));
lnk := On_disk { store; loc; schema })
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@art-w What do you think? The cache keeps a trace (loc to link) of the serialized links.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it looks similar to the previous code :) (which we shouldn't need to update, so to keep the diff short it's best not to introduce minor differences)

Copy link
Copy Markdown
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! I have some minor requests to clean up the changes :)

Comment thread src/index-format/granular_marshal.ml Outdated
| Some (Link (type b) ((existing_lnk, type_id') : b link * _)) ->
(match Type.Id.provably_equal type_id type_id' with
| Some (Equal : (a link, b link) Type.eq) ->
lnk := !existing_lnk
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misspoke the other day when I assumed you could remove all the Duplicate, this is a case where we are potentially leaking if we do not use them (the existing_lnk is tracked and released by the LRU, but the lnk copy isn't)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I kept the Duplicate, but during my tests, they were never used.

Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
match !lnk with
| Small v ->
schema iter v;
child_smalls:= (Value (Obj.magic v)) :: !child_smalls;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run dune fmt, and also get rid of the magic now? :)

Comment thread src/index-format/granular_map.ml Outdated
Comment thread src/index-format/dbllist.ml
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment on lines +130 to +139
| Serialized { loc } ->
(match Cache.find_opt store.cache loc with
| Some (Link (type b) ((existing_lnk, type_id') : b link * _)) ->
(match Type.Id.provably_equal type_id type_id' with
| Some (Equal : (a link, b link) Type.eq) ->
lnk := !existing_lnk
| None -> lnk := On_disk { store; loc; schema })
| None ->
Cache.add store.cache loc (Link (lnk, type_id));
lnk := On_disk { store; loc; schema })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it looks similar to the previous code :) (which we shouldn't need to update, so to keep the diff short it's best not to introduce minor differences)

Comment thread src/index-format/granular_marshal.ml Outdated
val add_front : 'a t -> 'a * int -> 'a cell
val discard_size : 'a t -> int -> 'a list
val promote : 'a t -> 'a cell -> unit
val get : 'a cell -> 'a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is get really used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is, to return the value contained in the cell when we are fetching an In_cache value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

@Lucccyo
Copy link
Copy Markdown
Contributor Author

Lucccyo commented Apr 22, 2026

@art-w I don't know how Duplicate should behave in the write`.
What we could want is to call ' write ' recursively on the original link, right?

@Lucccyo
Copy link
Copy Markdown
Contributor Author

Lucccyo commented Apr 22, 2026

I pushed the review's resolution in a separate commit to make it easier to see the difference between before and after. They are going to be squashed and reorganized at the end.

Comment thread src/index-format/granular_marshal.ml Outdated
| In_cache (_v, t, _children) ->
let Cached (_, loc, {filename; id; _}, _) = t.content in
lnk := On_disk_ptr { filename; id; loc }
| _ -> failwith "Granular_marshal.write: duplicate unexpected state")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Duplicate, does the original code with write_child_reused work? The only new case should be In_cache, which is equivalent to an On_disk. (I believe it shouldn't be possible to have Duplicate (Small ... | In_memory ...), but let me know if that's not the case?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest going backward and reintroducing the Duplicate and *_reused, since they are linked, and we cannot remove *_reused while keeping the Duplicate.

(match !parent with
| In_cache (_, _, small_poses) ->
let Value v = small_poses.(pos) in
Obj.magic v
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@art-w How can we return properly v without the Obj.magic since it is existentially bound, and the typer rejects the case where we annotated v with type a like:

let (Value (v : a)) = small_poses.(pos) in

with the error on the part v : a:

File "src/index-format/granular_marshal.ml", line 192, characters 19-26:
192 |         let (Value (v : a)) = small_poses.(pos) in
                         ^^^^^^^
Error: This pattern matches values of type a
       but a pattern was expected which matches values of type $a1
       Hint: $a1 is an existential type bound by the constructor Value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part where we use a Type.Id.t to convince the type-checker that the value has indeed the right type :)

@Lucccyo Lucccyo force-pushed the lru_cache_index branch from 33a4290 to 9777008 Compare May 18, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants