From 8a106a5a55f411315764f146f65124b3bbc825bc Mon Sep 17 00:00:00 2001 From: Aapo Alasuutari Date: Wed, 11 Mar 2026 23:25:10 +0200 Subject: [PATCH] feat(doc): Update engine architecture documentation --- ARCHITECTURE.md | 80 ++----- CONTRIBUTING.md | 379 +++++++++++------------------- GARBAGE_COLLECTOR.md | 191 ++++++++------- README.md | 45 ++-- nova_lint/README.md | 15 +- nova_vm/src/README.md | 17 +- nova_vm/src/ecmascript/README.md | 83 +++++-- nova_vm/src/engine/README.md | 141 ++++++++++- nova_vm/src/engine/bytecode/vm.rs | 21 +- nova_vm/src/heap/README.md | 78 +++--- tests/expectations.json | 2 +- tests/metrics.json | 2 +- tracing/README.md | 4 +- 13 files changed, 563 insertions(+), 495 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index f2f8ddfc2..3f49cb90e 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -4,59 +4,27 @@ The architecture of Nova engine is built around data-oriented design. This means that most "data" like things are found in the heap in a vector with like-minded individuals. -## Concurrency - -Currently the heap is not thread-safe at all but the plan is to make it -thread-safe enough for concurrent marking to become possible in the future. To -this end, here's how I (@aapoalas) envision the engine to look like in the -future: - -```rs -struct Agent<'agent, 'generation>(Box>, AgentGuard<'generation>); -``` - -The `Agent` struct here is just a RAII wrapper around the inner Heap-allocated -Agent data. The first important thing is the `'agent` lifetime: This is a brand. -It is valid for as long as the Nova engine instance lives, and its only purpose -is to make sure that (type-wise) uses cannot mistakenly or otherwise mix and -match Values from different engine instances. - -The second lifetime, `'generation`, is the garbage collection generation. Here -what I want to achieve is a separation between "gc" and "nogc" scopes. But -before we dive into that, here's what I imagine a `Value` looking like: - -```rs -enum Value<'generation> { - Undefined, - String(StringIndex<'generation>>), - SmallString(SmallString), - // ... -} -``` - -The `Value` enum carries the `'generation` lifetime: As long as we can guarantee -that no garbage collection happens, we can safely keep `Value<'gen>` on the -stack or even temporarily on the heap. - -If we call a method that may trigger GC, then all `Value<'gen>` items are -invalidated. If we want to keep values alive through eg. JavaScript function -calls, we must use: - -```rs -struct ShadowStackValue<'agent>(u32, PhantomPinned); -``` - -This just moves the `Value` onto an Agent-controlled "shadow stack" that the -`u32` points into. Due to the `PhantomPinned` the shadow stack is mostly just -push-pop as any stack should be, and thus relatively quick. But it is also on -the heap and thus garbage collection can update any references on the shadow -stack. - -Note that this is essentially equivalent to: - -```rs -struct GlobalValue<'agent>(u32); -``` - -but "global values" are not push-pop, likely will have generational indexes, -possibly will have reference counting and so on and so forth. +## ECMAScript implementation + +Nova code aims to conform fairly strictly to the +[ECMAScript specification](https://tc39.es/ecma262/) in terms of both code +layout and structure. + +For details on the ECMAScript implementation, see the +[ecmascript/README.md](./nova_vm/src/ecmascript/README.md). + +## Engine implementation + +Nova's VM is a stack-based bytecode interpreter. + +For details on the engine, see the +[engine/README.md](./nova_vm/src/engine/README.md). + +## Heap implementation + +Nova's heap is made up of a mix of normal Rust `Vec`s and custom `SoAVec` +structs that implement a "Struct of Arrays" data structure with an API +equivalent to normal `Vec`s, all referenced by-index. + +For details on the heap architecture, see the +[heap/README.md](./nova_vm/src/heap/README.md). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 122670cf2..2af1a1af4 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -13,6 +13,14 @@ said, before you being, you'll want to know a few things. More information is found below. +## Developer documentation + +See [ARCHITECTURE.md](./ARCHITECTURE.md), +[GARBAGE_COLLECTOR.md](./GARBAGE_COLLECTOR.md), +[ecmascript/README.md](./nova_vm/src/ecmascript/README.md), +[engine/README.md](./nova_vm/src/engine/README.md), and +[heap/README.md](./nova_vm/src/heap/README.md) for various details. + ## Pull Request Code of Conduct The following ground rules should be followed: @@ -30,15 +38,30 @@ Commits. Consider also using the same logic in branch names. The commonly used prefixes here are `fix`, `feat`, and `chore`. Other recommended ones like `build`, `perf`, and `refactor` are of course good as well. -Scoping is also recommended, but is not currently clearly defined. Some examples -are: +Scoping is also recommended and corresponds to the main areas of the engine. +These are: + +1. `cli`: the testing CLI, located in `nova_cli`. +1. `lint`: the custom lints, located in `nova_lint`. +1. `ecmascript`: the ECMAScript specification implementation, located in + `nova_vm/src/ecmascript`. +1. `engine`: the bytecode compiler and interpreter, located in + `nova_vm/src/engine`. +1. `heap`: the heap structure, located in `nova_vm/src/heap`. +1. `gc`: the garbage collector logic, located in `nova_vm/src/heap` as well. +1. `test262`: the Test262 runner and git submodule, located in `tests`. +1. `docs`: documentation. +1. `deps`: dependencies, listed in `Cargo.toml`. + +Some commit/PR title examples are below: 1. `feat(ecmascript)`: This is an added feature to the spec-completeness of the - engine, eg. a new abstract operation, heap data object or such. -1. `fix(heap)`: This fixes something in the heap implementation, eg. maybe the - heap garbage collection. -1. `feat(vm)`: This adds to the interpreter. -1. `chore(cli)`: This might bump a dependency in the `nova_cli` crate. + engine, eg. a new abstract operation, new object type, ... +1. `fix(heap)`: This fixes something in the heap implementation. +1. `feat(engine)`: This adds to the interpreter. +1. `chore(deps)`: Bump a dependency. +1. `feat(test262)`: This adds a new feature to the Test262 runner. +1. `chore(test262)`: This might update the git submodule or the Test262 results. ### Use [Conventional Comments](https://conventionalcomments.org/) @@ -61,6 +84,57 @@ By all this it goes to say: When someone gives you `praise:`, they mean it. When someone marks down an `issue:` they do not mean that your code is bad, they just mean that there's something there to improve. +### Tests in PRs + +Nova mainly tests itself using the official +[ECMAScript test262 conformance suite](https://github.com/tc39/test262). The +expected results are saved into the `tests/expectations.json` file: All PR +branches are tested to check that the test results match the expectations. + +When making changes, you'll thus often need to update or at least check the +expectations. First make sure you have the git submodule checked out: + +```sh +git submodule update --init +``` + +Then running the tests can be done using the following command: + +```sh +cargo build --profile dev-fast && cargo run --bin test262 --profile dev-fast -- -u +``` + +This will build you a "dev-fast" version of Nova, and run the Test262 +conformance suite using that executable. At the end of the run, it will record +the results in `expectations.json` and `metrics.json`. + +You can run an individual test262 test case or a set of tests using + +```sh +cargo build && cargo run --bin test262 eval-test internal/path/to/tests +``` + +Here the "internal/path/to/tests" matches a path or a subpath in the +`expectations.json`. As an example: + +```sh +cargo build && cargo run --bin test262 eval-test built-ins/Array/from/from-string.js +``` + +We also have some unit and integration test around using cargo's test harnesses. +Adding to these is absolutely welcome, as they enable more Miri testing etc. +These are also run on all PRs. + +#### Custom quick-shot tests + +Keep your own `test.js` in the `nova` folder and run it with + +```sh +cargo run eval test.js +``` + +This is great for quick, simple things you want to test out in isolation. + ### Align with the [ECMAScript specification](https://tc39.es/ecma262/) Nova's code and folder structure follows the ECMAScript specification as much as @@ -172,54 +246,9 @@ reorder the value getter and/or the data creation to happen after the checks. If the specification tells you to first check one parameter value (and throws an error if incorrect), then conditionally performs some JavaScript function call, and then checks another parameter value and throws an error if incorrect then -you are not allowed to reorder the second check to happen together with the +you are **not** allowed to reorder the second check to happen together with the first, as that would be an observable change. -### Tests in PRs - -Nova mainly tests itself using the official -[ECMAScript test262 conformance suite](https://github.com/tc39/test262). The -expected results are saved into the `tests/expectations.json` file: All PR -branches are tested to check that the test results match the expectations. - -When making changes, you'll thus often need to update or at least check the -expectations. You can do this with the following command: - -```sh -cargo build --profile dev-fast && cargo run --bin test262 --profile dev-fast -- -u -``` - -This will build you a "dev-fast" version of Nova, and run the test262 -conformance using that executable. At the end of the run, it will record the -results `expectations.json`. - -You can run an individual test262 test case or a set of tests using - -```sh -cargo build && cargo run --bin test262 eval-test internal/path/to/tests -``` - -Here the "internal/path/to/tests" matches a path or a subpath in the -`expectations.json`. As an example: - -```sh -cargo build && cargo run --bin test262 eval-test built-ins/Array/from/from-string.js -``` - -We also have some unit and integration test around using cargo's test harnesses. -Adding to these is absolutely welcome, as they enable more Miri testing etc. -These are also run on all PRs. - -#### Custom quick-shot tests - -Keep your own `test.js` in the `nova` folder and run it with - -```sh -cargo run eval test.js -``` - -This is great for quick, simple things you want to test out in isolation. - ### Performance considerations The engine is not at the point where performance is a big consideration. That @@ -234,48 +263,60 @@ supremely optimal since we cannot prove it one way or the other. ## What are all the `bind(gc.nogc())`, `unbind()`, `scope(agent, gc.nogc())` and other those calls? Those are part of the garbage collector. See the -[contributor's guide to the garbage collector](https://github.com/trynova/nova/blob/main/GARBAGE_COLLECTOR.md) for -details. +[contributor's guide to the garbage collector](https://github.com/trynova/nova/blob/main/GARBAGE_COLLECTOR.md) +for details. ## List of active development ideas Here are some good ideas on what you can contribute to. -### Internal methods of exotic objects - -ECMAScript spec has a ton of exotic objects. Most of these just have some extra -internal slots while others change how they interact with user actions like -get-by-identifier or get-by-value etc. - -You can easily find exotic objects' internal methods by searching for -`"fn internal_get_prototype_of("` in the code base. Many of these matches will -be in files that contain a lot of `todo!()` points. As an example, -[proxy.rs](./nova_vm/src/ecmascript/builtins/proxy.rs) is currently entirely -unimplemented. The internal methods of Proxies can be found -[here](https://tc39.es/ecma262/#sec-proxy-object-internal-methods-and-internal-slots): -These abstract internal methods would need to be translated into Nova Rust code -in the `proxy.rs` file. - -[This PR](https://github.com/trynova/nova/pull/174) can perhaps also serve as a -good guide into how internal methods are implemented: Especially check the first -and third commits. One important thing for internal method implementations is -that whenever a special implementation exists in the spec, our internal method -should link to it. Another thing is that if you cannot figure out what you -should be calling in the method or the method you should be calling doesn't -exist yet and you think implementing it would be too much work, it is perfectly -fine to simply add a `todo!()` call to punt on the issue. - -### Builtin functions - -Even more than internal methods, the ECMAScript spec defines builtin functions. -The Nova engine already includes bindings for nearly all of them (only some -Annex B functions should be missing) but the bindings are mostly just `todo!()` -calls. - -Implementing missing builtin functions, or at least the easy and commonly used -parts of them, is a massive and important effort. You can find a mostly -exhaustive list of these (by constructor or prototype, or combined) -[in the GitHub issue tracker](https://github.com/trynova/nova/issues?q=is%3Aopen+is%3Aissue+label%3A%22builtin+function%22). +### Temporal API + +The Temporal API is being worked on by students of Bergen University, but it is +a big effort and more hands are absolutely welcome. + +### Single VM architecture + +Currently, a new `struct Vm` is created on every function call. That is an +unnecessary amount of wastage, and we would do better by avoiding that. We +should have a single persistent `Vm` held by the `Agent`, with +`struct +ExecutionContext`s holding the data needed to "pop" an execution +context's data from the `Vm`. + +### Realm-specific heaps + +Currently non-ordinary objects are "Realm-agnostic" in that they do not know +which Realm they belong to and will freely drift to whichever Realm we're +currently executing code in. This is not correct, and the fix for this would be +to either make exotic objects Realm-aware or make heaps Realm-specific. I prefer +the latter option. + +This would entail making `Agent` contain multiple `Heap`s, and introducing a new +`CrossRealmProxy` object type whose purpose is to "transfer" objects between +Realm heaps. An object `{}` created in Realm A would appear in Realm B's heap as +a `CrossRealmProxy(_)` whose data indicates that the real object is found in +Realm A, and gives some stable `OutRealmReference` index in the Realm A heap +that contains the actual object reference within it. + +So: in Realm A we have + +1. Object `{}` which does not have a stable identity due to our garbage + collector performing compaction. +2. `struct OutRealmReferenceRecord { rc: usize; object: Object }` which has a + stable identity (no compacting of these), reference count, and references the + actual object. + +and in Realm B we have + +1. A `struct CrossRealmProxyRecord { realm: Realm; ref: OutRealmReference }` + which references the `OutRealmReferenceRecord` in Realm A. +2. `struct CrossRealmProxy` handles that transparently trace to the original + object using all of the indirection in between. + +With this we make cross-realm objects very slow, but we keep in-realm exotic +objects from having to become Realm-aware. The cost of Realm-awareness is +on-average 4 bytes per object. ### Heap evolution @@ -304,171 +345,13 @@ dropped immediately (if concurrent heap marking is not currently ongoing) or pushed into a "graveyard" `UnsafeCell>` that gets dropped at the end of a mark-and-sweep iteration. -#### Interleaved garbage collection - -Currently Nova's garbage collection can only happen when no JavaScript is -running. This means that for instance a dumb live loop like this will eventually -exhaust all memory: - -```ts -while (true) { - ({}); -} -``` - -We want to interleave garbage collection together with running JavaScript, but -this is not a trivial piece of work. Right now a function in the engine might -look like this: - -```rs -fn call<'gc>(agent: &mut Agent, obj: Value, mut gc: GcScope<'gc, '_>) -> JsResult<'gc, Value<'gc>> { - if !obj.is_object() { - return Err(agent.throw_error(agent, "Not object", gc)); - } - let is_extensible = is_extensible(agent, obj, gc)?; - if is_extensible { - set(agent, obj, "foo".into(), "bar".into(), gc)?; - } - get(agent, obj, "length", gc) -} -``` - -If `obj` is a Proxy, then it the all three of the internal calls -(`is_extensible`, `set`, and `get`) can call user code. Even for non-Proxies, -the `set` and `get` methods may call user code through getters and setters. Now, -if that user code performs a lot of allocation then we'll eventually need to -perform garbage collection. The question is then "are we sure that `obj` is -still valid to use"? - -We obviously must make sure that somehow we can keep using `obj`, otherwise -interleaved garbage collection cannot be done. There are two ways to do this: - -1. We make all `Value`s safe to keep on stack. In our case, this means that - `Value` must point to an in-heap reference that points to the `Value`'s - actual heap data (the things that contains properties etc.). The in-heap - reference is dropped when the scope is dropped, so the `Value` is safe to - keep within call scope. -2. Alternatively, `Value` cannot be used after a potential garbage collection - point. A separate type is added that can be used to move the `Value` onto the - heap and point to it. That type is safe to keep within call scope. - -Some additional thoughts on the two approaches is found below. - -##### Make `Value` safe to keep on stack - -Any problem can always be fixed by adding an extra level of indirection. In this -case the problem of "where did you put that Value, is it still needed, and can I -mutate it during garbage collection?" can be solved by adding a level of -indirection. In V8 this would be the `HandleScope`. The garbage collector would -be given access to the `HandleScope`'s memory so that it can trace items "on the -stack" and fix them to point to the proper items after garbage collection. - -This would be the easiest solution, as this could optionally even be made to -work in terms of actual pointers to `Value`s. The big downside is that this is -an extra indirection which is often honestly unnecessary. - -If the `Value` is not pointer based, then another downside is that we cannot -drop them automatically once they're no longer needed using `impl Drop` because -we'd need access to the `HandleScope` inside the `Drop`. Something called linear -types could fix this issue. - -##### `Value` lifetime bound to garbage collection safepoints - -Any problem can always be fixed by adding an extra lifetime. In this case the -problem of "you're not allowed to keep that Value on stack, I would need to -mutate it during garbage collection" can be solved by using a lifetime to make -sure that Values are never on the stack when garbage collection might happen. -This isn't too hard, really, it just means calls change to be: - -```rs -fn call(agent: &'a mut Agent, value: Value<'a>) -> Value<'a> { - // ... -} -``` - -This works perfectly well, except for the fact that it cannot be called. Why? -Because the `Value<'a>` borrows the exclusively owned `&'a mut Agent` lifetime; -this is called a reborrow and it's fine within a function but it cannot be done -intra-procedurally. What we could do is this: - -```rs -fn call(agent: &mut Agent, value: Value, mut gc: Gc) -> Value { - // SAFETY: We've not called any methods that take `&mut Agent` before this. - // `Value` is thus still a valid reference. - let value = unsafe { value.bind(agent) }; - // ... - result.into_register() -} -``` - -Now we can at least call the function, and lifetimes would protect us from -keeping `Value<'a>` on the stack unsafely. They would _not_ help us with making -sure that `Register>` is used properly and even if it did, the whole -`Register>` system is fairly painful to use as each function call -would need to start with this `unsafe {}` song and dance. - -But what about when we call some mutable function and need to keep a reference -to a stack value past that call? This is how that would look: - -```rs -fn call<'gc>(agent: &mut Agent, value: Value, mut gc: GcScope<'gc, '_>) -> JsResult<'gc, Value<'gc>> { - let value = unsafe { value.bind(agent) }; - let kept_value: Global = value.make_global(value); - other_call(agent, gc.reborrow(), value.into_register())?; - let value = kept_value.take(agent); - // ... -} -``` - -We'd need to make the Value temporarily a Global (which introduces an extra -level of indirection), and then "unwrap" that Global after the call. Globals do -currently exist in Nova, but they are "leaky" in that dropping them on the stack -does not clear their memory on the heap, and is effectively a heap memory leak. -In this case we can see that if `other_call` returns early with an error, then -we accidentally leak `kept_value`'s data. This is again not good. - -So we'd need a `Local<'a, Value<'_>>` type of indirection in this case as well. -Whether or not the whole `Value` system makes any sense with that added in is -then very much up for debate. - ### Other things This list serves as a "this is where you were" for returning developers as well as a potential easy jumping-into point for newcomers. -- Write implementations of more abstract operations - - See `nova_vm/src/ecmascript/abstract_operations` - - Specifically eg. `operations_on_objects.rs` is missing multiple operations, - even stubs. - Some more long-term prospects and/or wild ideas: -- Reintroduce lifetimes to Heap if possible - - `Value<'gen>` lifetime that is "controlled" by a Heap generation number: - Heap Values are valid while we can guarantee that the Heap generation number - isn't mutably borrowed. This is basically completely equal to a scope based - `Local<'a, Value>` lifetime but the intended benefit is that the - `Value<'gen>` lifetimes can also be used during Heap compaction: When Heap - GC and compaction occurs it can be written as a transformation from - `Heap<'old>` to `Heap<'new>` and the borrow checker would then help to make - sure that any and all `T<'new>` structs within the heap are properly - transformed to `T<'new>`. -- Add a `Local<'a, Value>` enum that acts as our GC-safe, indirected Value - storage. See above for more discussion on this under "Heap evolution". - - A plain `Value` won't be safe to keep on the stack when calling into the - engine if the engine is free to perform garbage collection at (effectively) - any (safe)point within the engine code. The `Value`'s internal number might - need to be readjusted due to GC, which then breaks the `Value`'s identity in - a sense. - - A `Local<'a, Value>`s would not point directly to the heap but would instead - point to an intermediate storage (this is also exactly how V8 does it) where - identities never change. A nice benefit here is that if we make `Local` - itself an equivalent enum to `Value`, just with a different index type - inside, then we can have the intermediate storage store only heap value - references with 5 bytes each: - `struct Storage { types: [u8; N]; values: [u32; N]; }`. We - cannot drop the types array as it is needed for marking and sweeping the - storage. - Add `DISCRIMINANT + 0x80` variants that work as thrown values of type `DISCRIMINANT` - As a result, eg. a thrown String would be just a String with the top bit set diff --git a/GARBAGE_COLLECTOR.md b/GARBAGE_COLLECTOR.md index 6facae8d3..36dd0a812 100644 --- a/GARBAGE_COLLECTOR.md +++ b/GARBAGE_COLLECTOR.md @@ -2,23 +2,28 @@ > Alternative title: Why is the borrow checker angry at me? -At the time of writing this, Nova's garbage collector is still not interleaved -but the support for interleaving GC is getting closer to being done, so the -engine mostly runs as if garbage collection could already be interleaved. This -means that when you write code in the engine, you'll need to understand a bit of -the garbage collector. +Nova's garbage collector can be run in any function that takes the marker +`GcScope` as a parameter. Nova's handles (JavaScript values) are "unrooted" +handles meaning that when the garbage collector runs, they are invalidated. +Taken together, this means that when writing code in Nova you have to be very, +very careful! ... or would have to be if we didn't make use of the borrow +checker to help us. -## What the garbage collector does +Instead of being very, very careful you just have to remember and abide by a few +rules. Follow along! -The garbage collector is, at its core, a fairly simple thing. It starts from a -set of "root" heap allocated values and +## What Nova's garbage collector does + +Nova' garbage collector is, at its core, a fairly simple thing. It starts from a +set of "roots" (heap allocated values) and [traces](https://en.wikipedia.org/wiki/Tracing_garbage_collection) them to find -all heap allocated values that are reachable from the roots. It then removes all -those heap allocated values that were not reachable, and finally compacts the -heap to only contain live values. Compacting means that all live `Value`s -(`Value`s that will be used after the garbage collection) must also be fixed to -point to the correct post-compact locations. This requires the garbage collector -to be able to reach and mutate all live `Value`s. +all heap allocated values that are reachable from them. It then removes all +other heap allocated values, and finally compacts the heap to only contain the +values deemed reachable. Compacting means that most `Value`s that remain in the +heap move during garbage collection, and any references that pointed to their +old location must be fixed to point to the correct post-compaction location. +This means that the garbage collector must be able to reach and mutate all +`Value`s that are going to be used after garbage collection. The "set of roots" in Nova's case is a list of global JavaScript `Value`s, a list of "scoped" JavaScript `Value`s, a list of `Vm` structs that are currently @@ -32,40 +37,39 @@ either out of bounds data or to data that is different than what it was before the garbage collection. To avoid this, we want to add the `Value` to the list of "scoped" `Value`s, and -refer to the entry in that list: The garbage collector does not move `Value`s -within the list, so our entry will still point to the same conceptual `Value` +refer to the index in that list: The garbage collector does not move `Value`s +within the list, so our index will still point to the same conceptual `Value` after garbage collection and its pointed-to location will be fixed -post-compaction. Adding `Value`s to the list is done using the `Value::scope` -API. +post-compaction. Adding `Value`s to the list is done using the `Scoped::scope` +trait method. Fortunately, we can explain the interaction between `Value`s and the garbage collector to Rust's borrow checker and have it ensure that we call -`Value::scope` before the garbage collector is (possibly) run. **Unfortunately** -explaining the interaction isn't entirely trivial and means we have to jump -through quite a few hoops. +`Scoped::scope` before the garbage collector is (possibly) run. +**Unfortunately** explaining the interaction isn't entirely trivial and means we +have to jump through quite a few hoops. ## Simple example Let's take a silly and mostly trivial example: A method that takes a single JavaScript object, deletes a property from it and returns the object. Here's the -way we would have written the function before interleaved garbage collection -(though we didn't actually have GcToken back then): +naive way to write the function: ```rs -fn method(agent: &mut Agent, obj: Object, gc: GcToken) -> JsResult { +fn method<'gc>(agent: &mut Agent, obj: Object, gc: GcScope<'gc, '_>) -> JsResult<'gc, Object<'gc>> { + // WARNING: the next line is erroneous! delete(agent, obj, "key".into(), gc)?; Ok(obj) } ``` -Now, with interleaved garbage collection we must realise that `obj` could be a -Proxy with a `delete` trap that performs wild allocations, trying to crash the -engine through OOM. If within the `delete` call garbage collection happens, the -`obj` would no longer point to the same object that we took as a parameter. -Here, we can use scoping to solve the problem: +Because the `delete` method takes a `GcScope`, it might trigger garbage +collection. If that happens, then the `obj` would no longer point to the same +object that we took as a parameter: this is use-after-free. Here, we can use +scoping to solve the problem: ```rs -fn method(agent: &mut Agent, obj: Object, gc: GcToken) -> JsResult { +fn method<'gc>(agent: &mut Agent, obj: Object, gc: GcScope<'gc, '_>) -> JsResult<'gc, Object<'gc>> { let scoped_obj = obj.scope(agent, gc.nogc()); delete(agent, obj, "key".into(), gc.reborrow())?; Ok(scoped_obj.get(agent)) @@ -80,16 +84,16 @@ object heap data. The issue here is that we have to know to call the `scope` method ourselves, and without help this will be impossible to keep track of. Above you already see the -`GcToken::nogc` and `GcToken::reborrow` methods: We use these to make Rust's +`GcScope::nogc` and `GcScope::reborrow` methods: We use these to make Rust's borrow checker track the GC safety for us. -The `GcToken::nogc` method performs a shared borrow on the current `GcToken` and -returns a `NoGcToken` that is bound to the lifetime of that shared borrow. -Effectively you can think of it as saying "for as long as this `NoGcToken` or a +The `GcScope::nogc` method performs a shared borrow on the current `GcScope` and +returns a `NoGcScope` that is bound to the lifetime of that shared borrow. +Effectively you can think of it as saying "for as long as this `NoGcScope` or a `Value` derived from it exists, garbage collection cannot be performed". This -API is used for scoping, for explicitly binding `Value`s to the `GcToken`'s -lifetime, and for calling methods that are guaranteed to not call JavaScript -and/or perform garbage collection. +method is used for scoping, explicitly binding `Value`s to the `GcScope`'s +lifetime, and calling methods that are guaranteed to not call JavaScript or +perform garbage collection otherwise. > Note 1: "Scoped" `Value`s do not restrict garbage collection from being > performed. They have a different type, `Scoped`, and are thus not @@ -98,25 +102,26 @@ and/or perform garbage collection. > Note 2: Currently, Nova makes no difference between methods that can call into > JavaScript and methods that can perform garbage collection. All JavaScript > execution is required to be capable of performing garbage collection so -> calling into JavaScript always requires the `GcToken`. A method that cannot +> calling into JavaScript always requires the `GcScope`. A method that cannot > call into JavaScript but may trigger garbage collection is theoretically -> possible and would likewise require a `GcToken` but there would be no way to +> possible and would likewise require a `GcScope` but there would be no way to > say that it never calls JavaScript. -The `GcToken::reborrow` method performs an exclusive borrow on the current -`GcToken` and returns a new `GcToken` that is bound to the lifetime of that -exclusive borrow. Effectively, it says "for as long as this `GcToken` or a -`Value` derived from it exists, no other `GcToken` or `Value` can be used". This -API is used when calling into methods that may perform garbage collection. +The `GcScope::reborrow` method performs an exclusive borrow on the current +`GcScope` and returns a new `GcScope` that is bound to the lifetime of that +exclusive borrow. Effectively, it says "for as long as this `GcScope` or a +`Value` derived from it exists, no other `GcScope` or `Value` derived from them +can be used". This method is used when calling into methods that may perform +garbage collection. -With the `GcToken::nogc`, we can explicitly "bind" a `Value` to the `GcToken` +With the `GcScope::nogc`, we can explicitly "bind" a `Value` to the `GcScope` like this: ```rs -fn method(agent: &mut Agent, obj: Object, gc: GcToken) -> JsResult { +fn method(agent: &mut Agent, obj: Object, gc: GcScope) -> JsResult { let obj = obj.bind(gc.nogc()); let scoped_obj = obj.scope(agent, gc.nogc()); - delete(agent, obj.unbind(), "key".into(), gc.reborrow())?; + delete(agent, obj.unbind(), "key".into(), gc.reborrow()).unbind()?; Ok(scoped_obj.get(agent)) } ``` @@ -125,16 +130,18 @@ If we were to write out all the lifetime changes here a bit more explicitly, it would look something like this: ```rs -fn method(agent: &'agent mut Agent, obj: Object<'obj>, gc: GcToken<'gc, 'scope>) -> JsResult<'gc, Object<'gc>> { - let nogc: NoGcToken<'nogc, 'scope> = gc.nogc(); // [1] +fn method(agent: &'agent mut Agent, obj: Object<'obj>, gc: GcScope<'gc, 'scope>) -> JsResult<'gc, Object<'gc>> { + let nogc: NoGcScope<'nogc, 'scope> = gc.nogc(); // [1] let obj: Object<'nogc> = obj.bind(gc.nogc()); let scoped_obj: Scoped<'scope, Object<'static>> = obj.scope(agent, gc.nogc()); // [2] { let obj_unbind: Object<'static> = obj.unbind(); // [3] - let gc_reborrow: GcToken<'gcrb, 'scope> = gc.reborrow(); // [4] - delete(agent, obj_unbind, "key".into(), gc_reborrow)?; + let gc_reborrow: GcScope<'gcrb, 'scope> = gc.reborrow(); // [4] + let result: JsResult<'gcrb, bool> = delete(agent, obj_unbind, "key".into(), gc_reborrow); // [5] + let unbound_result: JsResult<'static, bool> = result.unbind(); + unbound_result?; } - let scoped_obj_get: Object<'static> = scoped_obj.get(agent); // [5] + let scoped_obj_get: Object<'static> = scoped_obj.get(agent); // [6] Ok(scoped_obj_get) } ``` @@ -144,40 +151,46 @@ Taking the steps in order: - 1: `'gc: 'nogc`, ie. `'nogc` is shorter than and derives/reborrows from `'gc`. - 2: `Scoped` does not bind to the `'nogc` lifetime of - `NoGcToken<'nogc, 'scope>` but instead to the `'scope` lifetime. This is + `NoGcScope<'nogc, 'scope>` but instead to the `'scope` lifetime. This is purposeful and is what enables `Scoped` to be used after the `delete` call - without angering the garbage collector. + without angering the borrow checker. -- 3: The `Object` needs to be "unbound" from the `GcToken` when used as a +- 3: The `Object` needs to be "unbound" from the `GcScope` when used as a parameter for a method that may perform garbage collection. The reason for this is that, effectively, performing `gc.reborrow()` or passing `gc` as a parameter to a call invalidates all existing `Value`s "bound" to the - `GcToken`. + `GcScope`, including the `obj` that we wanted to pass as a parameter. - 4: `'gc: 'gcrb`, ie. `'gcrb` is shorter than and derives/reborrows from `'gc`. -- 5: The returned `Value` (in this case `Object`) from a `Scoped::get` is - currently bound to the `'static` lifetime, ie. unbound from the `GcToken`. - This isn't great and we'd prefer to have `get` take in a `NoGcToken` and - return `Value<'nogc>` instead but I've not yet figured out if that's possible - (it requires expert level trait magics). Because here we return the `Value` we - got immediately, this lifetime is not a problem but in general we should - perform call `value.bind(gc.nogc())` immediately after the `get`. +- 5: The `delete` method returns a `JsResult` that carries the `'gcrb` + lifetime. As this is a shorter lifetime than `'gc`, we cannot rethrow the + `JsResult::Err` variant as-is (it does not live long enough to be returned + from the `fn method` function) and therefore have to `.unbind()` the result + before rethrowing. + +- 6: The `Value` (or in this case `Object`) returned from a `Scoped::get` is + unbound from the `GcScope`. This isn't great and we'd prefer to have `get` + take in a `NoGcScope` and return `Value<'nogc>` instead but I've not yet + figured out if that's possible (it requires expert level trait magics). + Because here we return the `Value` we got immediately, this lifetime is not a + problem but in general we should perform call `value.bind(gc.nogc())` + immediately after the `get`. With these steps, the borrow checker will now ensure that `obj` is not used after the `delete` call, giving us the help we want and desperately need. -## Rules of thumb for methods that take `GcToken` +## Rules of thumb for methods that take `GcScope` Here's a helpful set of things to remember about scoping of `Value`s in calls -and the different APIs related to the `GcToken`. +and the different APIs related to the `GcScope`. ### At the beginning of a function, bind all parameters Example: ```rs -fn method(agent: &mut Agent, a: Object, b: Value, c: PropertyKey, d: ArgumentsList, gc: GcToken) { +fn method(agent: &mut Agent, a: Object, b: Value, c: PropertyKey, d: ArgumentsList, gc: GcScope) { let nogc = gc.nogc(); // Perfectly okay to avoid repeating `gc.nogc()` in each call. let a = a.bind(nogc); let b = b.bind(nogc); @@ -190,11 +203,12 @@ fn method(agent: &mut Agent, a: Object, b: Value, c: PropertyKey, d: ArgumentsLi ``` Yes, this is annoying, I understand. You **must** still do it, or bugs will seep -through! TODO: We should allow binding `ArgumentsList` directly as well. +through! You can also bind `d: ArgumentsList` directly as well to reduce the +work a little. > Note: The `nogc` local value cannot be used after the first `gc.reborrow()` -> call. You'll need to re-do `let nogc = gc.nogc();` if you want the convenience -> again. +> call. You'll need to re-do `let nogc = gc.nogc();` or `nogc = gc.nogc()` if +> you want the convenience again. ### Unbind all parameters only at the call-site, never before @@ -222,32 +236,34 @@ Example: method(agent, scoped_a.get(agent), gc.reborrow()); ``` -### Immediately "rebind" return values from methods that take `GcToken` +### Immediately "rebind" return values from methods that take `GcScope` Example: ```rs let result = method(agent, a.unbind(), gc.reborrow()) - .unbind().bind(gc.nogc()); + .unbind() + .bind(gc.nogc()); ``` The reason to do this is that the `result` as returned from `method` extends the -lifetime of the `gc.reborrow()` exclusive borrow on the `GcToken`. In effect, -the `result` says that as long as it lives, the `gc: GcToken` cannot be used nor +lifetime of the `gc.reborrow()` exclusive borrow on the `GcScope`. In effect, +the `result` says that as long as it lives, the `gc: GcScope` cannot be used nor can any other `Value`s exist. The exact reason for why it works like this is some fairly obscure Rust lifetime trivia having to do with internal mutability. In our case, a quick `.unbind().bind(gc.nogc())` allows us to drop the exclusive -borrow on the `GcToken` and replace it with, effectively, a shared borrow on the -`GcToken`. This gives us the `GcToken` binding we wanted. +borrow on the `GcScope` and replace it with, effectively, a shared borrow on the +`GcScope`. This gives us the `GcScope` binding we wanted. Exception: This does not need to be done if you're simply returning the result -immediately: +immediately (this does require passing the entire `gc` by-value instead of by +calling `gc.reborrow()`): ```rs -fn call<'a>(agent: &mut Agent, a: Value, gc: GcToken<'a, '_>) -> JsResult<'a, Value<'a>> { +fn call<'a>(agent: &mut Agent, a: Value, gc: GcScope<'a, '_>) -> JsResult<'a, Value<'a>> { let a = a.bind(gc.nogc()); - method(agent, a.unbind(), gc.reborrow()) // No need to rebind the result + method(agent, a.unbind(), gc) // No need to rebind the result } ``` @@ -276,8 +292,10 @@ as the borrow checker would force you to again unbind both `Value`s immediately. Example: ```rs +let a = a.bind(gc.nogc()); let result = method(agent, a.unbind(), gc.reborrow()) - .unbind().bind(gc.nogc()); + .unbind() + .bind(gc.nogc()); a.internal_set_prototype(agent, result.unbind(), gc.reborrow()); // Error! `gc` is immutably borrowed here but mutably borrowed above ``` @@ -285,6 +303,7 @@ If you cannot figure out a way around the borrow checker error (it is absolutely correct in erroring here), then scope the offending `Value`: ```rs +let a = a.bind(gc.nogc()); let scoped_a = a.scope(agent, gc.nogc()); let result = method(agent, a.unbind(), gc.reborrow()) .unbind().bind(gc.nogc()); @@ -308,21 +327,25 @@ the `into_nogc` method, this can be done: ```rs let a = a.unbind(); -// No garbage collection or JS call can happen after this point. We no longer need the GcToken. +// No garbage collection or JS call can happen after this point. We no longer +// need the GcScope. let gc = gc.into_nogc(); -let a = a.bind(gc); // With this we're back to being bound; temporary unbinding like this is okay. +// With this we're back to being bound; temporary unbinding like this is okay. +let a = a.bind(gc); ``` **Bad example:** -"Temporary unbinding" like above must not contain any `GcToken` usage in +"Temporary unbinding" like above must not contain any `GcScope` usage in between: ```rs let a = a.unbind(); -method(agent, b.unbind(), gc.reborrow()); // GC can trigger here! +// GC can trigger here! +method(agent, b.unbind(), gc.reborrow()); let gc = gc.into_nogc(); -let a = a.bind(gc); // We're back to being bound but GC might have triggered while we were unbound! +// We're back to being bound but GC might have triggered while we were unbound! +let a = a.bind(gc); ``` This is absolutely incorrect and will one day lead to a weird bug. diff --git a/README.md b/README.md index e3286c1c4..21b5e80fd 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,17 @@ The core of our team is on our [Discord server](https://discord.gg/bwY4TRB8J7). ## Talks +### [Out the cave, off the cliff — data-oriented design in Nova JavaScript engine](https://www.youtube.com/watch?v=QuJRKhySp-0) + +Slides: +[Google Drive](https://docs.google.com/presentation/d/1_N5uLxkR0G4HSYtGuI68eXaj51c7FVCngDg7lxiRytM/edit?usp=sharing) + +Presented originally at Turku University JavaScript Day, then at Sydney Rust +Meetup, and finally at [JSConf.jp](https://jsconf.jp/2025/en) in slightly +differing and evolving forms, the talk presents the "today" of major JavaScript +engines and the "future" of what Nova is doing, and why it is both a good and a +bad idea. + ### [Abusing reborrowing for fun, profit, and a safepoint garbage collector @ FOSDEM 2025](https://fosdem.org/2025/schedule/event/fosdem-2025-4394-abusing-reborrowing-for-fun-profit-and-a-safepoint-garbage-collector/) Slides: @@ -59,29 +70,31 @@ TC39 slides: The architecture and structure of the engine follows the ECMAScript specification in spirit, but uses data-oriented design for the actual -implementation. Records that are present in the specification are generally -found as a `struct` in Nova in an "equivalent" file / folder path as the -specification defines them in. But instead of referring to these records by -pointer or reference, the engine usually calls these structs the "RecordData" or -"RecordHeapData", and defines a separate "index" type which takes the "Record" -name and only contains a 32-bit unsigned integer. The heap data struct is stored -inside the engine heap in a vector of these heap data structs, and the index -type stores the correct vector index for the value. Polymorphic index types, -such as the main JavaScript Value, are represented as tagged enums over the -index types. +implementation. Types that are present in the specification, and are often +called "Something Records", are generally found as a `struct` in Nova in an +"equivalent" file / folder path as the specification defines them in. But +instead of referring to these records by pointer or reference, the engine +usually calls these structs the "SomethingRecord" or "SomethingHeapData", and +defines a separate "handle" type which takes the plain "Something" type name and +only contains a 32-bit unsigned integer. The record struct is stored inside the +engine heap in a vector, and the handle type stores the correct vector index for +the value. Polymorphic index types, such as the main JavaScript Value, are +represented as tagged enums over the index types. In general, all specification abstract operations are then written to operate on -the index types instead of operating on the heap structs themselves. This avoids -issues with re-entrancy, pointer aliasing, and others. +the index types instead of operating on references to the heap structs +themselves. This avoids issues with re-entrancy, pointer aliasing, and others. ### Heap structure - Data-oriented design -Reading the above, you might be wondering why the split into index and heap data -structs is done. The ultimate reason is two-fold: +Reading the above, you might be wondering why the split into handle and heap +data structs is done. The ultimate reason is two-fold: 1. It is an interesting design. -2. It helps the computer make frequently used things fast while allowing the - infrequently used things to become slow. + +1. It helps the computer make frequently used things fast while allowing the + infrequently used things to take less (or no) memory at the cost of access + performance. Data-oriented design is all the rage on the Internet because of its cache-friendliness. This engine is one more attempt at seeing what sort of diff --git a/nova_lint/README.md b/nova_lint/README.md index 4cc6dd5ea..901ed2c15 100644 --- a/nova_lint/README.md +++ b/nova_lint/README.md @@ -12,10 +12,13 @@ along side [Clippy](https://doc.rust-lang.org/stable/clippy/index.html). ## Usage 1. Install `cargo-dylint` and `dylint-link`: - ```bash - cargo install cargo-dylint dylint-link - ``` + +```bash +cargo install cargo-dylint dylint-link +``` + 2. Run the linter in the root of the project: - ```bash - cargo dylint --all - ``` + +```bash +cargo dylint --all +``` diff --git a/nova_vm/src/README.md b/nova_vm/src/README.md index 79ad55538..075709156 100644 --- a/nova_vm/src/README.md +++ b/nova_vm/src/README.md @@ -4,18 +4,9 @@ The Nova VM source structure is as follows: 1. The `ecmascript` folder contains all code relating to the ECMAScript specification. -1. The `engine` folder contains Nova engine specific code such as bytecode. -1. The `heap` folder contains the setup for the heap of the VM and the direct - APIs to work with it. - -## ECMAScript folder structure -The ECMAScript folder will have its own READMEs to better describe various -details of the structure but the basic idea is to stay fairly close to the -ECMAScript specification text and its structure. +1. The `engine` folder contains Nova engine specific code such as bytecode + interpreter. -As an example, the `ecmascript/types` folder corresponds to the specification's -section 6, `ECMAScript Data Types and Values`. That section has two subsections, -6.1 `ECMAScript Language Types` and 6.2 `ECMAScript Specification Types`, which -then correspond to the `ecmascript/types/language` and `ecmascript/types/spec` -folders respectively. +1. The `heap` folder contains the setup for the heap of the VM and the direct + APIs to work with it. diff --git a/nova_vm/src/ecmascript/README.md b/nova_vm/src/ecmascript/README.md index c90e72e26..d7134a572 100644 --- a/nova_vm/src/ecmascript/README.md +++ b/nova_vm/src/ecmascript/README.md @@ -1,50 +1,91 @@ -## ECMAScript +# ECMAScript This folder contains the code for things mentioned directly in the [ECMAScript language specification](https://tc39.es/ecma262/). As much as is reasonable, the structure within this folder should be similar to the specification text and code should reuse terms from the specification directly. -### Crossreferencing +This is also conceptually the main entry point into the `nova_vm` library for +embedders. -#### 6. ECMAScript Data Types and Values +## Values -Found in the [`types`](./types/) folder. +The main ECMAScript `Value` type of Nova is found in +[`types/language/value.rs`](./types/language/value.rs) and is a public Rust +`enum`. The Nova `Value` API is thus fully open about its internal +representation, and the usage of pattern matching on `Value` in embedders is +explicitly encouraged. + +Nova's `Value` variants always hold either an in-line stored payload directly, +or ["handle"](https://en.wikipedia.org/wiki/Handle_(computing)) to data stored +on the engine heap. + +### Primitives + +Nova's primitive types are implemented as follows. + +1. `undefined` and `null` as payload-less variants on `Value`. + +1. `true` and `false` as a `bool` carrying variant on `Value`. + +1. Strings are encoded as [WTF-8] and have two variants on `Value`: short + strings stored in-line, and longer strings heap-allocated and referenced + though a handle. + +1. Symbols have a single variant on `Value`, carrying a handle to heap-allocated + data. -#### 7. Abstract operations +1. Numbers have three variants on `Value`: safe integers stored in-line, + double-precision floating point values with 8 trailing zeroes stored in-line, + and other numbers heap-allocated and referenced through a handle. -Currently mostly found as methods on `Value`. +1. BigInts have two variants on `Value`: 56-bit signed values stored in-line, + and larger values heap-allocated and referenced through a handle. -Maybe move to [`abstract_operations`](./abstract_operation)? +### Objects -#### 8. Syntax-Directed Operations +All non-primitive ECMAScript values are objects. Object data is always +heap-allocated and referenced through a handle. Unlike most JavaScript engines, +Nova does not use pointers to refer to heap-allocated data and instead chooses +to use a combination of the handle's type and an index contained in the handle. +This means that all object data is allocated on the heap in dedicated typed +arenas, and accessing the data is done by offsetting based on the handle's +contained index. + +## Crossreferencing + +### 6. ECMAScript Data Types and Values + +Found in the [`types`](./types/) folder. + +### 7. Abstract Operations + +Found in the [`abstract_operations`](./abstract_operation) folder. + +### 8. Syntax-Directed Operations This is more about the parsing so I am not sure if this needs to be in the engine at all. If this ends up being needed then it will be in a [`syntax`](./syntax/) folder. -#### 9. Executable Code and Execution Contexts +### 9. Executable Code and Execution Contexts Found in the [`execution`](./execution/) folder. -#### 10. Ordinary and Exotic Objects Behaviours - -Currently mostly found in `builtins` but maybe move to -[`behaviours`](./behaviours)? +### 10. Ordinary and Exotic Objects Behaviours -On the other hand, this part of the spec also contains the subsection 10.3 -Built-in Function Objects and various other built-in related things so it might -be okay to keep this in `builtins` in an inline sort of way. +Found in the [`builtins`](./builtins) folder. -#### 11-15. ECMAScript Language, and 18. Error Handling and Language Extensions +### 11-15. ECMAScript Language, and 17. Error Handling and Language Extensions -This is all syntax (and then some) and will not be found in the engine. +For the parts concerning evaluation of the language syntax, these are mainly +found in the adjacent [`engine`](../engine) folder. -#### 16. ECMAScript Language: Scripts and Modules +### 16. ECMAScript Language: Scripts and Modules Found in the [`scripts_and_modules`](./scripts_and_modules/) folder. -#### 18. ECMAScript Standard Built-in Objects, and 19-28. +### 18.-28. ECMAScript Standard Built-in Objects -Should be found in the [`builtins`](./builtins/) folder. +Found in the [`builtins`](./builtins) folder. diff --git a/nova_vm/src/engine/README.md b/nova_vm/src/engine/README.md index c2e78ca32..422a24dce 100644 --- a/nova_vm/src/engine/README.md +++ b/nova_vm/src/engine/README.md @@ -1,4 +1,141 @@ # Engine -This folder should contain engine-specific details such as the bytecode -representation and interpretation. +This folder contains engine-specific details such as the bytecode representation +and interpreter. The Nova VM is a fairly simple stack-based bytecode +interpreter. The interpreter loop revolves around the +[`struct +Vm`](./bytecode/vm.rs:86:12). + +## The `Vm` struct + +The `Vm` struct is made up of the following parts: + +1. Instruction pointer: this tracks the progress of the interpreter in the + referenced bytecode buffer. + +1. Result register: a single `Option` that holds the last expression or + statement result, or `None`. When the previous result needs to be recalled + later, it is pushed onto the Value stack. + +1. Reference register: a single `Option` of a [Reference Record] that holds the + last "place expression" (an expression `a` that can appear on the left side + of an assignment operation `a = b`) to be evaluated, or `None`. When a + reference needs to be recalled later, it is pushed onto the Reference stack. + +1. Value stack: a `Vec` that is used to store unnamed variables as + needed. The stack is also used for storing non-escaping named variables as an + optimisation. + +1. Reference stack: a `Vec` of [Reference Records][Reference Record] that is + used by the interpreter to store references for later use. + +1. Exception handler stack: a `Vec` of installed exception handlers. These are + produced by try-catch blocks and async iterators, and are removed upon + exiting the associated block. + +1. Iterator stack: a `Vec` of iterators currently being processed. These are + produced by `for-of` and `for-in` iterators and are removed upon exiting the + loop. + +Contrary to the common (and the most obvious and efficient) way of building a +VM, Nova does not have a single `Vm` struct per `Agent` (engine instance) that +gets reused between calls. Instead, the engine creates a new `Vm` on the native +stack on every function call. This is not an architectural decision per se, but +rather a historical one that ought to be fixed at some point. + +Because the `Vm` structs are stored on the stack, they must be explicitly rooted +when calling into methods that might trigger garbage collection (methods that +take `GcScope`). A helper function `with_vm_gc` is provided for this purpose. + +In addition to the `Vm` structs, ECMAScript execution per specification is +defined by the [Execution Contexts][Execution Context]. These are stored in a +stack in the `Agent`, and are currently considered entirely separate from the +`Vm` structs despite being very closely related in actual fact. + +### Running ECMAScript code in a `Vm` + +ECMAScript code must be compiled into bytecode to be executable; such a compiled +bytecode is stored on the `Agent` heap separately and is referenced using the +handle type `Executable`. An `Executable` can then be executed by calling the +`Vm::execute` static method, and will return one of four result variants: + +1. `Return`: the code executed successfully and returned a result. This is + produced by both the `return` statement and an executable reaching the end of + the bytecode buffer (implicit `return`). + +1. `Throw`: the code executed unsuccessfully and threw an error. This is + produced by the `throw` statement. + +1. `Await`: the code could not finish execution and has to await a `Promise` + before continuing. The variant includes both the `Promise` to await and a + suspended variant of the `Vm` struct. This is produced by the `await` + expression and async iterators. + +1. `Yield`: the code requested to yield to the caller. The variant includes both + the `Value` to yield and a suspended variant of the `Vm` struct. This is + produced by the `yield` expression. + +The `struct SuspendedVm` returned by `Await` and `Yield` variants must be stored +on the `Agent` heap and resumed at a later point. + +### Compiling ECMAScript code for executing + +ECMAScript code can be compiled using one of the four `compile` APIs on the +`Executable` handle type: + +1. `compile_script`: for compiling [`Script`s][Script]. +1. `compile_module` for compiling ECMAScript [Modules][Module]. +1. `compile_function_body` for compiling ECMAScript [Functions][Function]. +1. `compile_eval_body` for compiling [`eval()` scripts][eval]. + +Once compiled, the `Executable` handle is used to refer to the bytecode stored +on the `Agent` heap. If the `Executable` is itself not stored on the heap (such +as stored in an `ECMAScriptFunction`'s heap data) then the bytecode will +eventually be garbage collected. + +It is worth noting that at present closures are compiled anew on every use as +the bytecode gets stored in the `ECMAScriptFunction` instance and is not shared +between multiple function instances even though they originate from the same +code: + +```typescript +array.filter((v) => v > 10); // one function instance, compiled once + +for (let i = 0; i < N; i++) { + array.filter((v) => v > 10); // one function instance per loop, compiled N times +} +``` + +This is again not an architectural choice, but simply a historical happenstance +that ought to be fixed. + +## Bytecode format + +Nova VM's bytecode is a variable-width, high-level bytecode. It does not offer +any instructions for direct heap memory manipulation, but offers a set of +base-level instructions for manipulating the state of the `Vm` struct, and +beyond that provides many ECMAScript specific instructions that perform more +complicated actions that would not be possible to implement on just the `Vm` +struct. + +Each bytecode instruction is always made up of a single instruction byte which +is then followed by 2 or 4 bytes of data, depending on the instruction. The data +bytes are interpreted as either `bool`s, `u16` values, or a single `u32` value +depending on the instruction. Note that the data is not aligned and cannot be +directly reinterpreted as a `u16` or `u32`. + +## `Vm` dispatch loop + +The `Vm` dispatch loop is found in the `Vm::inner_execute` method and consists +of a `while let Some()` loop "consuming" bytecode instructions (by moving the +`Vm`'s instruction pointer forward) from the `Executable`'s bytecode buffer. +This loop also contains the _only_ point in the engine where garbage collection +may be triggered. There is no particular reason why this should be the only +point where GC is checked and triggered, but currently it happens to be so. + +[eval]: https://tc39.es/ecma262/#sec-eval-x +[Execution Context]: https://tc39.es/ecma262/#sec-execution-contexts +[Function]: https://tc39.es/ecma262/#sec-ecmascript-function-objects +[Module]: https://tc39.es/ecma262/#sec-source-text-module-records +[Reference Record]: https://tc39.es/ecma262/#sec-reference-record-specification-type +[Script]: https://tc39.es/ecma262/#sec-scripts diff --git a/nova_vm/src/engine/bytecode/vm.rs b/nova_vm/src/engine/bytecode/vm.rs index 1f5f54658..510bc83b9 100644 --- a/nova_vm/src/engine/bytecode/vm.rs +++ b/nova_vm/src/engine/bytecode/vm.rs @@ -97,21 +97,20 @@ pub(crate) struct Vm { #[derive(Debug)] pub(crate) struct SuspendedVm { ip: usize, - /// Note: Stack is non-empty only if the code awaits inside a call - /// expression. This is reasonably rare that we can expect the stack to - /// usually be empty. In this case this Box is an empty dangling pointer - /// and no heap data clone is required. + /// Note: Stack is empty only if the code contains no local variables + /// optimised into stack slots or temporarily stored Values. A heap clone is + /// probably often performed by the `stack.into_boxed_slice()` call. stack: Box<[Value<'static>]>, - /// Note: Reference stack is non-empty only if the code awaits inside a - /// call expression. This means that usually no heap data clone is - /// required. + /// Note: Reference stack is non-empty only if the code awaits inside a call + /// expression. This means that usually no heap data clone is required. reference_stack: Box<[Reference<'static>]>, /// Note: Iterator stack is non-empty only if the code awaits inside a /// for-in or for-of loop. This means that often no heap data clone is /// required. iterator_stack: Box<[VmIteratorRecord<'static>]>, - /// Note: Exception jump stack is non-empty only if the code awaits inside - /// a try block. This means that often no heap data clone is required. + /// Note: Exception jump stack is non-empty only if the code awaits inside a + /// try block or an await for-of loop. This means that often no heap data + /// clone is required. exception_jump_target_stack: Box<[ExceptionHandler<'static>]>, } @@ -264,7 +263,7 @@ impl Vm { eprintln!(); } - pub(crate) fn resume<'gc>( + fn resume<'gc>( mut self, agent: &mut Agent, executable: Scoped, @@ -275,7 +274,7 @@ impl Vm { self.inner_execute(agent, executable, gc) } - pub(crate) fn resume_throw<'gc>( + fn resume_throw<'gc>( mut self, agent: &mut Agent, executable: Scoped, diff --git a/nova_vm/src/heap/README.md b/nova_vm/src/heap/README.md index 365bf67a2..6d955916d 100644 --- a/nova_vm/src/heap/README.md +++ b/nova_vm/src/heap/README.md @@ -12,7 +12,7 @@ good idea of the data and common actions on it. So what are common actions that a JavaScript runtime does? - Calling functions -- Accessing properties of objects +- Accessing named properties of objects - Adding and manipulating object properties - Accessing indexed properties of arrays - Iterating arrays @@ -21,10 +21,12 @@ What are some uncommon actions? - Deleting object properties: Hashmap-like object are rare in modern JavaScript. - Accessing or defining property descriptors. -- Accessing or assigning named (non-indexed) properties on arrays. -- Calling getter or setter functions. +- Accessing or assigning named properties on arrays (except `length`). +- Calling getter or setter functions (compared to reading or writing data + properties). - Accessing the length or name of a function. -- Adding, manipulating, or deleting properties on functions. +- Adding, manipulating, or deleting properties on functions (except for + classes). - Adding, manipulating, or deleting properties on ArrayBuffers, Uint8Arrays, DataViews, Dates, RegExps, ... Most builtin objects are used for their named purpose only and beyond that are left alone. @@ -39,11 +41,11 @@ So what we can gather from this is that 1. Function calls are very important. Properties of functions are not very important. 1. Objects mostly need good access to their keys and values. Their prototype is - mostly secondary, and with hidden classes / shapes the keys actually become - secondary as well. This should be no big surprise, as this is exactly what - structs are in system programming languages. + mostly secondary, and with "hidden class" / "shape" optimisations the keys + actually become secondary as well. This should be no big surprise, as this is + exactly what structs are in systems programming languages. 1. Property descriptors are not very important. -1. Quick iteration over array elements is very important. +1. Quick access into and iteration over arrays is very important. From an engine standpoint the only common action that is done outside of JavaScript is the garbage collection. The garbage collection mostly cares about @@ -52,11 +54,11 @@ quick access to JavaScript heap values and efficient iteration over them. Thus our heap design starts to form. To help the garbage collection along we want to place like elements one after the other. As much as possible, we want to give the CPU an easy time in guessing where we'll be accessing data next. This -then means both that vectors are our friend but also we want to avoid vectors of -`dyn Trait` and instead want vectors of concrete structs even if we end up with +then means both that vectors are our friend and that we want to avoid vectors of +`dyn Trait`; vectors of concrete types is what we want, even if we end up with more vectors. -To avoid unnecessary memory usage for eg. arrays, ArrayBuffers, RegExps etc. +To avoid unnecessary memory usage for eg. Arrays, ArrayBuffers, RegExps etc. we'd want to avoid creating their "object" side if its not going to be used. This then means that the "object" side of these must be separate from their "business" side. An ArrayBuffer will need to have a pointer to a raw memory @@ -64,11 +66,11 @@ buffer somewhere and that is its most common usage so it makes sense to put that in its "business" side but we do not need to keep its property key-value pairs close at hand. -So we put object base heap data one after the other somewhere. We could refer to -these by pointer but that takes 8 bytes of memory and Rust doesn't particularly -like raw pointers, nor does it like `&'static` references. So direct references -are not a good idea. If we use heap vectors we can instead use indexes into -these vectors! A vector is `usize` indexed but a JS engine will never run into 4 +So we put ordinary object data in a vector of their own. We could refer to these +by pointer but that takes 8 bytes of memory and Rust doesn't particularly like +raw pointers, nor does it like `&'static` references. So direct references are +not a good idea. If we use heap vectors we can instead use indexes into these +vectors! A vector is `usize` indexed but a JS engine will never run into 4 billion objects or numbers or strings. A `u32` index is thus sufficient. It then makes sense to put other heap data in vectors as well. So we have @@ -85,7 +87,7 @@ locality! In a simple, object-oriented design an Object looks like follows: -```rs +```rust struct Object { prototype: Option>, properties: HashMap, Gc>, @@ -94,7 +96,7 @@ struct Object { and an Array would be: -```rs +```rust struct Array { base: Object, elements: Vec>, @@ -107,7 +109,7 @@ As we noted above, named properties on arrays are rare as are changing prototypes. Hence, we do not actually need the "base" very often. We can thus reduce memory usage by changing Array to: -```rs +```rust struct Array { base: Option>, elements: Vec>, @@ -116,7 +118,7 @@ struct Array { Now most Arrays will avoid creating the object base entirely and will instead rely on the knowledge that an Array without a base contains no named properties -(except length, which can be synthesised from elements) and has +(except `length`, which can be synthesised from elements) and has `Array.prototype` as its prototype. (Note: An Array without a base does actually need to know which Realm it was created in, so the above struct is not quite adequate.) @@ -126,11 +128,11 @@ reduce memory usage, but it does not yet give us the benefit for iterating over objects that we want. For that, we need to split the Array into parts and put them into parallel vectors: -```rs +```rust struct BaseObject(Option); struct Elements(Vec>); -let arrays: ParallelVec<(BaseObject, Elements)>; +let arrays: SoAVec<(BaseObject, Elements)>; struct Array(usize); // Array is now just an index into the arrays vector. ``` @@ -173,7 +175,7 @@ though: algorithm. Then Boa: Boa uses a modified version of the [gc](https://crates.io/crates/gc) -crate, which is a tracing garbage collector based on a `Trace`, trait and +crate, which is a tracing garbage collector based on a `Trace` trait, and created using basic Rust structs. Thus it follows that: 1. The Boa heap is not located in a defined area. Each heap object is allocated @@ -229,11 +231,11 @@ Still, V8 does have some disadvantages as well. So, what can we expect from Nova's heap? First the advantages: -1. All objects are of the same size and can thus be placed in a vector, - benefiting from cache locality and amortisation of allocations. -1. All references to and within the heap data can be done with a 32 bit integer - index and a type: The type tells which array to access from and the index - gives the offset. +1. All objects of a given type are of the same size and can thus be placed in a + vector, benefiting from cache locality and amortisation of allocations. +1. All references to and within the heap data can be done using a 32-bit handle + and a type: The type tells which vector to access from and the handle gives + the offset. 1. Placing heap data "by type and use" in a DOD fashion allows eg. `Uint8Array`s to not include any of the data that an object has. 1. Garbage collection must be done on the heap level which quite logically lends @@ -262,7 +264,7 @@ There are disadvantages as well. ## Ownership, Rust and the borrow checker We've established that our heap will consist of vectors that contain heap data -that refer to one another using a type + index pair (an enum, essentially). The +that refer to one another using a type + handle pair (an enum, essentially). The question one might then ask is, what does the Rust borrow checker think of this? There are two immediate answers: @@ -283,9 +285,10 @@ We do not want to do reference counting, so that way is barred to us. But we cannot do references either, the borrow checker will not allow such a thing. So it seems like the borrow checker does not like what we're doing at all. -But remember, we're not doing references (ie. pointers), we're doing indexes. So +But remember, we're not doing references (ie. pointers), we're doing handles. So instead of the above `struct A` with an internal reference we have -`struct A { key: (u8, u32) }` where the `u8` is the type information and `u32` +`struct A { +key: (u8, u32) }` where the `u8` is the type information and `u32` the index. These imply no referential ownership and thus the borrow check does not care at all. @@ -306,7 +309,14 @@ memory corruption with this, even if we do cause heap corruption. That is exactly how we want it to be as well: As said, JavaScript's ownership model cannot be represented in a way that would satisfy the Rust borrow checker -and thus it would be a waste of time to even try. +and it would be a waste of time to even try. Consider this as well: when an +object is ready to be garbage collected from the JavaScript's point of view, it +must be the engine that frees or reuses the object's backing memory. It should +therefore be immediately obvious that attempting to track "actual memory +ownership" of garbage collected data using normal Rust references is an +fundamental misunderstanding, as the garbage collected references must never +free the memory they point at and therefore do not _own_ that memory in any +sense of the word. So summing up: Our heap model does not try to present the JavaScript ownership model to Rust's borrow checker in any way or form. Instead we represent to the @@ -323,9 +333,9 @@ Here're the broad strokes of it: 1. Starting with roots (global values as well as local values currently held by the mutator thread, ie. the JavaScript thread), mark the heap data entries corresponding to those roots. Marks are entered in a separate vector of mark - bytes (or possibly bits). + bits. 2. For each marked entry, trace all its referred heap data entries. Recurse. -3. Once no more work is left to be done, walk through the vector of mark bytes +3. Once no more work is left to be done, walk through the vector of mark bits and note each unmarked slot in the heap vectors. Gather up a list of compactions (eg. Starting at index 30 in object heap vector, shift down 2, starting at index 45 shift down 3, ...) and the number of marked elements diff --git a/tests/expectations.json b/tests/expectations.json index a3d0d1bb1..298bfcb8c 100644 --- a/tests/expectations.json +++ b/tests/expectations.json @@ -7026,4 +7026,4 @@ "staging/sm/syntax/yield-as-identifier.js": "FAIL", "staging/source-phase-imports/import-source-source-text-module.js": "FAIL", "staging/top-level-await/tla-hang-entry.js": "FAIL" -} \ No newline at end of file +} diff --git a/tests/metrics.json b/tests/metrics.json index 03b9dac85..7588cceb5 100644 --- a/tests/metrics.json +++ b/tests/metrics.json @@ -8,4 +8,4 @@ "unresolved": 37 }, "total": 50733 -} \ No newline at end of file +} diff --git a/tracing/README.md b/tracing/README.md index be5e637a8..c7ffdf6a4 100644 --- a/tracing/README.md +++ b/tracing/README.md @@ -1,4 +1,4 @@ # Scripts for bpftrace and DTrace -This folder contains useful scripts for debugging or performance measurements -of Nova VM written in the bpftrace language or the DTrace D language. +This folder contains useful scripts for debugging or performance measurements of +Nova VM written in the bpftrace language or the DTrace D language.