ProteusLab · uslsteen · Dec 18, 2024 · Dec 15, 2024 · Dec 15, 2024 · Dec 15, 2024
diff --git a/README.md b/README.md
@@ -0,0 +1,55 @@
+# CPU & OS simulation Elective Course
+
+## What is the course about?
+
+Сourse repository is dedicated to CPU and OS simulation in third bachelor semester at MIPT.
+
+All teaching materials used during the semester are [here](slides/).
+
+## Demo Code
+
+You can find [simulator library](lib/) and [test generation script](test/) here.
+
+During the course we consistently improve our toy model of the simulator.
+
+1. [Naive Interpreter](naive_interpreter/sim.cc)
+2. [Inline Assembly model](inline_assembly/sim.cc)
+3. [AsmJit Assembly model](asmjit_assembly/sim.cc)
+4. [JIT translator](jit_translator/sim.cc)
+
+Here are the results of a comparison of four different models on a [test](test/code.hpp) in which we varied the number of loop iterations (aka LC):
+
+- MIPS:
+![img](pics/bench-mips.png)
+
+- Time, seconds:
+![img](pics/bench-time.png)
+
+## Slides
+
+1.  [Introduction](slides/00_Introduction.pdf)
+2.  [Software Modeling](slides/01_Software_Modeling.pdf)
+3.  [Interpreters](slides/02_Interpreters.pdf)
+4.  [Decoder](slides/03_Decoder.pdf)
+5.  [ELF](slides/04_ELF.pdf)
+6.  [Advanced Interptreters](slides/05_Interpreter+.pdf)
+7.  [Full-System Simulation](slides/06_FSS.pdf)
+8.  [Trace Driver Simulation](slides/07_TDS.pdf)
+9.  [Cycle-Accurate Models](slides/08_CA_models.pdf)
+10. [Caches](slides/09_Caches.pdf)
+11. [Program Execution Analysis](slides/10_Program_Execution_Analysis.pdf)
+
+## Usage
+
+From the root of source directory configure:
+
+```bash
+mkdir -p build/
+cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
+```
+
+Then run build:
+
+```bash
+cmake --build build/ -j<nproc>
+```
diff --git a/asmjit_assembly/README.md b/asmjit_assembly/README.md
@@ -0,0 +1,62 @@
+## AsmJit Interpreter
+
+Implementing semantics using inline assembler is definitely a step backwards in terms of the evolution of programming languages. We certainly wouldn't do it that way. However, the optimization worked.
+
+Maybe there is some way to make this easier? Maybe there are already Is libraries for generating host code using C++ code?
+
+- [LLVM](https://llvm.org/doxygen/classllvm_1_1IRBuilder.html)
+- [MJIT](https://llvm.org/docs/MCJITDesignAndImplementation.html)
+- [AsmJit](https://asmjit.com/)
+
+Here may note that LLVM is a huge compiler project, and AsmJit is a lightweight library for JIT compilation. As well, it's. Suggest take a look at a short summary from ChatGPT and move on.
+
+![img](../pics/llvm-vs-asmjit.png)
+
+The assembly capsule for the addition operation from toy ISA now looks like this:
+
+```cpp
+case isa::Opcode::kAdd: {
+    assembler.mov(asmjit::x86::eax, cpu.regs[insn.src1]);
+    assembler.add(asmjit::x86::eax, cpu.regs[insn.src2]);
+    //
+    assembler.mov(
+        asmjit::x86::dword_ptr((size_t)(&(cpu.regs[insn.dst]))),
+        asmjit::x86::eax);
+
+    assembler.mov(asmjit::x86::eax, cpu.pc);
+    assembler.add(asmjit::x86::eax, 1);
+
+    assembler.mov(asmjit::x86::dword_ptr((size_t)(&cpu.pc)),
+                    asmjit::x86::eax);
+
+    assembler.ret();
+    break;
+}
+```
+
+We would like to pay your attention to two details:
+- Сapsule changes the state of the entire model, not just the memory.
+- When leaving the capsule, do return
+
+By carefully considering the two previous points, you will most likely avoid segfault.
+
+Let's benchmark this implementation and ...
+
+### Benchmark
+
+This is a failure. Even a naive interpreter compiled without optimizations (-O0) reaches 40 MIPS.
+
+- MIPS:
+![img](../pics/asmjit-interp-mips.png)
+
+- Why does this happen? There are two main reasons:
+    - Redundant calls of host code emitter
+    - Redundant, but vital complete the execution of every guest instruction
+
+- Time, seconds:
+![img](../pics/asmjit-interp-time.png)
+
+Is there any way we can reduce the excessive number of emitter calls?
+
+Of course, you can. First, you'll have to understand the [technology of binary dynamic translation](../slides/05_Interpreter+.pdf) in detail.
+Then will try to [improve our model](../jit_translator/).
diff --git a/inline_assembly/README.md b/inline_assembly/README.md
@@ -0,0 +1,49 @@
+## Inline Assembly Interpreter
+
+While coding the execution stage, you probably thought about at least one of the points
+
+1. It's too boring to code
+
+2. It's applicable for templates
+
+3. It's definitely not the most optimal approach to simulate
+
+Therefore, we introduce a new concept.
+A capsule is a block of host code, mapped to one guest instruction
+
+![img](../pics/template-capsule.png)
+
+As mentioned, the interpreter:
+- Easy to implement and modify
+- Simple but slow.
+
+Let's try to apply capsule technology to our toy ISA.
+Of course, our code will not become simpler from previous one. Moreover, vice a versa. However, this stage of the simulation evolution is an important step for understanding the following.
+
+And in the first approximation, we will do this in the crudest way - [inline assembly and preprocessor](capsule_asm.hh).
+
+```cpp
+#define ARITHM_CAPSULE(op, dst, src1, src2)         \
+    asm volatile("mov %[rs1], %%eax\n\t" op         \
+                 " %[rs2], %%eax\n\t"               \
+                 "mov %%eax, %[rd]\n\t"             \
+                 : [rd] "=r"(dst)                   \
+                 : [rs1] "r"(src1), [rs2] "r"(src2) \
+                 : "%eax");
+```
+
+So, let's benchmark our optimization.
+
+### Benchmark
+
+We see that optimization and suffering with assembler inserts were not in vain. The model was able to overcome 40 MIPS, but not much more.
+
+- MIPS:
+![img](../pics/inline-asm-mips.png)
+
+Moreover, on average, it works almost at the same speed as the naive interpretation.
+
+- Time, seconds:
+![img](../pics/inline-asm-time.png)
+
+The obtained results indicate that it is necessary to look for new possible optimizations of execution.
diff --git a/jit_translator/README.md b/jit_translator/README.md
@@ -0,0 +1,40 @@
+## Dynamic Binary Translation (JIT)
+**Binary translation** optimization is based on the idea of "compiling" blocks of target code once and then reusing the results of this work many times. This eliminates the need to interpret instructions at each step of execution.
+
+![img](../pics/dynamic-bin-translation.png)
+
+The guest code is transformed into translation blocks. Each translation block has at least one entry point - the address from which execution starts.
+
+**Recup**: The block of host code obtained for one guest instruction is called a capsule
+
+**Dynamic binary translation** is an approach in which guest system simulation is interspersed with binary translation engine runs for new code blocks that will soon be executed, as well as translation updates for blocks that have changed their contents.
+
+There are different approaches to solving the problem of organizing translation units in dynamic binary translation
+
+* **Traces** : The first execution of each guest instruction is performed using the naive interpreter, and its translation and store of the result in the trace are also performed.
+
+![img](../pics/dbt-traces.png)
+
+* **Pages** : Instructions located in memory at adjacent addresses are likely to belong to related parts of the program algorithm, will be executed together and can therefore be translated into one block. Translated sections of the host code are reused for previously executed blocks.
+
+![img](../pics/dbt-pages.png)
+
+In [our implementation](sim.cc) we will use a special case of the first approach, code translation by basic blocks.
+
+**Recup** Basic block -- its a straight-line code sequence with one entry point at the beginning, only one out instruction at the end.
+
+### Benchmark
+
+Let's evaluate perfomance of new approach.
+
+- MIPS:
+![img](../pics/jit-translator-mips.png)
+
+So, on our synthetic test it showed amazing gain of MIPS.
+Based on the bars, it can be noted two obvious things:
+
+- The performance gain becomes significant when the translation overhead is offset by the execution speed of the translated basic blocks.
+- On average models with AsmJit translation will reaches about 350 MIPS. To get more we need to apply other optimizations.
+
+- Time, seconds:
+![img](../pics/jit-translator-time.png)
diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt
@@ -1,4 +1,4 @@
-set(DIRS cpu_state isa memory decoder logger hart)
+set(DIRS cpu_state isa memory decoder logger hart timer)
 
 foreach(DIR ${DIRS})
   add_subdirectory(${DIR})

diff --git a/lib/hart/CMakeLists.txt b/lib/hart/CMakeLists.txt
@@ -1,4 +1,4 @@
 add_library(hart INTERFACE)
 target_include_directories(hart INTERFACE include)
 target_include_directories(hart INTERFACE ${CMAKE_SOURCE_DIR}/test)
-target_link_libraries(hart INTERFACE memory cpu_state decoder logger)
+target_link_libraries(hart INTERFACE memory cpu_state decoder logger timer)
diff --git a/lib/hart/include/sim/hart.hh b/lib/hart/include/sim/hart.hh
@@ -7,6 +7,8 @@
 #include "sim/cpu_state.hh"
 #include "sim/logger.hh"
 #include "sim/memory.hh"
+//
+#include "timer.hh"
 
 namespace sim {
 
@@ -115,8 +117,12 @@ struct Hart {
 };
 
 void do_sim(Hart* hart, const std::vector<uint32_t>& program) {
+    Time::Timer timer{};
+    //
     hart->load(program);
     hart->run();
+    std::cout << "Time elapsed in microseconds "
+              << timer.elapsed<std::chrono::microseconds>() << std::endl;
 }
 
 }  // namespace sim

diff --git a/lib/timer/CMakeLists.txt b/lib/timer/CMakeLists.txt
@@ -0,0 +1,2 @@
+add_library(timer INTERFACE)
+target_include_directories(timer INTERFACE include)
diff --git a/lib/timer/include/timer.hh b/lib/timer/include/timer.hh
@@ -0,0 +1,26 @@
+#ifndef __LIB_INCLUDE_TIMER_HH__
+#define __LIB_INCLUDE_TIMER_HH__
+
+#include <chrono>
+
+namespace Time {
+using std::chrono::microseconds;
+
+class Timer final {
+    using clock_t = std::chrono::high_resolution_clock;
+
+    std::chrono::time_point<clock_t> beg;
+
+public:
+    Timer() : beg(clock_t::now()) {}
+
+    void reset_time() { beg = clock_t::now(); }
+
+    template <typename T>
+    double elapsed() const {
+        return std::chrono::duration_cast<T>(clock_t::now() - beg).count();
+    }
+};
+}  // namespace Time
+
+#endif  // __LIB_INCLUDE_TIMER_HH__
diff --git a/naive_interpreter/README.md b/naive_interpreter/README.md
@@ -0,0 +1,47 @@
+## Naive Interpreter
+
+Interpretation in the general meaning of the word is the translation of text from one language to another.
+
+![img](../pics/interpreter.png)
+
+In this section, we present a functional model of the processor for the toy RISC-like Instruction Set Architecture.
+
+Our toy ISA was described in [```isa.hpp```](../lib/isa/include/sim/isa.hh).
+
+```cpp
+enum class Opcode : std::uint8_t {
+  kUnknown = 0,
+  kAdd,
+  kHalt,
+  kJump,
+  kLoad,
+  kStore,
+  kBeq,
+};
+```
+
+The operating algorithm is generally similar to the stage of the command execution pipeline in a real processor:
+
+1. [Fetch](../lib/hart/include/sim/hart.hh#L41)
+2. [Decode](../lib/decoder/decoder.cc#L6)
+3. [Execute](../lib/hart/include/sim/hart.hh#L48)
+4. [Write back](../lib/hart/include/sim/hart.hh#L52)
+5. [Advance Program Counter](../lib/hart/include/sim/hart.hh#L78)
+
+![img](../pics/five-stages.png)
+
+The [lecture](../slides/02_Interpreters.pdf) describes the implementation in detail.
+
+Let's take a look at the results of the model's performance for subsequent comparison with the model's optimizations.
+
+### Benchmark
+
+We see that on average, when implementing a naive interpreter, we get no more than **40 MIPS**.
+
+- MIPS:
+![img](../pics/naive-interp-mips.png)
+
+It's Not bad for a start, but, honestly speaking, the result is quite modest for such a simple command system.
+
+- Time, seconds:
+![img](../pics/naive-interp-time.png)
diff --git a/naive_interpreter/sim.cc b/naive_interpreter/sim.cc
@@ -21,8 +21,7 @@ int main() {
     };
 
     sim::NaiveInertpreter model{};
-    model.set_logger("sim.log");
     sim::do_sim(&model, program);
-
     model.dump(std::cout);
+    std::cout << "Icount = " << model.icount << std::endl;
 }
diff --git a/pics/asmjit-interp-mips.png b/pics/asmjit-interp-mips.png
diff --git a/pics/asmjit-interp-time.png b/pics/asmjit-interp-time.png
diff --git a/pics/bench-mips.png b/pics/bench-mips.png
diff --git a/pics/bench-time.png b/pics/bench-time.png
diff --git a/pics/dbt-pages.png b/pics/dbt-pages.png
diff --git a/pics/dbt-traces.png b/pics/dbt-traces.png
diff --git a/pics/dynamic-bin-translation.png b/pics/dynamic-bin-translation.png
diff --git a/pics/five-stages.png b/pics/five-stages.png
diff --git a/pics/inline-asm-mips.png b/pics/inline-asm-mips.png
diff --git a/pics/inline-asm-time.png b/pics/inline-asm-time.png
diff --git a/pics/interpreter.png b/pics/interpreter.png
diff --git a/pics/jit-translator-mips.png b/pics/jit-translator-mips.png
diff --git a/pics/jit-translator-time.png b/pics/jit-translator-time.png
diff --git a/pics/llvm-vs-asmjit.png b/pics/llvm-vs-asmjit.png
diff --git a/pics/models-evaluation-mips.png b/pics/models-evaluation-mips.png
diff --git a/pics/models-evaluation-time.png b/pics/models-evaluation-time.png
diff --git a/pics/naive-interp-mips.png b/pics/naive-interp-mips.png
diff --git a/pics/naive-interp-time.png b/pics/naive-interp-time.png
diff --git a/pics/template-capsule.png b/pics/template-capsule.png
diff --git a/slides/00_Introduction.pdf b/slides/00_Introduction.pdf
diff --git a/slides/01_Software_Modeling.pdf b/slides/01_Software_Modeling.pdf
diff --git a/slides/02_Interpreters.pdf b/slides/02_Interpreters.pdf
diff --git a/slides/03_Decoder.pdf b/slides/03_Decoder.pdf
diff --git a/slides/04_ELF.pdf b/slides/04_ELF.pdf
diff --git a/slides/05_Interpreter+.pdf b/slides/05_Interpreter+.pdf
diff --git a/slides/06_FSS.pdf b/slides/06_FSS.pdf
diff --git a/slides/07_TDS.pdf b/slides/07_TDS.pdf
diff --git a/slides/08_CA_models.pdf b/slides/08_CA_models.pdf
diff --git a/slides/09_Caches.pdf b/slides/09_Caches.pdf
diff --git a/slides/10_Program_Execution_Analysis.pdf b/slides/10_Program_Execution_Analysis.pdf
diff --git a/slides/README.md b/slides/README.md
@@ -0,0 +1,39 @@
+## Internals
+
+1.  [Introduction](00_Introduction.pdf)
+    - Overview of the course and its purposes
+    - Overview of abstraction levels
+2.  [Software Modeling](01_Software_Modeling.pdf)
+    - History of software modeling
+    - Introduction of useful defenitions and metrics
+    - Different types software modeling: functional, Cycle-Accurate, RTL models and etc
+3.  [Interpreters](02_Interpreters.pdf)
+    - Interpretation review
+    - Five-stage interpreters and its optimizations
+4.  [Decoder](03_Decoder.pdf)
+    - Introduction into RISC-V architecture
+    - Decoder algorithm implementation and its optimizations
+5.  [ELF](04_ELF.pdf)
+    - Review Executable and Linkable Format
+    - Review Linux address space
+6.  [Advanced Interptreters](05_Interpreter+.pdf)
+    - Introduction into binary translation
+    - Static binary translation and its optimizations
+    - Dynamic bynary translation, its application and optimizations
+7.  [Full-System Simulation](06_FSS.pdf)
+    - Review of application mode exectuion
+    - Full-System Simulation and Event-driven model
+8.  [Trace Driver Simulation](07_TDS.pdf)
+    - Introduction into trace technology and its application
+    - Definiton of trace-driver simulation
+    - ChampSim overview
+9.  [Cycle-Accurate Models](08_CA_models.pdf)
+    - Introduction into Cycle-Accurate models
+    - CA models software implementation details
+10. [Caches](09_Caches.pdf)
+    - Introduction into the concept and structure of caches
+    - Cache memory modeling and its corner cases
+11. [Program Execution Analysis](10_Program_Execution_Analysis.pdf)
+    - Overview of Dynamic Binary Analysis
+    - Valgrind implemenation details
+    - How code a own Valgrind Tool
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		add_library(timer INTERFACE)
		target_include_directories(timer INTERFACE include)