diff --git a/README.md b/README.md index e69de29..3ca070a 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,55 @@ +# CPU & OS simulation Elective Course + +## What is the course about? + +Сourse repository is dedicated to CPU and OS simulation in third bachelor semester at MIPT. + +All teaching materials used during the semester are [here](slides/). + +## Demo Code + +You can find [simulator library](lib/) and [test generation script](test/) here. + +During the course we consistently improve our toy model of the simulator. + +1. [Naive Interpreter](naive_interpreter/sim.cc) +2. [Inline Assembly model](inline_assembly/sim.cc) +3. [AsmJit Assembly model](asmjit_assembly/sim.cc) +4. [JIT translator](jit_translator/sim.cc) + +Here are the results of a comparison of four different models on a [test](test/code.hpp) in which we varied the number of loop iterations (aka LC): + +- MIPS: +![img](pics/bench-mips.png) + +- Time, seconds: +![img](pics/bench-time.png) + +## Slides + +1. [Introduction](slides/00_Introduction.pdf) +2. [Software Modeling](slides/01_Software_Modeling.pdf) +3. [Interpreters](slides/02_Interpreters.pdf) +4. [Decoder](slides/03_Decoder.pdf) +5. [ELF](slides/04_ELF.pdf) +6. [Advanced Interptreters](slides/05_Interpreter+.pdf) +7. [Full-System Simulation](slides/06_FSS.pdf) +8. [Trace Driver Simulation](slides/07_TDS.pdf) +9. [Cycle-Accurate Models](slides/08_CA_models.pdf) +10. [Caches](slides/09_Caches.pdf) +11. [Program Execution Analysis](slides/10_Program_Execution_Analysis.pdf) + +## Usage + +From the root of source directory configure: + +```bash +mkdir -p build/ +cmake -B build -S . -DCMAKE_BUILD_TYPE=Release +``` + +Then run build: + +```bash +cmake --build build/ -j +``` diff --git a/asmjit_assembly/README.md b/asmjit_assembly/README.md index e69de29..f9f3d01 100644 --- a/asmjit_assembly/README.md +++ b/asmjit_assembly/README.md @@ -0,0 +1,62 @@ +## AsmJit Interpreter + +Implementing semantics using inline assembler is definitely a step backwards in terms of the evolution of programming languages. We certainly wouldn't do it that way. However, the optimization worked. + +Maybe there is some way to make this easier? Maybe there are already Is libraries for generating host code using C++ code? + +- [LLVM](https://llvm.org/doxygen/classllvm_1_1IRBuilder.html) +- [MJIT](https://llvm.org/docs/MCJITDesignAndImplementation.html) +- [AsmJit](https://asmjit.com/) + +Here may note that LLVM is a huge compiler project, and AsmJit is a lightweight library for JIT compilation. As well, it's. Suggest take a look at a short summary from ChatGPT and move on. + +![img](../pics/llvm-vs-asmjit.png) + +The assembly capsule for the addition operation from toy ISA now looks like this: + +```cpp +case isa::Opcode::kAdd: { + assembler.mov(asmjit::x86::eax, cpu.regs[insn.src1]); + assembler.add(asmjit::x86::eax, cpu.regs[insn.src2]); + // + assembler.mov( + asmjit::x86::dword_ptr((size_t)(&(cpu.regs[insn.dst]))), + asmjit::x86::eax); + + assembler.mov(asmjit::x86::eax, cpu.pc); + assembler.add(asmjit::x86::eax, 1); + + assembler.mov(asmjit::x86::dword_ptr((size_t)(&cpu.pc)), + asmjit::x86::eax); + + assembler.ret(); + break; +} +``` + +We would like to pay your attention to two details: +- Сapsule changes the state of the entire model, not just the memory. +- When leaving the capsule, do return + +By carefully considering the two previous points, you will most likely avoid segfault. + +Let's benchmark this implementation and ... + +### Benchmark + +This is a failure. Even a naive interpreter compiled without optimizations (-O0) reaches 40 MIPS. + +- MIPS: +![img](../pics/asmjit-interp-mips.png) + +- Why does this happen? There are two main reasons: + - Redundant calls of host code emitter + - Redundant, but vital complete the execution of every guest instruction + +- Time, seconds: +![img](../pics/asmjit-interp-time.png) + +Is there any way we can reduce the excessive number of emitter calls? + +Of course, you can. First, you'll have to understand the [technology of binary dynamic translation](../slides/05_Interpreter+.pdf) in detail. +Then will try to [improve our model](../jit_translator/). diff --git a/inline_assembly/README.md b/inline_assembly/README.md index e69de29..a976715 100644 --- a/inline_assembly/README.md +++ b/inline_assembly/README.md @@ -0,0 +1,49 @@ +## Inline Assembly Interpreter + +While coding the execution stage, you probably thought about at least one of the points + +1. It's too boring to code + +2. It's applicable for templates + +3. It's definitely not the most optimal approach to simulate + +Therefore, we introduce a new concept. +A capsule is a block of host code, mapped to one guest instruction + +![img](../pics/template-capsule.png) + +As mentioned, the interpreter: +- Easy to implement and modify +- Simple but slow. + +Let's try to apply capsule technology to our toy ISA. +Of course, our code will not become simpler from previous one. Moreover, vice a versa. However, this stage of the simulation evolution is an important step for understanding the following. + +And in the first approximation, we will do this in the crudest way - [inline assembly and preprocessor](capsule_asm.hh). + +```cpp +#define ARITHM_CAPSULE(op, dst, src1, src2) \ + asm volatile("mov %[rs1], %%eax\n\t" op \ + " %[rs2], %%eax\n\t" \ + "mov %%eax, %[rd]\n\t" \ + : [rd] "=r"(dst) \ + : [rs1] "r"(src1), [rs2] "r"(src2) \ + : "%eax"); +``` + +So, let's benchmark our optimization. + +### Benchmark + +We see that optimization and suffering with assembler inserts were not in vain. The model was able to overcome 40 MIPS, but not much more. + +- MIPS: +![img](../pics/inline-asm-mips.png) + +Moreover, on average, it works almost at the same speed as the naive interpretation. + +- Time, seconds: +![img](../pics/inline-asm-time.png) + +The obtained results indicate that it is necessary to look for new possible optimizations of execution. diff --git a/jit_translator/README.md b/jit_translator/README.md index e69de29..8734b5e 100644 --- a/jit_translator/README.md +++ b/jit_translator/README.md @@ -0,0 +1,40 @@ +## Dynamic Binary Translation (JIT) +**Binary translation** optimization is based on the idea of ​​"compiling" blocks of target code once and then reusing the results of this work many times. This eliminates the need to interpret instructions at each step of execution. + +![img](../pics/dynamic-bin-translation.png) + +The guest code is transformed into translation blocks. Each translation block has at least one entry point - the address from which execution starts. + +**Recup**: The block of host code obtained for one guest instruction is called a capsule + +**Dynamic binary translation** is an approach in which guest system simulation is interspersed with binary translation engine runs for new code blocks that will soon be executed, as well as translation updates for blocks that have changed their contents. + +There are different approaches to solving the problem of organizing translation units in dynamic binary translation + +* **Traces** : The first execution of each guest instruction is performed using the naive interpreter, and its translation and store of the result in the trace are also performed. + +![img](../pics/dbt-traces.png) + +* **Pages** : Instructions located in memory at adjacent addresses are likely to belong to related parts of the program algorithm, will be executed together and can therefore be translated into one block. Translated sections of the host code are reused for previously executed blocks. + +![img](../pics/dbt-pages.png) + +In [our implementation](sim.cc) we will use a special case of the first approach, code translation by basic blocks. + +**Recup** Basic block -- its a straight-line code sequence with one entry point at the beginning, only one out instruction at the end. + +### Benchmark + +Let's evaluate perfomance of new approach. + +- MIPS: +![img](../pics/jit-translator-mips.png) + +So, on our synthetic test it showed amazing gain of MIPS. +Based on the bars, it can be noted two obvious things: + +- The performance gain becomes significant when the translation overhead is offset by the execution speed of the translated basic blocks. +- On average models with AsmJit translation will reaches about 350 MIPS. To get more we need to apply other optimizations. + +- Time, seconds: +![img](../pics/jit-translator-time.png) diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt index 72ea338..4f21d82 100644 --- a/lib/CMakeLists.txt +++ b/lib/CMakeLists.txt @@ -1,4 +1,4 @@ -set(DIRS cpu_state isa memory decoder logger hart) +set(DIRS cpu_state isa memory decoder logger hart timer) foreach(DIR ${DIRS}) add_subdirectory(${DIR}) diff --git a/lib/hart/CMakeLists.txt b/lib/hart/CMakeLists.txt index 3632d75..03179ab 100644 --- a/lib/hart/CMakeLists.txt +++ b/lib/hart/CMakeLists.txt @@ -1,4 +1,4 @@ add_library(hart INTERFACE) target_include_directories(hart INTERFACE include) target_include_directories(hart INTERFACE ${CMAKE_SOURCE_DIR}/test) -target_link_libraries(hart INTERFACE memory cpu_state decoder logger) +target_link_libraries(hart INTERFACE memory cpu_state decoder logger timer) diff --git a/lib/hart/include/sim/hart.hh b/lib/hart/include/sim/hart.hh index 9a9c1e5..771ff5b 100644 --- a/lib/hart/include/sim/hart.hh +++ b/lib/hart/include/sim/hart.hh @@ -7,6 +7,8 @@ #include "sim/cpu_state.hh" #include "sim/logger.hh" #include "sim/memory.hh" +// +#include "timer.hh" namespace sim { @@ -115,8 +117,12 @@ struct Hart { }; void do_sim(Hart* hart, const std::vector& program) { + Time::Timer timer{}; + // hart->load(program); hart->run(); + std::cout << "Time elapsed in microseconds " + << timer.elapsed() << std::endl; } } // namespace sim diff --git a/lib/timer/CMakeLists.txt b/lib/timer/CMakeLists.txt new file mode 100644 index 0000000..68e70e7 --- /dev/null +++ b/lib/timer/CMakeLists.txt @@ -0,0 +1,2 @@ +add_library(timer INTERFACE) +target_include_directories(timer INTERFACE include) diff --git a/lib/timer/include/timer.hh b/lib/timer/include/timer.hh new file mode 100644 index 0000000..e6232fe --- /dev/null +++ b/lib/timer/include/timer.hh @@ -0,0 +1,26 @@ +#ifndef __LIB_INCLUDE_TIMER_HH__ +#define __LIB_INCLUDE_TIMER_HH__ + +#include + +namespace Time { +using std::chrono::microseconds; + +class Timer final { + using clock_t = std::chrono::high_resolution_clock; + + std::chrono::time_point beg; + +public: + Timer() : beg(clock_t::now()) {} + + void reset_time() { beg = clock_t::now(); } + + template + double elapsed() const { + return std::chrono::duration_cast(clock_t::now() - beg).count(); + } +}; +} // namespace Time + +#endif // __LIB_INCLUDE_TIMER_HH__ diff --git a/naive_interpreter/README.md b/naive_interpreter/README.md index e69de29..b922a3e 100644 --- a/naive_interpreter/README.md +++ b/naive_interpreter/README.md @@ -0,0 +1,47 @@ +## Naive Interpreter + +Interpretation in the general meaning of the word is the translation of text from one language to another. + +![img](../pics/interpreter.png) + +In this section, we present a functional model of the processor for the toy RISC-like Instruction Set Architecture. + +Our toy ISA was described in [```isa.hpp```](../lib/isa/include/sim/isa.hh). + +```cpp +enum class Opcode : std::uint8_t { + kUnknown = 0, + kAdd, + kHalt, + kJump, + kLoad, + kStore, + kBeq, +}; +``` + +The operating algorithm is generally similar to the stage of the command execution pipeline in a real processor: + +1. [Fetch](../lib/hart/include/sim/hart.hh#L41) +2. [Decode](../lib/decoder/decoder.cc#L6) +3. [Execute](../lib/hart/include/sim/hart.hh#L48) +4. [Write back](../lib/hart/include/sim/hart.hh#L52) +5. [Advance Program Counter](../lib/hart/include/sim/hart.hh#L78) + +![img](../pics/five-stages.png) + +The [lecture](../slides/02_Interpreters.pdf) describes the implementation in detail. + +Let's take a look at the results of the model's performance for subsequent comparison with the model's optimizations. + +### Benchmark + +We see that on average, when implementing a naive interpreter, we get no more than **40 MIPS**. + +- MIPS: +![img](../pics/naive-interp-mips.png) + +It's Not bad for a start, but, honestly speaking, the result is quite modest for such a simple command system. + +- Time, seconds: +![img](../pics/naive-interp-time.png) diff --git a/naive_interpreter/sim.cc b/naive_interpreter/sim.cc index bf4254b..2834693 100644 --- a/naive_interpreter/sim.cc +++ b/naive_interpreter/sim.cc @@ -21,8 +21,7 @@ int main() { }; sim::NaiveInertpreter model{}; - model.set_logger("sim.log"); sim::do_sim(&model, program); - model.dump(std::cout); + std::cout << "Icount = " << model.icount << std::endl; } diff --git a/pics/asmjit-interp-mips.png b/pics/asmjit-interp-mips.png new file mode 100644 index 0000000..c896b0a Binary files /dev/null and b/pics/asmjit-interp-mips.png differ diff --git a/pics/asmjit-interp-time.png b/pics/asmjit-interp-time.png new file mode 100644 index 0000000..c1f1b65 Binary files /dev/null and b/pics/asmjit-interp-time.png differ diff --git a/pics/bench-mips.png b/pics/bench-mips.png new file mode 100644 index 0000000..15aa736 Binary files /dev/null and b/pics/bench-mips.png differ diff --git a/pics/bench-time.png b/pics/bench-time.png new file mode 100644 index 0000000..059ed90 Binary files /dev/null and b/pics/bench-time.png differ diff --git a/pics/dbt-pages.png b/pics/dbt-pages.png new file mode 100644 index 0000000..9558639 Binary files /dev/null and b/pics/dbt-pages.png differ diff --git a/pics/dbt-traces.png b/pics/dbt-traces.png new file mode 100644 index 0000000..0af0e5c Binary files /dev/null and b/pics/dbt-traces.png differ diff --git a/pics/dynamic-bin-translation.png b/pics/dynamic-bin-translation.png new file mode 100644 index 0000000..5fe185c Binary files /dev/null and b/pics/dynamic-bin-translation.png differ diff --git a/pics/five-stages.png b/pics/five-stages.png new file mode 100644 index 0000000..014d410 Binary files /dev/null and b/pics/five-stages.png differ diff --git a/pics/inline-asm-mips.png b/pics/inline-asm-mips.png new file mode 100644 index 0000000..d4b5283 Binary files /dev/null and b/pics/inline-asm-mips.png differ diff --git a/pics/inline-asm-time.png b/pics/inline-asm-time.png new file mode 100644 index 0000000..2f00e4c Binary files /dev/null and b/pics/inline-asm-time.png differ diff --git a/pics/interpreter.png b/pics/interpreter.png new file mode 100644 index 0000000..c748795 Binary files /dev/null and b/pics/interpreter.png differ diff --git a/pics/jit-translator-mips.png b/pics/jit-translator-mips.png new file mode 100644 index 0000000..70f4ec3 Binary files /dev/null and b/pics/jit-translator-mips.png differ diff --git a/pics/jit-translator-time.png b/pics/jit-translator-time.png new file mode 100644 index 0000000..4167cb9 Binary files /dev/null and b/pics/jit-translator-time.png differ diff --git a/pics/llvm-vs-asmjit.png b/pics/llvm-vs-asmjit.png new file mode 100644 index 0000000..a913f5d Binary files /dev/null and b/pics/llvm-vs-asmjit.png differ diff --git a/pics/models-evaluation-mips.png b/pics/models-evaluation-mips.png new file mode 100644 index 0000000..15aa736 Binary files /dev/null and b/pics/models-evaluation-mips.png differ diff --git a/pics/models-evaluation-time.png b/pics/models-evaluation-time.png new file mode 100644 index 0000000..059ed90 Binary files /dev/null and b/pics/models-evaluation-time.png differ diff --git a/pics/naive-interp-mips.png b/pics/naive-interp-mips.png new file mode 100644 index 0000000..3f5959c Binary files /dev/null and b/pics/naive-interp-mips.png differ diff --git a/pics/naive-interp-time.png b/pics/naive-interp-time.png new file mode 100644 index 0000000..42beccd Binary files /dev/null and b/pics/naive-interp-time.png differ diff --git a/pics/template-capsule.png b/pics/template-capsule.png new file mode 100644 index 0000000..db0d867 Binary files /dev/null and b/pics/template-capsule.png differ diff --git a/slides/00_Introduction.pdf b/slides/00_Introduction.pdf new file mode 100755 index 0000000..2081518 Binary files /dev/null and b/slides/00_Introduction.pdf differ diff --git a/slides/01_Software_Modeling.pdf b/slides/01_Software_Modeling.pdf new file mode 100755 index 0000000..3102d67 Binary files /dev/null and b/slides/01_Software_Modeling.pdf differ diff --git a/slides/02_Interpreters.pdf b/slides/02_Interpreters.pdf new file mode 100755 index 0000000..2c81625 Binary files /dev/null and b/slides/02_Interpreters.pdf differ diff --git a/slides/03_Decoder.pdf b/slides/03_Decoder.pdf new file mode 100644 index 0000000..49f03f2 Binary files /dev/null and b/slides/03_Decoder.pdf differ diff --git a/slides/04_ELF.pdf b/slides/04_ELF.pdf new file mode 100755 index 0000000..5d6a7c4 Binary files /dev/null and b/slides/04_ELF.pdf differ diff --git a/slides/05_Interpreter+.pdf b/slides/05_Interpreter+.pdf new file mode 100755 index 0000000..df13dd7 Binary files /dev/null and b/slides/05_Interpreter+.pdf differ diff --git a/slides/06_FSS.pdf b/slides/06_FSS.pdf new file mode 100755 index 0000000..7dd6964 Binary files /dev/null and b/slides/06_FSS.pdf differ diff --git a/slides/07_TDS.pdf b/slides/07_TDS.pdf new file mode 100755 index 0000000..493a538 Binary files /dev/null and b/slides/07_TDS.pdf differ diff --git a/slides/08_CA_models.pdf b/slides/08_CA_models.pdf new file mode 100755 index 0000000..8b05ba2 Binary files /dev/null and b/slides/08_CA_models.pdf differ diff --git a/slides/09_Caches.pdf b/slides/09_Caches.pdf new file mode 100755 index 0000000..4989055 Binary files /dev/null and b/slides/09_Caches.pdf differ diff --git a/slides/10_Program_Execution_Analysis.pdf b/slides/10_Program_Execution_Analysis.pdf new file mode 100755 index 0000000..bbfb2ac Binary files /dev/null and b/slides/10_Program_Execution_Analysis.pdf differ diff --git a/slides/README.md b/slides/README.md new file mode 100644 index 0000000..3eb9789 --- /dev/null +++ b/slides/README.md @@ -0,0 +1,39 @@ +## Internals + +1. [Introduction](00_Introduction.pdf) + - Overview of the course and its purposes + - Overview of abstraction levels +2. [Software Modeling](01_Software_Modeling.pdf) + - History of software modeling + - Introduction of useful defenitions and metrics + - Different types software modeling: functional, Cycle-Accurate, RTL models and etc +3. [Interpreters](02_Interpreters.pdf) + - Interpretation review + - Five-stage interpreters and its optimizations +4. [Decoder](03_Decoder.pdf) + - Introduction into RISC-V architecture + - Decoder algorithm implementation and its optimizations +5. [ELF](04_ELF.pdf) + - Review Executable and Linkable Format + - Review Linux address space +6. [Advanced Interptreters](05_Interpreter+.pdf) + - Introduction into binary translation + - Static binary translation and its optimizations + - Dynamic bynary translation, its application and optimizations +7. [Full-System Simulation](06_FSS.pdf) + - Review of application mode exectuion + - Full-System Simulation and Event-driven model +8. [Trace Driver Simulation](07_TDS.pdf) + - Introduction into trace technology and its application + - Definiton of trace-driver simulation + - ChampSim overview +9. [Cycle-Accurate Models](08_CA_models.pdf) + - Introduction into Cycle-Accurate models + - CA models software implementation details +10. [Caches](09_Caches.pdf) + - Introduction into the concept and structure of caches + - Cache memory modeling and its corner cases +11. [Program Execution Analysis](10_Program_Execution_Analysis.pdf) + - Overview of Dynamic Binary Analysis + - Valgrind implemenation details + - How code a own Valgrind Tool