This project contains an optimized implementation of the MD5 hashing algorithm written in x86-64 assembly.
A naive C implementation is included as a reference for benchmarking and performance comparison.
- Do not allocate the buffer on the heap
- Compute padding / preprocessing at the end of the pipeline
The main optimizations are based on low-level instruction and register usage improvements.
- Loop unrolling is used where possible
- Constant values computed per message are precomputed and hardcoded
- The implementation is fully register-based:
- No stack saving/restoring during core rounds
- No use of global variables (.data section is avoided)
- The computation of F = ( B & C) | (~B & D) in the first 16 rounds can be calculated with its equivalent expresion F = D ^ ( B & ( C ^ D)), which takes less asm instructions. Same for the rounds 16-32, where we compute F as F = C ^ ( D & (B ^ C)).
$ time bin/md5_c "$(python3 -c "print('A'*100000)")"
5793f7e3037448b250ae716b43ece2c2
real 0m0.026s
user 0m0.017s
sys 0m0.009s$ time bin/md5_asm "$(python3 -c "print('A'*100000)")"
5793f7e3037448b250ae716b43ece2c2
real 0m0.016s
user 0m0.009s
sys 0m0.009s