print_timings: prompt eval time = 1110.28 ms / 1134 tokens ( 0.98 ms per token, 1021.37 tokens per second)
print_timings: eval time = 561.83 ms / 20 runs ( 28.09 ms per token, 35.60 tokens per second)
print_timings: total time = 1672.10 ms
slot 0 released (1155 tokens in cache)
slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time = 1483.84 ms / 1134 tokens ( 1.31 ms per token, 764.23 tokens per second)
print_timings: eval time = 591.59 ms / 20 runs ( 29.58 ms per token, 33.81 tokens per second)
print_timings: total time = 2075.43 ms
slot 1 released (1155 tokens in cache)
slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time = 1764.20 ms / 1134 tokens ( 1.56 ms per token, 642.78 tokens per second)
print_timings: eval time = 618.07 ms / 20 runs ( 30.90 ms per token, 32.36 tokens per second)
print_timings: total time = 2382.28 ms
slot 2 released (1155 tokens in cache)
slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time = 2229.91 ms / 1134 tokens ( 1.97 ms per token, 508.54 tokens per second)
print_timings: eval time = 642.50 ms / 20 runs ( 32.12 ms per token, 31.13 tokens per second)
print_timings: total time = 2872.41 ms
slot 3 released (1155 tokens in cache)
Please help provide information about the failure / bug.
(./build/bin/server -m ./ggml-model-q4.bin -ngl 9999 --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
Pre-Prerequisite
Thanks to all the contributors for all the great work on llama.cpp!
Prerequisites
Expected Behaviour
Current Behaviour
prompt_evaltime gets much slower.Environment and Context
Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU
Operating System: Linux
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Thanks!