Speed up RemoteFX encoding#571
Conversation
CBenoit
left a comment
There was a problem hiding this comment.
Nice work!! Do you have some numbers from criterion? How much the performance was improved?
suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.
|
fwiw, I opened awxkee/yuvutils-rs#3 |
Thank you for following up on this. I'll watch the issue. |
Well, criterion is huge time saver depending on your HW, but it still uses same amount of compute:
what's the rationale to put tests and benchmark in various crates? We try to ensure a common framework for the various crates that way? |
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
In theory, this could help the compiler to unroll loops.. doesn't seem to be the case though, but it allows to drop the assert_eq!() at least. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
That seems to speed up a bit the code:
rfxenc time: [46.040 µs 46.288 µs 46.698 µs]
change: [-9.2580% -8.6663% -7.8304%] (p = 0.00 < 0.05)
Performance has improved.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
That doesn't change the speed though, code isn't inlined afaict. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
This can help a lot wall-clock time, but depends on CPU.
rfx_enc time: [9.7885 ms 10.123 ms 10.439 ms]
change: [-80.484% -79.847% -79.208%] (p = 0.00 < 0.05)
Performance has improved.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Apparently it already did, I do not observe perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Unfortunately, that doesn't seem to help unrolling & vectorizing: no perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
rgb2yuv time: [11.706 µs 11.716 µs 11.727 µs]
change: [-24.083% -23.682% -23.394%] (p = 0.00 < 0.05)
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Good numbers! By "criterion" here, you mean "rayon" for parallelizing the encoding?
For the tests, the rationale is explained at several places:
EDIT: Also a recommend read with similar points made: https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html I'll also add less objective opinions and personal tastes:
For the benchmarks, in addition to the above arguments:
We also have a single That's about it for the rationale behind this suggestion. |
CBenoit
left a comment
There was a problem hiding this comment.
LGTM! Really nice work on the performance.
Unfortunately, it's not simple to use hand-written assembly from a project like yuvutils, because RDP uses odd bias and precision. Something left for the another day.