Skip to content

Version 3 ldm opt#4

Open
senhuang42 wants to merge 36 commits into
devfrom
version_3_ldm_opt
Open

Version 3 ldm opt#4
senhuang42 wants to merge 36 commits into
devfrom
version_3_ldm_opt

Conversation

@senhuang42

@senhuang42 senhuang42 commented Sep 29, 2020

Copy link
Copy Markdown
Owner

Draft PR
Preliminary results regarding compressed size:

  • 5259889920 (5GB) bytes: 5 Linux kernels concatenated: versions 4.19.148, 5.4.68, 5.8.12, 5.9-rc7, 5.8.12:
    • Old: Level 22 - long mode - window log = 27
      • 622643668 → 11.84%
    • Old: Level 22 - no long mode - window log = 27
      • 624737369 → 11.88%
    • New: Level 22 - long mode - window log = 27
      • 622639116 → 11.84%
  • 1995028480 (2GB) bytes: 2 relatively close Linux kernels concatenated: 5.8.12, 5.9-rc7
    • Old: Level 22 - long mode - window log = 30
      • 123668734 → 6.20%
    • Old: Level 22 - no long mode - window log = 30
      • 131944438 → 6.61%
    • New: Level 22 - long mode - window log = 30
      • 122466509 → 6.14%
  • 211950592 (200MB) bytes: silesia.tar
    • Old: Level 22 - long mode - window log = 27
      • 52777588 → 24.90%
    • Old: Level 22 - no long mode - window log = 27
      • 52701591 → 24.87%
    • New: Level 22 - long mode - window log = 27
      • 52699843 → 24.86%

New algorithm approach:

  • Essentially, we will use the LDM rawSeqStore given to us in a similar way to the current LDM insertion algorithm works. What this means is that essentially for each new block, the LDM sequence at position pos within the LDM seqStore is guaranteed to start at the same relative position as the block. That is, for each n bytes processed in the block, n bytes are processed and consumed in the LDM seqstore, and the member pos will point to relevant sequence.
  • As such, all of our calculations are purely relative within each block, and the approach is robust against issues such as window updates, etc.

@Cyan4973

Copy link
Copy Markdown

Little trick :
if you want to be sure that I'll notice your PR into your own fork,
just cc @Cyan4973 in a comment,
this will be enough to generate a notification.

@Cyan4973

Cyan4973 commented Sep 29, 2020

Copy link
Copy Markdown

Compression ratio :
These are good results, and within expected range.

Could you also add a comparison for a use case where --long improves ratio,
for both "old" and "new" implementations ?
We want to show that, on top of avoiding compression ratio regressions, which is key,
the new implementation also improves compression ratio (a bit) for favorable cases.

I believe it can be achieved by using the same tarball of linux sources,
but extending the window to 1 GB,
using command --long=30.

If you are afraid of long compression delays when using --ultra in combination with 5 GB input,
you may want to reduce the tarball to 2 kernel versions, consecutive or distant,
to reduce budget to compress around ~2 GB.

@senhuang42

Copy link
Copy Markdown
Owner Author

@Cyan4973 Good suggestion - I've updated the original post with a benchmark comparing the case you mentioned, and indeed the new method seems to still maintain the compression ratio gains.

I will be including much more comprehensive benchmarks in the actual version of the PR against the real Zstd repo, including some rough estimates on speed.

@Cyan4973

Cyan4973 commented Sep 30, 2020

Copy link
Copy Markdown

Quick comparison on my laptop, compressing silesia.tar

branch conf size time cpu%
v1.3.4 -19 -T0 54118223 36.4 s 259%
v1.3.5 -19 -T0 53451696 39.7 s 290%
v1.3.6 -19 -T0 53451696 38.6 s 287%
v1.4.0 -19 -T0 53285591 42.7 s 295%
v1.4.4 -19 -T0 53278346 41.8 s 293%
dev -19 -T0 53279750 42.6 s 293%
ldm3 -19 -T0 53279750 43.3 s 296%
v1.3.4 -19 -T0 --long=23 54312876 47.1 s 162%
v1.3.5 -19 -T0 --long=23 53554099 99.5 s 99%
v1.3.6 -19 -T0 --long=23 53554099 99.8 s 99%
v1.4.0 -19 -T0 --long=23 53545769 101.1 s 99%
v1.4.4 -19 -T0 --long=23 53542105 99.2 s 99%
dev -19 -T0 --long=23 53543099 100.3 s 99%
ldm3 -19 -T0 --long=23 53276362 108.0 s 99%
dev -19 -T0 --long=23 53537474 46.3 s 288%

Level 19 offers a full search over the 8 MB window.
In this situation, --long=23 is expected to provide basically nothing, since it targets the same window size.
This is a "hard" scenario for the ldm.
As expected, combining ldm with the optimal parser translates into a loss with current and previous versions of zstd.
In contrast, this branch manages to generate a tiny gain, proving once again that it's no longer detrimental.

What worries me though is the discovery that enabling the ldm mode kills multithreading performance.
This effect was unexpected. The combination of ldm with multithreading is supposed to work since v1.3.4.

However, this effect is unrelated to this PR. It was present before!
But it's concerning, because it indirectly impacts the usefulness of --long command in combination with high compression mode and multithreading (most users of high compression mode appreciate the multi-threading feature).

This probably warrants an investigation, though in a different follow up PR.

cc @terrelln : the feature (ldm + mt) seems to have disappeared almost immediately after its introduction into v1.3.4. By any chance, do you remember anything about it ?

edit : this is confirmed : the reduction in nb of active threads come from the large expansion of job size when triggering the --long mode. When forcing it back to expected 32 MB, one can observe multithreading active again.
There's probably a topic around revisiting the algorithm determining job size, though this should be a follow-up PR.

Comment thread lib/compress/zstd_opt.c Outdated
U32 ldmStartPosInBlock = 0;
U32 ldmEndPosInBlock = 0;
U32 ldmOffset = 0;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrelln So this comment below goes into more explicit detail about the surprising behavior I mentioned during our 1:1 and the code below is the my supposed "mitigation", basically pushing back adding the match by adding more bytes to the litLength of the ldm.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@senhuang42 I see the problem now. This is not the correct solution. This will likely break on large files.

I believe this is caused by btultra2 which does 2 passes over the first block. You can double check by seeing if the problem goes away at level 18, which uses btultra, not btultra2. ` You'll need to reset the LDM seqStore after the first pass, to re-consume the sequences.

https://github.com/facebook/zstd/blob/cc88eb7594d9c70ec2440ce6122f7861dbd64af2/lib/compress/zstd_opt.c#L1123-L1129

@senhuang42

Copy link
Copy Markdown
Owner Author

The multi-threading issue definitely seems to be just a configuration sort of issue - multithreading was enabled when I tried compressing a 5GB file with --long -22 --ultra -T0, but it does seem worth it to adjust it in a follow-up PR since it seems like there were pretty good speed gains on silesia.tar when it was enabled for ldm mode.

Comment thread lib/compress/zstd_opt.c Outdated
bytesToSkip -= seq->litLength;
seq->litLength = 0;
if (bytesToSkip < seq->matchLength) {
seq->matchLength -= (U32)bytesToSkip;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when matchLength < 3 after this line? That matchLength would be unrepresentable.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually allow matchLength < 3 since: we will always reject any matchLength < MINMATCH later on when we try to add LDMs, and I wanted to keep the functions as simple as possible to avoid any issues with miscounting the number of bytes we actually skipped. It would definitely make sense to note this specifically though.

Comment thread lib/compress/zstd_opt.c Outdated
if (remainingBytes <= currSeq.litLength) {
currSeq.offset = 0;
} else if (remainingBytes < currSeq.litLength + currSeq.matchLength) {
currSeq.matchLength = remainingBytes - currSeq.litLength;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here about matchLength < 3.

Comment thread lib/compress/zstd_opt.c Outdated
U32 ldmStartPosInBlock = 0;
U32 ldmEndPosInBlock = 0;
U32 ldmOffset = 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@senhuang42 I see the problem now. This is not the correct solution. This will likely break on large files.

I believe this is caused by btultra2 which does 2 passes over the first block. You can double check by seeing if the problem goes away at level 18, which uses btultra, not btultra2. ` You'll need to reset the LDM seqStore after the first pass, to re-consume the sequences.

https://github.com/facebook/zstd/blob/cc88eb7594d9c70ec2440ce6122f7861dbd64af2/lib/compress/zstd_opt.c#L1123-L1129

@terrelln

terrelln commented Sep 30, 2020

Copy link
Copy Markdown

@senhuang42 LDM + MT at level 19 will create 256 MB jobs by default. The code controlling it is here

https://github.com/facebook/zstd/blob/cc88eb7594d9c70ec2440ce6122f7861dbd64af2/lib/compress/zstdmt_compress.c#L1188

You can control it on the CLI with -B. E.g. a 32MB job size is -B32MB.

We should use the cycleLog instead of the chainLog. That will reduce it to a 128 MB job size. Then we could consider using cycleLog+3 instead of +4.

We could also take the strategy into account, and have a more fine grained approach.

@senhuang42

Copy link
Copy Markdown
Owner Author

@terrelln the note about btultra2 was super helpful and cleared up this mystery that had been bugging me the entire time. Thanks!

@senhuang42 senhuang42 force-pushed the version_3_ldm_opt branch 2 times, most recently from ce09360 to 167276d Compare October 1, 2020 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants