Rough draft: Opt parser improvements with long distance matches by senhuang42 · Pull Request #1 · senhuang42/zstd

senhuang42 · 2020-09-18T00:49:04Z

Draft PR for suggestions/help. This PR introduces a method to find the absolute position of an LDM, and adds code to allow LDMs to be just candidates for the opt parser.
Context:

I realized after the fact that my previous version with allocations (commit a59d8e6) only seemed more "stable" because of a couple of "lucky" bugs that allowed it to pass the unit tests and spot-checking. So let's just focus on the latest version with no allocations.
The two crucial blocks of code to review are:
ZSTD_insertBtAndGetAllMatches() (starting line 758) and ZSTD_ldm_hasMatchAtAbsolutePosition().

Approach to handling the LDMs

We add a new field bytesRead to rawSeqStore_t that corresponds to another member, pos. While pos is the last read index, bytesRead is the last read absolute byte position. (bytePos would probably be a better name). We use this to derive the absolute position of each match.
Within ZSTD_insertBtAndGetAllMatches(), we have a value U32 curr = (U32)ip - ms->window.base. Thus, curr represents the absolute position within the bytestream at which ZSTD_insertBtAndGetAllMatches() was called.
Next, we calculate if curr equals the calculated absolute position of any LDM. For example, consider the following three LDMs all occurring in the same seq store:
- (of: 22, ml: 100, ll: 50) ==> absolute position is 50 bytes
- (of: 23, ml: 40, ll: 30) ==> absolute position is 180 bytes (50 (prev abs position) + 100 (prev ml) + 30 (prev ll))
- (of: 48, ml: 20, ll: 10) ==> absolute position is 230 bytes (180 + 40 + 10)
So, using the same example, if we called ZSTD_insertBtAndGetAllMatches() with ip == 50, then ZSTD_ldm_hasMatchAtAbsolutePosition() should return the index of the appropriate match, and if it's long enough, we add it to the end of our matches array

A few implementation details

zstd_opt.c now takes a dependency on zstd_ldm.h which exposes two functions we'd like to use. I think this should be fine since I don't know of any zstd build which needs zstd_opt but not zstd_ldm
ZSTD_matchState_t now has a rawSeqStore_t* pointer to ldm seq store. However, I just realized that isn't good in certain cases, the ldm seq store can apparently be instantiated on the stack upstream, so it seems we lose track of it. Maybe the way to this is to add some new functions that take rawSeqStore_t* as an argument.
There are a couple of random debug and print functions in here just for debugging - I plan on removing them in the actual PR.

What generally works:
Single-threaded --long mode, where matches cannot go beyond blockSize. This is the "standard" case and sticks pretty close to the approach I mentioned above.

Where I'm currently stuck
Roundtrip issues that occur when multithreading is enabled. Apparently, multithreading enables the LDM-generator to find LDMs with matchLength and litLength longer than block sizes. Sequence splitting becomes an issue.

This is the case I'm currently having the most trouble with. As it is right now, we actually can handle matchLength going over a block boundary, but I'm not exactly sure how to handle the other case where litLength is actually greater than remaining (the remaining number of bytes in a block).

For example, on a file which I manually verified that has the first LDM as (of: 951277, ml: 1362, ll: 952876), we inserted it when curr == 952877 according to our above algorithm. But in reality, according to the "correct" opt parser (without --long) which found the same match, it inserted the match when curr == 1083471. It's unclear to me why this happens - I don't see why the LDM should be inserted at position 1083471 instead of 952877. This is indeed the first LDM, and so the position of the match should just be litLength + 1.

And so this causes the following decompression error:

../lib/decompress/zstd_decompress_block.c: seq: litL=0, matchL=1362, offset=951277 
../lib/decompress/zstd_decompress_block.c:856: ERROR!: check (__builtin_expect((sequence.offset > (size_t)(oLitEnd - virtualStart)), 0)) failed, returning ERROR(corruption_detected): 
../lib/decompress/zstd_decompress_block.c: regenerated sequence size : 4294967276 
../lib/decompress/zstd_decompress_block.c: ZSTD_decompressSequences_body: after decode loop, remaining nbSeq : 8772
yeet.zst : Decoding error (36) : Corrupted block detected

…ons()

…ns()

… matches

…tatic one

senhuang42 · 2020-09-18T15:21:39Z

                                       zc->blockState.nextCBlock->rep,
                                       src, srcSize);
-            assert(ldmSeqStore.pos == ldmSeqStore.size);
+            //assert(ldmSeqStore.pos == ldmSeqStore.size);


This may no longer be necessarily true, since ZSTD_BtGetAllMatches() might never get called at the absolute byte position of an LDM.

Cyan4973 · 2020-09-18T17:20:59Z

+  size_t pos;               /* The position where reading stopped. <= size. */
+  size_t size;              /* The number of sequences. <= capacity. */
+  size_t capacity;          /* The capacity starting from `seq` pointer */
+  size_t bytesRead;         /* The bytes-based analogue to "pos" - read position in bytes rather than idx */


Adding bytesRead increases the size of rawSeqStore_t,
which in turns increases the budget needed for any array of this type.

Are you certain that all allocations and size evaluations involving this type are properly resized ?
(i.e. typically use a multiple of sizeof(rawSeqStore_t))

Cyan4973 · 2020-09-18T17:22:58Z

    optState_t opt;         /* optimal parser state */
    const ZSTD_matchState_t* dictMatchState;
    ZSTD_compressionParameters cParams;
+    rawSeqStore_t* ldmSeqStore; /* reference to ldm seq store if there is one */


Is it guaranteed to be properly initialized ?
(I presume that ldmSeqStore == NULL is a valid entry and means "ldm seqStore not present")

Cyan4973 · 2020-09-18T17:51:10Z

+            /* TODO: as followup, also insert matches that are shorter than bestLength, but with smaller offset */
+            if (matchLength >= bestLength) {
+                /* TODO: This calculation needs more scrutiny, but seems to work to split matchLengths
+                 * that are longer than the remaining bytes in a block. But this doesn't work for litLengths


I think it would be better to bound matchLength to a maximum that doesn't cross the block boundary.
This is likely possible, because ZSTD_insertBtAndGetAllMatches() must already do it for any regular match,
so the information about the bound must be present.

Cyan4973 · 2020-09-18T17:55:05Z

+                }
+
+                /* Append the (possibly split) LDM to the end of our match candidates
+                 * signifying that it's the "best" candidate */


"best" candidate

Call it "longest".
We don't know yet if it's the "best" one, this decision is left to the parser,
and the answer is likely different depending on the end point.

Cyan4973 · 2020-09-18T18:02:42Z

+                } else {
+                    if (finalSeq.offset == 0) {
+                        /* This is the case that I'm not really sure how to handle yet
+                         * Offset == 0 implies that the litLength > remaining bytes


Offset == 0 implies that the litLength > remaining bytes

I don't understand this sentence.
Both parts of the statement don't align naturally together.
Is there something in the code that makes it happen ?

senhuang42 added 30 commits September 14, 2020 17:55

Expand rawSeqStore_t to include absolute positions of sequences

796d08d

Provde function declaration int ZSTD_createRawSequencesAbsolutePositi…

e0f8b75

…ons()

Provde function definition for ZSTD_createRawSequencesAbsolutePositio…

7619aee

…ns()

Add ldmSeqStore to matchState

72a541d

Add *ldmSeqStore to match state

2ba2556

Generate absolute positions for ldm

06d7ef4

Add ldmSeqStoreHasAbsolutePositionMatch() static fxn to find absolute…

64dfa68

… matches

Add sequence splitting function for LDMs in opt parser

763f2bf

Add relevant comments for match splitter

b0c2a63

First pass at integrating into matches

c3e6735

ZSTD_ldm_maybeSplitSequence() is now a public function in zstd_ldm.h

4e3ef44

ms->ldmSeqStore ==> ldmSeqStore for fewer ms accesses

f6e2a55

Remove static maybeSplitLdm()

2c6a307

Add function ZSTD_ldm_hasMatchAtAbsolutePosition() instead of using s…

0f8a4e8

…tatic one

Fix incorrect function documentation

16d3fa2

Fix ordering bug

5a3fa3b

Fix remainingBytes calculation to address segfaults

4c557cc

Address compilation warnings to do with rawSeqStore

c7a1c75

Remove no-longer-true assert

1d2d920

Fix incorrect index in condition

020dbf2

Update LDM seq store ->pos with matched index

192169c

Add extra condition for mnum > 0 to avoid edge cases

7cdc92d

Add a binary search option for larger ldm seq stores

63e82e6

Reduce binary search threshold

58a5058

Improve documentation of some confusing struct fields

75ec5ca

Add some extra comments in zstd_opt.c

9134755

Add extra --ultra mode roundTripTest

a59d8e6

Remove allocated array for abs positions, rederive each time instead

4153c91

ZSTD_createRawSequencesAbsolutePositions() => removed

2a26a92

Remove offset restraint to improve compression ratio

3b71676

senhuang42 added 2 commits September 17, 2020 10:04

Update bytesRead on the fly, and optimize absolute position matcher

f1264d5

First attempt at fixing errors

f8628be

senhuang42 force-pushed the improve_long_mode_with_opt_parser branch 3 times, most recently from 869d5c3 to 5b00f5d Compare September 18, 2020 02:11

Cleanups, comment clarification

b51fa4f

senhuang42 force-pushed the improve_long_mode_with_opt_parser branch from 5b00f5d to b51fa4f Compare September 18, 2020 02:15

senhuang42 changed the title ~~Improve long mode with opt parser~~ Rough draft: Opt parser improvements with long distance matches Sep 18, 2020

senhuang42 commented Sep 18, 2020

View reviewed changes

Cyan4973 reviewed Sep 18, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rough draft: Opt parser improvements with long distance matches#1

Rough draft: Opt parser improvements with long distance matches#1
senhuang42 wants to merge 33 commits into
devfrom
improve_long_mode_with_opt_parser

senhuang42 commented Sep 18, 2020 •

edited

Loading

Uh oh!

senhuang42 Sep 18, 2020

Uh oh!

Cyan4973 Sep 18, 2020 •

edited

Loading

Uh oh!

Cyan4973 Sep 18, 2020

Uh oh!

Cyan4973 Sep 18, 2020

Uh oh!

Cyan4973 Sep 18, 2020

Uh oh!

Cyan4973 Sep 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

senhuang42 commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

senhuang42 Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

senhuang42 commented Sep 18, 2020 •

edited

Loading

Cyan4973 Sep 18, 2020 •

edited

Loading