Skip to content

Rough draft: Opt parser improvements with long distance matches#1

Open
senhuang42 wants to merge 33 commits into
devfrom
improve_long_mode_with_opt_parser
Open

Rough draft: Opt parser improvements with long distance matches#1
senhuang42 wants to merge 33 commits into
devfrom
improve_long_mode_with_opt_parser

Conversation

@senhuang42

@senhuang42 senhuang42 commented Sep 18, 2020

Copy link
Copy Markdown
Owner

Draft PR for suggestions/help. This PR introduces a method to find the absolute position of an LDM, and adds code to allow LDMs to be just candidates for the opt parser.
Context:

  • I realized after the fact that my previous version with allocations (commit a59d8e6) only seemed more "stable" because of a couple of "lucky" bugs that allowed it to pass the unit tests and spot-checking. So let's just focus on the latest version with no allocations.
  • The two crucial blocks of code to review are:
    ZSTD_insertBtAndGetAllMatches() (starting line 758) and ZSTD_ldm_hasMatchAtAbsolutePosition().

Approach to handling the LDMs

  1. We add a new field bytesRead to rawSeqStore_t that corresponds to another member, pos. While pos is the last read index, bytesRead is the last read absolute byte position. (bytePos would probably be a better name). We use this to derive the absolute position of each match.
  2. Within ZSTD_insertBtAndGetAllMatches(), we have a value U32 curr = (U32)ip - ms->window.base. Thus, curr represents the absolute position within the bytestream at which ZSTD_insertBtAndGetAllMatches() was called.
  3. Next, we calculate if curr equals the calculated absolute position of any LDM. For example, consider the following three LDMs all occurring in the same seq store:
    • (of: 22, ml: 100, ll: 50) ==> absolute position is 50 bytes
    • (of: 23, ml: 40, ll: 30) ==> absolute position is 180 bytes (50 (prev abs position) + 100 (prev ml) + 30 (prev ll))
    • (of: 48, ml: 20, ll: 10) ==> absolute position is 230 bytes (180 + 40 + 10)
  4. So, using the same example, if we called ZSTD_insertBtAndGetAllMatches() with ip == 50, then ZSTD_ldm_hasMatchAtAbsolutePosition() should return the index of the appropriate match, and if it's long enough, we add it to the end of our matches array

A few implementation details

  • zstd_opt.c now takes a dependency on zstd_ldm.h which exposes two functions we'd like to use. I think this should be fine since I don't know of any zstd build which needs zstd_opt but not zstd_ldm
  • ZSTD_matchState_t now has a rawSeqStore_t* pointer to ldm seq store. However, I just realized that isn't good in certain cases, the ldm seq store can apparently be instantiated on the stack upstream, so it seems we lose track of it. Maybe the way to this is to add some new functions that take rawSeqStore_t* as an argument.
  • There are a couple of random debug and print functions in here just for debugging - I plan on removing them in the actual PR.

What generally works:
Single-threaded --long mode, where matches cannot go beyond blockSize. This is the "standard" case and sticks pretty close to the approach I mentioned above.

Where I'm currently stuck
Roundtrip issues that occur when multithreading is enabled. Apparently, multithreading enables the LDM-generator to find LDMs with matchLength and litLength longer than block sizes. Sequence splitting becomes an issue.

This is the case I'm currently having the most trouble with. As it is right now, we actually can handle matchLength going over a block boundary, but I'm not exactly sure how to handle the other case where litLength is actually greater than remaining (the remaining number of bytes in a block).

For example, on a file which I manually verified that has the first LDM as (of: 951277, ml: 1362, ll: 952876), we inserted it when curr == 952877 according to our above algorithm. But in reality, according to the "correct" opt parser (without --long) which found the same match, it inserted the match when curr == 1083471. It's unclear to me why this happens - I don't see why the LDM should be inserted at position 1083471 instead of 952877. This is indeed the first LDM, and so the position of the match should just be litLength + 1.

And so this causes the following decompression error:

../lib/decompress/zstd_decompress_block.c: seq: litL=0, matchL=1362, offset=951277 
../lib/decompress/zstd_decompress_block.c:856: ERROR!: check (__builtin_expect((sequence.offset > (size_t)(oLitEnd - virtualStart)), 0)) failed, returning ERROR(corruption_detected): 
../lib/decompress/zstd_decompress_block.c: regenerated sequence size : 4294967276 
../lib/decompress/zstd_decompress_block.c: ZSTD_decompressSequences_body: after decode loop, remaining nbSeq : 8772
yeet.zst : Decoding error (36) : Corrupted block detected 

senhuang42 added 30 commits September 14, 2020 17:55
@senhuang42 senhuang42 force-pushed the improve_long_mode_with_opt_parser branch 3 times, most recently from 869d5c3 to 5b00f5d Compare September 18, 2020 02:11
@senhuang42 senhuang42 force-pushed the improve_long_mode_with_opt_parser branch from 5b00f5d to b51fa4f Compare September 18, 2020 02:15
@senhuang42 senhuang42 changed the title Improve long mode with opt parser Rough draft: Opt parser improvements with long distance matches Sep 18, 2020
zc->blockState.nextCBlock->rep,
src, srcSize);
assert(ldmSeqStore.pos == ldmSeqStore.size);
//assert(ldmSeqStore.pos == ldmSeqStore.size);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may no longer be necessarily true, since ZSTD_BtGetAllMatches() might never get called at the absolute byte position of an LDM.

size_t pos; /* The position where reading stopped. <= size. */
size_t size; /* The number of sequences. <= capacity. */
size_t capacity; /* The capacity starting from `seq` pointer */
size_t bytesRead; /* The bytes-based analogue to "pos" - read position in bytes rather than idx */

@Cyan4973 Cyan4973 Sep 18, 2020

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding bytesRead increases the size of rawSeqStore_t,
which in turns increases the budget needed for any array of this type.

Are you certain that all allocations and size evaluations involving this type are properly resized ?
(i.e. typically use a multiple of sizeof(rawSeqStore_t))

optState_t opt; /* optimal parser state */
const ZSTD_matchState_t* dictMatchState;
ZSTD_compressionParameters cParams;
rawSeqStore_t* ldmSeqStore; /* reference to ldm seq store if there is one */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed to be properly initialized ?
(I presume that ldmSeqStore == NULL is a valid entry and means "ldm seqStore not present")

Comment thread lib/compress/zstd_opt.c
/* TODO: as followup, also insert matches that are shorter than bestLength, but with smaller offset */
if (matchLength >= bestLength) {
/* TODO: This calculation needs more scrutiny, but seems to work to split matchLengths
* that are longer than the remaining bytes in a block. But this doesn't work for litLengths

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to bound matchLength to a maximum that doesn't cross the block boundary.
This is likely possible, because ZSTD_insertBtAndGetAllMatches() must already do it for any regular match,
so the information about the bound must be present.

Comment thread lib/compress/zstd_opt.c
}

/* Append the (possibly split) LDM to the end of our match candidates
* signifying that it's the "best" candidate */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best" candidate

Call it "longest".
We don't know yet if it's the "best" one, this decision is left to the parser,
and the answer is likely different depending on the end point.

Comment thread lib/compress/zstd_opt.c
} else {
if (finalSeq.offset == 0) {
/* This is the case that I'm not really sure how to handle yet
* Offset == 0 implies that the litLength > remaining bytes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset == 0 implies that the litLength > remaining bytes

I don't understand this sentence.
Both parts of the statement don't align naturally together.
Is there something in the code that makes it happen ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants