Rough draft: Opt parser improvements with long distance matches#1
Rough draft: Opt parser improvements with long distance matches#1senhuang42 wants to merge 33 commits into
Conversation
869d5c3 to
5b00f5d
Compare
5b00f5d to
b51fa4f
Compare
| zc->blockState.nextCBlock->rep, | ||
| src, srcSize); | ||
| assert(ldmSeqStore.pos == ldmSeqStore.size); | ||
| //assert(ldmSeqStore.pos == ldmSeqStore.size); |
There was a problem hiding this comment.
This may no longer be necessarily true, since ZSTD_BtGetAllMatches() might never get called at the absolute byte position of an LDM.
| size_t pos; /* The position where reading stopped. <= size. */ | ||
| size_t size; /* The number of sequences. <= capacity. */ | ||
| size_t capacity; /* The capacity starting from `seq` pointer */ | ||
| size_t bytesRead; /* The bytes-based analogue to "pos" - read position in bytes rather than idx */ |
There was a problem hiding this comment.
Adding bytesRead increases the size of rawSeqStore_t,
which in turns increases the budget needed for any array of this type.
Are you certain that all allocations and size evaluations involving this type are properly resized ?
(i.e. typically use a multiple of sizeof(rawSeqStore_t))
| optState_t opt; /* optimal parser state */ | ||
| const ZSTD_matchState_t* dictMatchState; | ||
| ZSTD_compressionParameters cParams; | ||
| rawSeqStore_t* ldmSeqStore; /* reference to ldm seq store if there is one */ |
There was a problem hiding this comment.
Is it guaranteed to be properly initialized ?
(I presume that ldmSeqStore == NULL is a valid entry and means "ldm seqStore not present")
| /* TODO: as followup, also insert matches that are shorter than bestLength, but with smaller offset */ | ||
| if (matchLength >= bestLength) { | ||
| /* TODO: This calculation needs more scrutiny, but seems to work to split matchLengths | ||
| * that are longer than the remaining bytes in a block. But this doesn't work for litLengths |
There was a problem hiding this comment.
I think it would be better to bound matchLength to a maximum that doesn't cross the block boundary.
This is likely possible, because ZSTD_insertBtAndGetAllMatches() must already do it for any regular match,
so the information about the bound must be present.
| } | ||
|
|
||
| /* Append the (possibly split) LDM to the end of our match candidates | ||
| * signifying that it's the "best" candidate */ |
There was a problem hiding this comment.
"best" candidate
Call it "longest".
We don't know yet if it's the "best" one, this decision is left to the parser,
and the answer is likely different depending on the end point.
| } else { | ||
| if (finalSeq.offset == 0) { | ||
| /* This is the case that I'm not really sure how to handle yet | ||
| * Offset == 0 implies that the litLength > remaining bytes |
There was a problem hiding this comment.
Offset == 0 implies that the litLength > remaining bytes
I don't understand this sentence.
Both parts of the statement don't align naturally together.
Is there something in the code that makes it happen ?
Draft PR for suggestions/help. This PR introduces a method to find the absolute position of an LDM, and adds code to allow LDMs to be just candidates for the opt parser.
Context:
ZSTD_insertBtAndGetAllMatches()(starting line 758) andZSTD_ldm_hasMatchAtAbsolutePosition().Approach to handling the LDMs
bytesReadtorawSeqStore_tthat corresponds to another member,pos. Whileposis the last read index,bytesReadis the last read absolute byte position. (bytePoswould probably be a better name). We use this to derive the absolute position of each match.ZSTD_insertBtAndGetAllMatches(), we have a valueU32 curr = (U32)ip - ms->window.base. Thus,currrepresents the absolute position within the bytestream at whichZSTD_insertBtAndGetAllMatches()was called.currequals the calculated absolute position of any LDM. For example, consider the following three LDMs all occurring in the same seq store:(of: 22, ml: 100, ll: 50) ==> absolute position is 50 bytes(of: 23, ml: 40, ll: 30) ==> absolute position is 180 bytes (50 (prev abs position) + 100 (prev ml) + 30 (prev ll))(of: 48, ml: 20, ll: 10) ==> absolute position is 230 bytes (180 + 40 + 10)ZSTD_insertBtAndGetAllMatches()withip == 50, thenZSTD_ldm_hasMatchAtAbsolutePosition()should return the index of the appropriate match, and if it's long enough, we add it to the end of ourmatchesarrayA few implementation details
zstd_opt.cnow takes a dependency onzstd_ldm.hwhich exposes two functions we'd like to use. I think this should be fine since I don't know of any zstd build which needszstd_optbut notzstd_ldmZSTD_matchState_tnow has arawSeqStore_t*pointer to ldm seq store. However, I just realized that isn't good in certain cases, the ldm seq store can apparently be instantiated on the stack upstream, so it seems we lose track of it. Maybe the way to this is to add some new functions that takerawSeqStore_t*as an argument.What generally works:
Single-threaded
--longmode, where matches cannot go beyond blockSize. This is the "standard" case and sticks pretty close to the approach I mentioned above.Where I'm currently stuck
Roundtrip issues that occur when multithreading is enabled. Apparently, multithreading enables the LDM-generator to find LDMs with matchLength and litLength longer than block sizes. Sequence splitting becomes an issue.
This is the case I'm currently having the most trouble with. As it is right now, we actually can handle
matchLengthgoing over a block boundary, but I'm not exactly sure how to handle the other case wherelitLengthis actually greater thanremaining(the remaining number of bytes in a block).For example, on a file which I manually verified that has the first LDM as
(of: 951277, ml: 1362, ll: 952876), we inserted it whencurr == 952877according to our above algorithm. But in reality, according to the "correct" opt parser (without --long) which found the same match, it inserted the match whencurr == 1083471. It's unclear to me why this happens - I don't see why the LDM should be inserted at position 1083471 instead of 952877. This is indeed the first LDM, and so the position of the match should just belitLength + 1.And so this causes the following decompression error: