You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Build as optimization, not decision.Value = Min<usize>. Objective: Minimize compression cost |D| + |C| + (h−1) × (pointer count).
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.
Motivation
MINIMUM EXTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR22, "EXTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string using a separate dictionary string and a compressed string with pointers. This problem formalizes the macro model of data compression introduced by Storer and Szymanski, which generalizes many practical compression schemes including LZ-family algorithms.
INSTANCE: Alphabet Σ, string s ∈ Σ*, pointer cost h ∈ Z⁺.
OBJECTIVE: Find strings D (dictionary string) and C (compressed string) in (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})*, where the symbols pᵢ are "pointers," such that s can be obtained from C by repeatedly replacing pointers with their corresponding substrings of D, minimizing the total cost:
cost(D, C) = |D| + |C| + (h − 1) × (number of pointer occurrences in D and C)
The corresponding decision problem (GJ SR22) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.
Variables
Count: 2 × |s| variables — |s| slots for the dictionary string D and |s| slots for the compressed string C. Unused trailing slots are filled with a special "empty" marker.
D-slot domain: {empty} ∪ Σ — each dictionary position is either an alphabet symbol or unused. Domain size = alphabet_size + 1.
C-slot domain: {empty} ∪ Σ ∪ {ptr(start, len) : 0 ≤ start < |s|, 1 ≤ len ≤ |s| − start} — each compressed-string position is an alphabet symbol, a pointer into D (specified by start position and length), or unused. Domain size = alphabet_size + 1 + |s| × (|s| + 1) / 2.
Meaning: A configuration encodes a candidate (D, C) pair. D (the non-empty prefix of D-slots) is a dictionary of reusable substrings. C (the non-empty prefix of C-slots) contains literal symbols and pointers into D. The pair is valid if replacing all pointers in C with their referenced D-substrings yields s. The cost is |D| + |C| + (h − 1) × (number of pointer symbols in C).
Note: This encoding restricts D to be pointer-free (pure alphabet string). The GJ definition allows pointers in D, but the pointer-free variant is NP-complete (Storer & Szymanski, 1982) and sufficient for all known reductions.
Schema (data type)
Type name:MinimumExternalMacroDataCompression Variants: none (no graph or weight type parameter) Value type:Min<usize>
Field
Type
Description
alphabet_size
usize
Size of the alphabet Σ (symbols indexed 0..alphabet_size)
string
Vec<usize>
The source string s to be compressed, as a sequence of symbol indices
pointer_cost
usize
The pointer cost h (each pointer occurrence contributes h − 1 extra beyond the position it occupies)
Getters (for overhead expressions):
string_length() → string.len()
alphabet_size() → alphabet_size
pointer_cost() → pointer_cost
Complexity
Optimization complexity: NP-hard (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). The decision version (is minimum cost ≤ B?) is NP-complete. Remains NP-hard even for fixed h ≥ 2, for alphabet size ≥ 3, and for variants where D contains no pointers.
Best known exact algorithm: Brute-force over all possible (D, C) pairs: for each candidate dictionary string D of length up to |s|, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Upper bound: O(alphabet_size^string_length × (alphabet_size + string_length²)^string_length).
Approximation: Practical heuristic algorithms (LZ77, LZSS, LZ78) achieve good compression ratios in linear or near-linear time but do not guarantee optimality. LZSS (Lempel-Ziv-Storer-Szymanski) is a direct practical algorithm derived from this theoretical framework. The related Smallest Grammar Problem is APX-hard (Charikar et al., 2005).
References:
[Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
[Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
[Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.
Specialization
This is related to: INTERNAL MACRO DATA COMPRESSION (P171) -- the variant where D and C are merged into a single self-referencing string
Generalization of: Many practical compression schemes (LZ77, LZSS) are restricted forms of external macro compression
Extra Remark
Full book text (GJ SR22, decision version):
INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Are there strings D (dictionary string) and C (compressed string) in (Sigma union {p_i: 1 <= i <= |s|})*, where the symbols p_i are "pointers," such that
|D| + |C| + (h-1) * (number of occurrences of pointers in D and C) <= B
and such that there is a way of identifying pointers with substrings of D so that S can be obtained from C by repeatedly replacing pointers in C by their corresponding substrings in D?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. Many variants, including those in which D can contain no pointers and/or no pointers can refer to overlapping strings, are also NP-complete. If the alphabet size is fixed at 3 or greater, and the pointer cost is ceiling(h * log|s|), the problem is also NP-complete. For further variants, including the case of "original pointers," see references.
How to solve
It can be solved by (existing) bruteforce. Enumerate all possible dictionary strings D up to length |s|, and for each D, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Track the minimum total cost across all valid (D, C) pairs.
It can be solved by reducing to integer programming.
Other: Practical heuristic compression algorithms (LZSS, LZ77) provide approximate solutions in linear time, though they do not guarantee optimality.
Example Instance
Alphabet Σ = {a, b, c, d, e, f} (alphabet_size = 6)
String s = [a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f] (length 18)
Pointer cost h = 2
Optimal compression (cost = 12):
Dictionary D = "abcdef" (length 6)
Compressed string C = "p₁ p₁ p₁" where p₁ → D[0..6] = "abcdef"
Important
Build as optimization, not decision.
Value = Min<usize>.Objective: Minimize compression cost |D| + |C| + (h−1) × (pointer count).
Do not add a
bound/thresholdfield — let the solver find the optimum directly. See #765.Motivation
MINIMUM EXTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR22, "EXTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string using a separate dictionary string and a compressed string with pointers. This problem formalizes the macro model of data compression introduced by Storer and Szymanski, which generalizes many practical compression schemes including LZ-family algorithms.
Associated reduction rules:
Definition
Name:
MinimumExternalMacroDataCompressionCanonical name: External Macro Data Compression (also: External Pointer Macro Compression, EPM Compression)
Reference: Garey & Johnson, Computers and Intractability, A4 SR22
Mathematical definition:
INSTANCE: Alphabet Σ, string s ∈ Σ*, pointer cost h ∈ Z⁺.
OBJECTIVE: Find strings D (dictionary string) and C (compressed string) in (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})*, where the symbols pᵢ are "pointers," such that s can be obtained from C by repeatedly replacing pointers with their corresponding substrings of D, minimizing the total cost:
cost(D, C) = |D| + |C| + (h − 1) × (number of pointer occurrences in D and C)
The corresponding decision problem (GJ SR22) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.
Variables
alphabet_size + 1.alphabet_size + 1 + |s| × (|s| + 1) / 2.Note: This encoding restricts D to be pointer-free (pure alphabet string). The GJ definition allows pointers in D, but the pointer-free variant is NP-complete (Storer & Szymanski, 1982) and sufficient for all known reductions.
Schema (data type)
Type name:
MinimumExternalMacroDataCompressionVariants: none (no graph or weight type parameter)
Value type:
Min<usize>alphabet_sizeusizestringVec<usize>pointer_costusizeGetters (for overhead expressions):
string_length()→string.len()alphabet_size()→alphabet_sizepointer_cost()→pointer_costComplexity
Specialization
Extra Remark
Full book text (GJ SR22, decision version):
INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Are there strings D (dictionary string) and C (compressed string) in (Sigma union {p_i: 1 <= i <= |s|})*, where the symbols p_i are "pointers," such that
|D| + |C| + (h-1) * (number of occurrences of pointers in D and C) <= B
and such that there is a way of identifying pointers with substrings of D so that S can be obtained from C by repeatedly replacing pointers in C by their corresponding substrings in D?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. Many variants, including those in which D can contain no pointers and/or no pointers can refer to overlapping strings, are also NP-complete. If the alphabet size is fixed at 3 or greater, and the pointer cost is ceiling(h * log|s|), the problem is also NP-complete. For further variants, including the case of "original pointers," see references.
How to solve
Example Instance
Alphabet Σ = {a, b, c, d, e, f} (alphabet_size = 6)
String s = [a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f] (length 18)
Pointer cost h = 2
Optimal compression (cost = 12):
Suboptimal scheme (cost = 16):
No compression: cost = |s| = 18
Optimal value: Min(12)