Skip to content

[QUARK-480] Phase 2: replace dict-style quant_config access with LayerQuantConfig#274

Open
thpereir wants to merge 1 commit intomainfrom
thpereir/quark_quant_layer_phase2
Open

[QUARK-480] Phase 2: replace dict-style quant_config access with LayerQuantConfig#274
thpereir wants to merge 1 commit intomainfrom
thpereir/quark_quant_layer_phase2

Conversation

@thpereir
Copy link
Contributor

@thpereir thpereir commented Mar 6, 2026

…cross all models and ops

Motivation

Following up DeepSeek R1 enablement with per layer quantization support this converts all other models to use the same structured quantization config instead of a dict.

This PR depends on both #236 and #268

Technical Details

  1. Clean type boundary — LayerQuantConfig is a frozen dataclass; impossible to accidentally mutate or misspell fields.
  2. Single resolution API — resolve(prefix) handles exclude-list, per-layer overrides, pattern matching, and global fallback in one call with documented priority. Less error prone than previous approach of accessing the dict. What was quant_config.get_layer_quant_config(f"{prefix}.fused_qkv_a_proj") becomes quant_config.resolve(prefix)
  3. Parser registry — @register_quant_parser("quark") means new quantization formats are self-contained modules. In the current approach adding a new quantization method means editing QuantizationConfig.__init__ directly.

Test Plan

Test Result

Qwen3-30B-A3B-FP8-dynamic

ATOM Server:

python -m atom.entrypoints.openai_server --model /root/models/RedHatAI/Qwen3-30B-A3B-FP8-dynamic/ --trust-remote-code -tp 4 --kv_cache_dtype fp8

ATOM lm-eval GSM8K:

python tests/evals/gsm8k/gsm8k_eval.py --host http://127.0.0.1 --port 8000 --num-questions 1000 --save-results logs
Running GSM8K evaluation: 1000 questions, 5-shot

Results:
Accuracy: 0.858
Invalid responses: 0.000
Total latency: 24.870 s
Questions per second: 40.209
Total output tokens: 119342
Output tokens per second: 4798.671
Results saved to logs

HuggingFace GSM8K:

Accuracy: 0.86

DeepSeek-R1-0528-MoE-MXFP4-Attn-MTP-PTPC-FP8

ATOM Server:

python -m atom.entrypoints.openai_server --model /root/models/amd/DeepSeek-R1-0528-MoE-MXFP4-Attn-MTP-PTPC-FP8/ --trust-remote-code -tp 8 --kv_cache_dtype fp8

ATOM lm-eval GSM8K:

lm_eval --model local-completions   --model_args "model=/root/models/amd/DeepSeek-R1-0528-MoE-MXFP4-Attn-MTP-PTPC-FP8/,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5   --batch_size 1

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9386|±  |0.0066|
|     |       |strict-match    |     5|exact_match|↑  |0.9356|±  |0.0068|

Submission Checklist

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 66e940b to 873a969 Compare March 10, 2026 16:29
@thpereir thpereir changed the base branch from thpereir/quark_quant_layer to main March 10, 2026 16:30
@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch 2 times, most recently from 8801fb5 to 0c9fd54 Compare March 10, 2026 17:06
@thpereir thpereir changed the title Phase 2: replace dict-style quant_config access with LayerQuantSpec a… Phase 2: replace dict-style quant_config access with LayerQuantConfig Mar 10, 2026
@thpereir thpereir marked this pull request as ready for review March 11, 2026 16:35
@thiagocrepaldi
Copy link

Is it possible to request a regression test for this pr or for the current main branch?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the Phase 2 migration away from dict-style quantization config access by introducing a typed, immutable LayerQuantConfig and routing quant config parsing through a parser registry, then updating models/ops/docs/tests to use the new API.

Changes:

  • Introduce atom/quant_spec.py with LayerQuantConfig (frozen dataclass), ParsedQuantConfig, and a quant-config parser registry (QuarkParser, GenericParser).
  • Refactor QuantizationConfig to parse via the registry and expose typed per-layer resolution via get_layer_quant_config().
  • Update model/ops code and docs/tests to use attribute access (e.g., .quant_dtype) instead of dict keys.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
atom/quant_spec.py Adds typed quant spec + parser registry and built-in parsers.
atom/config.py Switches quant parsing to registry + typed resolution; removes dict-based parsing paths.
atom/models/llama.py Uses per-layer resolved quant type instead of global dict access.
atom/models/deepseek_v2.py Replaces dict indexing with typed attribute access for resolved specs.
atom/models/deepseek_mtp.py Replaces dict indexing with typed attribute access.
atom/model_ops/moe.py Updates MoE methods to use typed LayerQuantConfig attributes.
atom/model_ops/linear.py Updates LinearBase plumbing to use typed spec attributes.
atom/model_ops/layernorm.py Updates RMSNorm to resolve per-layer spec via prefix.
atom/model_ops/activation.py Updates SiluAndMul to resolve per-layer spec via prefix.
tests/test_quant_config.py Updates/refactors tests for typed API and parser registry.
docs/configuration_guide.md Updates docs to reflect LayerQuantConfig as a frozen dataclass and new parsing path.
docs/architecture_guide.md Removes LayerQuantConfig from atom/config.py ownership list.
Comments suppressed due to low confidence (1)

docs/configuration_guide.md:139

  • The LayerQuantConfig table documents quant_method as type str with default "", but the implementation is quant_method: str | None = None. Update the docs to reflect that the field may be None and clarify what None vs an empty string means for downstream logic.
| Field | Type | Default | Description |
|---|---|---|---|
| `quant_type` | `QuantType` | `QuantType.No` | Quantization granularity (see below) |
| `quant_dtype` | `torch.dtype` | `torch.bfloat16` | Data type for quantized weights |
| `is_dynamic` | `bool` | `True` | Use dynamic quantization (scales computed at runtime) |
| `quant_method` | `str` | `""` | Quantization method (e.g., `"quark"`, `"compressed-tensors"`) |

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@haoyangli0109
Copy link
Contributor

haoyangli0109 commented Mar 13, 2026

@thpereir
Thank you very much.
It's great to see that several of the directions we discussed were implemented quickly. If we modify the code for other quantization formats, we should also select the corresponding models for accuracy testing.

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch 5 times, most recently from 0ce71af to 59dcb22 Compare March 16, 2026 23:20
@thpereir
Copy link
Contributor Author

Rebased again. Fixed a small issue loading gpt-oss-120b. But I found that even main branch has a big issue running inference for gpt-oss-120b! Output is garbage.

I am investigating and working on a separate fix to not pollute this PR

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 59dcb22 to 74ca941 Compare March 17, 2026 00:50
@thpereir
Copy link
Contributor Author

Rebased again. Posted test results on the top PR description

@thpereir thpereir changed the title Phase 2: replace dict-style quant_config access with LayerQuantConfig [QUARK-480] Phase 2: replace dict-style quant_config access with LayerQuantConfig Mar 17, 2026
@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 74ca941 to c481f2b Compare March 17, 2026 14:19
atom/config.py Outdated
)
# Use the parser registry to build a structured ParsedQuantConfig
parser = get_quant_parser(self.quant_method)
self._parsed_quant_config = parser.parse(self.hf_quant_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can pass exclude_layer directly into QuantizationConfig and let it own that field, why can’t we do the same for global_spec and layer_pattern_specs?

Compared to accessing self._parsed_quant_config.layer_pattern_specs, wouldn’t it be more reasonable and straightforward to use self.layer_pattern_specs directly?

Returning a tuple feels somewhat inelegant, but introducing an additional _parsed_quant_config object that doesn’t seem strictly necessary also feels a bit odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update creating fields, so access is direct but keeping parsed_quant_config as a private implementation detail.

Copy link
Contributor

@haoyangli0109 haoyangli0109 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @thpereir
The typed return contract for QuantConfigParser.parse() is more self-documenting than returning a tuple or dict. When adding new parsers (Quark, GenericParser, etc.), it provides a clear "output contract." And if the parsing output ever needs more fields, adding them to ParsedQuantConfig is cleaner than dealing with an ever-growing tuple.

However, as currently implemented there are several issues:

QuantizationConfig is the only consumer. ParsedQuantConfig is never used directly anywhere else. It gets created and immediately proxied through properties — essentially a middleman that adds a layer of indirection for no benefit.
Inconsistent design. exclude_layers is copied out and stored directly on QuantizationConfig, while global_spec and layer_pattern_specs are accessed via _parsed_quant_config. If one field can be stored directly, why not the other two? This inconsistency is confusing for readers.
Encapsulation is broken. remap_layer_name directly mutates self._parsed_quant_config.layer_pattern_specs, which means ParsedQuantConfig provides no real encapsulation or immutability guarantee — it is neither frozen nor read-only.
Extra cognitive load. When reading self._parsed_quant_config.global_spec, you need to understand what _parsed_quant_config is and where it comes from. If it were simply self.global_spec as a direct field, the comprehension cost would be zero.
A dedicated class is not necessary for the parser return value. A Python NamedTuple, or simply having the parser return a (global_spec, layer_pattern_specs, exclude_layers) tuple, would be equally clear for this simple scenario without introducing an extra class.
Suggestion: keep ParsedQuantConfig as the parser return type (it's a fine interface contract), but treat it as a transient object — destructure it in init and discard it:

parser = get_quant_parser(self.quant_method)
parsed = parser.parse(self.hf_quant_config)
self.global_spec = parsed.global_spec
self.layer_pattern_specs = parsed.layer_pattern_specs
self.exclude_layers = list(parsed.exclude_layers)

This way parsers still return a well-typed ParsedQuantConfig, but QuantizationConfig owns all three fields directly — no delegation, no inconsistency, and remap_layer_name just mutates self.layer_pattern_specs without reaching into a private sub-object.
This is just my personal suggestion; I’d like to hear your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your point— _parsed_quant_config as a stored member was inconsistent and added indirection for no real benefit. Updated to destructure the result of parser.parse() directly into self.global_spec, self.layer_pattern_specs, and self.exclude_layers.
ParsedQuantConfig is now purely a typed return contract for parsers; QuantizationConfig owns all three fields directly and remap_layer_name mutates self.layer_pattern_specs without reaching into a sub-object.

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch 3 times, most recently from 3e92890 to 2aa9bbf Compare March 19, 2026 17:10
… for per-layer quant config

Introduce atom/quant_spec.py with:
- LayerQuantConfig: frozen dataclass with typed attribute access
  (quant_type, quant_dtype, is_dynamic, quant_method) replacing
  the old dict-based LayerQuantConfig(dict) subclass
- ParsedQuantConfig: structured output of HF config parsing
- Parser registry (@register_quant_parser) with QuarkParser and
  GenericParser (fallback for compressed-tensors, GPTQ, AWQ, etc.)

Refactor QuantizationConfig (atom/config.py):
- Internal storage now uses ParsedQuantConfig via parser registry
- get_layer_quant_config(prefix) -> LayerQuantConfig (frozen dataclass)
- global_quant_config property -> LayerQuantConfig
- Convenience properties: quant_type, quant_dtype, is_dynamic
- compute_hash() uses typed internal structures

Migrate all consumers to typed attribute access:
- linear.py: layer_quant_config.quant_type instead of ["quant_type"]
- moe.py: all MoE method classes use LayerQuantConfig type hints
- activation.py, layernorm.py: accept prefix param, use
  get_layer_quant_config() instead of bypassing with global_quant_config
- deepseek_mtp.py, deepseek_v2.py, llama.py: use get_layer_quant_config()

Fix GenericParser exclude-layer key handling (atom/quant_spec.py):
- Different quantizers use different keys for excluded layers:
  compressed-tensors uses "ignore", gpt-oss/HF uses "modules_to_not_convert",
  Quark uses "exclude"
- GenericParser now tries all three keys in priority order so excluded
  layers are never silently treated as quantized

Fix hard-coded quant_config=None across models:
- gpt_oss.py OAIAttention: qkv_proj and o_proj were passing quant_config=None,
  preventing fp8/mxfp4 quantization on attention projections in Quark
  gpt-oss models (e.g. fp8 qkv + mxfp4 MoE); both now receive quant_config
- deepseek_v2.py Indexer: weights_proj passed quant_config=None while sibling
  linears wq_b and wk correctly used quant_config; fixed for consistency
- qwen3_next.py GatedDeltaNet: conv1d ColumnParallelLinear omitted
  quant_config while other linears in the same class passed it; fixed

Fix GenericParser MXFP4 quant_type: default fp4x2 to per_1x32 when no explicit
quant_type is found, fixing gpt-oss-120b AITER kernel dispatch.
@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 2aa9bbf to 4b24376 Compare March 19, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants