diff --git a/openspec/changes/harden-gguf-contracts/design.md b/openspec/changes/harden-gguf-contracts/design.md new file mode 100644 index 0000000..49f62a3 --- /dev/null +++ b/openspec/changes/harden-gguf-contracts/design.md @@ -0,0 +1,63 @@ +## Context + +The repository is in final hardening mode. For this phase, honest failure is better than partial runtime optimism. The current GGUF host-side path violates that principle in three ways: + +1. integer sizes from GGUF metadata/tensor descriptors are trusted too long +2. runtime tensor validation happens after CUDA allocation starts +3. schema mapping mixes valid configuration keys with unrelated metadata fallbacks + +These problems are tightly coupled and live behind one shallow seam: `GGUFParser` + `ModelLoader::loadGGUF`. + +## Goals / Non-Goals + +**Goals:** +- Fail malformed or oversized GGUF inputs with `Result` errors, not host exceptions. +- Validate runtime-required tensors before any CUDA allocation. +- Keep the GGUF host-side seam smaller and more honest. +- Add regression coverage for hostile and incomplete GGUF fixtures. + +**Non-Goals:** +- Full GGUF runtime support for every quantization format. +- Reworking CUDA kernels or broader inference-engine execution flow. +- Changing binary model loading behavior. + +## Decisions + +### 1. Add explicit overflow guards in GGUF size math + +`GGUFTensorInfo::numElements()`, `GGUFTensorInfo::calculateSize()`, and GGUF array readers will reject sizes that overflow `size_t` or exceed safe allocation math. + +Why: +- This converts malicious/untrusted metadata into structured parse errors. +- It keeps the parser as deep module: callers do not need to duplicate size-validation logic. + +### 2. Validate GGUF runtime completeness before CUDA allocation + +`ModelLoader::loadGGUF()` will build a required-tensor checklist from the extracted `ModelConfig` and return an error if required runtime tensors are missing or unsupported. + +Why: +- Missing tensors are input-validation failures, not CUDA failures. +- This concentrates loader policy in one place and removes fake-success zero placeholders from the public seam. + +### 3. Remove schema-invalid config fallback behavior + +`GGUFParser::extractModelConfig()` will stop treating unrelated metadata like `general.architecture` as numeric dimension fallbacks. + +Why: +- `general.architecture` is architecture identity, not `hidden_dim`. +- Keeping schema mapping strict improves locality and makes malformed GGUF behavior predictable. + +## Risks / Trade-offs + +- Some GGUF files that previously limped through with zero-filled placeholders will now fail early. This is intentional hardening. +- Tests must use host-side fixtures that fail before CUDA allocation when the environment lacks a working GPU/runtime. + +## Verification Plan + +1. Add failing host-side tests for: + - tensor-size overflow guards + - oversized GGUF arrays + - invalid config fallback behavior + - incomplete GGUF runtime tensors returning `Result` errors without throwing +2. Implement minimal production changes to satisfy those tests. +3. Run the focused test target plus strongest available repository verification commands. diff --git a/openspec/changes/harden-gguf-contracts/proposal.md b/openspec/changes/harden-gguf-contracts/proposal.md new file mode 100644 index 0000000..a5d59c1 --- /dev/null +++ b/openspec/changes/harden-gguf-contracts/proposal.md @@ -0,0 +1,22 @@ +## Why + +Tiny-LLM's GGUF host-side path is currently too forgiving and too shallow. The parser accepts untrusted sizes without enough overflow protection, the loader validates required runtime tensors only after it starts allocating CUDA memory, and malformed metadata can influence `ModelConfig` in ways that do not reflect the GGUF schema. In practice this means crafted or incomplete GGUF files can trigger exceptions, oversized allocations, or fake-success paths instead of returning a clean `Result` error. + +## What Changes + +- Harden GGUF parsing against integer-overflow and oversized-allocation paths. +- Make GGUF runtime loading validate required metadata and tensor presence before any CUDA allocation. +- Remove misleading fallback behavior in GGUF-to-`ModelConfig` mapping. +- Add regression tests for hostile/malformed GGUF inputs and incomplete runtime artifacts. +- Align the inference-engine capability delta with the stricter GGUF contract. + +## Capabilities + +### Modified Capabilities +- `inference-engine`: Tightens GGUF parsing and runtime-loading failure behavior so malformed or incomplete files fail explicitly instead of throwing or silently synthesizing invalid weights. + +## Impact + +- Affects `src/gguf_parser.cpp`, `src/model_loader.cpp`, and related tests. +- Adds stricter failure behavior for malformed/incomplete GGUF inputs. +- Improves alignment between `Result`-based error handling and the actual GGUF host-side path. diff --git a/openspec/changes/harden-gguf-contracts/specs/inference-engine/spec.md b/openspec/changes/harden-gguf-contracts/specs/inference-engine/spec.md new file mode 100644 index 0000000..ad788d4 --- /dev/null +++ b/openspec/changes/harden-gguf-contracts/specs/inference-engine/spec.md @@ -0,0 +1,28 @@ +## MODIFIED Requirements + +### Requirement: Model Loading + +The system SHALL parse and load model files with explicit validation of malformed, incomplete, or unsupported GGUF inputs before runtime allocation begins. + +1. The system SHALL parse GGUF file headers and configuration metadata. +2. The system SHALL reject GGUF metadata, tensor dimensions, or array sizes that overflow safe allocation math. +3. The system SHALL validate that required runtime tensors are present before allocating CUDA memory for GGUF runtime loading. +4. The system SHALL return a structured error for corrupted, invalid, incomplete, or unsupported model files. +5. The system SHALL provide a structured model representation including `ModelConfig` and `QuantizedWeight` structures. + +#### Scenario: Reject oversized GGUF tensor metadata +- **GIVEN** a GGUF tensor descriptor or metadata array whose size computation would overflow +- **WHEN** the parser reads the file +- **THEN** parsing SHALL fail with a structured error +- **AND** the parser SHALL NOT attempt an undersized allocation + +#### Scenario: Reject incomplete GGUF runtime tensor sets +- **GIVEN** a GGUF file that parses successfully but omits tensors required for runtime inference +- **WHEN** `ModelLoader::loadGGUF()` is called +- **THEN** the loader SHALL return an error describing the missing or unsupported tensors +- **AND** it SHALL fail before starting CUDA allocations for model weights + +#### Scenario: Ignore unrelated metadata fallbacks +- **GIVEN** GGUF metadata that includes unrelated keys such as `general.architecture` +- **WHEN** `extractModelConfig()` maps metadata into `ModelConfig` +- **THEN** unrelated keys SHALL NOT override numeric model dimensions diff --git a/openspec/changes/harden-gguf-contracts/tasks.md b/openspec/changes/harden-gguf-contracts/tasks.md new file mode 100644 index 0000000..51634ac --- /dev/null +++ b/openspec/changes/harden-gguf-contracts/tasks.md @@ -0,0 +1,17 @@ +## 1. GGUF parser overflow hardening + +- [ ] 1.1 Add failing tests for tensor-size overflow and oversized GGUF arrays +- [ ] 1.2 Guard `numElements()`, `calculateSize()`, and array byte-count math against overflow +- [ ] 1.3 Re-run the focused GGUF/parser tests + +## 2. GGUF config and runtime contract hardening + +- [ ] 2.1 Add failing tests for invalid metadata fallback and incomplete runtime tensor sets +- [ ] 2.2 Make `extractModelConfig()` ignore unrelated metadata fallbacks +- [ ] 2.3 Make `ModelLoader::loadGGUF()` validate required tensors before CUDA allocation and fail with `Result` errors +- [ ] 2.4 Re-run the focused GGUF/model-loader tests + +## 3. Spec and verification alignment + +- [ ] 3.1 Update the inference-engine change delta to describe the stricter GGUF contract +- [ ] 3.2 Run repository verification commands or strongest available substitutes and record environment limits diff --git a/src/gguf_parser.cpp b/src/gguf_parser.cpp index bbf489c..ba12fb5 100644 --- a/src/gguf_parser.cpp +++ b/src/gguf_parser.cpp @@ -3,9 +3,39 @@ #include #include +#include namespace tiny_llm { +namespace { + +bool safeMultiplySize(size_t lhs, size_t rhs, size_t &out) { + if (lhs != 0 && rhs > std::numeric_limits::max() / lhs) { + out = 0; + return false; + } + out = lhs * rhs; + return true; +} + +template +bool canAllocateArray(uint64_t count) { + return count <= std::numeric_limits::max() / sizeof(T); +} + +size_t remainingBytes(std::ifstream &file) { + const std::streampos current = file.tellg(); + file.seekg(0, std::ios::end); + const std::streampos end = file.tellg(); + file.seekg(current); + if (current < 0 || end < current) { + return 0; + } + return static_cast(end - current); +} + +} // namespace + GGUFParser::GGUFParser(const std::string &path) : path_(path) {} Result GGUFParser::parse() { @@ -273,34 +303,50 @@ Result GGUFParser::readArray(std::ifstream &file) { } TLLM_TRACE("Reading array of {} elements, type {}", count, static_cast(elem_type)); + const size_t remaining = remainingBytes(file); // Handle different array types switch (elem_type) { case GGUFType::UINT32: { + if (!canAllocateArray(count) || count > remaining / sizeof(uint32_t)) { + return Result::err("Array too large: " + std::to_string(count)); + } std::vector arr(count); file.read(reinterpret_cast(arr.data()), count * 4); if (!file) return Result::err("Failed to read uint32 array"); return Result::ok(GGUFValue{arr}); } case GGUFType::INT32: { + if (!canAllocateArray(count) || count > remaining / sizeof(int32_t)) { + return Result::err("Array too large: " + std::to_string(count)); + } std::vector arr(count); file.read(reinterpret_cast(arr.data()), count * 4); if (!file) return Result::err("Failed to read int32 array"); return Result::ok(GGUFValue{arr}); } case GGUFType::FLOAT32: { + if (!canAllocateArray(count) || count > remaining / sizeof(float)) { + return Result::err("Array too large: " + std::to_string(count)); + } std::vector arr(count); file.read(reinterpret_cast(arr.data()), count * 4); if (!file) return Result::err("Failed to read float array"); return Result::ok(GGUFValue{arr}); } case GGUFType::FLOAT64: { + if (!canAllocateArray(count) || count > remaining / sizeof(double)) { + return Result::err("Array too large: " + std::to_string(count)); + } std::vector arr(count); file.read(reinterpret_cast(arr.data()), count * 8); if (!file) return Result::err("Failed to read double array"); return Result::ok(GGUFValue{arr}); } case GGUFType::STRING: { + if (count > remaining / sizeof(uint64_t)) { + return Result::err("Array too large: " + std::to_string(count)); + } std::vector arr; arr.reserve(count); for (uint64_t i = 0; i < count; ++i) { @@ -340,6 +386,10 @@ Result GGUFParser::readString(std::ifstream &file) { return Result::err("String too long: " + std::to_string(length)); } + if (length == 0) { + return Result::ok({}); + } + std::string str(length, '\0'); file.read(&str[0], length); @@ -403,13 +453,13 @@ Result GGUFParser::extractModelConfig() const { get_int("llama.attention.head_count", config.num_heads); get_int("llama.attention.head_count_kv", config.num_kv_heads); get_int("llama.context_length", config.max_seq_len); - get_int("general.architecture", config.hidden_dim); // fallback // Try alternative keys (some models use different naming) - get_int("llama.embedding_length", config.hidden_dim); - if (config.hidden_dim == 4096) { // default - get_int("llama.embedding_length", config.hidden_dim); - } + get_int("general.embedding_length", config.hidden_dim); + get_int("general.block_count", config.num_layers); + get_int("general.head_count", config.num_heads); + get_int("general.head_count_kv", config.num_kv_heads); + get_int("general.context_length", config.max_seq_len); // Tokenizer metadata get_int("tokenizer.ggml.model.vocab_size", config.vocab_size); @@ -472,7 +522,10 @@ Result> GGUFParser::readTensorData(const GGUFTensorInfo &te std::to_string(read_offset)); } - size_t size = tensor.calculateSize(); + size_t size = tensor.calculateSize(); + if (!tensor.dimensions.empty() && size == 0) { + return Result>::err("Tensor size overflow for: " + tensor.name); + } std::vector data(size); file.read(reinterpret_cast(data.data()), size); @@ -491,43 +544,63 @@ size_t GGUFTensorInfo::numElements() const { if (dimensions.empty()) return 0; size_t n = 1; for (auto d : dimensions) { - n *= d; + size_t next = 0; + if (!safeMultiplySize(n, static_cast(d), next)) { + return 0; + } + n = next; } return n; } size_t GGUFTensorInfo::calculateSize() const { size_t num_elem = numElements(); + if (!dimensions.empty() && num_elem == 0) { + return 0; + } + + auto scaledSize = [&](size_t bytes_per_elem) -> size_t { + size_t size = 0; + return safeMultiplySize(num_elem, bytes_per_elem, size) ? size : 0; + }; + auto blockScaledSize = [&](size_t block_size, size_t bytes_per_block) -> size_t { + if (block_size == 0) { + return 0; + } + const size_t blocks = (num_elem + block_size - 1) / block_size; + size_t size = 0; + return safeMultiplySize(blocks, bytes_per_block, size) ? size : 0; + }; // Bytes per element for each type switch (type) { case GGMLType::F32: - return num_elem * 4; + return scaledSize(4); case GGMLType::F16: - return num_elem * 2; + return scaledSize(2); case GGMLType::I8: return num_elem; case GGMLType::I16: - return num_elem * 2; + return scaledSize(2); case GGMLType::I32: - return num_elem * 4; + return scaledSize(4); case GGMLType::I64: - return num_elem * 8; + return scaledSize(8); case GGMLType::F64: - return num_elem * 8; + return scaledSize(8); case GGMLType::Q8_0: // Q8_0: 32 values per block, each block has 32 int8 + 1 half scale - return (num_elem / 32) * (32 + 2); + return blockScaledSize(32, 32 + 2); case GGMLType::Q4_0: // Q4_0: 32 values per block, each block has 16 int8 + 1 half scale - return (num_elem / 32) * (16 + 2); + return blockScaledSize(32, 16 + 2); case GGMLType::Q4_1: // Q4_1: 32 values per block, each block has 16 int8 + 2 half (scale + min) - return (num_elem / 32) * (16 + 4); + return blockScaledSize(32, 16 + 4); default: // Default to 2 bytes per element (FP16) TLLM_WARN("Unknown tensor type {}, assuming FP16 size", static_cast(type)); - return num_elem * 2; + return scaledSize(2); } } diff --git a/src/main.cpp b/src/main.cpp index c75ef73..8c2d11a 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -31,7 +31,8 @@ void printHelp(const char *program_name) { std::cout << " " << program_name << " # Show CUDA readiness" << std::endl; std::cout << " " << program_name << " --info # Show detailed device info" << std::endl; - std::cout << " " << program_name << " model.gguf # Load GGUF model (partial support)" + std::cout << " " << program_name + << " model.gguf # Inspect GGUF support notes (runtime load unsupported)" << std::endl; } @@ -177,8 +178,8 @@ int main(int argc, char **argv) { // Handle model path argument if (!model_path.empty()) { if (model_path.size() >= 5 && model_path.substr(model_path.size() - 5) == ".gguf") { - std::cout << "\nRuntime note: GGUF parsing is partial and runtime GGUF loading is not " - "supported yet." + std::cout << "\nRuntime note: GGUF parsing/validation is available, but runtime GGUF " + "loading is intentionally unsupported." << std::endl; std::cout << "Use the test binary format consumed by ModelLoader::loadBin() for " "end-to-end loading." diff --git a/src/model_loader.cpp b/src/model_loader.cpp index bde06fc..b0ab7d3 100644 --- a/src/model_loader.cpp +++ b/src/model_loader.cpp @@ -40,14 +40,6 @@ Result ModelLoader::loadGGUF(const std::string &path, ModelConfig tensor_map[t.name] = &t; } - ModelWeights weights; - bool success = false; - auto cleanup_on_error = [&]() { - if (!success) { - freeWeights(weights); - } - }; - // Helper function to find tensor auto find_tensor = [&](const std::string &name) -> const GGUFTensorInfo * { auto it = tensor_map.find(name); @@ -57,279 +49,80 @@ Result ModelLoader::loadGGUF(const std::string &path, ModelConfig return nullptr; }; - // Load token embedding - // GGUF naming: token_embd.weight - const GGUFTensorInfo *embed_tensor = find_tensor("token_embd.weight"); - if (!embed_tensor) { - // Try alternative names - embed_tensor = find_tensor("tok_embeddings.weight"); - } - - if (embed_tensor) { - TLLM_DEBUG("Loading token embedding from tensor: {}", embed_tensor->name); - auto data_result = parser.readTensorData(*embed_tensor); - if (data_result.isErr()) { - cleanup_on_error(); - return Result::err("Failed to read token embedding: " + - data_result.error()); - } - - size_t embed_size = static_cast(config.vocab_size) * config.hidden_dim; - CUDA_CHECK(cudaMalloc(&weights.token_embedding, embed_size * sizeof(half))); - - // Convert to FP16 if needed - const auto &data = data_result.value(); - if (embed_tensor->type == GGMLType::F16) { - CUDA_CHECK(cudaMemcpy(weights.token_embedding, data.data(), data.size(), - cudaMemcpyHostToDevice)); - } else if (embed_tensor->type == GGMLType::F32) { - // Convert F32 to F16 using utility function - auto f16_result = - convertF32ToF16(reinterpret_cast(data.data()), embed_size); - if (f16_result.isOk()) { - CUDA_CHECK(cudaMemcpy(weights.token_embedding, f16_result.value().data(), - embed_size * sizeof(half), cudaMemcpyHostToDevice)); - } else { - TLLM_WARN("Failed to convert embedding: {}", f16_result.error()); + auto find_any_tensor = + [&](std::initializer_list names) -> const GGUFTensorInfo * { + for (const char *name : names) { + if (const auto *tensor = find_tensor(name)) { + return tensor; } - } else { - TLLM_WARN("Unsupported embedding type: {}, skipping", - static_cast(embed_tensor->type)); } - } else { - TLLM_WARN("Token embedding tensor not found, using zeros"); - size_t embed_size = static_cast(config.vocab_size) * config.hidden_dim; - CUDA_CHECK(cudaMalloc(&weights.token_embedding, embed_size * sizeof(half))); - CUDA_CHECK(cudaMemset(weights.token_embedding, 0, embed_size * sizeof(half))); - } - - // Load layer weights - weights.layers.resize(config.num_layers); - - for (int layer = 0; layer < config.num_layers; ++layer) { - auto &lw = weights.layers[layer]; + return nullptr; + }; - // GGUF tensor naming convention: - // - blk.{N}.attn_q.weight - // - blk.{N}.attn_k.weight - // - blk.{N}.attn_v.weight - // - blk.{N}.attn_output.weight (or attn_out) - // - blk.{N}.ffn_gate.weight (w1) - // - blk.{N}.ffn_up.weight (w3) - // - blk.{N}.ffn_down.weight (w2) - // - blk.{N}.attn_norm.weight - // - blk.{N}.ffn_norm.weight - - std::string layer_prefix = "blk." + std::to_string(layer) + "."; - - // Alternative LLaMA naming - std::string llama_prefix = "layers." + std::to_string(layer) + "."; - - auto find_layer_tensor = [&](const std::string &suffix) -> const GGUFTensorInfo * { - const GGUFTensorInfo *t = find_tensor(layer_prefix + suffix); - if (!t) { - t = find_tensor(llama_prefix + suffix); + std::vector missing_tensors; + auto require_tensor = [&](std::initializer_list names) { + if (!find_any_tensor(names)) { + std::string joined; + for (const char *name : names) { + if (!joined.empty()) { + joined += " | "; + } + joined += name; } - return t; - }; - - // Load attention weights (Q, K, V, O) - // For now, load as FP16 placeholders (full quantization support would need conversion) - int hidden = config.hidden_dim; - int kv_dim = config.num_kv_heads * config.head_dim; - - // Helper to allocate zero-initialized weight - auto alloc_zero_weight = [](int rows, int cols) -> QuantizedWeight { - QuantizedWeight qw; - qw.rows = rows; - qw.cols = cols; - qw.group_size = 128; - CUDA_CHECK(cudaMalloc(&qw.data, qw.weightElements() * sizeof(int8_t))); - CUDA_CHECK(cudaMalloc(&qw.scales, qw.scaleElements() * sizeof(half))); - CUDA_CHECK(cudaMemset(qw.data, 0, qw.weightElements() * sizeof(int8_t))); - CUDA_CHECK(cudaMemset(qw.scales, 0, qw.scaleElements() * sizeof(half))); - return qw; - }; - - // Load Q weight - const GGUFTensorInfo *q_tensor = find_layer_tensor("attn_q.weight"); - if (q_tensor) { - TLLM_TRACE("Loading Q weight from: {}", q_tensor->name); - // For now, allocate placeholder (full GGUF quantization support requires conversion) - lw.wq = alloc_zero_weight(hidden, hidden); - } else { - TLLM_WARN("Layer {} Q weight not found", layer); - lw.wq = alloc_zero_weight(hidden, hidden); - } - - // Load K weight - const GGUFTensorInfo *k_tensor = find_layer_tensor("attn_k.weight"); - if (k_tensor) { - TLLM_TRACE("Loading K weight from: {}", k_tensor->name); - lw.wk = alloc_zero_weight(hidden, kv_dim); - } else { - lw.wk = alloc_zero_weight(hidden, kv_dim); - } - - // Load V weight - const GGUFTensorInfo *v_tensor = find_layer_tensor("attn_v.weight"); - if (v_tensor) { - TLLM_TRACE("Loading V weight from: {}", v_tensor->name); - lw.wv = alloc_zero_weight(hidden, kv_dim); - } else { - lw.wv = alloc_zero_weight(hidden, kv_dim); - } - - // Load O weight - const GGUFTensorInfo *o_tensor = find_layer_tensor("attn_output.weight"); - if (!o_tensor) { - o_tensor = find_layer_tensor("attn_out.weight"); - } - if (o_tensor) { - TLLM_TRACE("Loading O weight from: {}", o_tensor->name); - lw.wo = alloc_zero_weight(hidden, hidden); - } else { - lw.wo = alloc_zero_weight(hidden, hidden); - } - - // Load FFN weights - const GGUFTensorInfo *w1_tensor = find_layer_tensor("ffn_gate.weight"); - const GGUFTensorInfo *w2_tensor = find_layer_tensor("ffn_down.weight"); - const GGUFTensorInfo *w3_tensor = find_layer_tensor("ffn_up.weight"); - int inter = config.intermediate_dim; - - if (w1_tensor) { - TLLM_TRACE("Loading w1 (gate) weight from: {}", w1_tensor->name); - } - if (w2_tensor) { - TLLM_TRACE("Loading w2 (down) weight from: {}", w2_tensor->name); - } - if (w3_tensor) { - TLLM_TRACE("Loading w3 (up) weight from: {}", w3_tensor->name); + missing_tensors.push_back(std::move(joined)); } + }; - lw.w1 = alloc_zero_weight(hidden, inter); - lw.w2 = alloc_zero_weight(inter, hidden); - lw.w3 = alloc_zero_weight(hidden, inter); - - // Load normalization weights - const GGUFTensorInfo *attn_norm = find_layer_tensor("attn_norm.weight"); - if (attn_norm) { - auto data_result = parser.readTensorData(*attn_norm); - if (data_result.isOk()) { - CUDA_CHECK(cudaMalloc(&lw.rms_att_weight, hidden * sizeof(half))); - // Convert if needed - if (attn_norm->type == GGMLType::F32) { - auto f16_result = convertF32ToF16( - reinterpret_cast(data_result.value().data()), hidden); - if (f16_result.isOk()) { - CUDA_CHECK(cudaMemcpy(lw.rms_att_weight, f16_result.value().data(), - hidden * sizeof(half), cudaMemcpyHostToDevice)); - } - } else { - CUDA_CHECK(cudaMemcpy(lw.rms_att_weight, data_result.value().data(), - hidden * sizeof(half), cudaMemcpyHostToDevice)); + require_tensor({"token_embd.weight", "tok_embeddings.weight"}); + for (int layer = 0; layer < config.num_layers; ++layer) { + const std::string layer_prefix = "blk." + std::to_string(layer) + "."; + const std::string llama_prefix = "layers." + std::to_string(layer) + "."; + auto require_layer_tensor = [&](std::initializer_list suffixes) { + for (const char *suffix : suffixes) { + if (find_tensor(layer_prefix + suffix) || find_tensor(llama_prefix + suffix)) { + return; } } - } else { - TLLM_WARN("Layer {} attention norm not found", layer); - CUDA_CHECK(cudaMalloc(&lw.rms_att_weight, hidden * sizeof(half))); - CUDA_CHECK(cudaMemset(lw.rms_att_weight, 0, hidden * sizeof(half))); - } - - const GGUFTensorInfo *ffn_norm = find_layer_tensor("ffn_norm.weight"); - if (ffn_norm) { - auto data_result = parser.readTensorData(*ffn_norm); - if (data_result.isOk()) { - CUDA_CHECK(cudaMalloc(&lw.rms_ffn_weight, hidden * sizeof(half))); - if (ffn_norm->type == GGMLType::F32) { - auto f16_result = convertF32ToF16( - reinterpret_cast(data_result.value().data()), hidden); - if (f16_result.isOk()) { - CUDA_CHECK(cudaMemcpy(lw.rms_ffn_weight, f16_result.value().data(), - hidden * sizeof(half), cudaMemcpyHostToDevice)); - } - } else { - CUDA_CHECK(cudaMemcpy(lw.rms_ffn_weight, data_result.value().data(), - hidden * sizeof(half), cudaMemcpyHostToDevice)); + std::string joined; + for (const char *suffix : suffixes) { + if (!joined.empty()) { + joined += " | "; } + joined += layer_prefix + suffix; + joined += " | "; + joined += llama_prefix + suffix; } - } else { - TLLM_WARN("Layer {} FFN norm not found", layer); - CUDA_CHECK(cudaMalloc(&lw.rms_ffn_weight, hidden * sizeof(half))); - CUDA_CHECK(cudaMemset(lw.rms_ffn_weight, 0, hidden * sizeof(half))); - } + missing_tensors.push_back(std::move(joined)); + }; - TLLM_TRACE("Loaded layer {}/{}", layer + 1, config.num_layers); + require_layer_tensor({"attn_q.weight"}); + require_layer_tensor({"attn_k.weight"}); + require_layer_tensor({"attn_v.weight"}); + require_layer_tensor({"attn_output.weight", "attn_out.weight"}); + require_layer_tensor({"ffn_gate.weight"}); + require_layer_tensor({"ffn_down.weight"}); + require_layer_tensor({"ffn_up.weight"}); + require_layer_tensor({"attn_norm.weight"}); + require_layer_tensor({"ffn_norm.weight"}); } - // Load final norm - const GGUFTensorInfo *output_norm = find_tensor("output_norm.weight"); - if (!output_norm) { - output_norm = find_tensor("norm.weight"); - } - if (output_norm) { - TLLM_DEBUG("Loading final norm from: {}", output_norm->name); - auto data_result = parser.readTensorData(*output_norm); - if (data_result.isOk()) { - CUDA_CHECK(cudaMalloc(&weights.final_norm_weight, config.hidden_dim * sizeof(half))); - if (output_norm->type == GGMLType::F32) { - auto f16_result = convertF32ToF16( - reinterpret_cast(data_result.value().data()), config.hidden_dim); - if (f16_result.isOk()) { - CUDA_CHECK(cudaMemcpy(weights.final_norm_weight, f16_result.value().data(), - config.hidden_dim * sizeof(half), - cudaMemcpyHostToDevice)); - } - } else { - CUDA_CHECK(cudaMemcpy(weights.final_norm_weight, data_result.value().data(), - config.hidden_dim * sizeof(half), cudaMemcpyHostToDevice)); + require_tensor({"output_norm.weight", "norm.weight"}); + require_tensor({"output.weight", "lm_head.weight"}); + + if (!missing_tensors.empty()) { + std::string missing; + for (size_t i = 0; i < missing_tensors.size(); ++i) { + if (i > 0) { + missing += ", "; } + missing += missing_tensors[i]; } - } else { - TLLM_WARN("Output norm not found"); - CUDA_CHECK(cudaMalloc(&weights.final_norm_weight, config.hidden_dim * sizeof(half))); - CUDA_CHECK(cudaMemset(weights.final_norm_weight, 0, config.hidden_dim * sizeof(half))); - } - - // Load LM head - const GGUFTensorInfo *lm_head = find_tensor("output.weight"); - if (!lm_head) { - lm_head = find_tensor("lm_head.weight"); - } - if (lm_head) { - TLLM_DEBUG("Loading LM head from: {}", lm_head->name); - // For now, allocate placeholder (full support needs quantization conversion) - weights.lm_head.rows = config.hidden_dim; - weights.lm_head.cols = config.vocab_size; - weights.lm_head.group_size = 128; - CUDA_CHECK( - cudaMalloc(&weights.lm_head.data, weights.lm_head.weightElements() * sizeof(int8_t))); - CUDA_CHECK( - cudaMalloc(&weights.lm_head.scales, weights.lm_head.scaleElements() * sizeof(half))); - CUDA_CHECK( - cudaMemset(weights.lm_head.data, 0, weights.lm_head.weightElements() * sizeof(int8_t))); - CUDA_CHECK( - cudaMemset(weights.lm_head.scales, 0, weights.lm_head.scaleElements() * sizeof(half))); - } else { - TLLM_WARN("LM head tensor not found"); - weights.lm_head.rows = config.hidden_dim; - weights.lm_head.cols = config.vocab_size; - weights.lm_head.group_size = 128; - CUDA_CHECK( - cudaMalloc(&weights.lm_head.data, weights.lm_head.weightElements() * sizeof(int8_t))); - CUDA_CHECK( - cudaMalloc(&weights.lm_head.scales, weights.lm_head.scaleElements() * sizeof(half))); - CUDA_CHECK( - cudaMemset(weights.lm_head.data, 0, weights.lm_head.weightElements() * sizeof(int8_t))); - CUDA_CHECK( - cudaMemset(weights.lm_head.scales, 0, weights.lm_head.scaleElements() * sizeof(half))); + return Result::err("GGUF runtime tensors missing: " + missing); } - success = true; - TLLM_INFO("GGUF model loaded successfully"); - - return Result::ok(std::move(weights)); + return Result::err( + "GGUF runtime tensor conversion is not implemented; use GGUF parsing/metadata surfaces " + "only or convert to the supported binary runtime format first."); } Result ModelLoader::loadBin(const std::string &path, const ModelConfig &config) { diff --git a/tests/test_model_loader.cpp b/tests/test_model_loader.cpp index 435762d..1c3b6f5 100644 --- a/tests/test_model_loader.cpp +++ b/tests/test_model_loader.cpp @@ -4,6 +4,7 @@ #include #include #include +#include #include // #include // NOTE: rapidcheck/gtest disabled due to GCC 11/12 std::function bug @@ -55,9 +56,144 @@ class TempFile { std::string path_; }; +namespace { + +void appendBytes(std::vector &out, const void *data, size_t size) { + const auto *bytes = static_cast(data); + out.insert(out.end(), bytes, bytes + size); +} + +template +void appendValue(std::vector &out, const T &value) { + appendBytes(out, &value, sizeof(T)); +} + +void appendString(std::vector &out, const std::string &value) { + uint64_t size = value.size(); + appendValue(out, size); + appendBytes(out, value.data(), value.size()); +} + +struct GGUFFileBuilder { + std::vector metadata_entries; + std::vector tensor_entries; + uint64_t metadata_count = 0; + uint64_t tensor_count = 0; + + template + void addScalarMetadata(const std::string &key, GGUFType type, const T &value) { + appendString(metadata_entries, key); + uint32_t raw_type = static_cast(type); + appendValue(metadata_entries, raw_type); + appendValue(metadata_entries, value); + ++metadata_count; + } + + void addArrayMetadataHeader(const std::string &key, GGUFType elem_type, uint64_t count) { + appendString(metadata_entries, key); + uint32_t raw_type = static_cast(GGUFType::ARRAY); + appendValue(metadata_entries, raw_type); + uint32_t raw_elem_type = static_cast(elem_type); + appendValue(metadata_entries, raw_elem_type); + appendValue(metadata_entries, count); + ++metadata_count; + } + + void writeTo(const TempFile &file) const { + std::vector bytes; + appendValue(bytes, GGUF_MAGIC); + uint32_t version = 3; + appendValue(bytes, version); + appendValue(bytes, tensor_count); + appendValue(bytes, metadata_count); + bytes.insert(bytes.end(), metadata_entries.begin(), metadata_entries.end()); + bytes.insert(bytes.end(), tensor_entries.begin(), tensor_entries.end()); + + constexpr uint64_t kAlignment = 32; + size_t aligned_size = (bytes.size() + kAlignment - 1) & ~(kAlignment - 1); + bytes.resize(aligned_size, 0); + file.writeBytes(bytes); + } +}; + +} // namespace + // Unit tests for GGUF header parsing class GGUFHeaderTest : public ::testing::Test {}; +TEST(GGUFTensorInfoTest, NumElementsReturnsZeroOnOverflow) { + GGUFTensorInfo info; + info.dimensions = {std::numeric_limits::max(), 2}; + + EXPECT_EQ(info.numElements(), 0u); +} + +TEST(GGUFTensorInfoTest, CalculateSizeReturnsZeroOnOverflow) { + GGUFTensorInfo info; + info.type = GGMLType::F32; + info.dimensions = {std::numeric_limits::max(), 2}; + + EXPECT_EQ(info.calculateSize(), 0u); +} + +TEST_F(GGUFHeaderTest, OversizedArrayReturnsErrorInsteadOfThrowing) { + TempFile file(".gguf"); + GGUFFileBuilder gguf; + gguf.addArrayMetadataHeader("bad.array", GGUFType::UINT32, + std::numeric_limits::max() / sizeof(uint32_t)); + gguf.writeTo(file); + + GGUFParser parser(file.path()); + + EXPECT_NO_THROW({ + auto result = parser.parse(); + EXPECT_TRUE(result.isErr()); + if (result.isErr()) { + EXPECT_NE(result.error().find("Array"), std::string::npos); + } + }); +} + +TEST_F(GGUFHeaderTest, ExtractModelConfigIgnoresNumericArchitectureMetadata) { + TempFile file(".gguf"); + GGUFFileBuilder gguf; + gguf.addScalarMetadata("general.architecture", GGUFType::UINT32, uint32_t{1234}); + gguf.writeTo(file); + + GGUFParser parser(file.path()); + auto parse_result = parser.parse(); + ASSERT_TRUE(parse_result.isOk()) << parse_result.error(); + + auto config_result = parser.extractModelConfig(); + ASSERT_TRUE(config_result.isOk()) << config_result.error(); + EXPECT_EQ(config_result.value().hidden_dim, ModelConfig{}.hidden_dim); +} + +TEST_F(GGUFHeaderTest, IncompleteRuntimeGGUFReturnsStructuredErrorWithoutThrowing) { + TempFile file(".gguf"); + GGUFFileBuilder gguf; + gguf.addScalarMetadata("llama.embedding_length", GGUFType::UINT32, uint32_t{8}); + gguf.addScalarMetadata("llama.block_count", GGUFType::UINT32, uint32_t{1}); + gguf.addScalarMetadata("llama.attention.head_count", GGUFType::UINT32, uint32_t{1}); + gguf.addScalarMetadata("llama.attention.head_count_kv", GGUFType::UINT32, uint32_t{1}); + gguf.addScalarMetadata("llama.context_length", GGUFType::UINT32, uint32_t{16}); + gguf.addScalarMetadata("llama.feed_forward_length", GGUFType::UINT32, uint32_t{16}); + gguf.addScalarMetadata("tokenizer.ggml.model.vocab_size", GGUFType::UINT32, uint32_t{32}); + gguf.writeTo(file); + + ModelConfig config; + + EXPECT_NO_THROW({ + auto result = ModelLoader::loadGGUF(file.path(), config); + EXPECT_TRUE(result.isErr()); + if (result.isErr()) { + EXPECT_TRUE(result.error().find("missing") != std::string::npos || + result.error().find("required") != std::string::npos || + result.error().find("unsupported") != std::string::npos); + } + }); +} + TEST_F(GGUFHeaderTest, ValidHeader) { TempFile file(".gguf");