Skip to content

apoorvaeerly/UCT_POC1_ShortText_Categorization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Material ID Assignment Pipeline — End-to-End System Analysis

This document outlines the complete architectural approach for the Material ID Assignment Pipeline, encompassing both Phase 1 (Deterministic & ML Filtering) and Phase 2 (LLM Validation, Assignment & Inference).

The objective of this dual-phase system is to achieve maximum coverage of Material ID assignments while guaranteeing high accuracy and providing human-readable explanations for every decision made.

1. End-To-End Architectural Flow Chart

The following flowchart illustrates the step-by-step data journey through the pipeline.

graph TD
    %% PHASE 1: Data Preparation & ML Matching
    subgraph "Phase 1: Deterministic & Embedding Pipeline"
        A[Raw PO Data: 16,567 rows] --> B(Data Normalization)
        B --> C{Goods vs Service Classifier}
        C -- Non-Goods --> D[Exclude from Assignment]
        C -- Goods --> E(Duplicate Cluster Detection)
        
        E --> F{Deterministic Matching}
        F -- Fuzzy/Supplier/Price Match --> G[Assign from Historical or Item Master]
        F -- No Exact Match --> H{Embedding Matching Chroma ANN}
        
        H -- High Confidence Match --> I[Assign from Extracted Vector Index]
        H -- Low Confidence Band --> J[Flag for Review]
        H -- No Match --> K[Flag as Unassigned]
        
        G -.-> L{Catalog Integrity Check}
        I -.-> L
        L -- Check Assigned IDs<br>against Item Master --> M{Valid?}
        M -- Missing from Master --> N[Flag: Data Provider<br>Discrepancy]
        M -- Valid --> CC[Confirmed Assigned]
    end

    %% PHASE 2: Validation & LLM Processing
    subgraph "Phase 2: LLM Validation & Intelligent Grouping"
        CC --> Q[LLM Processing Engine]
        K -.-> Q
        J -.-> Q
        
        Q --> R1[RAG Similarity Score Check]
        R1 -- Score Adequate --> R2[Task A: Matching the<br>9.3k Unassigned Cases]
        R1 -- Score Too Low --> V[Task B: Suggest SAP<br>Group & Flag 'Needs Detail']
        
        R2 --> T{RAG-Assisted<br>LLM Match?}
        T -- Match Found --> U[Assign Existing ID<br>+ Explanation]
        T -- No Catalog Match --> V
        
        Q --> S[Task C: QA Cross-validation<br>of Phase 1 Matches]
        S --> W[Confirm or Reject<br>Phase 1 Matches]
        
        U --> Z[Task D: Finalize Goods<br>vs Service Flag]
        V --> Z
        W --> Z
    end
    
    %% FINAL OUTPUT
    subgraph "Phase 3: Final Output Generation"
        U --> X[Final Consolidated Dataset]
        V --> X
        W --> X
        D --> X
        
        X --> Y[(Excel Output with Granular Reasons)]
    end
Loading

2. Step-by-Step Execution Plan

Phase 1: High-Speed Deterministic & ML Matching

Phase 1 acts as the high-throughput, low-cost filter. It processes 16,000+ rows in under 6 minutes.

  1. Normalization & Cleanup: Strings are normalized, noise is removed, and common abbreviations (e.g., assyassembly, pcspieces) are expanded to match SAP catalogs.
  2. Goods vs Service Evaluation: Utilizing sentence embeddings, the pipeline identifies non-tangible services (e.g., "maintenance labor") and explicitly removes them from the Material ID requirement pool.
  3. Deduplication & Clustering: Exact duplicates (matching text, supplier, price, and UOM) are clustered. If a known material ID exists within a cluster, it is propagated to the unknown rows.
  4. Deterministic & Embedding Matching: The system attempts to find a match using exact fuzzy bounds. If it fails, it uses ChromaDB HNSW Vectors to find semantic matches against historical data and the active Item Master.
  5. Catalog Integrity Check: A strict deterministic check is then run against all successfully assigned IDs. It verifies if the generated ID fundamentally exists inside the actual Item Master catalog; if it doesn't, it flags status: data_provider_discrepancy.

Phase 2: LLM Validation & Intelligent Inference (Azure OpenAI)

Phase 2 acts as the cognitive engine to resolve the ceiling hit by traditional ML (the remaining ~9,300 unassigned rows and the cross-validation of assigned rows).

Step 1: Pre-LLM Similarity Score Gating

Before sending an unassigned row to the LLM, we check its Chroma DB vector distance. If the Top 1 candidate's similarity score is below our minimum viable threshold (meaning the catalog returns completely irrelevant "garbage"), we bypass Task A entirely. We route this row directly to Task B and flag it as needs_more_detail. This saves API costs and prevents the LLM from hallucinating matches against irrelevant data.

Step 2: LLM Task A — Matching the Unassigned Cases

For all unassigned rows that pass the similarity gate, we transition from broad ML matching to contextual LLM matching.

  • LLM Processing: The LLM reads the raw PO line and evaluates against the Top 5 curated catalog items. If human logic dictates a match, it assigns the ID and generates a short llm_explanation.

Step 3: LLM Task B — Material Group Suggestion (Net-New/Garbage Items)

If an item fails the similarity gate OR the LLM explicitly determines the unassigned row truly does not exist in the catalog:

  • The LLM generates a Recommended Material Group (e.g., Bearings & Seals, IT Hardware).
  • Instead of returning generic blanks, you receive actionable clustered groups predicting what new IDs need creation.

Step 4: LLM Task C — Cross-Validation (All Existing Assignments)

For all assigned rows, the LLM verifies if the existing mapping is logically coherent. If it detects a hallucination, it drops the confidence score and outputs status: llm_rejected_assignment.

Step 5: LLM Task D — Finalize Goods vs Service Flag

For every single row processed by the LLM, the model will output a final_goods_flag. It evaluates the item description against the Phase-1 is_goods prediction. If the ML model mistakenly flagged an item, the LLM catches the context and issues the final verdict.

3. Data Output Schema Integration

The final Excel output merges the technical tracking of Phase 1 with the contextual reasoning of Phase 2. The output ensures that every single row contains a dedicated explanation.

Column Name Description
final_material_id The ultimate ID chosen (Phase 1, LLM Assigned, or None).
assignment_stage The specific step that claimed the row (e.g., phase1_duplicate_cluster, llm_rag_match, llm_rejected).
final_goods_flag The LLM-verified Goods vs. Non-Goods (Service) status.
goods_flag_explanation Why the LLM classified it as a good or service.
discrepancy_flag Data integrity result from Catalog Integrity Check (e.g., missing_from_item_master, valid).
llm_explanation A plain English reasoning string for why the LLM agreed, disagreed, or couldn't match.
suggested_material_group Exists only for unassigned items to facilitate new ID creation.

4. Why Chroma DB is Used Before LLM Matching (RAG Strategy)

You might wonder: "Why not just send everything to the LLM and let it figure it out?"

The answer comes down to API Cost and Context Window Limits. Your Item Master contains 1,550 rows. Your pending PO Data contains 9,353 unassigned rows. If we asked the LLM to scan all 1,550 items for every single one of the 9,353 rows, that would amount to millions of tokens of redundant information per request. This would be outrageously expensive, incredibly slow, and would likely cause the LLM to hallucinate or exceed its maximum context window.

Instead, we use a RAG (Retrieval-Augmented Generation) pattern:

  1. Chroma DB (Fast & Cheap): Instantly queries the 1,550 Item Master list and retrieves the Top 5 most semantically relevant candidates.
  2. Pre-LLM Score Gating (Quality Control): We evaluate the raw distance score from Chroma. If the score is inadequate (signaling the top candidates are "garbage" relative to the PO text), we do not send it to the LLM for matching. We route it straight to the Material Group suggestion phase.
  3. Azure OpenAI (Deep & Logical): Receives only the specific PO row and those highly curated Top 5 candidates, allowing it to spend its processing power strictly on deep logical evaluation rather than brute-force searching against irrelevant noise.

5. LLM Prompt Context & Rules Strategy

To ensure the LLM makes decisions with the same rigor as an expert Master Data Analyst, we will dynamically construct a strict context window.

Context Injected into the Prompt

For every candidate Chroma fetches, we will extract and provide the LLM with the following 8 critical data fields directly from the Item Master:

  1. Plnt (Site Location)
  2. Material (The ID to assign)
  3. Material Description
  4. Material PO Text
  5. Product Category
  6. Vendor
  7. Vendor Name
  8. Net Price

The "Analyst Ruleset" Provided to the LLM:

The LLM will be instructed to cross-reference the PO row against those 8 fields using the following strict rules before declaring a match:

  1. Description & Functional Alignment:
    • Rule: Do not just match keywords. Check functional equivalency. E.g., The LLM must understand that "1/2 inch drill bit" and "12.7mm steel bit" are functionally identical, but "Bearing 5mm" and "Bearing 8mm" are critically different items and cannot share a Material ID.
  2. Pricing Corroboration (Using Net Price):
    • Rule: If the Material Description looks like a perfect match, but the PO Net Price is $5.00 and the Item Master Net Price is $5,000.00, the LLM must reject the match. It must infer a mismatch in scope (e.g., purchasing a single replacement part vs. a bulk container or entire machine assembly).
  3. Supplier & Brand Context (Using Vendor & Vendor Name):
    • Rule: If the PO specifies a highly specialized OEM vendor (e.g., "Cisco" or "Caterpillar"), the LLM must severely penalize or reject Item Master candidates where the Vendor Name is flagged as an off-brand generic or belongs to a conflicting proprietary vendor.
  4. Site & Category Awareness (Using Plnt & Product Category):
    • Rule: If available in the data, verify location constraints. If an Item Master Material is strictly designated for a specific Plnt, and the PO explicitly belongs to a different site (where cross-plant transfer is not applicable), the LLM must reject the assignment. Wait for explicit context within the Product Category to validate matching intent.
  5. Unit of Measure (UOM) Logic:
    • Rule: Align packaging constraints. If the PO lists a UOM of BOX (50) and the Item Master Material PO Text implies EA (Each), the LLM must mathematically validate if they represent the exact same physical material before approving the ID.
  6. Goods vs. Service Status Override:
    • Rule: The LLM must output a final truth value regarding whether the item is a tangible "Good" or a non-tangible "Service". It must compare this logic against the Phase-1 predicted Goods flag. If Phase 1 falsely flagged a service as a good (or vice versa), the LLM catches the context and outputs the correct final_goods_flag.

Few-Shot Prompting Examples

To maximize the number of assigned rows while maintaining accuracy, the system prompt will include Few-Shot Examples. These examples will train the LLM on how to weigh the available raw data evidence against the Item Master:

  1. Sufficient Evidence for Match: An example showing a messy PO row where the combination of matching Supplier, comparable Price, and core Item Text provides enough inferred evidence to confidently map to a Material ID.
  2. Insufficient Evidence / Conflict: An example where the text matches but the Supplier or Price grossly contradicts the Item Master, teaching the LLM to output a detailed rejection reason.
  3. Service/Non-Tangible Identification: An example showing how to evaluate a raw description (e.g., "installation of pump") and explicitly flag it as final_goods_flag: false (Service), generating an explanation anchored to these instructions.

Outcome: By injecting these 8 fields, 6 strict rules, and few-shot examples into the system prompt, Azure OpenAI acts as a protective yet highly adaptive data-governance layer, maximizing assignments when evidence is sufficient and refusing IDs when logistics, text, or finance present contradictions.

About

UCT POC1 where we are cleaning the short-text information to form material-id, groups and standard nomenclature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages