added RFC on how to create a living knowledge base of owasp things#734
added RFC on how to create a living knowledge base of owasp things#734northdpole wants to merge 1 commit intomainfrom
Conversation
|
@northdpole I've gone through the RFC and it gives a clear architectural and experimental framework to build the proposal around. I'll spend some time digesting it in detail and start aligning my work proposal with this design and the pre-code experiments outlined here. |
|
Thanks for putting this together Sir, the experimental framework is really clear. I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation. The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”). Plan:
If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms. I'll take this up step by step . I’ll share experiment results and observations before proposing any implementation. I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3. Thank you . |
|
Hi @northdpole , Thanks for putting together this RFC — the structure, pre-code experiments, and CI-first mindset are exactly the kind of system I enjoy working on. I’d like to formally express my interest in owning Module B: Noise / Relevance Filter as my primary contribution, and I’m also happy to assist with adjacent modules where needed. So Why Module B The framing of Module B as a cheap, high-signal gate before expensive downstream processing resonates strongly with me. Getting this layer right feels critical to the quality, cost, and trustworthiness of the entire pipeline, especially given the planned regression dataset and CI enforcement. Proposed Plan of Action (Aligned with the RFC)
And Cross-Module Contributions While Module B would be my ownership area, I can also help with: I’ve read and understood Section 3 (Agent-Ready CI & AI-generated PR constraints) and I’m comfortable working within those boundaries. Looking forward to collaborating — this project feels like a rare opportunity to build something both technically rigorous and genuinely useful. Best, |
@northdpole Module C update (pre‑code experiment complete) I ran the RFC‑required 50‑item ASVS experiment and also a 100‑item stability check to reduce variance (the negative subset is small, so a larger sample gives a more stable signal). Results (negative top‑1):
This passes the RFC success criteria (>20% improvement on negative requirements). Design doc (pipeline + CI plan): Hybrid search (BM25 + vector) is listed as a bonus. I have not implemented it yet; I plan to explore it after the pre‑code experiment and design are approved. Next steps per RFC (please confirm):
|
|
Awesome, but requires some redesigning I think. Let's find out together.
Let’s book time next week and work an hour on this together. Slack me options please, if you’re open. |
|
Hey @robvanderveer Thanks for the detailed feedback. I updated the Module C design to align with your points. Key changes:
Updated design: Also happy to sync live for 1 hour around next week; I will share timing on Slack. |
|
Hi @northdpole, Experiment Results & Quality Metrics:
Shall I continue to write a detailed proposal regarding this? |
|
Hi @northdpole , I’ve been thinking about a lightweight “Noise / Relevance Filter” (Module B). As your idea suggest to first apply a cheap regex-based filter to discard obvious non-knowledge changes (formatting, lockfiles, minor docs), and then use a small LLM classifier to determine whether a commit actually adds meaningful security knowledge. AS plan suggests to validate this with a benchmark on ~100 historical commits to measure precision before proposing full integration. Additionally, I’d like your thoughts on optionally adding a CodeRabbit AI layer to generate a structured diff summary before sending context to the LLM. Since CodeRabbit is free for open-source projects, it could provide higher-quality summaries and improve classification accuracy by giving the LLM better semantic context. Would you be open to this direction, or prefer a simpler initial baseline first? |
|
Hey team @northdpole , @robvanderveer , and @Pa04rth 👋 Following up on our recent architectural discussions, I’ve spent the last 10 days deeply analyzing the end-to-end pipeline for Project OIE (#734). as conveyed to Spyros Since I have 6-7 months of extended bandwidth due to my internship term and less academic pressure , my goal for this GSoC period is to take ownership of creating a complete, production-ready flow across the ecosystem, under guidance of all my mentors. As Rob accurately stated: "We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI." To ensure complete clarity and alignment before the proposal deadline, I have physically mapped out the architectural blueprints and tool stacks for the entire project. How the modules connect in one line:
I have broken down my blueprints into 4 detailed documents (with flow diagrams and tool selections): 🎯 1. System Goals & Architecture FlowMapping the Functionality Promise and visualizing exactly how the data flows from GitHub, through the three modules, to the Master Database. 📦 2. The Upstream Data Prep (Ingestion & Chunking)Addressing Rob's feedback: Implementing 🧠 3. Module C: The Librarian (Semantic Intelligence)Focusing strictly on mapping: Implementing Link-First authoritative overrides, and utilizing my successful Pre-Code Experiment (Cross-Encoders) to solve the "Negation Problem" with 100% accuracy. (2 Components explained) 📊 4. Module D: The Dashboard (Human-in-the-Loop)Building a "Tinder-speed review UI with keyboard bindings to allow maintainers to clear <0.8 confidence threshold queues in minutes, while logging rejections for future ML training. (3 Components explained) I would love your feedback on these blueprints to ensure my final proposal hits the exact mark you envision for this living knowledge base! |
| @@ -0,0 +1,262 @@ | |||
| # RFC: The OpenCRE Scraper & Indexer (Project OIE) | |||
There was a problem hiding this comment.
Change name to OWASP Agent. Position it as promise first: the why, not the how. So not: 'scraper and indexer'
| Don't rely just on vectors. Use Hybrid Search (Vector + Keyword/BM25). | ||
| Why: Vectors are bad at exact keyword matches (e.g., specific CVE IDs). | ||
|
|
||
| ### Module D: HITL & Logging |
There was a problem hiding this comment.
Please make the workflow more clear. thanks
There was a problem hiding this comment.
Thank you @robvanderveer that makes a lot of sense.
I’ll rename this to OWASP Agent and adjust the introduction to focus first on the problem and the promise it delivers, before going into the implementation details.
I’ll also rework the workflow section to make the end-to-end flow clearer and more explicit, especially around module responsibilities and how data moves between ingestion, hybrid retrieval, semantic reasoning, human validation, and the master database.
I’ll iterate on the document accordingly.
|
Hi @northdpole, I wanted to share a quick update on the Noise/Relevance Filter prototype. I’ve extracted 100 randomly sampled historical commits and manually labeled them (80 noise / 20 security knowledge) to create a gold benchmark dataset. I then implemented a batch-based LLM classifier (Gemini) with rate limiting and evaluated it against this dataset. Current results after prompt calibration:
I have significantly reduced false positives through stricter “new security concept” criteria, but there’s still room to improve precision further before proposing integration. I’ve temporarily paused experimentation due to API quota limits, but I’ll continue refining the prompt and evaluation loop to push precision higher while keeping recall stable. Would you prefer prioritizing higher precision (fewer false positives) even at the cost of some recall? And I also want to get the feedback of adding a layer of coderabbitai so LLM can get a better understanding of the changes and code base. this is the repo I have created if you are intrested |
No description provided.