feat: add local research-gap analysis pipeline for arXiv papers#1
Draft
Copilot wants to merge 3 commits into
Draft
feat: add local research-gap analysis pipeline for arXiv papers#1Copilot wants to merge 3 commits into
Copilot wants to merge 3 commits into
Conversation
Agent-Logs-Url: https://github.com/automation-workflows/Research-gap/sessions/1e290af4-e833-4d6c-942c-6f5e5711b8e0 Co-authored-by: kpj2006 <187440630+kpj2006@users.noreply.github.com>
Agent-Logs-Url: https://github.com/automation-workflows/Research-gap/sessions/1e290af4-e833-4d6c-942c-6f5e5711b8e0 Co-authored-by: kpj2006 <187440630+kpj2006@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add local scriptable research-gap analysis pipeline for arXiv papers
feat: add local research-gap analysis pipeline for arXiv papers
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a fully local, scriptable pipeline that ingests arXiv papers, extracts research gaps with evidence, verifies novelty against prior work, and produces a structured Markdown report — no UI required.
Pipeline stages
parsing.py) — downloads PDF via--arxiv-id, accepts--pdf/--input-dir; submits to GROBID (Docker) → TEI XML; extracts title, abstract, all headings, and 8 canonical key sections intocontext_pack.jsongaps.py) — two modes:--no-llm: regex heuristics (~20 patterns: limitation, future work, remains, unexplored, we leave for, etc.) with section + verbatim quote in evidenceprior_work/) — queries both OpenAlex and Semantic Scholar per gap, deduplicates, ranks candidates by cosine similarity usingall-MiniLM-L6-v2locally; assigns risk labels (low / medium / high) →novelty_report.jsonreporting.py) — producesreport.mdwith heading tree, gaps + evidence quotes, proposed non-incremental directions, and novelty table with linksModule structure
Usage
Outputs per paper
paper.tei.xmlcontext_pack.jsongaps.jsonnovelty_report.jsonreport.mdOther
pyproject.tomlwithresearch-gapconsole script entrypoint;pip install -e ".[dev]"installs everything.env.exampleforOPENAI_API_KEY,SEMANTIC_SCHOLAR_API_KEY,OPENALEX_EMAILOriginal prompt
Repository: automation-workflows/Research-gap (default branch: main)
Create a pull request that adds a local, scriptable research-gap analysis pipeline for arXiv papers.
Primary capabilities
--arxiv-id(downloads PDF from arXiv)--pdf(single pdf path)--input-dir(batch)context_pack.jsonwith:--no-llm): extract candidate gap sentences containing patterns like: limitation(s), future work, remains, unexplored, invite, we leave for, open problem, etc. Include section name and exact sentence.gaps.jsonschema:gaps[]: { gap, evidence[{section, quote}] (>=2), why_it_matters, non_incremental_directions[{direction, axis_of_difference}], prior_work_search_queries[] }all-MiniLM-L6-v2).novelty_report.jsonwith ranked nearest prior work and a risk label (e.g., low/medium/high).report.mdsummarizing:Engineering requirements
pyproject.toml(packaging) and a module runnable viapython -m research_gap ....research-gap(or similar).research_gap/parsing.py,research_gap/gaps.py,research_gap/prior_work/openalex.py,research_gap/prior_work/semantic_scholar.py,research_gap/reporting.py, etc..env.examplefor:OPENAI_API_KEY(LLM mode)SEMANTIC_SCHOLAR_API_KEY(optional)OPENALEX_EMAIL(recommended polite pool)research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1--no-llmConstraints / notes
Create the PR with a clear title and include usage instructions in README.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: is there a tool that scans arxiv papers and extracts future work sections
Assistant: Yes—there are a few practical options, depending on whether you want an off-the-shelf service, a library you can script, or an LLM-based extractor.
1) Best “ready-made” option: Elicit
2) Best DIY “scanner” (programmatic): GROBID + rules
If you specifically want to extract the “Future Work” section, the robust approach is:
Future Work,Future Directions,Limitations and Future Work,Discussion and Future Work, etc.This works well at scale and doesn’t require an LLM, but headings vary and some papers don’t label it explicitly.
3) Best if you want “future work” even when it’s not a labeled section: LLM extraction
Pipeline:
This is usually higher recall than heading-based extraction, but costs more and needs quality control.
4) Other useful building blocks
pdftotext,science-parse,unstructured, etc. (quality varies; GROBID is the common choice for scholarly PDFs).Qui...
This pull request was created from Copilot chat.