feat: add local research-gap analysis pipeline for arXiv papers by Copilot · Pull Request #1 · automation-workflows/Research-gap

Copilot · 2026-04-28T05:46:35Z

Adds a fully local, scriptable pipeline that ingests arXiv papers, extracts research gaps with evidence, verifies novelty against prior work, and produces a structured Markdown report — no UI required.

Pipeline stages

Ingestion & parsing (parsing.py) — downloads PDF via --arxiv-id, accepts --pdf / --input-dir; submits to GROBID (Docker) → TEI XML; extracts title, abstract, all headings, and 8 canonical key sections into context_pack.json
Gap extraction (gaps.py) — two modes:
- --no-llm: regex heuristics (~20 patterns: limitation, future work, remains, unexplored, we leave for, etc.) with section + verbatim quote in evidence
- default (LLM): OpenAI chat completion with strict JSON schema; validates all evidence quotes are verbatim substrings, drops/repairs hallucinations
Prior-work verification (prior_work/) — queries both OpenAlex and Semantic Scholar per gap, deduplicates, ranks candidates by cosine similarity using all-MiniLM-L6-v2 locally; assigns risk labels (low / medium / high) → novelty_report.json
Reporting (reporting.py) — produces report.md with heading tree, gaps + evidence quotes, proposed non-incremental directions, and novelty table with links

Module structure

research_gap/
├── __main__.py          # CLI: research-gap / python -m research_gap
├── parsing.py
├── gaps.py
├── reporting.py
└── prior_work/
    ├── openalex.py
    ├── semantic_scholar.py
    └── embeddings.py

Usage

# LLM mode (requires OPENAI_API_KEY)
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1

# Heuristic mode, no API key needed
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1 --no-llm

# Local PDF, batch
research-gap --input-dir papers/ --out-dir out/batch

Outputs per paper

File	Contents
`paper.tei.xml`	GROBID output
`context_pack.json`	Parsed sections
`gaps.json`	Gaps with evidence, directions, search queries
`novelty_report.json`	Ranked prior work + similarity + risk
`report.md`	Human-readable summary

Other

pyproject.toml with research-gap console script entrypoint; pip install -e ".[dev]" installs everything
.env.example for OPENAI_API_KEY, SEMANTIC_SCHOLAR_API_KEY, OPENALEX_EMAIL
53 pytest tests (TEI parsing, heuristic extraction, API clients with mocked HTTP, report generation) — no network required

Original prompt

Repository: automation-workflows/Research-gap (default branch: main)

Create a pull request that adds a local, scriptable research-gap analysis pipeline for arXiv papers.

Primary capabilities

Ingest arXiv papers (prefer PDF). Support inputs:
- --arxiv-id (downloads PDF from arXiv)
- --pdf (single pdf path)
- --input-dir (batch)
Parse paper structure with headings/sections for context.
- Prefer GROBID (Docker) to convert PDF -> TEI XML.
- Provide helper script or documented command to run GROBID locally.
- Parse TEI to produce context_pack.json with:
  - all section headings
  - selected key sections when present: Abstract, Introduction, Related Work, Experiments, Discussion, Limitations, Conclusion, Future Work
Gap extraction with evidence.
- Heuristic mode (--no-llm): extract candidate gap sentences containing patterns like: limitation(s), future work, remains, unexplored, invite, we leave for, open problem, etc. Include section name and exact sentence.
- LLM mode (default): take context pack and output strict JSON gaps.json schema:
  - gaps[]: { gap, evidence[{section, quote}] (>=2), why_it_matters, non_incremental_directions[{direction, axis_of_difference}], prior_work_search_queries[] }
- Enforce that quotes are verbatim substrings from extracted text; if not, drop/repair.
Prior-work verification using both OpenAlex and Semantic Scholar.
- For each gap/direction, run searches on OpenAlex and Semantic Scholar (S2 API key optional).
- Collect top K candidate papers with title/authors/year/venue/abstract/url/citation counts when available.
- Compute similarity between idea text and abstracts using a local embedding model (e.g., sentence-transformers all-MiniLM-L6-v2).
- Produce novelty_report.json with ranked nearest prior work and a risk label (e.g., low/medium/high).
Reporting
- Generate report.md summarizing:
  - extracted headings
  - gaps with evidence quotes
  - proposed non-incremental directions
  - novelty check results with links
- Save all intermediate JSON artifacts.

Engineering requirements

Implement in Python with pyproject.toml (packaging) and a module runnable via python -m research_gap ....
Provide a console script entrypoint research-gap (or similar).
Modular structure: research_gap/parsing.py, research_gap/gaps.py, research_gap/prior_work/openalex.py, research_gap/prior_work/semantic_scholar.py, research_gap/reporting.py, etc.
Provide .env.example for:
- OPENAI_API_KEY (LLM mode)
- SEMANTIC_SCHOLAR_API_KEY (optional)
- OPENALEX_EMAIL (recommended polite pool)
Add minimal tests (pytest) for TEI parsing and API client query construction (mock HTTP).
Update README with:
- Setup (venv, install)
- Start GROBID Docker
- Example run: research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1
- Example run with --no-llm

Constraints / notes

Must run locally.
HTML parsing from arXiv is optional; focus on PDF+GROBID first.
Avoid fabricating citations; quotes must come from extracted text.

Create the PR with a clear title and include usage instructions in README.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: is there a tool that scans arxiv papers and extracts future work sections

Assistant: Yes—there are a few practical options, depending on whether you want an off-the-shelf service, a library you can script, or an LLM-based extractor.

1) Best “ready-made” option: Elicit

Elicit can search papers (including arXiv) and produce structured summaries; it often captures “limitations” / “future work” style takeaways as part of its extraction workflows.
Good if you want this without building a pipeline.

2) Best DIY “scanner” (programmatic): GROBID + rules

If you specifically want to extract the “Future Work” section, the robust approach is:

Download the PDF from arXiv
Convert to structured XML/TEI with GROBID
Identify the section whose heading matches patterns like:
- Future Work, Future Directions, Limitations and Future Work, Discussion and Future Work, etc.
Extract the section body

This works well at scale and doesn’t require an LLM, but headings vary and some papers don’t label it explicitly.

3) Best if you want “future work” even when it’s not a labeled section: LLM extraction

Pipeline:

PDF → text (GROBID or another parser)
Prompt an LLM to extract bullet-point “future work” statements (including those embedded in Discussion/Conclusion)
This is usually higher recall than heading-based extraction, but costs more and needs quality control.

4) Other useful building blocks

arXiv API: fetch metadata + PDF links in bulk (you still need parsing).
PDF-to-text tools: pdftotext, science-parse, unstructured, etc. (quality varies; GROBID is the common choice for scholarly PDFs).

Qui...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/automation-workflows/Research-gap/sessions/1e290af4-e833-4d6c-942c-6f5e5711b8e0 Co-authored-by: kpj2006 <187440630+kpj2006@users.noreply.github.com>

Initial plan

7466cce

Copilot AI assigned Copilot and kpj2006 Apr 28, 2026

Copilot started work on behalf of kpj2006 April 28, 2026 05:46 View session

Copilot AI and others added 2 commits April 28, 2026 05:56

feat: add local research-gap analysis pipeline for arXiv papers

e4462f8

Agent-Logs-Url: https://github.com/automation-workflows/Research-gap/sessions/1e290af4-e833-4d6c-942c-6f5e5711b8e0 Co-authored-by: kpj2006 <187440630+kpj2006@users.noreply.github.com>

chore: add .gitignore to exclude pycache and build artifacts"

da01c20

Agent-Logs-Url: https://github.com/automation-workflows/Research-gap/sessions/1e290af4-e833-4d6c-942c-6f5e5711b8e0 Co-authored-by: kpj2006 <187440630+kpj2006@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add local scriptable research-gap analysis pipeline for arXiv papers~~ feat: add local research-gap analysis pipeline for arXiv papers Apr 28, 2026

Copilot finished work on behalf of kpj2006 April 28, 2026 06:01

Copilot AI requested a review from kpj2006 April 28, 2026 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add local research-gap analysis pipeline for arXiv papers#1

feat: add local research-gap analysis pipeline for arXiv papers#1
Copilot wants to merge 3 commits into
mainfrom
copilot/add-research-gap-analysis-pipeline

Copilot AI commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pipeline stages

Module structure

Usage

Outputs per paper

Other

1) Best “ready-made” option: Elicit

2) Best DIY “scanner” (programmatic): GROBID + rules

3) Best if you want “future work” even when it’s not a labeled section: LLM extraction

4) Other useful building blocks

Qui...

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 28, 2026 •

edited

Loading