Skip to content

feat: add local research-gap analysis pipeline for arXiv papers#1

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/add-research-gap-analysis-pipeline
Draft

feat: add local research-gap analysis pipeline for arXiv papers#1
Copilot wants to merge 3 commits into
mainfrom
copilot/add-research-gap-analysis-pipeline

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 28, 2026

Adds a fully local, scriptable pipeline that ingests arXiv papers, extracts research gaps with evidence, verifies novelty against prior work, and produces a structured Markdown report — no UI required.

Pipeline stages

  • Ingestion & parsing (parsing.py) — downloads PDF via --arxiv-id, accepts --pdf / --input-dir; submits to GROBID (Docker) → TEI XML; extracts title, abstract, all headings, and 8 canonical key sections into context_pack.json
  • Gap extraction (gaps.py) — two modes:
    • --no-llm: regex heuristics (~20 patterns: limitation, future work, remains, unexplored, we leave for, etc.) with section + verbatim quote in evidence
    • default (LLM): OpenAI chat completion with strict JSON schema; validates all evidence quotes are verbatim substrings, drops/repairs hallucinations
  • Prior-work verification (prior_work/) — queries both OpenAlex and Semantic Scholar per gap, deduplicates, ranks candidates by cosine similarity using all-MiniLM-L6-v2 locally; assigns risk labels (low / medium / high) → novelty_report.json
  • Reporting (reporting.py) — produces report.md with heading tree, gaps + evidence quotes, proposed non-incremental directions, and novelty table with links

Module structure

research_gap/
├── __main__.py          # CLI: research-gap / python -m research_gap
├── parsing.py
├── gaps.py
├── reporting.py
└── prior_work/
    ├── openalex.py
    ├── semantic_scholar.py
    └── embeddings.py

Usage

# LLM mode (requires OPENAI_API_KEY)
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1

# Heuristic mode, no API key needed
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1 --no-llm

# Local PDF, batch
research-gap --input-dir papers/ --out-dir out/batch

Outputs per paper

File Contents
paper.tei.xml GROBID output
context_pack.json Parsed sections
gaps.json Gaps with evidence, directions, search queries
novelty_report.json Ranked prior work + similarity + risk
report.md Human-readable summary

Other

  • pyproject.toml with research-gap console script entrypoint; pip install -e ".[dev]" installs everything
  • .env.example for OPENAI_API_KEY, SEMANTIC_SCHOLAR_API_KEY, OPENALEX_EMAIL
  • 53 pytest tests (TEI parsing, heuristic extraction, API clients with mocked HTTP, report generation) — no network required
Original prompt

Repository: automation-workflows/Research-gap (default branch: main)

Create a pull request that adds a local, scriptable research-gap analysis pipeline for arXiv papers.

Primary capabilities

  1. Ingest arXiv papers (prefer PDF). Support inputs:
    • --arxiv-id (downloads PDF from arXiv)
    • --pdf (single pdf path)
    • --input-dir (batch)
  2. Parse paper structure with headings/sections for context.
    • Prefer GROBID (Docker) to convert PDF -> TEI XML.
    • Provide helper script or documented command to run GROBID locally.
    • Parse TEI to produce context_pack.json with:
      • all section headings
      • selected key sections when present: Abstract, Introduction, Related Work, Experiments, Discussion, Limitations, Conclusion, Future Work
  3. Gap extraction with evidence.
    • Heuristic mode (--no-llm): extract candidate gap sentences containing patterns like: limitation(s), future work, remains, unexplored, invite, we leave for, open problem, etc. Include section name and exact sentence.
    • LLM mode (default): take context pack and output strict JSON gaps.json schema:
      • gaps[]: { gap, evidence[{section, quote}] (>=2), why_it_matters, non_incremental_directions[{direction, axis_of_difference}], prior_work_search_queries[] }
    • Enforce that quotes are verbatim substrings from extracted text; if not, drop/repair.
  4. Prior-work verification using both OpenAlex and Semantic Scholar.
    • For each gap/direction, run searches on OpenAlex and Semantic Scholar (S2 API key optional).
    • Collect top K candidate papers with title/authors/year/venue/abstract/url/citation counts when available.
    • Compute similarity between idea text and abstracts using a local embedding model (e.g., sentence-transformers all-MiniLM-L6-v2).
    • Produce novelty_report.json with ranked nearest prior work and a risk label (e.g., low/medium/high).
  5. Reporting
    • Generate report.md summarizing:
      • extracted headings
      • gaps with evidence quotes
      • proposed non-incremental directions
      • novelty check results with links
    • Save all intermediate JSON artifacts.

Engineering requirements

  • Implement in Python with pyproject.toml (packaging) and a module runnable via python -m research_gap ....
  • Provide a console script entrypoint research-gap (or similar).
  • Modular structure: research_gap/parsing.py, research_gap/gaps.py, research_gap/prior_work/openalex.py, research_gap/prior_work/semantic_scholar.py, research_gap/reporting.py, etc.
  • Provide .env.example for:
    • OPENAI_API_KEY (LLM mode)
    • SEMANTIC_SCHOLAR_API_KEY (optional)
    • OPENALEX_EMAIL (recommended polite pool)
  • Add minimal tests (pytest) for TEI parsing and API client query construction (mock HTTP).
  • Update README with:
    • Setup (venv, install)
    • Start GROBID Docker
    • Example run: research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1
    • Example run with --no-llm

Constraints / notes

  • Must run locally.
  • HTML parsing from arXiv is optional; focus on PDF+GROBID first.
  • Avoid fabricating citations; quotes must come from extracted text.

Create the PR with a clear title and include usage instructions in README.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: is there a tool that scans arxiv papers and extracts future work sections

Assistant: Yes—there are a few practical options, depending on whether you want an off-the-shelf service, a library you can script, or an LLM-based extractor.

1) Best “ready-made” option: Elicit

  • Elicit can search papers (including arXiv) and produce structured summaries; it often captures “limitations” / “future work” style takeaways as part of its extraction workflows.
  • Good if you want this without building a pipeline.

2) Best DIY “scanner” (programmatic): GROBID + rules

If you specifically want to extract the “Future Work” section, the robust approach is:

  1. Download the PDF from arXiv
  2. Convert to structured XML/TEI with GROBID
  3. Identify the section whose heading matches patterns like:
    • Future Work, Future Directions, Limitations and Future Work, Discussion and Future Work, etc.
  4. Extract the section body

This works well at scale and doesn’t require an LLM, but headings vary and some papers don’t label it explicitly.

3) Best if you want “future work” even when it’s not a labeled section: LLM extraction

Pipeline:

  • PDF → text (GROBID or another parser)
  • Prompt an LLM to extract bullet-point “future work” statements (including those embedded in Discussion/Conclusion)
    This is usually higher recall than heading-based extraction, but costs more and needs quality control.

4) Other useful building blocks

  • arXiv API: fetch metadata + PDF links in bulk (you still need parsing).
  • PDF-to-text tools: pdftotext, science-parse, unstructured, etc. (quality varies; GROBID is the common choice for scholarly PDFs).

Qui...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add local scriptable research-gap analysis pipeline for arXiv papers feat: add local research-gap analysis pipeline for arXiv papers Apr 28, 2026
Copilot AI requested a review from kpj2006 April 28, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants