Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 26 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
## [Unreleased]

### Added

# Unreleased
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:

- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
Expand All @@ -26,7 +23,21 @@

- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.

## [0.1.5] - 08-02-2026
---
## [0.1.6] - 2026-04-02
### Changed
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hyphenate compound adjective in paper title text

Line 30 should use “multi-agent-based” for correct grammar/readability.

Proposed text fix
-- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C) 
+- [ComProScanner: a multi-agent-based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C) 
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
- [ComProScanner: a multi-agent-based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
🧰 Tools
🪛 LanguageTool

[grammar] ~30-~30: Use a hyphen to join words.
Context: ...ccess: - [ComProScanner: a multi-agent based framework for composition-property...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` at line 30, Replace the unhyphenated phrase in the paper title
entry "[ComProScanner: a multi-agent based framework for composition-property
structured data extraction from scientific
literature](https://doi.org/10.1039/D5DD00521C)" by changing "multi-agent based"
to "multi-agent-based" so the title reads "ComProScanner: a multi-agent-based
framework for composition-property structured data extraction from scientific
literature".


### Added
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.

### Fixed
- Model prefix handling in `rag_tool.py` standardized to reflect the docs.
- `HF_TOKEN` documentation clarified as optional — only required for gated or private Hugging Face models.

---
## [0.1.5] - 2026-02-08

### Added
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.
Expand Down Expand Up @@ -97,7 +108,8 @@

- README badges section converted from HTML to markdown format for better compatibility across platforms.

## [0.1.4] - 02-12-2025
---
## [0.1.4] - 2025-12-02

### Added

Expand Down Expand Up @@ -132,30 +144,34 @@
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)

## [0.1.3] - 04-11-2025
---
## [0.1.3] - 2025-11-04

### Fixed

- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`

## [0.1.2] - 24-10-2025
---
## [0.1.2] - 2025-10-24

### Added

- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)

## [0.1.1] - 22-10-2025
---
## [0.1.1] - 2025-10-22

### Fixed

- README images updated with external image link to fix PyPI rendering issue.
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)

## [0.1.0] - 22-10-2025
---
## [0.1.0] - 2025-10-22

### Added

Expand Down
19 changes: 13 additions & 6 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ contact:
- family-names: Roy
given-names: Aritra
orcid: "https://orcid.org/0000-0002-4928-2935"
message: If you use this software, please cite our article on arXiv.
message: If you use this software, please cite our article in Digital Discovery.
preferred-citation:
authors:
- family-names: Roy
Expand All @@ -31,21 +31,28 @@ preferred-citation:
- family-names: Gattinoni
given-names: Chiara
orcid: "https://orcid.org/0000-0002-3376-6374"
date-published: 2025-10-23
doi: "10.1039/D5DD00521C"
identifiers:
- type: doi
value: "10.1039/D5DD00521C"
description: "Peer-reviewed article"
- type: other
value: "arXiv:2510.20362"
description: "arXiv preprint"
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
journal: "Digital Discovery"
publisher:
name: "RSC"
status: advance-online
title: "ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature"
type: article
url: "https://arxiv.org/abs/2510.20362"
url: "https://doi.org/10.1039/D5DD00521C"
repository-code: "https://github.com/slimeslab/ComProScanner"
license: MIT
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
type: software
url: "https://slimeslab.github.io/ComProScanner/"
version: "0.1.4"
date-released: 2025-12-03
version: "0.1.6"
date-released: 2026-04-02
keywords:
- materials science
- data extraction
Expand Down
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,14 +169,15 @@ eval_visualizer.plot_multiple_radar_charts(
If you use ComProScanner in your research, please cite:

```bibtex
@misc{roy2025comproscannermultiagentbasedframework,
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
year={2025},
eprint={2510.20362},
archivePrefix={arXiv},
primaryClass={physics.comp-ph},
url={https://arxiv.org/abs/2510.20362},
@Article{roy2026comproscannermultiagentbasedframework,
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
journal ="Digital Discovery",
year ="2026",
pages ="Accepted",
publisher ="RSC",
doi ="10.1039/D5DD00521C",
url ="https://doi.org/10.1039/D5DD00521C"
}
```

Expand Down
75 changes: 60 additions & 15 deletions docs/about/changelog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,42 @@
## Unreleased
# Unreleased
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:

- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.

- **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.

- **Agentic evaluation**: a new `GetValueErrorThresholdTool` (CrewAI `BaseTool`) is added to the composition evaluator agent when thresholds are configured. The agent calls this tool with the reference value to retrieve the tolerance before deciding on each numeric match. No tool is added and no prompt changes are made when no thresholds are provided.

- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.

- VLM-based graph data extraction added across all publishers and PDF processors:

- New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.

- New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.

- New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.

- New unit tests added for all three agent tools in `tests/test_agent_tools/`.

### Fixed

- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.

---
## [0.1.6] - 2026-04-02
### Changed
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use hyphenated compound modifier

Line 30 should read “multi-agent-based framework”.

Proposed text fix
-- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C) 
+- [ComProScanner: a multi-agent-based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C) 
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
- [ComProScanner: a multi-agent-based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
🧰 Tools
🪛 LanguageTool

[grammar] ~30-~30: Use a hyphen to join words.
Context: ...ccess: - [ComProScanner: a multi-agent based framework for composition-property...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/about/changelog.md` at line 30, The changelog entry string
"ComProScanner: a multi-agent based framework for composition-property
structured data extraction from scientific literature" uses an unhyphenated
compound modifier; update that string to read "ComProScanner: a
multi-agent-based framework for composition-property structured data extraction
from scientific literature" by replacing "multi-agent based" with
"multi-agent-based".


### Added
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.

Comment on lines +27 to +34

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== CHANGELOG.md 0.1.6 section ==="
sed -n '/## \[0.1.6\]/,/## \[0.1.5\]/p' CHANGELOG.md | head -20

echo ""
echo "=== docs/about/changelog.md 0.1.6 section ==="
sed -n '/## \[0.1.6\]/,/## \[0.1.5\]/p' docs/about/changelog.md | head -20

Repository: aritraroy24/ComProScanner

Length of output: 1587


Add missing ### Fixed section to 0.1.6 in docs/about/changelog.md.

The docs/about/changelog.md file is missing the ### Fixed section that exists in CHANGELOG.md for version 0.1.6. The following items should be added:

### Fixed
- Model prefix handling in `rag_tool.py` standardized to reflect the docs.
- `HF_TOKEN` documentation clarified as optional — only required for gated or private Hugging Face models.
🧰 Tools
🪛 LanguageTool

[grammar] ~30-~30: Use a hyphen to join words.
Context: ...ccess: - [ComProScanner: a multi-agent based framework for composition-property...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/about/changelog.md` around lines 27 - 34, Add a new "### Fixed"
subsection under version 0.1.6 in docs/about/changelog.md containing the two
missing bullets: one noting that "Model prefix handling in rag_tool.py
standardized to reflect the docs" and the other clarifying "HF_TOKEN
documentation clarified as optional — only required for gated or private Hugging
Face models"; ensure the section header is exactly "### Fixed" and the two items
reference rag_tool.py and HF_TOKEN as written so they match existing
CHANGELOG.md entries.

---
## [0.1.5] - 2026-02-08

### Added
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.

- New parameter `apply_advanced_cleaning` added to data cleaning methods in `data_cleaner.py`. When set to `True`, it triggers the advanced cleaning pipeline.

Expand Down Expand Up @@ -37,9 +73,12 @@

- [CITATION.cff](https://github.com/slimeslab/ComProScanner/blob/main/CITATION.cff) added for standardized citation information based on the latest release and arXiv preprint.

- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.

### Fixed
- OAWorks API is replaced with OpenAlex API as OAWorks is no longer available.

- Empty/corrupted PDF handled in `pdf_processor.py` and `wiley_processor.py` to avoid having GLYPH errors during text extraction.

- Data extraction failures fixed if composition-property text data is empty.

- CSV progress tracking in `elsevier_processor.py`:

Expand All @@ -61,13 +100,12 @@
- GitHub Actions CI disk space issue:
- Added `--no-cache-dir` flag to pip install to reduce disk usage

- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.

### Changed

- README badges section converted from HTML to markdown format for better compatibility across platforms.

## [0.1.4] - 02-12-2025
---
## [0.1.4] - 2025-12-02

### Added

Expand Down Expand Up @@ -98,32 +136,39 @@

### Changed

- README images updated with raw GitHub links for better reliability: [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png), [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
- README images updated with raw GitHub links for better reliability:
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)

## [0.1.3] - 04-11-2025
---
## [0.1.3] - 2025-11-04

### Fixed

- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`

## [0.1.2] - 24-10-2025
---
## [0.1.2] - 2025-10-24

### Added

- Link to ComProScanner preprint on arXiv in the documentation index page and README.md: [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)

## [0.1.1] - 22-10-2025
---
## [0.1.1] - 2025-10-22

### Fixed

- README images updated with external image link to fix PyPI rendering issue. [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png), [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
- README images updated with external image link to fix PyPI rendering issue.
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)

## [0.1.0] - 22-10-2025
---
## [0.1.0] - 2025-10-22

### Added

- Initial release of ComProScanner.


17 changes: 9 additions & 8 deletions docs/about/citation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
If you use ComProScanner in your research, please cite our related paper:

```bibtex
@misc{roy2025comproscannermultiagentbasedframework,
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
year={2025},
eprint={2510.20362},
archivePrefix={arXiv},
primaryClass={physics.comp-ph},
url={https://arxiv.org/abs/2510.20362},
@Article{roy2026comproscannermultiagentbasedframework,
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
journal ="Digital Discovery",
year ="2026",
pages ="Accepted",
publisher ="RSC",
doi ="10.1039/D5DD00521C",
url ="https://doi.org/10.1039/D5DD00521C"
}
```
Loading
Loading