aritraroy24 · aritraroy24 · May 19, 2026 · Apr 27, 2026 · Apr 29, 2026 · May 1, 2026
diff --git a/.env.example b/.env.example
@@ -2,6 +2,7 @@
 
 # Publisher API keys
 SCOPUS_API_KEY=YOUR_SCOPUS_API_KEY
+SCIENCEDIRECT_INSTTOKEN=YOUR_SCIENCEDIRECT_INSTTOKEN # Optional: institutional token for ScienceDirect full-text access (contact your institution's library)
 SPRINGER_OPENACCESS_API_KEY=YOUR_SPRINGER_OPENACCESS_API_KEY
 SPRINGER_TDM_API_KEY=YOUR_SPRINGER_TDM_API_KEY/API_METRIC
 WILEY_API_KEY=YOUR_WILEY_API_KEY
@@ -14,13 +15,20 @@ DATABASE_PASSWORD=DB_PASSWORD
 DATABASE_NAME=DB_NAME
 
 # LLM Providers
-GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
+GEMINI_API_KEY=YOUR_GEMINI_API_KEY
 OPENAI_API_KEY=YOUR_OPENAI_API_KEY
 DEEPSEEK_API_KEY=YOUR_DEEPSEEK_API_KEY
+ANTHROPIC_API_KEY=YOUR_ANTHROPIC_API_KEY
 OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
+TOGETHER_API_KEY=YOUR_TOGETHER_API_KEY
+COHERE_API_KEY=YOUR_COHERE_API_KEY
+FIREWORKS_API_KEY=YOUR_FIREWORKS_API_KEY
 
 # neo4j
 NEO4J_URI=YOUR_NEO4J_URI # default URI for Neo4j is bolt://localhost:7687
 NEO4J_USER=YOUR_NEO4J_USERNAME
 NEO4J_PASSWORD=YOUR_NEO4J_PASSWORD
-NEO4J_DATABASE=YOUR_NEO4J_DATABASE_NAME
+NEO4J_DATABASE=YOUR_NEO4J_DATABASE_NAME
+
+# Optional model access
+HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
diff --git a/.gitignore b/.gitignore
@@ -178,17 +178,9 @@ cython_debug/
 .claude
 CLAUDE.md
 
-# Remove example directory primarily
+# Remove db directory related files to avoid accidentally committing large files
 examples/db/10.*
-tests example/
+examples/vlm_piezo_test/db/10.*
+examples/vlm_piezo_test/db/chroma.sqlite3
 
 applications
-vlm_test
-examples/vlm_piezo_test
-
-# Test results
-db
-results
-elsevier_test.xml
-springer_test.xml
-wiley_test.pdf
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,12 @@
 # Unreleased
+
+### Added
+
+- Added `SCIENCEDIRECT_INSTTOKEN` environment variable support in `ElsevierArticleProcessor` for off-campus remote access to subscription-based Elsevier articles and figures. When set, the token is sent as the `X-ELS-Insttoken` header in all ScienceDirect API requests and figure downloads. The variable is optional; omitting it does not affect on-campus access.
+
 - New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
 
-  - Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
+  - Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. Ranges are interpreted as **layers**: the narrowest range containing the ground-truth value determines the tolerance. For example, `(-150, 150): 1` applies only to values in (-150, -50) and (50, 150) when `(-50, 50): 0.5` is also present — no need for separate positive/negative sub-ranges. Tuple element order is irrelevant: `(-150, 150)` and `(150, -150)` are equivalent. Values outside all configured ranges fall back to exact comparison.
 
   - **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.
 
@@ -15,14 +20,49 @@
 
   - New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
 
-  - New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
+  - New `main_figure_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
 
 - New unit tests added for all three agent tools in `tests/test_agent_tools/`.
 
+- Added `save_failed_pdf_report` and `failed_pdf_report_path` to `process_articles()`, with filename-derived DOI validation and failed-PDF reporting for local PDF workflows.
+
+- Added `save_failed_automated_report` and `failed_automated_report_path` to `process_articles()` for automated publisher sources (Elsevier, Springer Nature, IOP, Wiley), mirroring the existing PDF failure report. Failed articles are written as tab-separated `doi`, `publisher`, `reason` entries to `results/failed_automated_articles.txt` by default.
+
+- Added image-aware fallback in `DataExtractionFlow.identify_materials_data_presence()`:
+
+  - The Materials Data Identifier still runs text RAG first.
+  - If RAG returns `no`, the flow now checks saved DOI figures with VLM and upgrades the decision to `yes` when relevant graph/figure evidence is found (including doping concentration vs property plots where full formulas are absent).
+
+- Added `is_store_unresolved_compositions` and `unresolved_compositions_file` parameters to `clean_data()` to optionally log split composition-property resolution statistics (`source`, `filtered`, `unresolved`, `resolved` counts) and persist filtered and unresolved composition keys in a JSON file keyed by DOI under `"filtered"` and `"unresolved"` top-level keys.
+
+- Added explicit Equation Tool model control:
+
+  - New `equation_model` parameter in `extract_composition_property_data()` (threaded through `DataExtractionFlow` and `CompositionExtractionCrew` into `EquationTool`).
+  - EquationTool model precedence is now: `equation_model` argument -> API-key-based auto-selection.
+
+- Clarified Equation Tool instruction customization in extraction docs and API:
+
+  - `formula_instruction` remains available in `extract_composition_property_data()` for domain-specific formula-derivation guidance, while preserving the built-in default instruction when unset.
+
+### Changed
+
+- Versioning scheme migrated from [Semantic Versioning](https://semver.org/) (SemVer) to [Calendar Versioning](https://calver.org/) (CalVer) using the `YYYY.MM.DD` format. Starting from this release, version numbers reflect the release date rather than an incrementing major/minor/patch scheme.
+
 ### Fixed
 
+- `_parse_json_output()` now recovers JSON from mixed-text crew outputs (e.g. `Thought: … { "json": "here" }`) by scanning for the first `{` / `[` and last `}` / `]` and retrying `json.loads()` on the extracted substring, before falling back to `ast.literal_eval()`.
+
+- Composition formatter agent now verifies `MaterialParserTool` output for incomplete variable substitution (e.g. `(1-x-y)` partially resolved as `(0.9-0.010)`) and overrides with the correct fully-substituted BODMAS expression when the tool is wrong.
+
 - `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
 
+- PNG, GIF, and WEBP figures now convert correctly to JPEG: transparent images are composited onto a white background, animated GIFs are pinned to frame 0, and two additional Springer Nature CDN URL patterns are tried to improve download success for these formats.
+
+- Added and updated tests for new extraction-flow behavior:
+
+  - EquationTool model selection tests now cover explicit arg override, env override, and updated model defaults.
+  - DataExtractionFlow tests now cover figure-based materials-data fallback and `equation_model` forwarding into `CompositionExtractionCrew`.
+
 ---
 ## [0.1.6] - 2026-04-02
 ### Changed

diff --git a/CITATION.cff b/CITATION.cff
@@ -41,8 +41,12 @@ preferred-citation:
       description: "arXiv preprint"
   journal: "Digital Discovery"
   publisher:
-    name: "RSC"
-  status: advance-online
+    name: "Royal Society of Chemistry"
+  volume: 5
+  issue: 4
+  start: 1794
+  end: 1808
+  year: 2026
   title: "ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature"
   type: article
   url: "https://doi.org/10.1039/D5DD00521C"

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 The MIT License (MIT)
 
-Copyright (c) 2025 SLIMES Lab
+Copyright © 2025-2026 SLIMES Lab
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
   <img src="https://raw.githubusercontent.com/aritraroy24/ComProScanner/refs/heads/main/assets/comproscanner_logo.png" alt="ComProScanner Logo" width="500"/>
 </p>
 
-[![Python Version](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg?logo=python&logoColor=white)](https://www.python.org/downloads/) [![License: MIT](https://custom-icon-badges.demolab.com/badge/license-MIT-yellow.svg?logo=law&logoColor=white)](https://opensource.org/licenses/MIT) [![PyPI](https://img.shields.io/pypi/v/comproscanner?logo=pypi&logoColor=white)](https://pypi.org/project/comproscanner/) [![Documentation](https://custom-icon-badges.demolab.com/badge/docs-latest-brightgreen.svg?logo=materialformkdocs&logoColor=white)](https://slimeslab.github.io/ComProScanner/) [![Coverage](https://img.shields.io/codecov/c/github/aritraroy24/ComProScanner?logo=codecov&logoColor=white&label=coverage&color=e62277)](https://codecov.io/gh/aritraroy24/ComProScanner) [![PyPI - Downloads](https://custom-icon-badges.demolab.com/pypi/dm/comproscanner?logo=download&logoColor=white&color=purple)](https://pypistats.org/packages/comproscanner) [![Ask DeepWiki](https://custom-icon-badges.demolab.com/badge/Ask%20DeepWiki-brightgreen.svg?logo=deepwikidevin&logoColor=white&labelColor=grey&color=5ab998)](https://deepwiki.com/slimeslab/ComProScanner) [![Digital Discovery](https://custom-icon-badges.demolab.com/badge/Digital_Discovery-10.1039/D5DD00521C-brightgreen.svg?logo=rsc&logoColor=white&color=c8c300)](https://doi.org/10.1039/D5DD00521C)
+[![Python Version](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg?logo=python&logoColor=white)](https://www.python.org/downloads/) [![License: MIT](https://custom-icon-badges.demolab.com/badge/license-MIT-brown.svg?logo=law&logoColor=white)](https://opensource.org/licenses/MIT) [![PyPI](https://img.shields.io/pypi/v/comproscanner?logo=pypi&logoColor=white)](https://pypi.org/project/comproscanner/) [![Documentation](https://custom-icon-badges.demolab.com/badge/docs-latest-brightgreen.svg?logo=materialformkdocs&logoColor=white)](https://slimeslab.github.io/ComProScanner/) [![Coverage](https://img.shields.io/codecov/c/github/aritraroy24/ComProScanner?logo=codecov&logoColor=white&label=coverage&color=e62277)](https://codecov.io/gh/aritraroy24/ComProScanner) [![PyPI - Downloads](https://custom-icon-badges.demolab.com/pypi/dm/comproscanner?logo=download&logoColor=white&color=purple)](https://pypistats.org/packages/comproscanner) [![Ask DeepWiki](https://custom-icon-badges.demolab.com/badge/Ask%20DeepWiki-brightgreen.svg?logo=deepwikidevin&logoColor=white&labelColor=grey&color=5ab998)](https://deepwiki.com/slimeslab/ComProScanner) [![Digital Discovery](https://custom-icon-badges.demolab.com/badge/Digital_Discovery-10.1039/D5DD00521C-brightgreen.svg?logo=rsc&logoColor=white&color=c8c300)](https://doi.org/10.1039/D5DD00521C)
 
 # ComProScanner
 
@@ -120,43 +120,6 @@ The ComProScanner workflow consists of four main stages:
 - Data Visualization
 - Evaluation Visualization
 
-## Example Use Cases
-
-### Extract Data from Multiple Sources
-
-```python
-scanner.process_articles(
-    property_keywords=property_keywords,
-    source_list=["elsevier", "springer", "wiley"]
-)
-```
-
-### Customize RAG Configuration
-
-```python
-scanner.extract_composition_property_data(
-    main_extraction_keyword="d33",
-    rag_chat_model="gemini-2.5-pro",
-    rag_max_tokens=2048,
-    rag_top_k=5
-)
-```
-
-### Visualize Results
-
-```python
-from comproscanner import data_visualizer, eval_visualizer
-
-# Create knowledge graph
-data_visualizer.create_knowledge_graph(result_file="results.json")
-
-# Plot evaluation metrics
-eval_visualizer.plot_multiple_radar_charts(
-    result_sources=["model1.json", "model2.json"],
-    model_names=["GPT-4o", "Claude-3.5"]
-)
-```
-
 ## Requirements
 
 - Python 3.12 or 3.13
@@ -169,15 +132,17 @@ eval_visualizer.plot_multiple_radar_charts(
 If you use ComProScanner in your research, please cite:
 
 ```bibtex
-@Article{roy2026comproscannermultiagentbasedframework,
-author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
-title  ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
-journal  ="Digital Discovery",
-year  ="2026",
-pages  ="Accepted",
-publisher  ="RSC",
-doi  ="10.1039/D5DD00521C",
-url  ="https://doi.org/10.1039/D5DD00521C"
+@Article{roy2026comproscanner,
+  title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
+  author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
+  journal={Digital Discovery},
+  volume={5},
+  number={4},
+  pages={1794--1808},
+  year={2026},
+  publisher={Royal Society of Chemistry},
+  doi  ="10.1039/D5DD00521C",
+  url  ="https://doi.org/10.1039/D5DD00521C"
 }
 ```
 

diff --git a/assets/overall_workflow.png b/assets/overall_workflow.png
diff --git a/docs/about/changelog.md b/docs/about/changelog.md
@@ -1,7 +1,12 @@
 # Unreleased
+
+### Added
+
+- Added `SCIENCEDIRECT_INSTTOKEN` environment variable support in `ElsevierArticleProcessor` for off-campus remote access to subscription-based Elsevier articles and figures. When set, the token is sent as the `X-ELS-Insttoken` header in all ScienceDirect API requests and figure downloads. The variable is optional; omitting it does not affect on-campus access.
+
 - New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
 
-  - Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
+  - Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. Ranges are interpreted as **layers**: the narrowest range containing the ground-truth value determines the tolerance. For example, `(-150, 150): 1` applies only to values in (-150, -50) and (50, 150) when `(-50, 50): 0.5` is also present — no need for separate positive/negative sub-ranges. Tuple element order is irrelevant: `(-150, 150)` and `(150, -150)` are equivalent. Values outside all configured ranges fall back to exact comparison.
 
   - **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.
 
@@ -15,18 +20,53 @@
 
   - New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
 
-  - New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
+  - New `main_figure_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
 
 - New unit tests added for all three agent tools in `tests/test_agent_tools/`.
 
+- Added `save_failed_pdf_report` and `failed_pdf_report_path` to `process_articles()`, with filename-derived DOI validation and failed-PDF reporting for local PDF workflows.
+
+- Added `save_failed_automated_report` and `failed_automated_report_path` to `process_articles()` for automated publisher sources (Elsevier, Springer Nature, IOP, Wiley), mirroring the existing PDF failure report. Failed articles are written as tab-separated `doi`, `publisher`, `reason` entries to `results/failed_automated_articles.txt` by default.
+
+- Added image-aware fallback in `DataExtractionFlow.identify_materials_data_presence()`:
+
+  - The Materials Data Identifier still runs text RAG first.
+  - If RAG returns `no`, the flow now checks saved DOI figures with VLM and upgrades the decision to `yes` when relevant graph/figure evidence is found (including doping concentration vs property plots where full formulas are absent).
+
+- Added `is_store_unresolved_compositions` and `unresolved_compositions_file` parameters to `clean_data()` to optionally log split composition-property resolution statistics (`source`, `filtered`, `unresolved`, `resolved` counts) and persist filtered and unresolved composition keys in a JSON file keyed by DOI under `"filtered"` and `"unresolved"` top-level keys.
+
+- Added explicit Equation Tool model control:
+
+  - New `equation_model` parameter in `extract_composition_property_data()` (threaded through `DataExtractionFlow` and `CompositionExtractionCrew` into `EquationTool`).
+  - EquationTool model precedence is now: `equation_model` argument -> API-key-based auto-selection.
+
+- Clarified Equation Tool instruction customization in extraction docs and API:
+
+  - `formula_instruction` remains available in `extract_composition_property_data()` for domain-specific formula-derivation guidance, while preserving the built-in default instruction when unset.
+
+### Changed
+
+- Versioning scheme migrated from [Semantic Versioning](https://semver.org/) (SemVer) to [Calendar Versioning](https://calver.org/) (CalVer) using the `YYYY.MM.DD` format. Starting from this release, version numbers reflect the release date rather than an incrementing major/minor/patch scheme.
+
 ### Fixed
 
+- `_parse_json_output()` now recovers JSON from mixed-text crew outputs (e.g. `Thought: … { "json": "here" }`) by scanning for the first `{` / `[` and last `}` / `]` and retrying `json.loads()` on the extracted substring, before falling back to `ast.literal_eval()`.
+
+- Composition formatter agent now verifies `MaterialParserTool` output for incomplete variable substitution (e.g. `(1-x-y)` partially resolved as `(0.9-0.010)`) and overrides with the correct fully-substituted BODMAS expression when the tool is wrong.
+
 - `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
 
+- PNG, GIF, and WEBP figures now convert correctly to JPEG: transparent images are composited onto a white background, animated GIFs are pinned to frame 0, and two additional Springer Nature CDN URL patterns are tried to improve download success for these formats.
+
+- Added and updated tests for new extraction-flow behavior:
+
+  - EquationTool model selection tests now cover explicit arg override, env override, and updated model defaults.
+  - DataExtractionFlow tests now cover figure-based materials-data fallback and `equation_model` forwarding into `CompositionExtractionCrew`.
+
 ---
 ## [0.1.6] - 2026-04-02
 ### Changed
-- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
+- Updated [README.md](https://github.com/slimeslab/ComProScanner/blob/main/README.md), [CITATION.cff](https://github.com/slimeslab/ComProScanner/blob/main/CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
   - [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C) 
 
 ### Added

diff --git a/docs/about/citation.md b/docs/about/citation.md
@@ -3,14 +3,16 @@
 If you use ComProScanner in your research, please cite our related paper:
 
 ```bibtex
-@Article{roy2026comproscannermultiagentbasedframework,
-author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
-title  ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
-journal  ="Digital Discovery",
-year  ="2026",
-pages  ="Accepted",
-publisher  ="RSC",
-doi  ="10.1039/D5DD00521C",
-url  ="https://doi.org/10.1039/D5DD00521C"
+@Article{roy2026comproscanner,
+  title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
+  author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
+  journal={Digital Discovery},
+  volume={5},
+  number={4},
+  pages={1794--1808},
+  year={2026},
+  publisher={Royal Society of Chemistry},
+  doi  ="10.1039/D5DD00521C",
+  url  ="https://doi.org/10.1039/D5DD00521C"
 }
 ```