Skip to content

NER Benchmarking Dataset#65

Open
tekrajchhetri wants to merge 14 commits intoimprovementfrom
evaluation_data
Open

NER Benchmarking Dataset#65
tekrajchhetri wants to merge 14 commits intoimprovementfrom
evaluation_data

Conversation

@tekrajchhetri
Copy link
Collaborator

@tekrajchhetri tekrajchhetri commented Feb 19, 2026

This pull requests adds the validation dataset for NER and organizes the evaluation directory content, i.e., moved old evaluation content to old directory.

What’s Included

  • NCBI Disease and gene and protein data ( for NER: see Dataset. Each of the dataset contains mapping (sentences, ner terms) in JSONL format and raw text.
  • The dataset only includes the test set. The original dataset is available in Sensein Google Drive.

Issues this PR addresses

Note: As stated in the pull request, it's not annotated based on paper but established benchmark dataset.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tekrajchhetri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the NER benchmarking capabilities by introducing a new validation dataset and a suite of analytical tools. The changes involve a major reorganization of the evaluation directory, migrating existing content to an 'old' folder to make way for new, standardized benchmarking data and scripts. The new Python scripts facilitate data conversion from BIO-tagged formats and provide comprehensive analysis of model performance across various metrics, including entity detection, ontology mapping, and judge scores. This update aims to streamline future NER evaluations and provide deeper insights into model strengths and weaknesses.

Highlights

  • NER Validation Dataset Added: Introduced a new validation dataset for Named Entity Recognition (NER) benchmarking, including NCBI Disease and gene/protein data.
  • Evaluation Directory Reorganization: Restructured the evaluation directory by moving existing content into an 'old' subdirectory to accommodate new benchmarking data and scripts.
  • BIO Tag Conversion Script: Added a Python script (bio_txt.py) to convert TSV/CSV files with BIO tags into continuous text and JSONL entity mappings, supporting flexible delimiters and tag formats.
  • Comprehensive NER Evaluation Tools: Included new Python scripts for comprehensive NER analysis, covering entity pool, judge score, label distribution, and location analysis, along with a data loader.
  • Detailed Model Performance Data: Integrated extensive evaluation results for various language models (Claude 3.7 Sonnet, GPT-4o-mini, DeepSeek V3 0324) on NER tasks, both with and without human-in-the-loop (HIL) feedback.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • evaluation/benchmark/script/bio_txt.py
    • Added a new Python script to convert BIO-tagged TSV/CSV files into continuous text and JSONL entity mappings.
    • Implemented auto-detection for tab or comma delimiters in input files.
    • Included logic to parse various BIO tag formats (O, B, I, B-XXX, I-XXX) and assign entity types.
    • Provided functions to read BIO files, compute token start positions, and build entity spans with global offsets.
    • Ensured output directories are created if they do not exist.
  • evaluation/combined_all_token_cost_data/old/combined_all_csv.csv
    • Added a new CSV file containing combined token cost data for various models and tasks.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/Integrating-brainstem-token-usage-with-hl.csv
    • Added a new CSV file detailing token usage for NER evaluation on the 'Integrating brainstem' paper with human-in-the-loop (HIL) feedback.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_claudesonet_s41593-024-01787-0_with_hil.json
    • Added a new JSON configuration file for Claude 3.7 Sonnet's NER evaluation on the 'Integrating brainstem' paper with HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_deepseek_s41593-024-01787-0_with_hil.json
    • Added a new JSON configuration file for DeepSeek V3 0324's NER evaluation on the 'Integrating brainstem' paper with HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_gpt_s41593-024-01787-0_with_hil.json
    • Added a new JSON configuration file for GPT-4o-mini's NER evaluation on the 'Integrating brainstem' paper with HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/Integrating-brainstem-token-usage-without-hl.csv
    • Added a new CSV file detailing token usage for NER evaluation on the 'Integrating brainstem' paper without human-in-the-loop (HIL) feedback.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
    • Added a new JSON configuration file for Claude 3.7 Sonnet's NER evaluation on the 'Integrating brainstem' paper without HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_deepseek_s41593-024-01787-0_without_hil.json
    • Added a new JSON configuration file for DeepSeek V3 0324's NER evaluation on the 'Integrating brainstem' paper without HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_gpt_s41593-024-01787-0_without_hil.json
    • Added a new JSON configuration file for GPT-4o-mini's NER evaluation on the 'Integrating brainstem' paper without HIL, including judged structured information.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_analysis_report.txt
    • Added a new text file containing a comprehensive summary report of NER evaluation results for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_summary_table.csv
    • Added a new CSV file providing a comprehensive summary table of NER evaluation metrics for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entities_missing_ontology.csv
    • Added a new CSV file listing entities that were detected but lacked complete ontology mappings for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_with_hil.csv
    • Added a new CSV file summarizing the entity pool for the 'Latent circuit' task with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_without_hil.csv
    • Added a new CSV file summarizing the entity pool for the 'Latent circuit' task without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_detailed_statistics.csv
    • Added a new CSV file containing detailed statistics of judge scores across models for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_with_hil.csv
    • Added a new CSV file providing judge score statistics for the 'Latent circuit' task with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_without_hil.csv
    • Added a new CSV file providing judge score statistics for the 'Latent circuit' task without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/label_distribution_statistics.csv
    • Added a new CSV file detailing the distribution of entity labels across models for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/location_statistics.csv
    • Added a new CSV file providing statistics on the paper locations where entities were detected for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_with_hil.csv
    • Added a new CSV file listing detailed missed entities for the 'Latent circuit' task with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_without_hil.csv
    • Added a new CSV file listing detailed missed entities for the 'Latent circuit' task without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/model_rankings.csv
    • Added a new CSV file containing model rankings based on composite performance scores for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/ontology_coverage_summary.csv
    • Added a new CSV file summarizing ontology coverage for the 'Latent circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_with_hil.csv
    • Added a new CSV file listing entities shared by all models for the 'Latent circuit' task with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_without_hil.csv
    • Added a new CSV file listing entities shared by all models for the 'Latent circuit' task without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_claudesonet_s41593-025-01869-7_with_hil.json
    • Added a new JSON configuration file for Claude 3.7 Sonnet's NER evaluation on the 'Latent circuit' paper with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_deepseek_s41593-025-01869-7_with_hil.json
    • Added a new JSON configuration file for DeepSeek V3 0324's NER evaluation on the 'Latent circuit' paper with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_gpt_s41593-025-01869-7_with_hil.json
    • Added a new JSON configuration file for GPT-4o-mini's NER evaluation on the 'Latent circuit' paper with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_token_usage_with_hil_latent.csv
    • Added a new CSV file detailing token usage for NER evaluation on the 'Latent circuit' paper with HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_claudesonet_s41593-025-01869-7_without_hil.json
    • Added a new JSON configuration file for Claude 3.7 Sonnet's NER evaluation on the 'Latent circuit' paper without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_deepseek_s41593-025-01869-7_without_hil.json
    • Added a new JSON configuration file for DeepSeek V3 0324's NER evaluation on the 'Latent circuit' paper without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_gpt_s41593-025-01869-7_without_hil.json
    • Added a new JSON configuration file for GPT-4o-mini's NER evaluation on the 'Latent circuit' paper without HIL.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_token_usage_without_hil_latent.csv
    • Added a new CSV file detailing token usage for NER evaluation on the 'Latent circuit' paper without HIL.
  • evaluation/ner/old/ner_config_claudesonet.yaml
    • Added a new YAML configuration file for Claude 3.7 Sonnet's NER agent, including roles, goals, backstories, and task definitions for extraction, alignment, judging, and human feedback.
  • evaluation/ner/old/ner_config_deepseek.yaml
    • Added a new YAML configuration file for DeepSeek V3 0324's NER agent, including roles, goals, backstories, and task definitions for extraction, alignment, judging, and human feedback.
  • evaluation/ner/old/ner_config_gpt.yaml
    • Added a new YAML configuration file for GPT-4o-mini's NER agent, including roles, goals, backstories, and task definitions for extraction, alignment, judging, and human feedback.
  • evaluation/notebook/README.md
    • Added a new Markdown README file documenting the usage and output of the token cost speed analysis script.
  • evaluation/notebook/ner_comprehensive_summary.py
    • Added a new Python script to generate a comprehensive summary and comparison of all NER evaluation results, including radar charts and text reports.
  • evaluation/notebook/ner_data_loader.py
    • Added a new Python script to load and preprocess NER evaluation data from JSON files, handling different group structures and providing utilities for entity dataframes and overlap calculations.
  • evaluation/notebook/ner_entity_pool_analysis.py
    • Added a new Python script to analyze the entity pool across models, calculate false negatives, and visualize entity overlaps using bar charts and heatmaps.
  • evaluation/notebook/ner_judge_score_analysis.py
    • Added a new Python script to analyze judge scores, including score distributions, statistics tables, and comparisons by entity characteristics and groups.
  • evaluation/notebook/ner_label_distribution.py
    • Added a new Python script to analyze the distribution of entity labels, including diversity metrics and consistency for shared entities.
  • evaluation/notebook/ner_location_analysis.py
    • Added a new Python script to analyze entity detection patterns by paper location, including normalization of locations and visualization of distributions and heatmaps.
  • evaluation/notebook/old/combined_analysis/data/combined_all_csv.csv
    • Added a new CSV file containing combined token cost data for various models and tasks, moved to an 'old' directory.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_cost_violin.svg
    • Added a new SVG file visualizing the cost distribution for the 'Integrating brainstem' task with HIL.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_violin.svg
    • Added a new SVG file visualizing the speed distribution for the 'Integrating brainstem' task with HIL.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_vs_cost.svg
    • Added a new SVG file visualizing the speed vs. cost scatter plot for the 'Integrating brainstem' task with HIL.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a significant amount of data and analysis scripts for NER benchmarking. The new Python script bio_txt.py is a useful utility for data conversion, and the suite of analysis scripts in evaluation/notebook/ provides a comprehensive framework for evaluating model performance.

My review focuses on the new Python scripts and documentation. I've identified a few areas for improvement regarding code reusability, maintainability, and robustness. Specifically, I've pointed out an issue with hardcoded file paths in bio_txt.py, a DRY principle violation due to a duplicated function across two analysis scripts, and a bug in ner_data_loader.py that could cause silent data loading failures due to inconsistent JSON structures. I've also noted that the new README file in the evaluation/notebook directory appears to be out of sync with the scripts added in this PR.

Overall, the added scripts are well-structured for analysis purposes. Addressing the suggested changes will improve the quality and reliability of the codebase.

Comment on lines +110 to +115
# Structure: {judge_ner_terms: {judge_ner_terms: {chunk_id: [entities]}}}
if 'judge_ner_terms' in data:
ner_data = data['judge_ner_terms'].get('judge_ner_terms', {})
else:
ner_data = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The data loading logic for the without_hil group assumes a consistent nested structure of {'judge_ner_terms': {'judge_ner_terms': ...}}. However, some JSON files in the dataset (e.g., ner_config_claudesonet_s41593-024-01787-0_without_hil.json) have a flatter structure {'judge_ner_terms': ...}. The current implementation will silently fail to load data from these inconsistently formatted files, as .get('judge_ner_terms', {}) will return an empty dictionary.

To prevent silent data loading failures and make the loader more robust, the logic should be updated to handle both possible structures.

        else:
            # Handle inconsistent structures for without_hil group
            if 'judge_ner_terms' in data:
                ner_data_root = data['judge_ner_terms']
                # Check if it has the extra nesting
                if 'judge_ner_terms' in ner_data_root:
                    ner_data = ner_data_root.get('judge_ner_terms', {})
                else:
                    ner_data = ner_data_root
            else:
                ner_data = {}

Comment on lines +27 to +29
IN_FILE = Path("JNLPBA_gene_protein_test.tsv")
OUT_TEXT = Path("JNLPBA_gene_protein_test_text.txt")
OUT_ENTS_JSONL = Path("JNLPBA_gene_protein_test_entities_mapping.jsonl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input and output file paths are hardcoded, which limits the script's reusability for different datasets. It would be more maintainable and flexible to provide these paths as command-line arguments.

You can use Python's built-in argparse module to handle this. For example:

import argparse

def main():
    parser = argparse.ArgumentParser(description='Convert BIO-tagged file to text and JSONL.')
    parser.add_argument('in_file', type=Path, help='Input BIO-tagged file.')
    parser.add_argument('--out-text', type=Path, help='Output text file.')
    parser.add_argument('--out-jsonl', type=Path, help='Output JSONL file.')
    args = parser.parse_args()

    IN_FILE = args.in_file
    OUT_TEXT = args.out_text or IN_FILE.with_suffix('.txt')
    OUT_ENTS_JSONL = args.out_jsonl or IN_FILE.with_suffix('.jsonl')

    # ... rest of your main function

Comment on lines +34 to +80
def normalize_location(location: str) -> str:
"""
Normalize paper location strings for consistency.

Args:
location: Raw location string

Returns:
Normalized location string or None if empty/whitespace
"""
if not location or not location.strip():
return None

location = location.lower().strip()

# Map variations to standard names
location_map = {
'introduction': 'introduction',
'intro': 'introduction',
'abstract': 'abstract',
'results': 'results',
'result': 'results',
'methods': 'methods',
'method': 'methods',
'discussion': 'discussion',
'conclusions': 'conclusion',
'conclusion': 'conclusion',
'references': 'references',
'reference': 'references',
'supplementary': 'supplementary',
'supplement': 'supplementary',
'figure': 'figure',
'table': 'table',
'introductory information': 'introduction', # Map to introduction
'author information': 'introduction', # Map to introduction
'acknowledgments': 'acknowledgments',
'acknowledgements': 'acknowledgments'
}

# Check if location contains any of the keywords
for key, value in location_map.items():
if key in location:
return value

# If no match, return 'other'
return 'other'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The normalize_location function is duplicated in ner_judge_score_analysis.py. To adhere to the Don't Repeat Yourself (DRY) principle, this function should be defined once in a shared location and imported where needed.

A good place for this utility function would be in ner_data_loader.py, perhaps as a static method of the NERDataLoader class, since it's directly related to processing the data loaded by that class.

Example of moving it to ner_data_loader.py:

# In ner_data_loader.py
class NERDataLoader:
    # ... existing methods ...

    @staticmethod
    def normalize_location(location: str) -> str:
        # ... function implementation ...

# In ner_location_analysis.py and ner_judge_score_analysis.py
# ...
df['normalized_location'] = df['paper_location'].apply(NERDataLoader.normalize_location)
# ...

Comment on lines +1 to +30
# Token Cost Speed Analysis

A script to visualize token usage, cost, and speed metrics across different language models.

## Usage

```bash
python token_cost_speed_analysis.py --task <task_name> --file <csv_file>
```

### Arguments

- `--task`: Task name (e.g., `reproschema`). Creates output folder with this name.
- `--file`: Path to CSV file containing token usage data.

### Example

```bash
python token_cost_speed_analysis.py --task reproschema --file reproschema/reproschema_token_usage.csv
```

### Output

Generates 4 plots in both PNG and SVG formats:
- Cost distribution by model
- Token usage (input/output) by model
- Speed distribution by model
- Speed vs cost scatter plot

All files are saved in a folder named after the task. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This README file documents a script named token_cost_speed_analysis.py, but this script does not appear to be included in this pull request. The README should be updated to document the new analysis scripts that are being added in this directory, such as ner_comprehensive_summary.py, ner_entity_pool_analysis.py, etc., to ensure the documentation is accurate and relevant to the directory's contents.

@djarecka
Copy link
Contributor

In the PR description, there is a pointer to the new dataset description, but there are much more things added. Is this on purpose? Perhaps some README can me added to evaluation and to subdirectories. Some subdirectories have README (e.,g., notebooks), but seems to be not updated.

From the dataset description it's not clear where the data comes from and how it will be used.

I also don't understand that there are 2 directories evaluation/ner/old and old/evaluation. What this is about? Do we need it? Can we have some Readme?

gemini also added comments

Added link to original dataset in the readme.
Added a link to the original dataset for reference.
@tekrajchhetri
Copy link
Collaborator Author

regarding other things, no not on purpose. Regarding the dataset where it comes from, we've had multiple meetings and I had shared drive link, didn't want to include in it github, but now have included the link in readme. The old/evaluation has been deleted, I somehow missed it.

@djarecka
Copy link
Contributor

djarecka commented Feb 26, 2026

Regarding the dataset where it comes from, we've had multiple meetings and I had shared drive link, didn't want to include in it github, but now have included the link in readme.

But in your drive, there is also no information were you took it from, or am I missing something?

The old/evaluation has been deleted, I somehow missed it.

After you removed old/evaluation most files are marked as renamed (not new), but it's still not clear to me if we need everything, and how the names with old should be understood. Overall Readme for benchmark and reviewing what you want to keep (and how it could be used) is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants