Query Sources

title	link	note
LitSearch	https://aclanthology.org/2024.emnlp-main.840/	Closest explicit literature-search query benchmark; for our task we reuse the query text and replace `paper` / `studies` with `models`.
LitSearch query subset	https://huggingface.co/datasets/yale-nlp/LitSearch-NLP-Class/viewer/query?row=6	Query-only subset we can reuse directly as recommendation-style query data.
SPRD / Scholarly Paper Recommendation Dataset	https://link.springer.com/article/10.1007/s00799-022-00339-w	Manual relevance judgments for scholarly paper recommendation; good for evaluation, not a natural-language query benchmark.
CiteULike	https://link.springer.com/article/10.1007/s10115-023-01901-x	User-item interaction data for scholarly papers; good for personalization, not a query benchmark.
RARD / Mr. DLib	https://mr-dlib.org/blog/2017/06/12/rard-the-related-article-recommendation-dataset/	Real recommendation logs and click feedback from a related-article service; closer to recommendation behavior than query synthesis.
unarXive	https://link.springer.com/article/10.1007/s11192-020-03382-z	Full-text papers with citation contexts; useful raw material for citation-recommendation or query synthesis.
unarXive open subset	https://zenodo.org/records/7752615	Open subset of unarXive with structured full text and citation network.

Download

Download the LitSearch query subset query and save it locally as JSONL:

python -c "from datasets import load_dataset; ds = load_dataset('yale-nlp/LitSearch-NLP-Class', split='query'); ds.to_json('others/query/litsearch_query.jsonl')"

Analysis on groundtruth for extracting model from paper corpusid

Extract unique corpusids into a plain text file:

python -m src.query.extract_corpusids_to_txt \
  --input others/query/litsearch_query.jsonl \
  --output data_251117/query/new_corpusids.txt

Test whether we could extract hf links from full text, if so we could infer model recommendation from paper recommendation

# go to z6dong@watgpu:~/shared_data/se_s2orc_250218, where we store se_s2orc_corpus data
PYTHONNOUSERSITE=1 python extract_corpus_hf_links.py \
  --ids_file new_corpusids.txt \
  --db_path paper_index_mini.db \
  --data_directory ./ \
  --output_parquet corpus_hf_links.parquet \
  --keep_full_text \
  --full_text_dir fulltexts \
  --num_workers 8

Analysis on corpus_hf_links.parquet

python3 -m src.query.stats_litsearch_corpus_links \
  --parquet corpus_hf_links.parquet \
  --query_jsonl others/query/litsearch_query.jsonl \
  --plot_path tmp/corpus_hf_links_funnel.png

Query substitution

python -m src.query.batch_query_rewrite build \
  --input others/query/litsearch_query.jsonl \
  --output data_251117/query/query_rewrite_batch_input.jsonl \
  --model gpt-4o-mini

# Submit batch input and download the output:
python -m src.llm.batch \
  data_251117/query/query_rewrite_batch_input.jsonl \
  data_251117/query/query_rewrite_batch_output.jsonl

# Parse the batch output into a clean rewrite file:
python -m src.query.batch_query_rewrite parse \
  --input data_251117/query/query_rewrite_batch_output.jsonl \
  --output data_251117/query/query_rewrite_polished.jsonl

python3 -m src.query.test_query_rewrite_compare --limit 5

Query labeling

Label queries in 100-query chunks with one prompt per chunk:

python -m src.query.query_label_once \
  --input data_251117/query/query_rewrite_batch_output.jsonl \
  --output data_251117/query/query_label_once.jsonl \
  --scheme six \
  --start 0 \
  --chunk-size 100

The default six labels are evidence-based, comparison, experience, reason, instruction, and debate. The output file includes label, reason, and an estimated input token count per chunk. Plot the label distribution with the same mini bar-chart style:

python -m src.query.stats_query_label_once \
  --input data_251117/query/query_label_once.jsonl \
  --plot_path data_251117/query/query_label_once_distribution.png \
  --scheme six

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Sources

Download

Analysis on groundtruth for extracting model from paper corpusid

Query substitution

Query labeling

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Query Sources

Download

Analysis on groundtruth for extracting model from paper corpusid

Query substitution

Query labeling