Skip to content

IBM/ContextAwareJoin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ContextAwareJoin

📄 Paper🔥 Getting Started

🗃️ Datasets✋ License📜 Citation

This repo contains the code for Evaluating Joinable Column Discovery Approaches for Context-Aware Search!

🔥 Getting Started

  1. Clone the repo
git clone https://github.com/IBM/ContextAwareJoin.git
  1. Create a new python env and activate
python3.11 -m venv pyenv
source pyenv/bin/activate
  1. Install the repo
pip install -e .
  1. To run index, search and evaluation, use main.py.
$ python main.py --help
usage: main.py [-h] [--method {exact_match,warpgate,topjoin,deepjoin,LSH_ensemble}] [--benchmark BENCHMARK] --groundtruth-filepath GROUNDTRUTH_FILEPATH [--datalake-dir DATALAKE_DIR] [--file-format {.csv,.df,.parquet}]
               [--metadata-dir METADATA_DIR] [--metadata-suffix METADATA_SUFFIX] [--model MODEL] [--embedding-indexer {NN,NN_HAMMING,LSH_FOREST}] [--minhash-indexer {NN,NN_HAMMING,LSH_FOREST}] [--top-k TOP_K]
               [--candidate-k CANDIDATE_K] [--start-k START_K] [--max-k MAX_K] [--topjoin-config TOPJOIN_CONFIG] [--lower-better] [--warpgate-encoder {webtable,fasttext}] [--fasttext-path FASTTEXT_PATH]

options:
  -h, --help            show this help message and exit
  --method {exact_match,warpgate,topjoin,deepjoin,LSH_ensemble}
                        Method
  --benchmark BENCHMARK
                        Name of the benchmark
  --groundtruth-filepath GROUNDTRUTH_FILEPATH
                        Jsonl groundtruth file
  --datalake-dir DATALAKE_DIR
                        Path to directory with the data lake
  --file-format {.csv,.df,.parquet}
                        File Format
  --metadata-dir METADATA_DIR
                        Directory containing metadata json for each table
  --metadata-suffix METADATA_SUFFIX
                        Suffix used for each metadata file eg ".CSV.json" or ".json" or ".meta".
  --model MODEL         Path (or ID) To Sentence Transformer Model to be used for Embedding
  --embedding-indexer {NN,NN_HAMMING,LSH_FOREST}
  --minhash-indexer {NN,NN_HAMMING,LSH_FOREST}
  --top-k TOP_K         K
  --candidate-k CANDIDATE_K
                        Candidate K
  --start-k START_K     Maximum K
  --max-k MAX_K         Maximum K
  --topjoin-config TOPJOIN_CONFIG
                        Path To TopJoin Config (Required when method is `topjoin`)
  --lower-better        Set to True if ground truth scores are ranking or distance (lower is better) and False for similarity scores (higher better). Default: False
  --warpgate-encoder {webtable,fasttext}
                        Encoder for Warpgate only
  --fasttext-path FASTTEXT_PATH
                        Path to fasttext embedding file; Required when method is `warpgate`)

Example Usage:

BENCHMARK=go_sales
DATALAKE_DIR=./datasets/gosales/datalake
FILE_FORMAT=.df
GT_FILE=./datasets/gosales/gt.jsonl
python main.py --benchmark ${BENCHMARK} --datalake-dir ${DATALAKE_DIR} --file-format ${FILE_FORMAT}  --groundtruth-filepath ${GT_FILE}  --method LSH_ensemble

🗃️ Datasets

All the datasets used in the paper are publicly available, except the CIO dataset. Read more about the datasets here

✋ License

Important

This code is released with CC BY-NC-ND 4.0 License. In addition to that, please pay attention to the public disclosure below.

You are free to copy, modify and distribute this code only for the purpose of comparing this code to other code for scientific experimental purposes, where that distribution is not for a fee, nor does it accompany anything for which a fee is charged.

All content in these repositories including code has been provided by IBM under the associated restrictive-use software license and IBM is under no obligation to provide enhancements, updates, or support. IBM developers produced this code as a computer science project (not as an IBM product), and IBM makes no assertions as to the level of quality nor security, and will not be maintaining this code going forward.

📜 Citation

@inproceedings{kokel2025topjoin,
  title={TOPJoin: A Context-Aware Multi-Criteria Approach for
Joinable Column Search.},
  author={Harsha Kokel and Aamod Khatiwada and Tejaswini Pedapati and Haritha
Ananthakrishnan and Oktie Hassanzadeh and Horst Samulowitz and Kavitha
Srinivas.},
 booktitle    = {{VLDB} Workshops},
  year={2025}
}

About

Code for Evaluating Joinable Column Discovery Approaches for Context-Aware Search!

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors