ContextAwareJoin

This repo contains the code for Evaluating Joinable Column Discovery Approaches for Context-Aware Search!

🔥 Getting Started

Clone the repo

git clone https://github.com/IBM/ContextAwareJoin.git

Create a new python env and activate

python3.11 -m venv pyenv
source pyenv/bin/activate

Install the repo

pip install -e .

To run index, search and evaluation, use main.py.

$ python main.py --help
usage: main.py [-h] [--method {exact_match,warpgate,topjoin,deepjoin,LSH_ensemble}] [--benchmark BENCHMARK] --groundtruth-filepath GROUNDTRUTH_FILEPATH [--datalake-dir DATALAKE_DIR] [--file-format {.csv,.df,.parquet}]
               [--metadata-dir METADATA_DIR] [--metadata-suffix METADATA_SUFFIX] [--model MODEL] [--embedding-indexer {NN,NN_HAMMING,LSH_FOREST}] [--minhash-indexer {NN,NN_HAMMING,LSH_FOREST}] [--top-k TOP_K]
               [--candidate-k CANDIDATE_K] [--start-k START_K] [--max-k MAX_K] [--topjoin-config TOPJOIN_CONFIG] [--lower-better] [--warpgate-encoder {webtable,fasttext}] [--fasttext-path FASTTEXT_PATH]

options:
  -h, --help            show this help message and exit
  --method {exact_match,warpgate,topjoin,deepjoin,LSH_ensemble}
                        Method
  --benchmark BENCHMARK
                        Name of the benchmark
  --groundtruth-filepath GROUNDTRUTH_FILEPATH
                        Jsonl groundtruth file
  --datalake-dir DATALAKE_DIR
                        Path to directory with the data lake
  --file-format {.csv,.df,.parquet}
                        File Format
  --metadata-dir METADATA_DIR
                        Directory containing metadata json for each table
  --metadata-suffix METADATA_SUFFIX
                        Suffix used for each metadata file eg ".CSV.json" or ".json" or ".meta".
  --model MODEL         Path (or ID) To Sentence Transformer Model to be used for Embedding
  --embedding-indexer {NN,NN_HAMMING,LSH_FOREST}
  --minhash-indexer {NN,NN_HAMMING,LSH_FOREST}
  --top-k TOP_K         K
  --candidate-k CANDIDATE_K
                        Candidate K
  --start-k START_K     Maximum K
  --max-k MAX_K         Maximum K
  --topjoin-config TOPJOIN_CONFIG
                        Path To TopJoin Config (Required when method is `topjoin`)
  --lower-better        Set to True if ground truth scores are ranking or distance (lower is better) and False for similarity scores (higher better). Default: False
  --warpgate-encoder {webtable,fasttext}
                        Encoder for Warpgate only
  --fasttext-path FASTTEXT_PATH
                        Path to fasttext embedding file; Required when method is `warpgate`)

Example Usage:

BENCHMARK=go_sales
DATALAKE_DIR=./datasets/gosales/datalake
FILE_FORMAT=.df
GT_FILE=./datasets/gosales/gt.jsonl
python main.py --benchmark ${BENCHMARK} --datalake-dir ${DATALAKE_DIR} --file-format ${FILE_FORMAT}  --groundtruth-filepath ${GT_FILE}  --method LSH_ensemble

🗃️ Datasets

All the datasets used in the paper are publicly available, except the CIO dataset. Read more about the datasets here

✋ License

Important

This code is released with CC BY-NC-ND 4.0 License. In addition to that, please pay attention to the public disclosure below.

You are free to copy, modify and distribute this code only for the purpose of comparing this code to other code for scientific experimental purposes, where that distribution is not for a fee, nor does it accompany anything for which a fee is charged.

All content in these repositories including code has been provided by IBM under the associated restrictive-use software license and IBM is under no obligation to provide enhancements, updates, or support. IBM developers produced this code as a computer science project (not as an IBM product), and IBM makes no assertions as to the level of quality nor security, and will not be maintaining this code going forward.

📜 Citation

@inproceedings{kokel2025topjoin,
  title={TOPJoin: A Context-Aware Multi-Criteria Approach for
Joinable Column Search.},
  author={Harsha Kokel and Aamod Khatiwada and Tejaswini Pedapati and Haritha
Ananthakrishnan and Oktie Hassanzadeh and Horst Samulowitz and Kavitha
Srinivas.},
 booktitle    = {{VLDB} Workshops},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
datasets		datasets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContextAwareJoin

🔥 Getting Started

🗃️ Datasets

✋ License

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ContextAwareJoin

🔥 Getting Started

🗃️ Datasets

✋ License

📜 Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages