Semantic Tree Construction Tutorial

This repository contains code to build hierarchical semantic trees from text datasets. The code works in conjunction with MNHN-Tree-Tools.

If you make use of the code and/or datasets herein, please cite: Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Installation

Install the tools locally as follows.

1. Install dependencies and MNHN-Tree-Tools

sudo apt-get install git build-essential libpng-dev libsdl2-dev \
liblapack-dev libopenmpi-dev libpocl-dev ocl-icd-opencl-dev pocl-opencl-icd

git clone https://github.com/haschka/mnhn-tree-tools
cd mnhn-tree-tools
mkdir bin
make all

2. Install the `semantic-trees` add-on

git clone https://github.com/haschka/semantic-trees/
cd semantic-trees
mkdir bin
make all

3. Export paths (optional)

To simplify access to the compiled binaries, you may export the paths:

cd mnhn-tree-tools/bin
export PATH=$PATH:$PWD
cd ../../semantic-trees/bin
export PATH=$PATH:$PWD

Running the Tools

1. Prerequisites

You should have a dataset of texts formatted as follows:

The data must be in JSONL format.
Each document must be stored on a single line.
The text content must be provided under the key resumes.

A valid dataset therefore looks like:

{ "resumes": "This is a text about something..." }
{ "resumes": "This is a text about something else..." }

Hosting an Embedding Model

To build a vector database, you must host an embedding model. This can be achieved either by running llama.cpp or by using one of the provided Python scripts to launch an embedding server:

SFR.py (SFR-Embedding-Mistral)
qwen3-embed-8b.py (Qwen3-Embedding-8B)

Start the server with, for example:

python SFR.py

or

python qwen3-embed-8b.py

Building the Vector Database

Once the embedding server is running, build the vector database:

build-theses-vector-db-from-server textdatabase.jsonl 127.0.0.1 8081 vectordatabase.vdb

Creating a Pseudo-FASTA File

For compatibility with MNHN-Tree-Tools, it is convenient to create a pseudo-FASTA file.

First, determine the number of records in the vector database:

show-vdb-details vectordatabase.vdb

Then generate the pseudo-FASTA file (replace N with the number of records):

for ((i=0;i<N;i++)); do
  echo ">seq_$i" >> /tmp/pseudofasta
  echo "ACGT" >> /tmp/pseudofasta
done
mv /tmp/pseudofasta .

2. Performing PCA on the Vector Database

To perform PCA on the vectors stored in the database, use the pca-from-vdb tool:

pca-from-vdb vectordatabase.vdb pca-10-dim pca-ev 10 20

Where:

pca-10-dim stores the projections onto the first 10 principal components,
pca-ev stores the eigenvalues of the covariance matrix,
10 is the number of principal components,
20 is the number of available threads.

The PCA output can later be downscaled. For example, to reduce to two dimensions:

awk '{print $1"\t"$2}' pca-10-dim > pca-2-dim

3. Adaptive Clustering on the PCA Projections

As an initial configuration, we use:

ε = 0.01
Δε = 0.00001
minpts = 5

Finding suitable parameters can be challenging. In practice, identifying a configuration that produces many clusters with approximately 30% coverage is a good starting point.

Run adaptive clustering as follows:

mkdir layers
adaptive_cluster_PCA pseudofasta 0.01 0.000001 5 layers/L 20 2 pca-2-dim > logfile.log

Where:

layers/L stores the clustering (split-sets) for each tree layer,
20 is the number of threads,
2 is the dimensionality of the PCA file.

The logfile.log reports:

the number of clusters per layer,
the dataset coverage,
the parent–child relationships between clusters across layers.

4. Visualizing the Tree

First, generate a Newick file:

split_sets_to_newick 0 0 layers/L* > tree.dnd

Then visualize the tree using Newick Utilities:

nw_display -sr -w 800 -i 'opacity:0' -l 'opacity:0' -b 'opacity:0' tree.dnd > tree.svg

Where:

-i controls internal node annotations,
-l controls leaf annotations,
-b controls branch lengths (uniform in our trees).

Removing opacity:0 enables annotation visualization. Increase -w to avoid label overlap.

In this tutorial, internal nodes are labeled using the format LXXCYYNZZ, where:

XX is the layer number,
YY is the cluster number within that layer,
ZZ is the number of texts in the node.

5. Retrieving Texts for Specific Tree Nodes

Each tree node corresponds to a set of documents. To retrieve these, first generate cluster index files:

mkdir clusters
for i in layers/L*; do
  split-set-to-indices $i clusters/$i-C
done

The clusters directory will contain files such as L0004-C-000031, where:

0004 is the layer number,
000031 is the cluster number.

Each file lists the line numbers of the corresponding documents in the original JSONL file.

To extract and aggregate the texts for each node:

mkdir texts
cd clusters
for i in *; do
  cat $i | print-vdb-texts vectordatabase.vdb > ../texts/$i
done

The resulting files can be used for downstream processing, such as LLM-based annotation of tree nodes.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
datasets		datasets
doc		doc
prompts		prompts
website		website
.gitignore		.gitignore
README.md		README.md
SFR.py		SFR.py
adaptive_clustering.c		adaptive_clustering.c
article-test.txt		article-test.txt
article-test.vdb		article-test.vdb
beir-corpus-test.c		beir-corpus-test.c
binary_array.c		binary_array.c
binary_array.h		binary_array.h
bpe-save		bpe-save
build-theses-vector-db-from-server.c		build-theses-vector-db-from-server.c
build-theses-vector-db.c		build-theses-vector-db.c
build-vector-db-from-server.c		build-vector-db-from-server.c
build-vector-db.c		build-vector-db.c
cluster.c		cluster.c
cluster.h		cluster.h
create-tokenizer.c		create-tokenizer.c
create-vector-db.c		create-vector-db.c
curl_helpers.c		curl_helpers.c
curl_helpers.h		curl_helpers.h
dataset.c		dataset.c
dataset.h		dataset.h
dbscan.c		dbscan.c
dbscan.h		dbscan.h
embedding-from-server-cli.c		embedding-from-server-cli.c
embedding-from-server.c		embedding-from-server.c
embedding-from-server.h		embedding-from-server.h
get-vdb-details.c		get-vdb-details.c
label-generator.py		label-generator.py
load-texts.c		load-texts.c
load-texts.h		load-texts.h
local_resolve.c		local_resolve.c
local_resolve.h		local_resolve.h
makefile		makefile
moby-dick-ch1.txt		moby-dick-ch1.txt
moby-dick.txt		moby-dick.txt
multirag.c		multirag.c
nmi-ari-score.py		nmi-ari-score.py
number_of_clusters_in_split_set.c		number_of_clusters_in_split_set.c
pca-from-vdb.c		pca-from-vdb.c
print-vdb-entry.c		print-vdb-entry.c
print-vdb.c		print-vdb.c
print-vector-db-entries.c		print-vector-db-entries.c
print-vector-db-texts.c		print-vector-db-texts.c
prompt-request.c		prompt-request.c
qwen3-embed-8b.py		qwen3-embed-8b.py
rag		rag
rag-with-cos-remote.c		rag-with-cos-remote.c
rag.c		rag.c
reviewer-text		reviewer-text
reviewer-text-cos.vdb		reviewer-text-cos.vdb
reviewer-text.vdb		reviewer-text.vdb
split_set_to_indices.c		split_set_to_indices.c
split_set_to_vdbs.c		split_set_to_vdbs.c
symmetric-upper-triangle.c		symmetric-upper-triangle.c
test-dictionary.c		test-dictionary.c
test-text.txt		test-text.txt
testdb		testdb
vector-db-io.c		vector-db-io.c
vector-db.c		vector-db.c
vector-db.h		vector-db.h
word_list.c		word_list.c
wordlist.h		wordlist.h
yule-distance-test.c		yule-distance-test.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Tree Construction Tutorial

Installation

1. Install dependencies and MNHN-Tree-Tools

2. Install the `semantic-trees` add-on

3. Export paths (optional)

Running the Tools

1. Prerequisites

Hosting an Embedding Model

Building the Vector Database

Creating a Pseudo-FASTA File

2. Performing PCA on the Vector Database

3. Adaptive Clustering on the PCA Projections

4. Visualizing the Tree

5. Retrieving Texts for Specific Tree Nodes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Tree Construction Tutorial

Installation

1. Install dependencies and MNHN-Tree-Tools

2. Install the semantic-trees add-on

3. Export paths (optional)

Running the Tools

1. Prerequisites

Hosting an Embedding Model

Building the Vector Database

Creating a Pseudo-FASTA File

2. Performing PCA on the Vector Database

3. Adaptive Clustering on the PCA Projections

4. Visualizing the Tree

5. Retrieving Texts for Specific Tree Nodes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Install the `semantic-trees` add-on

Packages