THETA (θ)

English | 中文

Textual Hybrid Embedding–based Topic Analysis

Overview

THETA (θ) is an open-source, research-oriented platform for LLM-enhanced topic analysis in social science. It combines:

Domain-adaptive document embeddings from Qwen-3 models (0.6B/4B/8B)
- Zero-shot embedding (no training), or
- Supervised/Unsupervised fine-tuning modes
Generative topic models with 12 baseline models for comparison:
- THETA: Main model using Qwen embeddings (0.6B/4B/8B)
- Traditional: LDA, HDP (auto topics), STM (requires covariates), BTM (short texts)
- Neural: ETM, CTM, DTM (time-aware), NVDM, GSM, ProdLDA, BERTopic
Scientific validation via 7 intrinsic metrics (PPL, TD, iRBO, NPMI, C_V, UMass, Exclusivity)
Comprehensive visualization with bilingual support (English/Chinese)

THETA aims to move topic modeling from "clustering with pretty plots" to a reproducible, validated scientific workflow.

Key Features

Hybrid embedding topic analysis: Zero-shot / Supervised / Unsupervised modes
Multiple Qwen model sizes: 0.6B (1024-dim), 4B (2560-dim), 8B (4096-dim)
12 Baseline models: LDA, HDP, STM (requires covariates), BTM, ETM, CTM, DTM, NVDM, GSM, ProdLDA, BERTopic for comparison
Data governance: Domain-aware cleaning for multiple languages (English, Chinese, German, Spanish)
Unified evaluation: 7 metrics with JSON/CSV export
Rich visualization: 20+ chart types with bilingual labels

Supported Models

Model Overview

Model	Type	Description	Auto Topics	Best For
`theta`	Neural	THETA with Qwen embeddings (0.6B/4B/8B)	No	General purpose, high quality
`lda`	Traditional	Latent Dirichlet Allocation (sklearn)	No	Fast baseline, interpretable
`hdp`	Traditional	Hierarchical Dirichlet Process	Yes	Unknown topic count
`stm`	Traditional	Structural Topic Model	No	Requires covariates (metadata)
`btm`	Traditional	Biterm Topic Model	No	Short texts (tweets, titles)
`etm`	Neural	Embedded Topic Model (Word2Vec + VAE)	No	Word embedding integration
`ctm`	Neural	Contextualized Topic Model (SBERT + VAE)	No	Semantic understanding
`dtm`	Neural	Dynamic Topic Model	No	Time-series analysis
`nvdm`	Neural	Neural Variational Document Model	No	VAE-based baseline
`gsm`	Neural	Gaussian Softmax Model	No	Better topic separation
`prodlda`	Neural	Product of Experts LDA	No	State-of-the-art neural LDA
`bertopic`	Neural	BERT-based topic modeling	Yes	Clustering-based topics

Model Selection Guide

Choose your model based on:

┌─────────────────────────────────────────────────────────────────┐
│ Do you know the number of topics?                               │
│   ├─ NO  → Use HDP or BERTopic (auto-detect topics)            │
│   └─ YES → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ What is your text length?                                       │
│   ├─ SHORT (tweets, titles) → Use BTM                          │
│   └─ NORMAL/LONG → Continue below                               │
├─────────────────────────────────────────────────────────────────┤
│ Do you have document-level metadata (covariates)?               │
│   ├─ YES → Use STM (models how metadata affects topics)         │
│   └─ NO  → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ Do you have time-series data?                                   │
│   ├─ YES → Use DTM                                              │
│   └─ NO  → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ What's your priority?                                           │
│   ├─ SPEED      → Use LDA (fastest)                            │
│   ├─ QUALITY    → Use THETA (best with Qwen embeddings)        │
│   └─ COMPARISON → Use multiple: lda,nvdm,prodlda,theta         │
└─────────────────────────────────────────────────────────────────┘

Training Parameters Reference

THETA Parameters

rameter	Type	Default	Range	Description
`--model_size`	str	`0.6B`	`0.6B`, `4B`, `8B`	Qwen model size
`--mode`	str	`zero_shot`	`zero_shot`, `supervised`, `unsupervised`	Embedding mode
`--num_topics`	int	20	5–100	Number of topics K
`--num_layers`	int	2	1–5	Number of encoder hidden layers
`--hidden_dim`	int	512	128–2048	Neurons per encoder hidden layer
`--epochs`	int	100	10–500	Training epochs
`--batch_size`	int	64	8–512	Batch size
`--learning_rate`	float	0.002	1e-5–0.1	Learning rate
`--dropout`	float	0.2	0–0.9	Encoder dropout rate
`--kl_start`	float	0.0	0–1	KL annealing start weight
`--kl_end`	float	1.0	0–1	KL annealing end weight
`--kl_warmup`	int	50	0–epochs	KL warmup epochs
`--patience`	int	10	1–50	Early stopping patience
`--language`	str	`en`	`en`, `zh`	Visualization language

Baseline Parameters

LDA

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--max_iter`	int	100	10–500	Maximum EM iterations
`--alpha`	float	auto (1/K)	>0	Document-topic Dirichlet prior
`--vocab_size`	int	5000	1000–20000	Vocabulary size

HDP

Parameter	Type	Default	Range	Description
`--max_topics`	int	150	50–300	Upper bound on number of topics
`--alpha`	float	1.0	>0	Document-level concentration parameter
`--vocab_size`	int	5000	1000–20000	Vocabulary size

STM

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--max_iter`	int	100	10–500	Maximum EM iterations
`--vocab_size`	int	5000	1000–20000	Vocabulary size

BTM

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--n_iter`	int	100	10–500	Gibbs sampling iterations
`--alpha`	float	1.0	>0	Dirichlet prior for topic distribution
`--beta`	float	0.01	>0	Dirichlet prior for word distribution
`--vocab_size`	int	5000	1000–20000	Vocabulary size

ETM

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--num_layers`	int	2	1–5	Number of encoder hidden layers
`--hidden_dim`	int	800	128–2048	Neurons per encoder hidden layer
`--embedding_dim`	int	300	50–1024	Word embedding dimension (Word2Vec)
`--epochs`	int	100	10–500	Training epochs
`--batch_size`	int	64	8–512	Batch size
`--learning_rate`	float	0.002	1e-5–0.1	Learning rate
`--dropout`	float	0.5	0–0.9	Dropout rate
`--vocab_size`	int	5000	1000–20000	Vocabulary size

CTM

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--inference_type`	str	`zeroshot`	`zeroshot`, `combined`	`zeroshot` (SBERT only) or `combined` (SBERT + BOW)
`--num_layers`	int	2	1–5	Number of encoder hidden layers
`--hidden_dim`	int	100	32–1024	Neurons per encoder hidden layer
`--epochs`	int	100	10–500	Training epochs
`--batch_size`	int	64	8–512	Batch size
`--learning_rate`	float	0.002	1e-5–0.1	Learning rate
`--dropout`	float	0.2	0–0.9	Dropout rate
`--vocab_size`	int	5000	1000–20000	Vocabulary size

DTM

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--num_layers`	int	2	1–5	Number of encoder hidden layers
`--hidden_dim`	int	512	128–2048	Neurons per encoder hidden layer
`--epochs`	int	100	10–500	Training epochs
`--batch_size`	int	64	8–512	Batch size
`--learning_rate`	float	0.002	1e-5–0.1	Learning rate
`--dropout`	float	0.2	0–0.9	Dropout rate
`--vocab_size`	int	5000	1000–20000	Vocabulary size

NVDM / GSM / ProdLDA

Parameter	Type	Default	Range	Description
`--num_topics`	int	20	5–100	Number of topics K
`--num_layers`	int	2	1–5	Number of encoder hidden layers
`--hidden_dim`	int	256	128–2048	Neurons per encoder hidden layer
`--epochs`	int	100	10–500	Training epochs
`--batch_size`	int	64	8–512	Batch size
`--learning_rate`	float	0.002	1e-5–0.1	Learning rate
`--dropout`	float	0.2	0–0.9	Dropout rate
`--vocab_size`	int	5000	1000–20000	Vocabulary size

BERTopic

Parameter	Type	Default	Range	Description
`--num_topics`	int	auto	≥2 or `None`	Target number of topics; `None` = automatic
`--min_cluster_size`	int	10	2–100	Minimum cluster size; controls topic granularity
`--top_n_words`	int	10	1–30	Number of words per topic
`--n_neighbors`	int	15	2–100	UMAP n_neighbors; controls local vs. global structure
`--n_components`	int	5	2–50	UMAP output dimensionality before clustering

Project Structure

/root/
├── ETM/                          # Main codebase
│   ├── run_pipeline.py           # Unified entry point
│   ├── prepare_data.py           # Data preprocessing
│   ├── config.py                 # Configuration management
│   ├── dataclean/                # Data cleaning module
│   ├── model/                    # Model implementations
│   │   ├── theta/                # THETA main model
│   │   ├── baselines/            # 12 baseline models
│   │   └── _reference/           # Reference implementations
│   ├── evaluation/               # Evaluation metrics
│   ├── visualization/            # Visualization tools
│   └── utils/                    # Utilities 
├── agent/                        # Agent system
│   ├── api.py                    # FastAPI endpoints
│   ├── core/                     # Agent implementations
│   ├── config/                   # Configuration management
│   ├── prompts/                  # Prompt templates
│   ├── utils/                    # LLM and vision utilities
│   └── docs/                     # API documentation
├── scripts/                      # Shell scripts for automation
├── embedding/                     # Qwen embedding generation
│   ├── main.py                    # Embedding generation main codebase
│   ├── embedder.py                # Embedding
│   ├── trainer.py                 # Training (supervised/unsupervised)
│   ├── data_loader.py             # Dataloader

Requirements

Python 3.10+
CUDA recommended for GPU acceleration
Key dependencies:

numpy>=1.20.0
scipy>=1.7.0
torch>=1.10.0
transformers>=4.30.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
gensim>=4.1.0
wordcloud>=1.8.0
pyLDAvis>=3.3.0
jieba>=0.42.0

Installation

git clone https://github.com/<YOUR_ORG>/THETA.git
cd THETA

# Install dependencies
pip install -r ETM/requirements.txt

# Or use the setup script
bash scripts/01_setup.sh

🚀 Quick Start Guide

Step 1: SSH Connect to Server

# Standard SSH connection template
ssh -p <PORT> root@<SERVER_IP>

# Example for AutoDL
ssh -p 12345 root@connect.westb.seetacloud.com

Step 2: Clone Project & Configure .env

# Clone the repository
git clone https://github.com/CodeSoul-co/THETA.git
cd THETA

# Copy environment template
cp .env.example .env

# Edit .env file (CRITICAL: Set model paths)
nano .env

🔑 Core Configuration (Must Set):

# .env file - Essential paths
QWEN_MODEL_0_6B=/root/autodl-tmp/models/Qwen3-Embedding-0.6B
QWEN_MODEL_4B=/root/autodl-tmp/models/Qwen3-Embedding-4B      # Optional
QWEN_MODEL_8B=/root/autodl-tmp/models/Qwen3-Embedding-8B      # Optional
SBERT_MODEL=/root/autodl-tmp/models/paraphrase-multilingual-MiniLM-L12-v2

# Optional: Customize embedding length (default: 512)
# MAX_EMBED_LENGTH=1024

Step 3: One-Click Run

# For Chinese datasets
bash scripts/11_quick_start_chinese.sh

# For English datasets
bash scripts/10_quick_start.sh

⚡ The Parameter Journey

THETA implements a three-level parameter priority system - this is the soul of the system design:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PARAMETER PRIORITY HIERARCHY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   P1 (HIGHEST) │ Command-line arguments                                │
│                │ Example: --max_length 1024 --num_topics 30            │
│                │                                                        │
│   P2 (MEDIUM)  │ .env configuration file                               │
│                │ Example: MAX_EMBED_LENGTH=1024 in .env                │
│                │                                                        │
│   P3 (LOWEST)  │ System hardcoded defaults                             │
│                │ Example: max_length=512 in code                       │
│                │                                                        │
└─────────────────────────────────────────────────────────────────────────┘

Example:

# .env has: MAX_EMBED_LENGTH=1024
# Command line: --max_length 2048

# Result: max_length = 2048 (CLI wins!)

🚧 Model Admission Requirements

Universal Rule: Minimum 5 Documents

ALL models require at least 5 independent documents/samples. This ensures statistical significance.

❌ ERROR: [CRITICAL ERROR] Insufficient data sources (currently 3 files).
         THETA requires at least 5 independent documents for statistical significance.

DTM/STM Special Requirements

Model	Input Format	Required Columns	Error if Missing
DTM	`.csv` only	`time` / `year` / `date`	SchemaError
STM	`.csv` only	At least 1 covariate column (besides `text`)	SchemaError

DTM Example CSV:

text,year
"Document content here...",2020
"Another document...",2021

STM Example CSV:

text,party,region
"Policy document...",Democrat,Northeast
"Another policy...",Republican,South

Error Message:

❌ SchemaError: [CRITICAL] DTM/STM requires structured metadata.
               Please provide a CSV file with time or covariate columns.
               Current columns: ['text']
               DTM requires: time column (time/year/date/timestamp)

🔬 Adaptive Long-Text Processing

THETA uses Sliding Window + Mean Pooling to handle documents of ANY length without information loss.

How It Works

┌─────────────────────────────────────────────────────────────────────────┐
│ Input: 50,000 character document (~100,000 tokens)                      │
│ max_length: 1024 tokens                                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│ Step 1: Sliding Window Tokenization (20% overlap)                       │
│ ┌──────────────────────────────────────────────────────────────────┐   │
│ │ Window 1: tokens[0:1022]       → Embedding_1                     │   │
│ │ Window 2: tokens[818:1840]     → Embedding_2  (20% overlap)      │   │
│ │ Window 3: tokens[1636:2658]    → Embedding_3                     │   │
│ │ ...                                                              │   │
│ │ Window N: tokens[...end]       → Embedding_N                     │   │
│ └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│ Step 2: Mean Pooling Aggregation                                        │
│ ┌──────────────────────────────────────────────────────────────────┐   │
│ │ Final Embedding = mean(Embedding_1, Embedding_2, ..., Embedding_N)│   │
│ └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│ Result: Single 1024-dim vector preserving ALL document information      │
└─────────────────────────────────────────────────────────────────────────┘

Why This Matters

Old Approach	THETA Approach
`truncation=True` → Loses 90%+ of long documents	Sliding Window → Preserves 100%
Fixed 512 tokens max	Dynamic up to 8192 tokens
Information loss	Zero information loss

Console Output:

[Sliding Window] 45/180 texts processed with sliding window
  → max_length=1024, overlap=20%

Physical Safety Limit

HARD_LIMIT_TOKEN = 8192  # Cannot exceed even if set higher in .env

This prevents GPU out-of-memory errors regardless of user configuration.

Pre-trained Data from HuggingFace

If pre-trained embeddings and BOW data are not available locally, download from HuggingFace:

Repository: https://huggingface.co/CodeSoulco/THETA

# Download pre-trained data and LoRA weights
bash scripts/09_download_from_hf.sh

# Or manually using Python
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='CodeSoulco/THETA',
    local_dir='/root/autodl-tmp/hf_cache/THETA'
)
"

The HuggingFace repository contains:

Pre-computed embeddings for benchmark datasets
BOW matrices and vocabularies
LoRA fine-tuned weights (optional)

Shell Scripts

All scripts are non-interactive (pure command-line parameters), suitable for DLC/batch environments. No stdin input required.

Script	Description
`01_setup.sh`	Install dependencies and download data from HuggingFace
`02_clean_data.sh`	Clean raw text data (tokenization, stopword removal, lemmatization)
`02_generate_embeddings.sh`	Generate Qwen embeddings (sub-script of 03, for failure recovery)
`03_prepare_data.sh`	One-stop data preparation: BOW + embeddings for all 12 models
`04_train_theta.sh`	Train THETA model (train + evaluate + visualize)
`05_train_baseline.sh`	Train 11 baseline models for comparison with THETA
`06_visualize.sh`	Generate visualizations for trained models
`07_evaluate.sh`	Standalone evaluation with 7 unified metrics
`08_compare_models.sh`	Cross-model metric comparison table
`09_download_from_hf.sh`	Download pre-trained data from HuggingFace
`10_quick_start_english.sh`	Quick start for English datasets
`11_quick_start_chinese.sh`	Quick start for Chinese datasets
`12_train_multi_gpu.sh`	Multi-GPU training with DistributedDataParallel
`13_test_agent.sh`	Test LLM Agent connection and functionality
`14_start_agent_api.sh`	Start the Agent API server (FastAPI)

For full parameter references, usage examples, and end-to-end workflow walkthroughs for each script, see the Shell Scripts Reference.

Semantic Enhancement (Embeddings)

THETA uses Qwen-3 embedding models with three size options:

Model Size	Embedding Dim	Use Case
0.6B	1024	Fast, default
4B	2560	Balanced
8B	4096	Best quality

Embedding modes:

zero_shot - Direct embedding without fine-tuning
supervised - Fine-tuned with labeled data
unsupervised - Fine-tuned without labels

# Generate embeddings for a dataset
python prepare_data.py --dataset my_dataset --model theta --model_size 0.6B --mode zero_shot

# Check if embeddings exist
python prepare_data.py --dataset my_dataset --model theta --model_size 4B --check-only

Output artifacts:

{dataset}_{mode}_embeddings.npy - Embedding matrix (N x D)
bow_matrix.npz - Bag-of-words matrix
vocab.json - Vocabulary list

Topic Modeling

THETA supports multiple topic modeling approaches:

Model	Description	Time-aware
THETA	Qwen embedding + ETM	No
LDA	Latent Dirichlet Allocation	No
ETM	Embedded Topic Model	No
CTM	Contextualized Topic Model	No
DTM	Dynamic Topic Model	Yes

Training outputs (organized by ResultManager):

model/theta_k{K}.npy - Document-topic distribution
model/beta_k{K}.npy - Topic-word distribution
model/training_history_k{K}.json - Training history
topicwords/topic_words_k{K}.json - Top words per topic
topicwords/topic_evolution_k{K}.json - Topic evolution (DTM only)

Validation & Evaluation

THETA provides unified evaluation with 7 metrics:

Metric	Description
PPL	Perplexity - model fit
TD	Topic Diversity
iRBO	Inverse Rank-Biased Overlap
NPMI	Normalized PMI coherence
C_V	C_V coherence
UMass	UMass coherence
Exclusivity	Topic exclusivity

from evaluation.unified_evaluator import UnifiedEvaluator

evaluator = UnifiedEvaluator(
    beta=beta,
    theta=theta,
    bow_matrix=bow_matrix,
    vocab=vocab,
    model_name="dtm",
    dataset="edu_data",
    num_topics=20
)

metrics = evaluator.evaluate_all()
evaluator.save_results()  # Saves to evaluation/metrics_k20.json and .csv

Evaluation outputs:

evaluation/metrics_k{K}.json - All metrics in JSON format
evaluation/metrics_k{K}.csv - All metrics in CSV format

Visualization

THETA provides comprehensive visualization with bilingual support (English/Chinese):

# Generate visualizations after training
python run_pipeline.py --dataset edu_data --models dtm --skip-train --language en

# Or use visualization module directly
python -c "
from visualization.run_visualization import run_baseline_visualization
run_baseline_visualization(
    result_dir='/root/autodl-tmp/result/baseline',
    dataset='edu_data',
    model='dtm',
    num_topics=20,
    language='zh'
)
"

Generated charts (20+ types):

Topic word bars, word clouds, topic similarity heatmap
Document clustering (UMAP), topic network graph
Topic evolution (DTM), sankey diagrams
Training convergence, coherence metrics
pyLDAvis interactive HTML

Output structure:

visualization_k{K}_{lang}_{timestamp}/
├── global/                    # Global charts
│   ├── topic_table.png
│   ├── topic_network.png
│   ├── clustering_heatmap.png
│   ├── topic_wordclouds.png
│   └── ...
├── topics/                    # Per-topic charts
│   ├── topic_0/
│   ├── topic_1/
│   └── ...
└── README.md                  # Summary report

Result Directory Structure

All results are organized using ResultManager:

/root/autodl-tmp/result/baseline/{dataset}/{model}/
├── bow/                    # BOW data and vocabulary
│   ├── bow_matrix.npz
│   ├── vocab.json
│   └── vocab.txt
├── model/                  # Model parameters
│   ├── theta_k{K}.npy
│   ├── beta_k{K}.npy
│   └── training_history_k{K}.json
├── evaluation/             # Evaluation results
│   ├── metrics_k{K}.json
│   └── metrics_k{K}.csv
├── topicwords/             # Topic words
│   ├── topic_words_k{K}.json
│   └── topic_evolution_k{K}.json
└── visualization_k{K}_{lang}_{timestamp}/

Using ResultManager:

from utils.result_manager import ResultManager

# Initialize
manager = ResultManager(
    result_dir='/root/autodl-tmp/result/baseline',
    dataset='edu_data',
    model='dtm',
    num_topics=20
)

# Save all results
manager.save_all(theta, beta, vocab, topic_words, metrics=metrics)

# Load all results
data = manager.load_all(num_topics=20)

# Migrate old flat structure to new structure
from utils.result_manager import migrate_baseline_results
migrate_baseline_results(dataset='edu_data', model='dtm')

Configuration

Dataset configurations are defined in config.py:

DATASET_CONFIGS = {
    "socialTwitter": {
        "vocab_size": 5000,
        "num_topics": 20,
        "min_doc_freq": 5,
        "language": "multi",
    },
    "hatespeech": {
        "vocab_size": 8000,
        "num_topics": 20,
        "min_doc_freq": 10,
        "language": "english",
    },
    "edu_data": {
        "vocab_size": 5000,
        "num_topics": 20,
        "min_doc_freq": 3,
        "language": "chinese",
        "has_timestamp": True,
    },
}

Command-line parameters:

Parameter	Description	Default
`--dataset`	Dataset name	Required
`--models`	Model list (comma-separated)	Required
`--model_size`	Qwen model size (THETA)	0.6B
`--mode`	THETA mode	zero_shot
`--num_topics`	Number of topics	20
`--epochs`	Training epochs	100
`--batch_size`	Batch size	64
`--language`	Visualization language	en
`--skip-train`	Skip training	False
`--skip-eval`	Skip evaluation	False
`--skip-viz`	Skip visualization	False

Supported Datasets

Dataset	Documents	Language	Time-aware
socialTwitter	~40K	Spanish/English	No
hatespeech	~437K	English	No
mental_health	~1M	English	No
FCPB	~854K	English	No
germanCoal	~9K	German	No
edu_data	~857	Chinese	Yes

License

Apache-2.0

Contributing

Contributions are welcome:

New dataset adapters
Topic visualization modules
Evaluation and reproducibility scripts
Documentation improvements

Suggested workflow:

Fork the repo and create a feature branch
Add a minimal reproducible example or tests
Open a pull request

Ethics & Safety

This project analyzes social text and may involve sensitive content.

Do not include personally identifiable information (PII)
Ensure dataset usage complies with platform terms and research ethics
Interpret outputs cautiously; topic discovery does not replace scientific conclusions
Be responsible with sensitive domains such as self-harm, hate speech, and political polarization

FAQ

Q: Is this only for Qwen-3?

A: No. Qwen-3 is the reference backbone, but THETA is designed to be model-agnostic. You can adapt it for other embedding models.

Q: What is the difference between ETM and DTM?

A: ETM learns static topics across the corpus; DTM (Dynamic Topic Model) models topic evolution over time and requires timestamps.

Q: Why is STM skipped when I try to train it? How do I use STM?

A: STM (Structural Topic Model) requires document-level covariates (metadata such as year, source, category). Unlike LDA, STM models how metadata influences topic prevalence, so covariates are mandatory. If your dataset doesn't have covariates configured, STM will be automatically skipped.

To use STM:

# 1. Make sure your cleaned CSV has metadata columns (e.g., year, source, category)

# 2. Register covariates in ETM/config.py:
#    DATASET_CONFIGS["my_dataset"] = {
#        "vocab_size": 5000,
#        "num_topics": 20,
#        "language": "english",
#        "covariate_columns": ["year", "source", "category"],  # <-- required for STM
#    }

# 3. Prepare data
bash scripts/03_prepare_data.sh --dataset my_dataset --model stm --vocab_size 5000

# 4. Train STM
bash scripts/05_train_baseline.sh --dataset my_dataset --models stm --num_topics 20

If your dataset has no meaningful metadata, use CTM (same logistic-normal prior, no covariates needed) or LDA instead.

Q: CUDA out of memory — what should I do?

A: Insufficient GPU VRAM. Solutions:

Embedding generation (unsupervised/supervised): reduce --batch_size (recommend 4–8)
THETA training: reduce --batch_size (recommend 32–64)
Check for other processes using the GPU: nvidia-smi
Kill zombie processes: kill -9 <PID>

Q: EMB shows ✗ (embeddings not generated)

A: Embedding generation failed (usually OOM) but the script did not exit with an error. Regenerate with a smaller batch_size:

bash scripts/02_generate_embeddings.sh \
    --dataset edu_data --mode unsupervised --model_size 0.6B \
    --batch_size 4 --gpu 0 \
    --exp_dir /root/autodl-tmp/result/0.6B/edu_data/data/exp_xxx

Q: How to choose an embedding mode?

Scenario	Recommended Mode	Reason
Quick testing	zero_shot	No training needed, completes in seconds
Unlabeled data	unsupervised	LoRA fine-tuning adapts to the domain
Labeled data	supervised	Leverages label information to enhance embeddings
Large datasets	zero_shot	Avoids lengthy fine-tuning

Q: How to choose the number of topics K?

Small datasets (<1000 docs): K = 5–15
Medium datasets (1000–10000): K = 10–30
Large datasets (>10000): K = 20–50
Use hdp or bertopic to auto-determine topic count as a reference

Q: What does the visualization --language parameter do?

en: Chart titles, axes, and legends in English
zh: Chart titles, axes, and legends in Chinese (e.g., "主题表", "训练损失图")
Only affects visualization; does not affect model training or evaluation

Q: What is the difference between BOW --language and visualization --language?

Parameter	Script	Values	Purpose
`--language` in `03_prepare_data.sh`	BOW generation	english, chinese	Controls tokenization and stopword filtering
`--language` in `04_train_theta.sh`	Visualization	en, zh	Controls chart label language
`--language` in `05_train_baseline.sh`	Visualization	en, zh	Controls chart label language

Q: Can I add my own dataset?

A: Yes. Prepare a cleaned CSV with text column (and optionally year for DTM, or metadata columns for STM), then add configuration to config.py:

DATASET_CONFIGS["my_dataset"] = {
    "vocab_size": 5000,
    "num_topics": 20,
    "min_doc_freq": 5,
    "language": "english",
    # Optional: for STM (document-level metadata)
    # "covariate_columns": ["year", "source", "category"],
    # Optional: for DTM (time-aware)
    # "has_timestamp": True,
}

Citation

If you find THETA useful in your research, please consider citing our paper:

@article{duan2026theta,
  title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
  author={Codesoul.co},
  journal={arXiv preprint arXiv:2603.05972},
  year={2026},
  doi={10.48550/arXiv.2603.05972}
}

Contact

Please contact us if you have any questions:

theta@code-soul.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
assets		assets
data		data
doc		doc
result		result
scripts		scripts
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
docs-requirements.txt		docs-requirements.txt
mkdocs.yml		mkdocs.yml

Folders and files

Latest commit

History

Repository files navigation

THETA (θ)

Overview

Key Features

Supported Models

Model Overview

Model Selection Guide

Training Parameters Reference

THETA Parameters

Baseline Parameters

Project Structure

Requirements

Installation

🚀 Quick Start Guide

Step 1: SSH Connect to Server

Step 2: Clone Project & Configure .env

Step 3: One-Click Run

⚡ The Parameter Journey

🚧 Model Admission Requirements

Universal Rule: Minimum 5 Documents

DTM/STM Special Requirements

🔬 Adaptive Long-Text Processing

How It Works

Why This Matters

Physical Safety Limit

Pre-trained Data from HuggingFace

Shell Scripts

Semantic Enhancement (Embeddings)

Topic Modeling

Validation & Evaluation

Visualization

Result Directory Structure

Configuration

Supported Datasets

License

Contributing

Ethics & Safety

FAQ

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages