THETA (θ) is an open-source, research-oriented platform for LLM-enhanced topic analysis in social science. It combines:
- Domain-adaptive document embeddings from Qwen-3 models (0.6B/4B/8B)
- Zero-shot embedding (no training), or
- Supervised/Unsupervised fine-tuning modes
- Generative topic models with 12 baseline models for comparison:
- THETA: Main model using Qwen embeddings (0.6B/4B/8B)
- Traditional: LDA, HDP (auto topics), STM (requires covariates), BTM (short texts)
- Neural: ETM, CTM, DTM (time-aware), NVDM, GSM, ProdLDA, BERTopic
- Scientific validation via 7 intrinsic metrics (PPL, TD, iRBO, NPMI, C_V, UMass, Exclusivity)
- Comprehensive visualization with bilingual support (English/Chinese)
THETA aims to move topic modeling from "clustering with pretty plots" to a reproducible, validated scientific workflow.
- Hybrid embedding topic analysis: Zero-shot / Supervised / Unsupervised modes
- Multiple Qwen model sizes: 0.6B (1024-dim), 4B (2560-dim), 8B (4096-dim)
- 12 Baseline models: LDA, HDP, STM (requires covariates), BTM, ETM, CTM, DTM, NVDM, GSM, ProdLDA, BERTopic for comparison
- Data governance: Domain-aware cleaning for multiple languages (English, Chinese, German, Spanish)
- Unified evaluation: 7 metrics with JSON/CSV export
- Rich visualization: 20+ chart types with bilingual labels
| Model | Type | Description | Auto Topics | Best For |
|---|---|---|---|---|
theta |
Neural | THETA with Qwen embeddings (0.6B/4B/8B) | No | General purpose, high quality |
lda |
Traditional | Latent Dirichlet Allocation (sklearn) | No | Fast baseline, interpretable |
hdp |
Traditional | Hierarchical Dirichlet Process | Yes | Unknown topic count |
stm |
Traditional | Structural Topic Model | No | Requires covariates (metadata) |
btm |
Traditional | Biterm Topic Model | No | Short texts (tweets, titles) |
etm |
Neural | Embedded Topic Model (Word2Vec + VAE) | No | Word embedding integration |
ctm |
Neural | Contextualized Topic Model (SBERT + VAE) | No | Semantic understanding |
dtm |
Neural | Dynamic Topic Model | No | Time-series analysis |
nvdm |
Neural | Neural Variational Document Model | No | VAE-based baseline |
gsm |
Neural | Gaussian Softmax Model | No | Better topic separation |
prodlda |
Neural | Product of Experts LDA | No | State-of-the-art neural LDA |
bertopic |
Neural | BERT-based topic modeling | Yes | Clustering-based topics |
Choose your model based on:
┌─────────────────────────────────────────────────────────────────┐
│ Do you know the number of topics? │
│ ├─ NO → Use HDP or BERTopic (auto-detect topics) │
│ └─ YES → Continue below │
├─────────────────────────────────────────────────────────────────┤
│ What is your text length? │
│ ├─ SHORT (tweets, titles) → Use BTM │
│ └─ NORMAL/LONG → Continue below │
├─────────────────────────────────────────────────────────────────┤
│ Do you have document-level metadata (covariates)? │
│ ├─ YES → Use STM (models how metadata affects topics) │
│ └─ NO → Continue below │
├─────────────────────────────────────────────────────────────────┤
│ Do you have time-series data? │
│ ├─ YES → Use DTM │
│ └─ NO → Continue below │
├─────────────────────────────────────────────────────────────────┤
│ What's your priority? │
│ ├─ SPEED → Use LDA (fastest) │
│ ├─ QUALITY → Use THETA (best with Qwen embeddings) │
│ └─ COMPARISON → Use multiple: lda,nvdm,prodlda,theta │
└─────────────────────────────────────────────────────────────────┘
| rameter | Type | Default | Range | Description |
|---|---|---|---|---|
--model_size |
str | 0.6B |
0.6B, 4B, 8B |
Qwen model size |
--mode |
str | zero_shot |
zero_shot, supervised, unsupervised |
Embedding mode |
--num_topics |
int | 20 | 5–100 | Number of topics K |
--num_layers |
int | 2 | 1–5 | Number of encoder hidden layers |
--hidden_dim |
int | 512 | 128–2048 | Neurons per encoder hidden layer |
--epochs |
int | 100 | 10–500 | Training epochs |
--batch_size |
int | 64 | 8–512 | Batch size |
--learning_rate |
float | 0.002 | 1e-5–0.1 | Learning rate |
--dropout |
float | 0.2 | 0–0.9 | Encoder dropout rate |
--kl_start |
float | 0.0 | 0–1 | KL annealing start weight |
--kl_end |
float | 1.0 | 0–1 | KL annealing end weight |
--kl_warmup |
int | 50 | 0–epochs | KL warmup epochs |
--patience |
int | 10 | 1–50 | Early stopping patience |
--language |
str | en |
en, zh |
Visualization language |
LDA
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--max_iter |
int | 100 | 10–500 | Maximum EM iterations |
--alpha |
float | auto (1/K) | >0 | Document-topic Dirichlet prior |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
HDP
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--max_topics |
int | 150 | 50–300 | Upper bound on number of topics |
--alpha |
float | 1.0 | >0 | Document-level concentration parameter |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
STM
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--max_iter |
int | 100 | 10–500 | Maximum EM iterations |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
BTM
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--n_iter |
int | 100 | 10–500 | Gibbs sampling iterations |
--alpha |
float | 1.0 | >0 | Dirichlet prior for topic distribution |
--beta |
float | 0.01 | >0 | Dirichlet prior for word distribution |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
ETM
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--num_layers |
int | 2 | 1–5 | Number of encoder hidden layers |
--hidden_dim |
int | 800 | 128–2048 | Neurons per encoder hidden layer |
--embedding_dim |
int | 300 | 50–1024 | Word embedding dimension (Word2Vec) |
--epochs |
int | 100 | 10–500 | Training epochs |
--batch_size |
int | 64 | 8–512 | Batch size |
--learning_rate |
float | 0.002 | 1e-5–0.1 | Learning rate |
--dropout |
float | 0.5 | 0–0.9 | Dropout rate |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
CTM
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--inference_type |
str | zeroshot |
zeroshot, combined |
zeroshot (SBERT only) or combined (SBERT + BOW) |
--num_layers |
int | 2 | 1–5 | Number of encoder hidden layers |
--hidden_dim |
int | 100 | 32–1024 | Neurons per encoder hidden layer |
--epochs |
int | 100 | 10–500 | Training epochs |
--batch_size |
int | 64 | 8–512 | Batch size |
--learning_rate |
float | 0.002 | 1e-5–0.1 | Learning rate |
--dropout |
float | 0.2 | 0–0.9 | Dropout rate |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
DTM
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--num_layers |
int | 2 | 1–5 | Number of encoder hidden layers |
--hidden_dim |
int | 512 | 128–2048 | Neurons per encoder hidden layer |
--epochs |
int | 100 | 10–500 | Training epochs |
--batch_size |
int | 64 | 8–512 | Batch size |
--learning_rate |
float | 0.002 | 1e-5–0.1 | Learning rate |
--dropout |
float | 0.2 | 0–0.9 | Dropout rate |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
NVDM / GSM / ProdLDA
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | 20 | 5–100 | Number of topics K |
--num_layers |
int | 2 | 1–5 | Number of encoder hidden layers |
--hidden_dim |
int | 256 | 128–2048 | Neurons per encoder hidden layer |
--epochs |
int | 100 | 10–500 | Training epochs |
--batch_size |
int | 64 | 8–512 | Batch size |
--learning_rate |
float | 0.002 | 1e-5–0.1 | Learning rate |
--dropout |
float | 0.2 | 0–0.9 | Dropout rate |
--vocab_size |
int | 5000 | 1000–20000 | Vocabulary size |
BERTopic
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
--num_topics |
int | auto | ≥2 or None |
Target number of topics; None = automatic |
--min_cluster_size |
int | 10 | 2–100 | Minimum cluster size; controls topic granularity |
--top_n_words |
int | 10 | 1–30 | Number of words per topic |
--n_neighbors |
int | 15 | 2–100 | UMAP n_neighbors; controls local vs. global structure |
--n_components |
int | 5 | 2–50 | UMAP output dimensionality before clustering |
/root/
├── ETM/ # Main codebase
│ ├── run_pipeline.py # Unified entry point
│ ├── prepare_data.py # Data preprocessing
│ ├── config.py # Configuration management
│ ├── dataclean/ # Data cleaning module
│ ├── model/ # Model implementations
│ │ ├── theta/ # THETA main model
│ │ ├── baselines/ # 12 baseline models
│ │ └── _reference/ # Reference implementations
│ ├── evaluation/ # Evaluation metrics
│ ├── visualization/ # Visualization tools
│ └── utils/ # Utilities
├── agent/ # Agent system
│ ├── api.py # FastAPI endpoints
│ ├── core/ # Agent implementations
│ ├── config/ # Configuration management
│ ├── prompts/ # Prompt templates
│ ├── utils/ # LLM and vision utilities
│ └── docs/ # API documentation
├── scripts/ # Shell scripts for automation
├── embedding/ # Qwen embedding generation
│ ├── main.py # Embedding generation main codebase
│ ├── embedder.py # Embedding
│ ├── trainer.py # Training (supervised/unsupervised)
│ ├── data_loader.py # Dataloader
- Python 3.10+
- CUDA recommended for GPU acceleration
- Key dependencies:
numpy>=1.20.0
scipy>=1.7.0
torch>=1.10.0
transformers>=4.30.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
gensim>=4.1.0
wordcloud>=1.8.0
pyLDAvis>=3.3.0
jieba>=0.42.0
git clone https://github.com/<YOUR_ORG>/THETA.git
cd THETA
# Install dependencies
pip install -r ETM/requirements.txt
# Or use the setup script
bash scripts/01_setup.sh# Standard SSH connection template
ssh -p <PORT> root@<SERVER_IP>
# Example for AutoDL
ssh -p 12345 root@connect.westb.seetacloud.com# Clone the repository
git clone https://github.com/CodeSoul-co/THETA.git
cd THETA
# Copy environment template
cp .env.example .env
# Edit .env file (CRITICAL: Set model paths)
nano .env🔑 Core Configuration (Must Set):
# .env file - Essential paths
QWEN_MODEL_0_6B=/root/autodl-tmp/models/Qwen3-Embedding-0.6B
QWEN_MODEL_4B=/root/autodl-tmp/models/Qwen3-Embedding-4B # Optional
QWEN_MODEL_8B=/root/autodl-tmp/models/Qwen3-Embedding-8B # Optional
SBERT_MODEL=/root/autodl-tmp/models/paraphrase-multilingual-MiniLM-L12-v2
# Optional: Customize embedding length (default: 512)
# MAX_EMBED_LENGTH=1024# For Chinese datasets
bash scripts/11_quick_start_chinese.sh
# For English datasets
bash scripts/10_quick_start.shTHETA implements a three-level parameter priority system - this is the soul of the system design:
┌─────────────────────────────────────────────────────────────────────────┐
│ PARAMETER PRIORITY HIERARCHY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ P1 (HIGHEST) │ Command-line arguments │
│ │ Example: --max_length 1024 --num_topics 30 │
│ │ │
│ P2 (MEDIUM) │ .env configuration file │
│ │ Example: MAX_EMBED_LENGTH=1024 in .env │
│ │ │
│ P3 (LOWEST) │ System hardcoded defaults │
│ │ Example: max_length=512 in code │
│ │ │
└─────────────────────────────────────────────────────────────────────────┘
Example:
# .env has: MAX_EMBED_LENGTH=1024
# Command line: --max_length 2048
# Result: max_length = 2048 (CLI wins!)ALL models require at least 5 independent documents/samples. This ensures statistical significance.
❌ ERROR: [CRITICAL ERROR] Insufficient data sources (currently 3 files).
THETA requires at least 5 independent documents for statistical significance.
| Model | Input Format | Required Columns | Error if Missing |
|---|---|---|---|
| DTM | .csv only |
time / year / date |
SchemaError |
| STM | .csv only |
At least 1 covariate column (besides text) |
SchemaError |
DTM Example CSV:
text,year
"Document content here...",2020
"Another document...",2021STM Example CSV:
text,party,region
"Policy document...",Democrat,Northeast
"Another policy...",Republican,SouthError Message:
❌ SchemaError: [CRITICAL] DTM/STM requires structured metadata.
Please provide a CSV file with time or covariate columns.
Current columns: ['text']
DTM requires: time column (time/year/date/timestamp)
THETA uses Sliding Window + Mean Pooling to handle documents of ANY length without information loss.
┌─────────────────────────────────────────────────────────────────────────┐
│ Input: 50,000 character document (~100,000 tokens) │
│ max_length: 1024 tokens │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Sliding Window Tokenization (20% overlap) │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Window 1: tokens[0:1022] → Embedding_1 │ │
│ │ Window 2: tokens[818:1840] → Embedding_2 (20% overlap) │ │
│ │ Window 3: tokens[1636:2658] → Embedding_3 │ │
│ │ ... │ │
│ │ Window N: tokens[...end] → Embedding_N │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Step 2: Mean Pooling Aggregation │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Final Embedding = mean(Embedding_1, Embedding_2, ..., Embedding_N)│ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Result: Single 1024-dim vector preserving ALL document information │
└─────────────────────────────────────────────────────────────────────────┘
| Old Approach | THETA Approach |
|---|---|
truncation=True → Loses 90%+ of long documents |
Sliding Window → Preserves 100% |
| Fixed 512 tokens max | Dynamic up to 8192 tokens |
| Information loss | Zero information loss |
Console Output:
[Sliding Window] 45/180 texts processed with sliding window
→ max_length=1024, overlap=20%
HARD_LIMIT_TOKEN = 8192 # Cannot exceed even if set higher in .envThis prevents GPU out-of-memory errors regardless of user configuration.
If pre-trained embeddings and BOW data are not available locally, download from HuggingFace:
Repository: https://huggingface.co/CodeSoulco/THETA
# Download pre-trained data and LoRA weights
bash scripts/09_download_from_hf.sh
# Or manually using Python
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='CodeSoulco/THETA',
local_dir='/root/autodl-tmp/hf_cache/THETA'
)
"The HuggingFace repository contains:
- Pre-computed embeddings for benchmark datasets
- BOW matrices and vocabularies
- LoRA fine-tuned weights (optional)
All scripts are non-interactive (pure command-line parameters), suitable for DLC/batch environments. No stdin input required.
| Script | Description |
|---|---|
01_setup.sh |
Install dependencies and download data from HuggingFace |
02_clean_data.sh |
Clean raw text data (tokenization, stopword removal, lemmatization) |
02_generate_embeddings.sh |
Generate Qwen embeddings (sub-script of 03, for failure recovery) |
03_prepare_data.sh |
One-stop data preparation: BOW + embeddings for all 12 models |
04_train_theta.sh |
Train THETA model (train + evaluate + visualize) |
05_train_baseline.sh |
Train 11 baseline models for comparison with THETA |
06_visualize.sh |
Generate visualizations for trained models |
07_evaluate.sh |
Standalone evaluation with 7 unified metrics |
08_compare_models.sh |
Cross-model metric comparison table |
09_download_from_hf.sh |
Download pre-trained data from HuggingFace |
10_quick_start_english.sh |
Quick start for English datasets |
11_quick_start_chinese.sh |
Quick start for Chinese datasets |
12_train_multi_gpu.sh |
Multi-GPU training with DistributedDataParallel |
13_test_agent.sh |
Test LLM Agent connection and functionality |
14_start_agent_api.sh |
Start the Agent API server (FastAPI) |
For full parameter references, usage examples, and end-to-end workflow walkthroughs for each script, see the Shell Scripts Reference.
THETA uses Qwen-3 embedding models with three size options:
| Model Size | Embedding Dim | Use Case |
|---|---|---|
| 0.6B | 1024 | Fast, default |
| 4B | 2560 | Balanced |
| 8B | 4096 | Best quality |
Embedding modes:
zero_shot- Direct embedding without fine-tuningsupervised- Fine-tuned with labeled dataunsupervised- Fine-tuned without labels
# Generate embeddings for a dataset
python prepare_data.py --dataset my_dataset --model theta --model_size 0.6B --mode zero_shot
# Check if embeddings exist
python prepare_data.py --dataset my_dataset --model theta --model_size 4B --check-onlyOutput artifacts:
{dataset}_{mode}_embeddings.npy- Embedding matrix (N x D)bow_matrix.npz- Bag-of-words matrixvocab.json- Vocabulary list
THETA supports multiple topic modeling approaches:
| Model | Description | Time-aware |
|---|---|---|
| THETA | Qwen embedding + ETM | No |
| LDA | Latent Dirichlet Allocation | No |
| ETM | Embedded Topic Model | No |
| CTM | Contextualized Topic Model | No |
| DTM | Dynamic Topic Model | Yes |
Training outputs (organized by ResultManager):
model/theta_k{K}.npy- Document-topic distributionmodel/beta_k{K}.npy- Topic-word distributionmodel/training_history_k{K}.json- Training historytopicwords/topic_words_k{K}.json- Top words per topictopicwords/topic_evolution_k{K}.json- Topic evolution (DTM only)
THETA provides unified evaluation with 7 metrics:
| Metric | Description |
|---|---|
| PPL | Perplexity - model fit |
| TD | Topic Diversity |
| iRBO | Inverse Rank-Biased Overlap |
| NPMI | Normalized PMI coherence |
| C_V | C_V coherence |
| UMass | UMass coherence |
| Exclusivity | Topic exclusivity |
from evaluation.unified_evaluator import UnifiedEvaluator
evaluator = UnifiedEvaluator(
beta=beta,
theta=theta,
bow_matrix=bow_matrix,
vocab=vocab,
model_name="dtm",
dataset="edu_data",
num_topics=20
)
metrics = evaluator.evaluate_all()
evaluator.save_results() # Saves to evaluation/metrics_k20.json and .csvEvaluation outputs:
evaluation/metrics_k{K}.json- All metrics in JSON formatevaluation/metrics_k{K}.csv- All metrics in CSV format
THETA provides comprehensive visualization with bilingual support (English/Chinese):
# Generate visualizations after training
python run_pipeline.py --dataset edu_data --models dtm --skip-train --language en
# Or use visualization module directly
python -c "
from visualization.run_visualization import run_baseline_visualization
run_baseline_visualization(
result_dir='/root/autodl-tmp/result/baseline',
dataset='edu_data',
model='dtm',
num_topics=20,
language='zh'
)
"Generated charts (20+ types):
- Topic word bars, word clouds, topic similarity heatmap
- Document clustering (UMAP), topic network graph
- Topic evolution (DTM), sankey diagrams
- Training convergence, coherence metrics
- pyLDAvis interactive HTML
Output structure:
visualization_k{K}_{lang}_{timestamp}/
├── global/ # Global charts
│ ├── topic_table.png
│ ├── topic_network.png
│ ├── clustering_heatmap.png
│ ├── topic_wordclouds.png
│ └── ...
├── topics/ # Per-topic charts
│ ├── topic_0/
│ ├── topic_1/
│ └── ...
└── README.md # Summary report
All results are organized using ResultManager:
/root/autodl-tmp/result/baseline/{dataset}/{model}/
├── bow/ # BOW data and vocabulary
│ ├── bow_matrix.npz
│ ├── vocab.json
│ └── vocab.txt
├── model/ # Model parameters
│ ├── theta_k{K}.npy
│ ├── beta_k{K}.npy
│ └── training_history_k{K}.json
├── evaluation/ # Evaluation results
│ ├── metrics_k{K}.json
│ └── metrics_k{K}.csv
├── topicwords/ # Topic words
│ ├── topic_words_k{K}.json
│ └── topic_evolution_k{K}.json
└── visualization_k{K}_{lang}_{timestamp}/
Using ResultManager:
from utils.result_manager import ResultManager
# Initialize
manager = ResultManager(
result_dir='/root/autodl-tmp/result/baseline',
dataset='edu_data',
model='dtm',
num_topics=20
)
# Save all results
manager.save_all(theta, beta, vocab, topic_words, metrics=metrics)
# Load all results
data = manager.load_all(num_topics=20)
# Migrate old flat structure to new structure
from utils.result_manager import migrate_baseline_results
migrate_baseline_results(dataset='edu_data', model='dtm')Dataset configurations are defined in config.py:
DATASET_CONFIGS = {
"socialTwitter": {
"vocab_size": 5000,
"num_topics": 20,
"min_doc_freq": 5,
"language": "multi",
},
"hatespeech": {
"vocab_size": 8000,
"num_topics": 20,
"min_doc_freq": 10,
"language": "english",
},
"edu_data": {
"vocab_size": 5000,
"num_topics": 20,
"min_doc_freq": 3,
"language": "chinese",
"has_timestamp": True,
},
}Command-line parameters:
| Parameter | Description | Default |
|---|---|---|
--dataset |
Dataset name | Required |
--models |
Model list (comma-separated) | Required |
--model_size |
Qwen model size (THETA) | 0.6B |
--mode |
THETA mode | zero_shot |
--num_topics |
Number of topics | 20 |
--epochs |
Training epochs | 100 |
--batch_size |
Batch size | 64 |
--language |
Visualization language | en |
--skip-train |
Skip training | False |
--skip-eval |
Skip evaluation | False |
--skip-viz |
Skip visualization | False |
| Dataset | Documents | Language | Time-aware |
|---|---|---|---|
| socialTwitter | ~40K | Spanish/English | No |
| hatespeech | ~437K | English | No |
| mental_health | ~1M | English | No |
| FCPB | ~854K | English | No |
| germanCoal | ~9K | German | No |
| edu_data | ~857 | Chinese | Yes |
Apache-2.0
Contributions are welcome:
- New dataset adapters
- Topic visualization modules
- Evaluation and reproducibility scripts
- Documentation improvements
Suggested workflow:
- Fork the repo and create a feature branch
- Add a minimal reproducible example or tests
- Open a pull request
This project analyzes social text and may involve sensitive content.
- Do not include personally identifiable information (PII)
- Ensure dataset usage complies with platform terms and research ethics
- Interpret outputs cautiously; topic discovery does not replace scientific conclusions
- Be responsible with sensitive domains such as self-harm, hate speech, and political polarization
Q: Is this only for Qwen-3?
A: No. Qwen-3 is the reference backbone, but THETA is designed to be model-agnostic. You can adapt it for other embedding models.
Q: What is the difference between ETM and DTM?
A: ETM learns static topics across the corpus; DTM (Dynamic Topic Model) models topic evolution over time and requires timestamps.
Q: Why is STM skipped when I try to train it? How do I use STM?
A: STM (Structural Topic Model) requires document-level covariates (metadata such as year, source, category). Unlike LDA, STM models how metadata influences topic prevalence, so covariates are mandatory. If your dataset doesn't have covariates configured, STM will be automatically skipped.
To use STM:
# 1. Make sure your cleaned CSV has metadata columns (e.g., year, source, category)
# 2. Register covariates in ETM/config.py:
# DATASET_CONFIGS["my_dataset"] = {
# "vocab_size": 5000,
# "num_topics": 20,
# "language": "english",
# "covariate_columns": ["year", "source", "category"], # <-- required for STM
# }
# 3. Prepare data
bash scripts/03_prepare_data.sh --dataset my_dataset --model stm --vocab_size 5000
# 4. Train STM
bash scripts/05_train_baseline.sh --dataset my_dataset --models stm --num_topics 20If your dataset has no meaningful metadata, use CTM (same logistic-normal prior, no covariates needed) or LDA instead.
Q: CUDA out of memory — what should I do?
A: Insufficient GPU VRAM. Solutions:
- Embedding generation (unsupervised/supervised): reduce
--batch_size(recommend 4–8) - THETA training: reduce
--batch_size(recommend 32–64) - Check for other processes using the GPU:
nvidia-smi - Kill zombie processes:
kill -9 <PID>
Q: EMB shows ✗ (embeddings not generated)
A: Embedding generation failed (usually OOM) but the script did not exit with an error. Regenerate with a smaller batch_size:
bash scripts/02_generate_embeddings.sh \
--dataset edu_data --mode unsupervised --model_size 0.6B \
--batch_size 4 --gpu 0 \
--exp_dir /root/autodl-tmp/result/0.6B/edu_data/data/exp_xxxQ: How to choose an embedding mode?
| Scenario | Recommended Mode | Reason |
|---|---|---|
| Quick testing | zero_shot | No training needed, completes in seconds |
| Unlabeled data | unsupervised | LoRA fine-tuning adapts to the domain |
| Labeled data | supervised | Leverages label information to enhance embeddings |
| Large datasets | zero_shot | Avoids lengthy fine-tuning |
Q: How to choose the number of topics K?
- Small datasets (<1000 docs): K = 5–15
- Medium datasets (1000–10000): K = 10–30
- Large datasets (>10000): K = 20–50
- Use
hdporbertopicto auto-determine topic count as a reference
Q: What does the visualization --language parameter do?
en: Chart titles, axes, and legends in Englishzh: Chart titles, axes, and legends in Chinese (e.g., "主题表", "训练损失图")- Only affects visualization; does not affect model training or evaluation
Q: What is the difference between BOW --language and visualization --language?
| Parameter | Script | Values | Purpose |
|---|---|---|---|
--language in 03_prepare_data.sh |
BOW generation | english, chinese | Controls tokenization and stopword filtering |
--language in 04_train_theta.sh |
Visualization | en, zh | Controls chart label language |
--language in 05_train_baseline.sh |
Visualization | en, zh | Controls chart label language |
Q: Can I add my own dataset?
A: Yes. Prepare a cleaned CSV with text column (and optionally year for DTM, or metadata columns for STM), then add configuration to config.py:
DATASET_CONFIGS["my_dataset"] = {
"vocab_size": 5000,
"num_topics": 20,
"min_doc_freq": 5,
"language": "english",
# Optional: for STM (document-level metadata)
# "covariate_columns": ["year", "source", "category"],
# Optional: for DTM (time-aware)
# "has_timestamp": True,
}If you find THETA useful in your research, please consider citing our paper:
@article{duan2026theta,
title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
author={Codesoul.co},
journal={arXiv preprint arXiv:2603.05972},
year={2026},
doi={10.48550/arXiv.2603.05972}
}Please contact us if you have any questions:
