Skip to content

CodeSoul-co/THETA

Repository files navigation

THETA Logo

THETA (θ)

Platform HuggingFace Paper

English | 中文

Textual Hybrid Embedding–based Topic Analysis

Overview

THETA (θ) is an open-source, research-oriented platform for LLM-enhanced topic analysis in social science. It combines:

  • Domain-adaptive document embeddings from Qwen-3 models (0.6B/4B/8B)
    • Zero-shot embedding (no training), or
    • Supervised/Unsupervised fine-tuning modes
  • Generative topic models with 12 baseline models for comparison:
    • THETA: Main model using Qwen embeddings (0.6B/4B/8B)
    • Traditional: LDA, HDP (auto topics), STM (requires covariates), BTM (short texts)
    • Neural: ETM, CTM, DTM (time-aware), NVDM, GSM, ProdLDA, BERTopic
  • Scientific validation via 7 intrinsic metrics (PPL, TD, iRBO, NPMI, C_V, UMass, Exclusivity)
  • Comprehensive visualization with bilingual support (English/Chinese)

THETA aims to move topic modeling from "clustering with pretty plots" to a reproducible, validated scientific workflow.


Key Features

  • Hybrid embedding topic analysis: Zero-shot / Supervised / Unsupervised modes
  • Multiple Qwen model sizes: 0.6B (1024-dim), 4B (2560-dim), 8B (4096-dim)
  • 12 Baseline models: LDA, HDP, STM (requires covariates), BTM, ETM, CTM, DTM, NVDM, GSM, ProdLDA, BERTopic for comparison
  • Data governance: Domain-aware cleaning for multiple languages (English, Chinese, German, Spanish)
  • Unified evaluation: 7 metrics with JSON/CSV export
  • Rich visualization: 20+ chart types with bilingual labels

Supported Models

Model Overview

Model Type Description Auto Topics Best For
theta Neural THETA with Qwen embeddings (0.6B/4B/8B) No General purpose, high quality
lda Traditional Latent Dirichlet Allocation (sklearn) No Fast baseline, interpretable
hdp Traditional Hierarchical Dirichlet Process Yes Unknown topic count
stm Traditional Structural Topic Model No Requires covariates (metadata)
btm Traditional Biterm Topic Model No Short texts (tweets, titles)
etm Neural Embedded Topic Model (Word2Vec + VAE) No Word embedding integration
ctm Neural Contextualized Topic Model (SBERT + VAE) No Semantic understanding
dtm Neural Dynamic Topic Model No Time-series analysis
nvdm Neural Neural Variational Document Model No VAE-based baseline
gsm Neural Gaussian Softmax Model No Better topic separation
prodlda Neural Product of Experts LDA No State-of-the-art neural LDA
bertopic Neural BERT-based topic modeling Yes Clustering-based topics

Model Selection Guide

Choose your model based on:

┌─────────────────────────────────────────────────────────────────┐
│ Do you know the number of topics?                               │
│   ├─ NO  → Use HDP or BERTopic (auto-detect topics)            │
│   └─ YES → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ What is your text length?                                       │
│   ├─ SHORT (tweets, titles) → Use BTM                          │
│   └─ NORMAL/LONG → Continue below                               │
├─────────────────────────────────────────────────────────────────┤
│ Do you have document-level metadata (covariates)?               │
│   ├─ YES → Use STM (models how metadata affects topics)         │
│   └─ NO  → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ Do you have time-series data?                                   │
│   ├─ YES → Use DTM                                              │
│   └─ NO  → Continue below                                       │
├─────────────────────────────────────────────────────────────────┤
│ What's your priority?                                           │
│   ├─ SPEED      → Use LDA (fastest)                            │
│   ├─ QUALITY    → Use THETA (best with Qwen embeddings)        │
│   └─ COMPARISON → Use multiple: lda,nvdm,prodlda,theta         │
└─────────────────────────────────────────────────────────────────┘

Training Parameters Reference

THETA Parameters

rameter Type Default Range Description
--model_size str 0.6B 0.6B, 4B, 8B Qwen model size
--mode str zero_shot zero_shot, supervised, unsupervised Embedding mode
--num_topics int 20 5–100 Number of topics K
--num_layers int 2 1–5 Number of encoder hidden layers
--hidden_dim int 512 128–2048 Neurons per encoder hidden layer
--epochs int 100 10–500 Training epochs
--batch_size int 64 8–512 Batch size
--learning_rate float 0.002 1e-5–0.1 Learning rate
--dropout float 0.2 0–0.9 Encoder dropout rate
--kl_start float 0.0 0–1 KL annealing start weight
--kl_end float 1.0 0–1 KL annealing end weight
--kl_warmup int 50 0–epochs KL warmup epochs
--patience int 10 1–50 Early stopping patience
--language str en en, zh Visualization language

Baseline Parameters

LDA

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--max_iter int 100 10–500 Maximum EM iterations
--alpha float auto (1/K) >0 Document-topic Dirichlet prior
--vocab_size int 5000 1000–20000 Vocabulary size

HDP

Parameter Type Default Range Description
--max_topics int 150 50–300 Upper bound on number of topics
--alpha float 1.0 >0 Document-level concentration parameter
--vocab_size int 5000 1000–20000 Vocabulary size

STM

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--max_iter int 100 10–500 Maximum EM iterations
--vocab_size int 5000 1000–20000 Vocabulary size

BTM

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--n_iter int 100 10–500 Gibbs sampling iterations
--alpha float 1.0 >0 Dirichlet prior for topic distribution
--beta float 0.01 >0 Dirichlet prior for word distribution
--vocab_size int 5000 1000–20000 Vocabulary size

ETM

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--num_layers int 2 1–5 Number of encoder hidden layers
--hidden_dim int 800 128–2048 Neurons per encoder hidden layer
--embedding_dim int 300 50–1024 Word embedding dimension (Word2Vec)
--epochs int 100 10–500 Training epochs
--batch_size int 64 8–512 Batch size
--learning_rate float 0.002 1e-5–0.1 Learning rate
--dropout float 0.5 0–0.9 Dropout rate
--vocab_size int 5000 1000–20000 Vocabulary size

CTM

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--inference_type str zeroshot zeroshot, combined zeroshot (SBERT only) or combined (SBERT + BOW)
--num_layers int 2 1–5 Number of encoder hidden layers
--hidden_dim int 100 32–1024 Neurons per encoder hidden layer
--epochs int 100 10–500 Training epochs
--batch_size int 64 8–512 Batch size
--learning_rate float 0.002 1e-5–0.1 Learning rate
--dropout float 0.2 0–0.9 Dropout rate
--vocab_size int 5000 1000–20000 Vocabulary size

DTM

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--num_layers int 2 1–5 Number of encoder hidden layers
--hidden_dim int 512 128–2048 Neurons per encoder hidden layer
--epochs int 100 10–500 Training epochs
--batch_size int 64 8–512 Batch size
--learning_rate float 0.002 1e-5–0.1 Learning rate
--dropout float 0.2 0–0.9 Dropout rate
--vocab_size int 5000 1000–20000 Vocabulary size

NVDM / GSM / ProdLDA

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K
--num_layers int 2 1–5 Number of encoder hidden layers
--hidden_dim int 256 128–2048 Neurons per encoder hidden layer
--epochs int 100 10–500 Training epochs
--batch_size int 64 8–512 Batch size
--learning_rate float 0.002 1e-5–0.1 Learning rate
--dropout float 0.2 0–0.9 Dropout rate
--vocab_size int 5000 1000–20000 Vocabulary size

BERTopic

Parameter Type Default Range Description
--num_topics int auto ≥2 or None Target number of topics; None = automatic
--min_cluster_size int 10 2–100 Minimum cluster size; controls topic granularity
--top_n_words int 10 1–30 Number of words per topic
--n_neighbors int 15 2–100 UMAP n_neighbors; controls local vs. global structure
--n_components int 5 2–50 UMAP output dimensionality before clustering

Project Structure

/root/
├── ETM/                          # Main codebase
│   ├── run_pipeline.py           # Unified entry point
│   ├── prepare_data.py           # Data preprocessing
│   ├── config.py                 # Configuration management
│   ├── dataclean/                # Data cleaning module
│   ├── model/                    # Model implementations
│   │   ├── theta/                # THETA main model
│   │   ├── baselines/            # 12 baseline models
│   │   └── _reference/           # Reference implementations
│   ├── evaluation/               # Evaluation metrics
│   ├── visualization/            # Visualization tools
│   └── utils/                    # Utilities 
├── agent/                        # Agent system
│   ├── api.py                    # FastAPI endpoints
│   ├── core/                     # Agent implementations
│   ├── config/                   # Configuration management
│   ├── prompts/                  # Prompt templates
│   ├── utils/                    # LLM and vision utilities
│   └── docs/                     # API documentation
├── scripts/                      # Shell scripts for automation
├── embedding/                     # Qwen embedding generation
│   ├── main.py                    # Embedding generation main codebase
│   ├── embedder.py                # Embedding
│   ├── trainer.py                 # Training (supervised/unsupervised)
│   ├── data_loader.py             # Dataloader

Requirements

  • Python 3.10+
  • CUDA recommended for GPU acceleration
  • Key dependencies:
numpy>=1.20.0
scipy>=1.7.0
torch>=1.10.0
transformers>=4.30.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
gensim>=4.1.0
wordcloud>=1.8.0
pyLDAvis>=3.3.0
jieba>=0.42.0

Installation

git clone https://github.com/<YOUR_ORG>/THETA.git
cd THETA

# Install dependencies
pip install -r ETM/requirements.txt

# Or use the setup script
bash scripts/01_setup.sh

🚀 Quick Start Guide

Step 1: SSH Connect to Server

# Standard SSH connection template
ssh -p <PORT> root@<SERVER_IP>

# Example for AutoDL
ssh -p 12345 root@connect.westb.seetacloud.com

Step 2: Clone Project & Configure .env

# Clone the repository
git clone https://github.com/CodeSoul-co/THETA.git
cd THETA

# Copy environment template
cp .env.example .env

# Edit .env file (CRITICAL: Set model paths)
nano .env

🔑 Core Configuration (Must Set):

# .env file - Essential paths
QWEN_MODEL_0_6B=/root/autodl-tmp/models/Qwen3-Embedding-0.6B
QWEN_MODEL_4B=/root/autodl-tmp/models/Qwen3-Embedding-4B      # Optional
QWEN_MODEL_8B=/root/autodl-tmp/models/Qwen3-Embedding-8B      # Optional
SBERT_MODEL=/root/autodl-tmp/models/paraphrase-multilingual-MiniLM-L12-v2

# Optional: Customize embedding length (default: 512)
# MAX_EMBED_LENGTH=1024

Step 3: One-Click Run

# For Chinese datasets
bash scripts/11_quick_start_chinese.sh

# For English datasets
bash scripts/10_quick_start.sh

⚡ The Parameter Journey

THETA implements a three-level parameter priority system - this is the soul of the system design:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PARAMETER PRIORITY HIERARCHY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   P1 (HIGHEST) │ Command-line arguments                                │
│                │ Example: --max_length 1024 --num_topics 30            │
│                │                                                        │
│   P2 (MEDIUM)  │ .env configuration file                               │
│                │ Example: MAX_EMBED_LENGTH=1024 in .env                │
│                │                                                        │
│   P3 (LOWEST)  │ System hardcoded defaults                             │
│                │ Example: max_length=512 in code                       │
│                │                                                        │
└─────────────────────────────────────────────────────────────────────────┘

Example:

# .env has: MAX_EMBED_LENGTH=1024
# Command line: --max_length 2048

# Result: max_length = 2048 (CLI wins!)

🚧 Model Admission Requirements

Universal Rule: Minimum 5 Documents

ALL models require at least 5 independent documents/samples. This ensures statistical significance.

❌ ERROR: [CRITICAL ERROR] Insufficient data sources (currently 3 files).
         THETA requires at least 5 independent documents for statistical significance.

DTM/STM Special Requirements

Model Input Format Required Columns Error if Missing
DTM .csv only time / year / date SchemaError
STM .csv only At least 1 covariate column (besides text) SchemaError

DTM Example CSV:

text,year
"Document content here...",2020
"Another document...",2021

STM Example CSV:

text,party,region
"Policy document...",Democrat,Northeast
"Another policy...",Republican,South

Error Message:

❌ SchemaError: [CRITICAL] DTM/STM requires structured metadata.
               Please provide a CSV file with time or covariate columns.
               Current columns: ['text']
               DTM requires: time column (time/year/date/timestamp)

🔬 Adaptive Long-Text Processing

THETA uses Sliding Window + Mean Pooling to handle documents of ANY length without information loss.

How It Works

┌─────────────────────────────────────────────────────────────────────────┐
│ Input: 50,000 character document (~100,000 tokens)                      │
│ max_length: 1024 tokens                                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│ Step 1: Sliding Window Tokenization (20% overlap)                       │
│ ┌──────────────────────────────────────────────────────────────────┐   │
│ │ Window 1: tokens[0:1022]       → Embedding_1                     │   │
│ │ Window 2: tokens[818:1840]     → Embedding_2  (20% overlap)      │   │
│ │ Window 3: tokens[1636:2658]    → Embedding_3                     │   │
│ │ ...                                                              │   │
│ │ Window N: tokens[...end]       → Embedding_N                     │   │
│ └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│ Step 2: Mean Pooling Aggregation                                        │
│ ┌──────────────────────────────────────────────────────────────────┐   │
│ │ Final Embedding = mean(Embedding_1, Embedding_2, ..., Embedding_N)│   │
│ └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│ Result: Single 1024-dim vector preserving ALL document information      │
└─────────────────────────────────────────────────────────────────────────┘

Why This Matters

Old Approach THETA Approach
truncation=True → Loses 90%+ of long documents Sliding Window → Preserves 100%
Fixed 512 tokens max Dynamic up to 8192 tokens
Information loss Zero information loss

Console Output:

[Sliding Window] 45/180 texts processed with sliding window
  → max_length=1024, overlap=20%

Physical Safety Limit

HARD_LIMIT_TOKEN = 8192  # Cannot exceed even if set higher in .env

This prevents GPU out-of-memory errors regardless of user configuration.

Pre-trained Data from HuggingFace

If pre-trained embeddings and BOW data are not available locally, download from HuggingFace:

Repository: https://huggingface.co/CodeSoulco/THETA

# Download pre-trained data and LoRA weights
bash scripts/09_download_from_hf.sh

# Or manually using Python
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='CodeSoulco/THETA',
    local_dir='/root/autodl-tmp/hf_cache/THETA'
)
"

The HuggingFace repository contains:

  • Pre-computed embeddings for benchmark datasets
  • BOW matrices and vocabularies
  • LoRA fine-tuned weights (optional)

Shell Scripts

All scripts are non-interactive (pure command-line parameters), suitable for DLC/batch environments. No stdin input required.

Script Description
01_setup.sh Install dependencies and download data from HuggingFace
02_clean_data.sh Clean raw text data (tokenization, stopword removal, lemmatization)
02_generate_embeddings.sh Generate Qwen embeddings (sub-script of 03, for failure recovery)
03_prepare_data.sh One-stop data preparation: BOW + embeddings for all 12 models
04_train_theta.sh Train THETA model (train + evaluate + visualize)
05_train_baseline.sh Train 11 baseline models for comparison with THETA
06_visualize.sh Generate visualizations for trained models
07_evaluate.sh Standalone evaluation with 7 unified metrics
08_compare_models.sh Cross-model metric comparison table
09_download_from_hf.sh Download pre-trained data from HuggingFace
10_quick_start_english.sh Quick start for English datasets
11_quick_start_chinese.sh Quick start for Chinese datasets
12_train_multi_gpu.sh Multi-GPU training with DistributedDataParallel
13_test_agent.sh Test LLM Agent connection and functionality
14_start_agent_api.sh Start the Agent API server (FastAPI)

For full parameter references, usage examples, and end-to-end workflow walkthroughs for each script, see the Shell Scripts Reference.


Semantic Enhancement (Embeddings)

THETA uses Qwen-3 embedding models with three size options:

Model Size Embedding Dim Use Case
0.6B 1024 Fast, default
4B 2560 Balanced
8B 4096 Best quality

Embedding modes:

  • zero_shot - Direct embedding without fine-tuning
  • supervised - Fine-tuned with labeled data
  • unsupervised - Fine-tuned without labels
# Generate embeddings for a dataset
python prepare_data.py --dataset my_dataset --model theta --model_size 0.6B --mode zero_shot

# Check if embeddings exist
python prepare_data.py --dataset my_dataset --model theta --model_size 4B --check-only

Output artifacts:

  • {dataset}_{mode}_embeddings.npy - Embedding matrix (N x D)
  • bow_matrix.npz - Bag-of-words matrix
  • vocab.json - Vocabulary list

Topic Modeling

THETA supports multiple topic modeling approaches:

Model Description Time-aware
THETA Qwen embedding + ETM No
LDA Latent Dirichlet Allocation No
ETM Embedded Topic Model No
CTM Contextualized Topic Model No
DTM Dynamic Topic Model Yes

Training outputs (organized by ResultManager):

  • model/theta_k{K}.npy - Document-topic distribution
  • model/beta_k{K}.npy - Topic-word distribution
  • model/training_history_k{K}.json - Training history
  • topicwords/topic_words_k{K}.json - Top words per topic
  • topicwords/topic_evolution_k{K}.json - Topic evolution (DTM only)

Validation & Evaluation

THETA provides unified evaluation with 7 metrics:

Metric Description
PPL Perplexity - model fit
TD Topic Diversity
iRBO Inverse Rank-Biased Overlap
NPMI Normalized PMI coherence
C_V C_V coherence
UMass UMass coherence
Exclusivity Topic exclusivity
from evaluation.unified_evaluator import UnifiedEvaluator

evaluator = UnifiedEvaluator(
    beta=beta,
    theta=theta,
    bow_matrix=bow_matrix,
    vocab=vocab,
    model_name="dtm",
    dataset="edu_data",
    num_topics=20
)

metrics = evaluator.evaluate_all()
evaluator.save_results()  # Saves to evaluation/metrics_k20.json and .csv

Evaluation outputs:

  • evaluation/metrics_k{K}.json - All metrics in JSON format
  • evaluation/metrics_k{K}.csv - All metrics in CSV format

Visualization

THETA provides comprehensive visualization with bilingual support (English/Chinese):

# Generate visualizations after training
python run_pipeline.py --dataset edu_data --models dtm --skip-train --language en

# Or use visualization module directly
python -c "
from visualization.run_visualization import run_baseline_visualization
run_baseline_visualization(
    result_dir='/root/autodl-tmp/result/baseline',
    dataset='edu_data',
    model='dtm',
    num_topics=20,
    language='zh'
)
"

Generated charts (20+ types):

  • Topic word bars, word clouds, topic similarity heatmap
  • Document clustering (UMAP), topic network graph
  • Topic evolution (DTM), sankey diagrams
  • Training convergence, coherence metrics
  • pyLDAvis interactive HTML

Output structure:

visualization_k{K}_{lang}_{timestamp}/
├── global/                    # Global charts
│   ├── topic_table.png
│   ├── topic_network.png
│   ├── clustering_heatmap.png
│   ├── topic_wordclouds.png
│   └── ...
├── topics/                    # Per-topic charts
│   ├── topic_0/
│   ├── topic_1/
│   └── ...
└── README.md                  # Summary report

Result Directory Structure

All results are organized using ResultManager:

/root/autodl-tmp/result/baseline/{dataset}/{model}/
├── bow/                    # BOW data and vocabulary
│   ├── bow_matrix.npz
│   ├── vocab.json
│   └── vocab.txt
├── model/                  # Model parameters
│   ├── theta_k{K}.npy
│   ├── beta_k{K}.npy
│   └── training_history_k{K}.json
├── evaluation/             # Evaluation results
│   ├── metrics_k{K}.json
│   └── metrics_k{K}.csv
├── topicwords/             # Topic words
│   ├── topic_words_k{K}.json
│   └── topic_evolution_k{K}.json
└── visualization_k{K}_{lang}_{timestamp}/

Using ResultManager:

from utils.result_manager import ResultManager

# Initialize
manager = ResultManager(
    result_dir='/root/autodl-tmp/result/baseline',
    dataset='edu_data',
    model='dtm',
    num_topics=20
)

# Save all results
manager.save_all(theta, beta, vocab, topic_words, metrics=metrics)

# Load all results
data = manager.load_all(num_topics=20)

# Migrate old flat structure to new structure
from utils.result_manager import migrate_baseline_results
migrate_baseline_results(dataset='edu_data', model='dtm')

Configuration

Dataset configurations are defined in config.py:

DATASET_CONFIGS = {
    "socialTwitter": {
        "vocab_size": 5000,
        "num_topics": 20,
        "min_doc_freq": 5,
        "language": "multi",
    },
    "hatespeech": {
        "vocab_size": 8000,
        "num_topics": 20,
        "min_doc_freq": 10,
        "language": "english",
    },
    "edu_data": {
        "vocab_size": 5000,
        "num_topics": 20,
        "min_doc_freq": 3,
        "language": "chinese",
        "has_timestamp": True,
    },
}

Command-line parameters:

Parameter Description Default
--dataset Dataset name Required
--models Model list (comma-separated) Required
--model_size Qwen model size (THETA) 0.6B
--mode THETA mode zero_shot
--num_topics Number of topics 20
--epochs Training epochs 100
--batch_size Batch size 64
--language Visualization language en
--skip-train Skip training False
--skip-eval Skip evaluation False
--skip-viz Skip visualization False

Supported Datasets

Dataset Documents Language Time-aware
socialTwitter ~40K Spanish/English No
hatespeech ~437K English No
mental_health ~1M English No
FCPB ~854K English No
germanCoal ~9K German No
edu_data ~857 Chinese Yes

License

Apache-2.0


Contributing

Contributions are welcome:

  • New dataset adapters
  • Topic visualization modules
  • Evaluation and reproducibility scripts
  • Documentation improvements

Suggested workflow:

  1. Fork the repo and create a feature branch
  2. Add a minimal reproducible example or tests
  3. Open a pull request

Ethics & Safety

This project analyzes social text and may involve sensitive content.

  • Do not include personally identifiable information (PII)
  • Ensure dataset usage complies with platform terms and research ethics
  • Interpret outputs cautiously; topic discovery does not replace scientific conclusions
  • Be responsible with sensitive domains such as self-harm, hate speech, and political polarization

FAQ

Q: Is this only for Qwen-3?

A: No. Qwen-3 is the reference backbone, but THETA is designed to be model-agnostic. You can adapt it for other embedding models.

Q: What is the difference between ETM and DTM?

A: ETM learns static topics across the corpus; DTM (Dynamic Topic Model) models topic evolution over time and requires timestamps.

Q: Why is STM skipped when I try to train it? How do I use STM?

A: STM (Structural Topic Model) requires document-level covariates (metadata such as year, source, category). Unlike LDA, STM models how metadata influences topic prevalence, so covariates are mandatory. If your dataset doesn't have covariates configured, STM will be automatically skipped.

To use STM:

# 1. Make sure your cleaned CSV has metadata columns (e.g., year, source, category)

# 2. Register covariates in ETM/config.py:
#    DATASET_CONFIGS["my_dataset"] = {
#        "vocab_size": 5000,
#        "num_topics": 20,
#        "language": "english",
#        "covariate_columns": ["year", "source", "category"],  # <-- required for STM
#    }

# 3. Prepare data
bash scripts/03_prepare_data.sh --dataset my_dataset --model stm --vocab_size 5000

# 4. Train STM
bash scripts/05_train_baseline.sh --dataset my_dataset --models stm --num_topics 20

If your dataset has no meaningful metadata, use CTM (same logistic-normal prior, no covariates needed) or LDA instead.

Q: CUDA out of memory — what should I do?

A: Insufficient GPU VRAM. Solutions:

  • Embedding generation (unsupervised/supervised): reduce --batch_size (recommend 4–8)
  • THETA training: reduce --batch_size (recommend 32–64)
  • Check for other processes using the GPU: nvidia-smi
  • Kill zombie processes: kill -9 <PID>

Q: EMB shows ✗ (embeddings not generated)

A: Embedding generation failed (usually OOM) but the script did not exit with an error. Regenerate with a smaller batch_size:

bash scripts/02_generate_embeddings.sh \
    --dataset edu_data --mode unsupervised --model_size 0.6B \
    --batch_size 4 --gpu 0 \
    --exp_dir /root/autodl-tmp/result/0.6B/edu_data/data/exp_xxx

Q: How to choose an embedding mode?

Scenario Recommended Mode Reason
Quick testing zero_shot No training needed, completes in seconds
Unlabeled data unsupervised LoRA fine-tuning adapts to the domain
Labeled data supervised Leverages label information to enhance embeddings
Large datasets zero_shot Avoids lengthy fine-tuning

Q: How to choose the number of topics K?

  • Small datasets (<1000 docs): K = 5–15
  • Medium datasets (1000–10000): K = 10–30
  • Large datasets (>10000): K = 20–50
  • Use hdp or bertopic to auto-determine topic count as a reference

Q: What does the visualization --language parameter do?

  • en: Chart titles, axes, and legends in English
  • zh: Chart titles, axes, and legends in Chinese (e.g., "主题表", "训练损失图")
  • Only affects visualization; does not affect model training or evaluation

Q: What is the difference between BOW --language and visualization --language?

Parameter Script Values Purpose
--language in 03_prepare_data.sh BOW generation english, chinese Controls tokenization and stopword filtering
--language in 04_train_theta.sh Visualization en, zh Controls chart label language
--language in 05_train_baseline.sh Visualization en, zh Controls chart label language

Q: Can I add my own dataset?

A: Yes. Prepare a cleaned CSV with text column (and optionally year for DTM, or metadata columns for STM), then add configuration to config.py:

DATASET_CONFIGS["my_dataset"] = {
    "vocab_size": 5000,
    "num_topics": 20,
    "min_doc_freq": 5,
    "language": "english",
    # Optional: for STM (document-level metadata)
    # "covariate_columns": ["year", "source", "category"],
    # Optional: for DTM (time-aware)
    # "has_timestamp": True,
}

Citation

If you find THETA useful in your research, please consider citing our paper:

@article{duan2026theta,
  title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
  author={Codesoul.co},
  journal={arXiv preprint arXiv:2603.05972},
  year={2026},
  doi={10.48550/arXiv.2603.05972}
}

Contact

Please contact us if you have any questions:

Releases

No releases published

Packages

 
 
 

Contributors