Spatio-temporal action recognition for furniture assembly using synchronized multi-camera RGB video, ResNet-50 spatial features, view attention, and a temporal Transformer classifier.
| Capability | Detail |
|---|---|
| Camera setup | 8 synchronized Intel RealSense D435i cameras |
| Action scope | 5 furniture-assembly actions plus a background/no-action label |
| Spatial encoder | ResNet-50 ImageNet features with a 2048-dim pooled output |
| View fusion | Multi-head attention across camera views |
| Temporal reasoning | Transformer encoder over frame sequences |
| Reported results | 88.4% frame-level accuracy and about 0.82 macro-F1 |
HAU_demo.mp4
This project recognizes furniture-assembly actions from synchronized multi-view video. Each sample contains aligned frame sequences from eight RealSense D435i cameras, allowing the model to reduce occlusion sensitivity and capture human-object interactions from multiple angles.
The model predicts a frame-sequence action label by:
- Extracting per-frame spatial features from every view with ResNet-50
- Applying multi-head attention to weight and fuse evidence across views
- Modeling how the action evolves over time with a Transformer encoder
- Pooling the sequence representation and classifying the action
┌─────────────────────────────────────┐
│ 8 Synchronized Camera Views │
│ Intel RealSense D435i Streams │
└──────────────────┬──────────────────┘
│ RGB frame sequences
▼
┌─────────────────────────────────────┐
│ ResNet-50 Spatial Encoder │
│ 2048-dim pooled frame features │
└──────────────────┬──────────────────┘
│ Per-view embeddings
▼
┌─────────────────────────────────────┐
│ Multi-Head View Attention │
│ Cross-view feature fusion │
└──────────────────┬──────────────────┘
│ View-aware sequence features
▼
┌─────────────────────────────────────┐
│ Temporal Transformer Encoder │
│ Action dynamics over time │
└──────────────────┬──────────────────┘
│ Temporal representations
▼
┌─────────────────────────────────────┐
│ Mean Pooling Across Time & Views │
│ Sequence-level representation │
└──────────────────┬──────────────────┘
│ Classification head input
▼
┌─────────────────────────────────────┐
│ Action Classifier │
│ Furniture assembly prediction │
└─────────────────────────────────────┘
| Metric | Reported Value |
|---|---|
| Frame-level accuracy | 88.4% |
| Macro-F1 | ~0.82 |
The metrics are computed frame by frame against the action annotations. The evaluation script reports accuracy, macro-F1, loss, and a confusion matrix for the selected split.
The dataset was collected for the Projektpraktikum Human Activity Understanding project using 8 synchronized Intel RealSense D435i cameras. Each sequence contains eight aligned views, one per camera.
Download the dataset from Google Drive:
https://drive.google.com/file/d/1ArpoIoj6K5msEck54Qc2nmbfTys39UEO/view?usp=drive_link
Annotation details:
- Labels were created on View 5 with the VGG Annotator tool.
- Because all cameras are synchronized, View 5 annotations align with corresponding frames from the other seven views.
data/annotations/action_labels.csvstores temporal segments, frame ranges, action labels, sequence IDs, view IDs, and FPS-derived frame differences.data/annotations/classes.jsonincludesno actionplus five assembly actions: inserting vertical short rods, inserting elbows, inserting horizontal short rods, inserting horizontal long rods, and flip.
git clone https://github.com/mumerabbasi/HumanActionRecognition.git
cd HumanActionRecognition
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtPlace the downloaded dataset under data/raw/, then preprocess it into train, validation, and test splits:
python -m data.preprocess_data \
--source_dir data/raw \
--dest_dir data/processed \
--train_pct 0.70 \
--val_pct 0.15 \
--test_pct 0.15Configuration lives in configs/default.yaml. The default setup trains the ResNet-50 plus view-attention plus temporal-Transformer model on data/processed.
python -m src.training.train --config configs/default.yamlTraining logs and checkpoints are written under the configured output_dir.
Evaluate a saved checkpoint on the configured split:
python -m src.training.evaluate \
--config configs/default.yaml \
--checkpoint output/custom_models/run_YYYYMMDD_HHMMSS/best_model_epoch_10.pthEvaluation logs are written under output/evaluation_results by default.
HumanActionRecognition/
|-- assets/
| `-- HAU_demo.mp4
|-- configs/
| `-- default.yaml
|-- data/
| |-- annotations/
| | |-- action_labels.csv
| | `-- classes.json
| `-- preprocess_data.py
|-- src/
| |-- data/
| | |-- dataset.py
| | `-- transforms.py
| |-- models/
| | |-- attention_views.py
| | |-- multiview_action_recognition_model.py
| | |-- spatial_feature_extractor.py
| | `-- transformer_encoder_temporal.py
| |-- training/
| | |-- evaluate.py
| | `-- train.py
| `-- utils/
| |-- helper.py
| |-- logger.py
| `-- preprocess_data_utils.py
`-- requirements.txt
| Area | Use |
|---|---|
| Assembly assistance | Recognize progress and mistakes during furniture assembly |
| Human activity understanding | Study fine-grained human-object interaction from synchronized views |
| Robotics | Provide action context for collaborative or assistive robots |
| Computer vision research | Benchmark multi-view attention and temporal sequence modeling |
Built at Technical University of Munich | EI78075 Projektpraktikum Human Activity Understanding