Data Engineering Technical Assessment

Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.

What you'll do:

Ingest raw CSV data into BigQuery
Design and implement staging transformations (prepare data for analytics)
Create production aggregations for business analytics
Choose between SQL or Python implementation approach

Location: exercises/batch_level_1/ Instructions: See exercises/batch_level_1/CANDIDATE_INSTRUCTIONS.md

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.

What you'll do:

Write pseudocode (not production code)
Handle incremental file processing
Implement deduplication strategy
Ensure idempotency

Location: exercises/batch_level_2/ Instructions: See exercises/batch_level_2/README.md

🚀 Getting Started

Prerequisites

Python 3.8+
GCP credentials (provided by interviewer via encrypted email)
Git

Repository Structure

├── exercises/
│   ├── batch_level_1/
│   │   ├── candidate_solution/     # Your working directory
│   │   ├── seeds/                  # Sample data
│   │   ├── CANDIDATE_INSTRUCTIONS.md
│   │   └── README.md
│   └── batch_level_2/
│       ├── *.csv                   # Sample event files
│       └── README.md
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Setup Instructions

Clone this repository

git clone https://github.com/dbarrios83/data-engineer-candidate-test.git
cd data-engineer-candidate-test

Set up Python environment

Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Linux/Mac:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configure GCP credentials

To receive credentials:
1. Contact the repository owner at daniel@0x.se
2. You'll receive the service account JSON file via encrypted email or secure file transfer
3. Place the file in config/data-engineer-recruitment.json
⚠️ Security: DO NOT commit credentials to Git. The .gitignore file is configured to prevent this.
Set your candidate ID (for Batch Level 1)

Windows PowerShell:
```
$env:CANDIDATE_ID = "candidate_yourname"
```
Linux/Mac:
```
export CANDIDATE_ID="candidate_yourname"
```
This creates isolated BigQuery datasets: candidate_yourname_raw, candidate_yourname_staging, candidate_yourname_prod

Verify setup

cd exercises/batch_level_1/candidate_solution
python scripts/test_bigquery_connection.py

📖 Working on Exercises

Batch Level 1 - Full Implementation

This exercise requires working code that runs in BigQuery:

cd exercises/batch_level_1/candidate_solution

# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files

# Run the full pipeline
python -m src.pipeline full

# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenue

What to deliver:

Staging transformations (prepare raw data for analytics use)
Production aggregation (country-level revenue metrics)
Working pipeline that produces correct results

Batch Level 2 - Implementation or Pseudocode

This exercise evaluates both your implementation and design thinking:

cd exercises/batch_level_2

# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offs

Example approach:

FUNCTION process_incremental_data(run_date):
  // Step 1: Identify new files since last run
  new_files = list_gcs_files(path, after=last_processed_timestamp)
  
  // Step 2: Load and deduplicate
  events = load_csv_files(new_files)
            .deduplicate(on=event_id, keep=first)
  
  // Step 3: Merge into daily table (idempotent)
  MERGE INTO daily_metrics USING events
    ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
    WHEN MATCHED THEN UPDATE SET ...
    WHEN NOT MATCHED THEN INSERT ...
  
  // Step 4: Record processed files
  update_metadata(last_processed_timestamp = now())
END

// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance

📊 GCP Infrastructure Details

Project: data-engineer-recruitment Location: EU

Your Datasets (isolated per candidate):

candidate_yourname_raw - Raw ingested data
candidate_yourname_staging - Prepared data (staging)
candidate_yourname_prod - Production analytics tables

Seed Data (for Batch Level 1):

Location: gs://de-recruitment-test-seeds-2026/
Files: users.csv, payments.csv
Auto-loaded by ingestion script

💡 Assessment Guidelines

What We're Looking For

Technical Skills ✅

Correct data combination logic (joins, lookups)
Proper null and missing data handling
Understanding of staging layer purpose (preparing data for analytics)
Understanding of idempotency and deduplication
Appropriate aggregation and window functions
Knowledge of BigQuery/SQL or pandas operations

Problem-Solving Approach ✅

Asking clarifying questions
Identifying edge cases
Explaining trade-offs
Considering production concerns (scale, cost, monitoring)

Communication ✅

Clear pseudocode with comments
Explaining reasoning behind design choices
Documenting assumptions

Tips for Success

Read the problem carefully - Understand requirements before coding
Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
Think production - Consider scale, failures, monitoring
Explain your reasoning - Add comments about trade-offs
Test your code - For Batch Level 1, verify your results

Pseudocode Guidelines

Syntax doesn't matter - SQL-like, Python-like, or plain English all work
Logic matters - Show clear data flow and transformations
Comment your reasoning - Explain why you chose this approach
Consider edge cases - What happens with nulls, duplicates, late data?

🔧 Troubleshooting

Can't connect to BigQuery

Check:

Credentials file exists: config/data-engineer-recruitment.json
CANDIDATE_ID environment variable is set
Run test script: python scripts/test_bigquery_connection.py

Datasets not created

Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:

bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prod

Import errors

Solution: Make sure you're in the correct directory and virtual environment is activated:

cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\Activate.ps1  # Windows

📞 Support

Questions or technical issues?

Email: daniel@0x.se
Include: Your candidate ID, error messages, and what you've tried

🔐 Security & Credentials

GCP Credentials:

Shared via encrypted email only
Never commit to Git (protected by .gitignore)
Minimal permissions (BigQuery only)
Will be revoked after assessment

Your Data:

All datasets isolated per candidate
No access to other candidates' data
Datasets cleaned up after interview

Good luck! We're excited to see your approach to these problems. 🚀

Batch Level 2: Deduplication Pipeline

Input: Event data with duplicates and complex rules
Output: Cleaned dataset with business logic applied
Focus: Deduplication strategies, data quality, incremental processing

Evaluation Criteria

Your solutions will be evaluated on:

Correctness: Does it produce the expected results?
Code Quality: Clean, readable, well-documented code
Problem Solving: Efficient algorithms and data structures
Production Readiness: Error handling, scalability considerations, idempotency
Documentation: Clear explanations of your approach

GCP Resources Available

BigQuery for data warehousing
Cloud Storage for file storage
Service account with necessary permissions
Sample datasets pre-loaded

Getting Help

Review the problem statements carefully
Ask clarifying questions if requirements are unclear
Test with the provided sample data
Document your assumptions and decisions

Submission

When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.

Good luck! We're excited to see your data engineering solutions. 🚀 c:\Users\ingdb\OneDrive\Documents\0x\recruitment\data-engineer-recruitment-test\README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Technical Assessment

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

🚀 Getting Started

Prerequisites

Repository Structure

Setup Instructions

📖 Working on Exercises

Batch Level 1 - Full Implementation

Batch Level 2 - Implementation or Pseudocode

📊 GCP Infrastructure Details

💡 Assessment Guidelines

What We're Looking For

Tips for Success

Pseudocode Guidelines

🔧 Troubleshooting

Can't connect to BigQuery

Datasets not created

Import errors

📞 Support

🔐 Security & Credentials

Batch Level 2: Deduplication Pipeline

Evaluation Criteria

GCP Resources Available

Getting Help

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
docs		docs
exercises		exercises
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

zero-plus-x/data-engineer-candidate-test

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Technical Assessment

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

🚀 Getting Started

Prerequisites

Repository Structure

Setup Instructions

📖 Working on Exercises

Batch Level 1 - Full Implementation

Batch Level 2 - Implementation or Pseudocode

📊 GCP Infrastructure Details

💡 Assessment Guidelines

What We're Looking For

Tips for Success

Pseudocode Guidelines

🔧 Troubleshooting

Can't connect to BigQuery

Datasets not created

Import errors

📞 Support

🔐 Security & Credentials

Batch Level 2: Deduplication Pipeline

Evaluation Criteria

GCP Resources Available

Getting Help

Submission

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages