Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.
Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.
What you'll do:
- Ingest raw CSV data into BigQuery
- Design and implement staging transformations (prepare data for analytics)
- Create production aggregations for business analytics
- Choose between SQL or Python implementation approach
Location: exercises/batch_level_1/
Instructions: See exercises/batch_level_1/CANDIDATE_INSTRUCTIONS.md
Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.
What you'll do:
- Write pseudocode (not production code)
- Handle incremental file processing
- Implement deduplication strategy
- Ensure idempotency
Location: exercises/batch_level_2/
Instructions: See exercises/batch_level_2/README.md
- Python 3.8+
- GCP credentials (provided by interviewer via encrypted email)
- Git
βββ exercises/
β βββ batch_level_1/
β β βββ candidate_solution/ # Your working directory
β β βββ seeds/ # Sample data
β β βββ CANDIDATE_INSTRUCTIONS.md
β β βββ README.md
β βββ batch_level_2/
β βββ *.csv # Sample event files
β βββ README.md
βββ requirements.txt # Python dependencies
βββ README.md # This file
-
Clone this repository
git clone https://github.com/dbarrios83/data-engineer-candidate-test.git cd data-engineer-candidate-test -
Set up Python environment
Windows PowerShell:
python -m venv .venv .venv\Scripts\Activate.ps1 pip install -r requirements.txt
Linux/Mac:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Configure GCP credentials
To receive credentials:
- Contact the repository owner at daniel@0x.se
- You'll receive the service account JSON file via encrypted email or secure file transfer
- Place the file in
config/data-engineer-recruitment.json
β οΈ Security: DO NOT commit credentials to Git. The.gitignorefile is configured to prevent this. -
Set your candidate ID (for Batch Level 1)
Windows PowerShell:
$env:CANDIDATE_ID = "candidate_yourname"
Linux/Mac:
export CANDIDATE_ID="candidate_yourname"
This creates isolated BigQuery datasets:
candidate_yourname_raw,candidate_yourname_staging,candidate_yourname_prod -
Verify setup
cd exercises/batch_level_1/candidate_solution python scripts/test_bigquery_connection.py
This exercise requires working code that runs in BigQuery:
cd exercises/batch_level_1/candidate_solution
# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files
# Run the full pipeline
python -m src.pipeline full
# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenueWhat to deliver:
- Staging transformations (prepare raw data for analytics use)
- Production aggregation (country-level revenue metrics)
- Working pipeline that produces correct results
This exercise evaluates both your implementation and design thinking:
cd exercises/batch_level_2
# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offsExample approach:
FUNCTION process_incremental_data(run_date):
// Step 1: Identify new files since last run
new_files = list_gcs_files(path, after=last_processed_timestamp)
// Step 2: Load and deduplicate
events = load_csv_files(new_files)
.deduplicate(on=event_id, keep=first)
// Step 3: Merge into daily table (idempotent)
MERGE INTO daily_metrics USING events
ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
// Step 4: Record processed files
update_metadata(last_processed_timestamp = now())
END
// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance
Project: data-engineer-recruitment
Location: EU
Your Datasets (isolated per candidate):
candidate_yourname_raw- Raw ingested datacandidate_yourname_staging- Prepared data (staging)candidate_yourname_prod- Production analytics tables
Seed Data (for Batch Level 1):
- Location:
gs://de-recruitment-test-seeds-2026/ - Files:
users.csv,payments.csv - Auto-loaded by ingestion script
Technical Skills β
- Correct data combination logic (joins, lookups)
- Proper null and missing data handling
- Understanding of staging layer purpose (preparing data for analytics)
- Understanding of idempotency and deduplication
- Appropriate aggregation and window functions
- Knowledge of BigQuery/SQL or pandas operations
Problem-Solving Approach β
- Asking clarifying questions
- Identifying edge cases
- Explaining trade-offs
- Considering production concerns (scale, cost, monitoring)
Communication β
- Clear pseudocode with comments
- Explaining reasoning behind design choices
- Documenting assumptions
- Read the problem carefully - Understand requirements before coding
- Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
- Think production - Consider scale, failures, monitoring
- Explain your reasoning - Add comments about trade-offs
- Test your code - For Batch Level 1, verify your results
- Syntax doesn't matter - SQL-like, Python-like, or plain English all work
- Logic matters - Show clear data flow and transformations
- Comment your reasoning - Explain why you chose this approach
- Consider edge cases - What happens with nulls, duplicates, late data?
Check:
- Credentials file exists:
config/data-engineer-recruitment.json CANDIDATE_IDenvironment variable is set- Run test script:
python scripts/test_bigquery_connection.py
Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:
bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prodSolution: Make sure you're in the correct directory and virtual environment is activated:
cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\Activate.ps1 # WindowsQuestions or technical issues?
- Email: daniel@0x.se
- Include: Your candidate ID, error messages, and what you've tried
GCP Credentials:
- Shared via encrypted email only
- Never commit to Git (protected by
.gitignore) - Minimal permissions (BigQuery only)
- Will be revoked after assessment
Your Data:
- All datasets isolated per candidate
- No access to other candidates' data
- Datasets cleaned up after interview
Good luck! We're excited to see your approach to these problems. π
- Input: Event data with duplicates and complex rules
- Output: Cleaned dataset with business logic applied
- Focus: Deduplication strategies, data quality, incremental processing
Your solutions will be evaluated on:
- Correctness: Does it produce the expected results?
- Code Quality: Clean, readable, well-documented code
- Problem Solving: Efficient algorithms and data structures
- Production Readiness: Error handling, scalability considerations, idempotency
- Documentation: Clear explanations of your approach
- BigQuery for data warehousing
- Cloud Storage for file storage
- Service account with necessary permissions
- Sample datasets pre-loaded
- Review the problem statements carefully
- Ask clarifying questions if requirements are unclear
- Test with the provided sample data
- Document your assumptions and decisions
When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.
Good luck! We're excited to see your data engineering solutions. π c:\Users\ingdb\OneDrive\Documents\0x\recruitment\data-engineer-recruitment-test\README.md