Skip to content

zero-plus-x/data-engineer-candidate-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Engineering Technical Assessment

Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.


πŸ“‹ What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.

What you'll do:

  • Ingest raw CSV data into BigQuery
  • Design and implement staging transformations (prepare data for analytics)
  • Create production aggregations for business analytics
  • Choose between SQL or Python implementation approach

Location: exercises/batch_level_1/ Instructions: See exercises/batch_level_1/CANDIDATE_INSTRUCTIONS.md

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.

What you'll do:

  • Write pseudocode (not production code)
  • Handle incremental file processing
  • Implement deduplication strategy
  • Ensure idempotency

Location: exercises/batch_level_2/ Instructions: See exercises/batch_level_2/README.md


πŸš€ Getting Started

Prerequisites

  • Python 3.8+
  • GCP credentials (provided by interviewer via encrypted email)
  • Git

Repository Structure

β”œβ”€β”€ exercises/
β”‚   β”œβ”€β”€ batch_level_1/
β”‚   β”‚   β”œβ”€β”€ candidate_solution/     # Your working directory
β”‚   β”‚   β”œβ”€β”€ seeds/                  # Sample data
β”‚   β”‚   β”œβ”€β”€ CANDIDATE_INSTRUCTIONS.md
β”‚   β”‚   └── README.md
β”‚   └── batch_level_2/
β”‚       β”œβ”€β”€ *.csv                   # Sample event files
β”‚       └── README.md
β”œβ”€β”€ requirements.txt                # Python dependencies
└── README.md                       # This file

Setup Instructions

  1. Clone this repository

    git clone https://github.com/dbarrios83/data-engineer-candidate-test.git
    cd data-engineer-candidate-test
  2. Set up Python environment

    Windows PowerShell:

    python -m venv .venv
    .venv\Scripts\Activate.ps1
    pip install -r requirements.txt

    Linux/Mac:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  3. Configure GCP credentials

    To receive credentials:

    1. Contact the repository owner at daniel@0x.se
    2. You'll receive the service account JSON file via encrypted email or secure file transfer
    3. Place the file in config/data-engineer-recruitment.json

    ⚠️ Security: DO NOT commit credentials to Git. The .gitignore file is configured to prevent this.

  4. Set your candidate ID (for Batch Level 1)

    Windows PowerShell:

    $env:CANDIDATE_ID = "candidate_yourname"

    Linux/Mac:

    export CANDIDATE_ID="candidate_yourname"

    This creates isolated BigQuery datasets: candidate_yourname_raw, candidate_yourname_staging, candidate_yourname_prod

  5. Verify setup

    cd exercises/batch_level_1/candidate_solution
    python scripts/test_bigquery_connection.py

πŸ“– Working on Exercises

Batch Level 1 - Full Implementation

This exercise requires working code that runs in BigQuery:

cd exercises/batch_level_1/candidate_solution

# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files

# Run the full pipeline
python -m src.pipeline full

# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenue

What to deliver:

  • Staging transformations (prepare raw data for analytics use)
  • Production aggregation (country-level revenue metrics)
  • Working pipeline that produces correct results

Batch Level 2 - Implementation or Pseudocode

This exercise evaluates both your implementation and design thinking:

cd exercises/batch_level_2

# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offs

Example approach:

FUNCTION process_incremental_data(run_date):
  // Step 1: Identify new files since last run
  new_files = list_gcs_files(path, after=last_processed_timestamp)
  
  // Step 2: Load and deduplicate
  events = load_csv_files(new_files)
            .deduplicate(on=event_id, keep=first)
  
  // Step 3: Merge into daily table (idempotent)
  MERGE INTO daily_metrics USING events
    ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
    WHEN MATCHED THEN UPDATE SET ...
    WHEN NOT MATCHED THEN INSERT ...
  
  // Step 4: Record processed files
  update_metadata(last_processed_timestamp = now())
END

// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance

πŸ“Š GCP Infrastructure Details

Project: data-engineer-recruitment Location: EU

Your Datasets (isolated per candidate):

  • candidate_yourname_raw - Raw ingested data
  • candidate_yourname_staging - Prepared data (staging)
  • candidate_yourname_prod - Production analytics tables

Seed Data (for Batch Level 1):

  • Location: gs://de-recruitment-test-seeds-2026/
  • Files: users.csv, payments.csv
  • Auto-loaded by ingestion script

πŸ’‘ Assessment Guidelines

What We're Looking For

Technical Skills βœ…

  • Correct data combination logic (joins, lookups)
  • Proper null and missing data handling
  • Understanding of staging layer purpose (preparing data for analytics)
  • Understanding of idempotency and deduplication
  • Appropriate aggregation and window functions
  • Knowledge of BigQuery/SQL or pandas operations

Problem-Solving Approach βœ…

  • Asking clarifying questions
  • Identifying edge cases
  • Explaining trade-offs
  • Considering production concerns (scale, cost, monitoring)

Communication βœ…

  • Clear pseudocode with comments
  • Explaining reasoning behind design choices
  • Documenting assumptions

Tips for Success

  1. Read the problem carefully - Understand requirements before coding
  2. Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
  3. Think production - Consider scale, failures, monitoring
  4. Explain your reasoning - Add comments about trade-offs
  5. Test your code - For Batch Level 1, verify your results

Pseudocode Guidelines

  • Syntax doesn't matter - SQL-like, Python-like, or plain English all work
  • Logic matters - Show clear data flow and transformations
  • Comment your reasoning - Explain why you chose this approach
  • Consider edge cases - What happens with nulls, duplicates, late data?

πŸ”§ Troubleshooting

Can't connect to BigQuery

Check:

  1. Credentials file exists: config/data-engineer-recruitment.json
  2. CANDIDATE_ID environment variable is set
  3. Run test script: python scripts/test_bigquery_connection.py

Datasets not created

Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:

bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prod

Import errors

Solution: Make sure you're in the correct directory and virtual environment is activated:

cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\Activate.ps1  # Windows

πŸ“ž Support

Questions or technical issues?

  • Email: daniel@0x.se
  • Include: Your candidate ID, error messages, and what you've tried

πŸ” Security & Credentials

GCP Credentials:

  • Shared via encrypted email only
  • Never commit to Git (protected by .gitignore)
  • Minimal permissions (BigQuery only)
  • Will be revoked after assessment

Your Data:

  • All datasets isolated per candidate
  • No access to other candidates' data
  • Datasets cleaned up after interview

Good luck! We're excited to see your approach to these problems. πŸš€

Batch Level 2: Deduplication Pipeline

  • Input: Event data with duplicates and complex rules
  • Output: Cleaned dataset with business logic applied
  • Focus: Deduplication strategies, data quality, incremental processing

Evaluation Criteria

Your solutions will be evaluated on:

  • Correctness: Does it produce the expected results?
  • Code Quality: Clean, readable, well-documented code
  • Problem Solving: Efficient algorithms and data structures
  • Production Readiness: Error handling, scalability considerations, idempotency
  • Documentation: Clear explanations of your approach

GCP Resources Available

  • BigQuery for data warehousing
  • Cloud Storage for file storage
  • Service account with necessary permissions
  • Sample datasets pre-loaded

Getting Help

  • Review the problem statements carefully
  • Ask clarifying questions if requirements are unclear
  • Test with the provided sample data
  • Document your assumptions and decisions

Submission

When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.


Good luck! We're excited to see your data engineering solutions. πŸš€ c:\Users\ingdb\OneDrive\Documents\0x\recruitment\data-engineer-recruitment-test\README.md

About

Data Engineering Technical Assessment Template

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •