How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide
Hire the right data engineer. Covers role definition, skills assessment, technical interview questions, take-home projects, and red/green flags.
Hiring a bad data engineer costs $50K-$150K when you factor in recruiting fees, onboarding time, salary during ramp-up, the 6-12 months of underperformance before you realize the mistake, and the cost of re-hiring. The data engineering talent market is competitive, and the skills gap between “writes SQL” and “designs production data systems” is enormous.
This guide helps you hire right the first time by defining the role clearly, structuring interviews that test real skills, running take-home projects that reveal engineering quality, and recognizing the green and red flags that predict long-term success or failure.
Step 1: Define the Role Clearly
Seniority Levels
| Seniority | Title | Experience | US Salary Range | Key Expectations |
|---|---|---|---|---|
| Junior | Data Engineer I | 0-2 years | $80K-$110K | Execute on defined tasks, learn patterns |
| Mid | Data Engineer II | 2-5 years | $110K-$150K | Design pipelines independently, own systems |
| Senior | Senior Data Engineer | 5-8 years | $150K-$200K | Architecture decisions, mentoring, cross-team |
| Staff | Staff/Lead Data Engineer | 8+ years | $190K-$250K | Technology strategy, org-wide standards |
Core vs Nice-to-Have Skills
The biggest hiring mistake is requiring every skill in the ecosystem. Focus on transferable fundamentals — specific tools can be learned.
| Core (Must Have) | Nice-to-Have (Learnable) |
|---|---|
| SQL (advanced — window functions, CTEs, query optimization) | Spark / PySpark |
| Python (data processing, not just scripting) | Kafka / streaming architectures |
| ETL/ELT pipeline design and maintenance | dbt (data build tool) |
| Cloud data services (at least ONE: AWS/Azure/GCP) | Data modeling (star schema, dimensional) |
| Git and version control | CI/CD for data pipelines |
| Data quality concepts (validation, profiling, monitoring) | Airflow / Dagster / Prefect |
| Understanding of data warehousing concepts | Terraform / IaC |
| Basic cloud infrastructure (networking, storage, compute) | Real-time/streaming processing |
Writing the Job Description
Do:
- Specify seniority level and salary range (saves everyone time)
- List 5-6 core requirements, not 20 “must-haves”
- Describe the actual work (what pipelines, what data, what scale)
- Mention the tech stack honestly (including legacy systems they’ll maintain)
Don’t:
- Require “10 years of Snowflake experience” (it hasn’t existed that long)
- List every tool in the modern data stack as required
- Hide seniority expectations behind vague titles
- Require a CS degree if experience demonstrates equivalent skill
Step 2: Technical Interview Questions
SQL Assessment (30 minutes)
Start with fundamentals and escalate to advanced patterns. Any data engineer must be fluent in SQL.
-- Question 1: Window Functions (Mid-level)
-- "Calculate each customer's purchase rank within their region"
SELECT
customer_name,
region,
total_purchases,
RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;
-- Good answer includes: discussion of RANK vs DENSE_RANK vs ROW_NUMBER
-- Great answer includes: performance implications of window functions on large datasets
-- Question 2: Self-Join / Pattern Detection (Senior-level)
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
AND b.order_date = a.order_date + INTERVAL '1 day';
-- Great answer: uses LAG window function instead of self-join for performance
-- SELECT customer_id FROM (
-- SELECT customer_id, order_date,
-- LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_date
-- FROM orders
-- ) sub WHERE order_date = prev_date + INTERVAL '1 day';
-- Question 3: Data Quality (Must pass for any level)
-- "Write a query to find duplicate customer records by email"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Follow-up: "How would you decide which duplicate to keep?"
-- Look for: recency, completeness, source system priority
-- Question 4: Performance (Senior+)
-- "This query takes 45 seconds. How would you diagnose and fix it?"
-- Look for: EXPLAIN ANALYZE, index analysis, query rewrite, partitioning
Python Assessment (30 minutes)
Test real-world data processing, not LeetCode algorithms.
# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
def clean_and_load(filepath: str, engine) -> dict:
"""
Expected answer should handle:
- Mixed date formats (2025-01-01, 01/01/2025, Jan 1 2025)
- Null values (strategic handling, not just dropna())
- Type coercion (strings that should be numbers)
- Duplicate detection (email-based dedup)
- Error logging (not silent failures)
- Idempotency (re-running doesn't create duplicates)
"""
df = pd.read_csv(filepath)
stats = {"raw_rows": len(df)}
# Clean dates (multiple formats)
df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')
stats["invalid_dates"] = df['created_at'].isna().sum()
# Normalize email (lowercase, strip whitespace)
df['email'] = df['email'].str.lower().str.strip()
# Deduplicate (keep most recent)
before = len(df)
df = df.sort_values('created_at', ascending=False).drop_duplicates(subset=['email'], keep='first')
stats["duplicates_removed"] = before - len(df)
# Validate email format
valid_mask = df['email'].str.contains(r'^[^@]+@[^@]+\.[^@]+$', na=False)
invalid = df[~valid_mask]
df = df[valid_mask]
stats["invalid_emails"] = len(invalid)
# Log invalid records for investigation
if len(invalid) > 0:
invalid.to_csv('rejected_records.csv', index=False)
logger.warning(f"Wrote {len(invalid)} rejected records to rejected_records.csv")
# Load (with conflict handling for idempotency)
df.to_sql('customers', engine, if_exists='append', index=False, method='multi')
stats["loaded"] = len(df)
return stats
Evaluation criteria for Python assessment:
| Aspect | Junior Pass | Senior Pass |
|---|---|---|
| Correctness | Works on happy path | Handles all edge cases |
| Error handling | Try/except exists | Specific exceptions, logging, rejected records saved |
| Idempotency | Not considered | Handles re-runs (upsert or dedup logic) |
| Code quality | Functional | Clean naming, docstrings, type hints |
| Testing | Not included | Mentions or writes test cases |
Step 3: Take-Home Project (Optional, 2-4 Hours)
## Data Pipeline Challenge
**Time limit: 2-4 hours** (respect the candidate's time)
You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).
Build a pipeline that:
1. Reads and profiles the source data (record count, null rates, distribution)
2. Applies data quality rules (document your rules and thresholds)
3. Transforms into a star schema (1 fact table + 2-3 dimension tables)
4. Loads into SQLite (or DuckDB)
5. Writes a brief report of data quality findings and decisions
Evaluation criteria (shared with candidate upfront):
- Code quality and structure (35%)
- Data quality handling and documentation (30%)
- Schema design decisions (20%)
- Documentation and assumptions (15%)
Bonus (not required):
- Unit tests for transformation logic
- Data validation checks post-load
- Dockerized environment
Take-Home Anti-Patterns (Reject These)
| Red Flag in Submission | What It Means |
|---|---|
| No error handling anywhere | Will build fragile pipelines |
df.dropna() with no analysis | Doesn’t understand data quality |
| No README or documentation | Cannot communicate technical decisions |
| Hardcoded file paths and credentials | No awareness of deployment practices |
| Single 500-line script | Cannot organize code or think modularly |
Step 4: Evaluation Scorecard
Standardize evaluation across all interviewers to reduce bias.
| Criteria | Weight | Score (1-5) | Notes |
|---|---|---|---|
| SQL proficiency (window functions, optimization, CTEs) | 25% | ___ | |
| Python data processing (pandas, edge cases, quality) | 20% | ___ | |
| Pipeline design thinking (idempotency, failure recovery, monitoring) | 20% | ___ | |
| Communication and collaboration | 15% | ___ | |
| Cloud services knowledge (at least one major cloud) | 10% | ___ | |
| Culture fit and growth mindset | 10% | ___ |
Green Flags 🟢
- Asks about data quality before building pipelines (“What does bad data look like?”)
- Thinks about idempotency and failure recovery without being prompted
- Writes tests for data transformations (even simple assertions)
- Considers downstream consumers (“Who uses this data? What do they need?”)
- Documents assumptions and trade-offs in their code
- Has opinions about data modeling (star schema vs normalized, with reasoning)
- Asks about monitoring and alerting for data freshness and quality
Red Flags 🔴
- “I just use pandas for everything” (won’t scale past 1M rows)
- Cannot explain trade-offs between batch and streaming processing
- No concept of data validation, profiling, or quality frameworks
- Cannot debug or explain a slow SQL query (no EXPLAIN ANALYZE awareness)
- No interest in understanding the business context of the data
- No mention of version control or testing in their workflow
- Only talks about tools, never about patterns (tool followers, not problem solvers)
Step 5: Onboarding Plan (30/60/90 Days)
The hire doesn’t end at the offer letter. A structured onboarding plan reduces ramp-up time by 40%.
| Milestone | Goals | Support Needed |
|---|---|---|
| Day 1-30 | Understand data architecture, ship first small PR, complete access setup | Assigned buddy, architecture walkthrough, pair programming sessions |
| Day 31-60 | Own one pipeline end-to-end, fix one data quality issue, contribute to documentation | Regular 1:1s, code reviews, stakeholder introductions |
| Day 61-90 | Design and build a new pipeline independently, present to team | Reduced supervision, ownership of a domain area |
Hiring Checklist
- Role level and responsibilities clearly defined (not a wishlist)
- Core vs nice-to-have skills documented and agreed upon
- Salary range benchmarked and included in job posting
- SQL assessment prepared (3 questions, escalating difficulty)
- Python assessment prepared (real-world data processing problem)
- Take-home project ready with clear evaluation rubric shared upfront
- Interview panel includes at least one practicing data engineer
- Evaluation scorecard standardized across all interviewers
- Onboarding plan prepared (30/60/90 day milestones)
- Rejection criteria documented (what’s a hard no vs coachable gap)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::