How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide

Hiring a bad data engineer costs $50K-$150K when you factor in recruiting fees, onboarding time, salary during ramp-up, the 6-12 months of underperformance before you realize the mistake, and the cost of re-hiring. The data engineering talent market is competitive, and the skills gap between “writes SQL” and “designs production data systems” is enormous.

This guide helps you hire right the first time by defining the role clearly, structuring interviews that test real skills, running take-home projects that reveal engineering quality, and recognizing the green and red flags that predict long-term success or failure.

Step 1: Define the Role Clearly

Seniority Levels

Seniority	Title	Experience	US Salary Range	Key Expectations
Junior	Data Engineer I	0-2 years	$80K-$110K	Execute on defined tasks, learn patterns
Mid	Data Engineer II	2-5 years	$110K-$150K	Design pipelines independently, own systems
Senior	Senior Data Engineer	5-8 years	$150K-$200K	Architecture decisions, mentoring, cross-team
Staff	Staff/Lead Data Engineer	8+ years	$190K-$250K	Technology strategy, org-wide standards

Core vs Nice-to-Have Skills

The biggest hiring mistake is requiring every skill in the ecosystem. Focus on transferable fundamentals — specific tools can be learned.

Core (Must Have)	Nice-to-Have (Learnable)
SQL (advanced — window functions, CTEs, query optimization)	Spark / PySpark
Python (data processing, not just scripting)	Kafka / streaming architectures
ETL/ELT pipeline design and maintenance	dbt (data build tool)
Cloud data services (at least ONE: AWS/Azure/GCP)	Data modeling (star schema, dimensional)
Git and version control	CI/CD for data pipelines
Data quality concepts (validation, profiling, monitoring)	Airflow / Dagster / Prefect
Understanding of data warehousing concepts	Terraform / IaC
Basic cloud infrastructure (networking, storage, compute)	Real-time/streaming processing

Writing the Job Description

Do:

Specify seniority level and salary range (saves everyone time)
List 5-6 core requirements, not 20 “must-haves”
Describe the actual work (what pipelines, what data, what scale)
Mention the tech stack honestly (including legacy systems they’ll maintain)

Don’t:

Require “10 years of Snowflake experience” (it hasn’t existed that long)
List every tool in the modern data stack as required
Hide seniority expectations behind vague titles
Require a CS degree if experience demonstrates equivalent skill

Step 2: Technical Interview Questions

SQL Assessment (30 minutes)

Start with fundamentals and escalate to advanced patterns. Any data engineer must be fluent in SQL.

-- Question 1: Window Functions (Mid-level)
-- "Calculate each customer's purchase rank within their region"
SELECT
    customer_name,
    region,
    total_purchases,
    RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;

-- Good answer includes: discussion of RANK vs DENSE_RANK vs ROW_NUMBER
-- Great answer includes: performance implications of window functions on large datasets

-- Question 2: Self-Join / Pattern Detection (Senior-level)
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
    AND b.order_date = a.order_date + INTERVAL '1 day';

-- Great answer: uses LAG window function instead of self-join for performance
-- SELECT customer_id FROM (
--   SELECT customer_id, order_date,
--     LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_date
--   FROM orders
-- ) sub WHERE order_date = prev_date + INTERVAL '1 day';

-- Question 3: Data Quality (Must pass for any level)
-- "Write a query to find duplicate customer records by email"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- Follow-up: "How would you decide which duplicate to keep?"
-- Look for: recency, completeness, source system priority

-- Question 4: Performance (Senior+)
-- "This query takes 45 seconds. How would you diagnose and fix it?"
-- Look for: EXPLAIN ANALYZE, index analysis, query rewrite, partitioning

Python Assessment (30 minutes)

Test real-world data processing, not LeetCode algorithms.

# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

def clean_and_load(filepath: str, engine) -> dict:
    """
    Expected answer should handle:
    - Mixed date formats (2025-01-01, 01/01/2025, Jan 1 2025)
    - Null values (strategic handling, not just dropna())
    - Type coercion (strings that should be numbers)
    - Duplicate detection (email-based dedup)
    - Error logging (not silent failures)
    - Idempotency (re-running doesn't create duplicates)
    """
    df = pd.read_csv(filepath)
    stats = {"raw_rows": len(df)}

    # Clean dates (multiple formats)
    df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')
    stats["invalid_dates"] = df['created_at'].isna().sum()

    # Normalize email (lowercase, strip whitespace)
    df['email'] = df['email'].str.lower().str.strip()

    # Deduplicate (keep most recent)
    before = len(df)
    df = df.sort_values('created_at', ascending=False).drop_duplicates(subset=['email'], keep='first')
    stats["duplicates_removed"] = before - len(df)

    # Validate email format
    valid_mask = df['email'].str.contains(r'^[^@]+@[^@]+\.[^@]+$', na=False)
    invalid = df[~valid_mask]
    df = df[valid_mask]
    stats["invalid_emails"] = len(invalid)

    # Log invalid records for investigation
    if len(invalid) > 0:
        invalid.to_csv('rejected_records.csv', index=False)
        logger.warning(f"Wrote {len(invalid)} rejected records to rejected_records.csv")

    # Load (with conflict handling for idempotency)
    df.to_sql('customers', engine, if_exists='append', index=False, method='multi')
    stats["loaded"] = len(df)

    return stats

Evaluation criteria for Python assessment:

Aspect	Junior Pass	Senior Pass
Correctness	Works on happy path	Handles all edge cases
Error handling	Try/except exists	Specific exceptions, logging, rejected records saved
Idempotency	Not considered	Handles re-runs (upsert or dedup logic)
Code quality	Functional	Clean naming, docstrings, type hints
Testing	Not included	Mentions or writes test cases

Step 3: Take-Home Project (Optional, 2-4 Hours)

## Data Pipeline Challenge

**Time limit: 2-4 hours** (respect the candidate's time)

You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).

Build a pipeline that:
1. Reads and profiles the source data (record count, null rates, distribution)
2. Applies data quality rules (document your rules and thresholds)
3. Transforms into a star schema (1 fact table + 2-3 dimension tables)
4. Loads into SQLite (or DuckDB)
5. Writes a brief report of data quality findings and decisions

Evaluation criteria (shared with candidate upfront):
- Code quality and structure (35%)
- Data quality handling and documentation (30%)
- Schema design decisions (20%)
- Documentation and assumptions (15%)

Bonus (not required):
- Unit tests for transformation logic
- Data validation checks post-load
- Dockerized environment

Take-Home Anti-Patterns (Reject These)

Red Flag in Submission	What It Means
No error handling anywhere	Will build fragile pipelines
`df.dropna()` with no analysis	Doesn’t understand data quality
No README or documentation	Cannot communicate technical decisions
Hardcoded file paths and credentials	No awareness of deployment practices
Single 500-line script	Cannot organize code or think modularly

Step 4: Evaluation Scorecard

Standardize evaluation across all interviewers to reduce bias.

Criteria	Weight	Score (1-5)
SQL proficiency (window functions, optimization, CTEs)	25%	___
Python data processing (pandas, edge cases, quality)	20%	___
Pipeline design thinking (idempotency, failure recovery, monitoring)	20%	___
Communication and collaboration	15%	___
Cloud services knowledge (at least one major cloud)	10%	___
Culture fit and growth mindset	10%	___

Green Flags 🟢

Asks about data quality before building pipelines (“What does bad data look like?”)
Thinks about idempotency and failure recovery without being prompted
Writes tests for data transformations (even simple assertions)
Considers downstream consumers (“Who uses this data? What do they need?”)
Documents assumptions and trade-offs in their code
Has opinions about data modeling (star schema vs normalized, with reasoning)
Asks about monitoring and alerting for data freshness and quality

Red Flags 🔴

“I just use pandas for everything” (won’t scale past 1M rows)
Cannot explain trade-offs between batch and streaming processing
No concept of data validation, profiling, or quality frameworks
Cannot debug or explain a slow SQL query (no EXPLAIN ANALYZE awareness)
No interest in understanding the business context of the data
No mention of version control or testing in their workflow
Only talks about tools, never about patterns (tool followers, not problem solvers)

Step 5: Onboarding Plan (30/60/90 Days)

The hire doesn’t end at the offer letter. A structured onboarding plan reduces ramp-up time by 40%.

Milestone	Goals	Support Needed
Day 1-30	Understand data architecture, ship first small PR, complete access setup	Assigned buddy, architecture walkthrough, pair programming sessions
Day 31-60	Own one pipeline end-to-end, fix one data quality issue, contribute to documentation	Regular 1:1s, code reviews, stakeholder introductions
Day 61-90	Design and build a new pipeline independently, present to team	Reduced supervision, ownership of a domain area

Hiring Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::