Verified by Garnet Grid

How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide

Hire the right data engineer. Covers role definition, skills assessment, technical interview questions, take-home projects, and red/green flags.

Hiring a bad data engineer costs $50K-$150K when you factor in recruiting fees, onboarding time, salary during ramp-up, the 6-12 months of underperformance before you realize the mistake, and the cost of re-hiring. The data engineering talent market is competitive, and the skills gap between “writes SQL” and “designs production data systems” is enormous.

This guide helps you hire right the first time by defining the role clearly, structuring interviews that test real skills, running take-home projects that reveal engineering quality, and recognizing the green and red flags that predict long-term success or failure.


Step 1: Define the Role Clearly

Seniority Levels

SeniorityTitleExperienceUS Salary RangeKey Expectations
JuniorData Engineer I0-2 years$80K-$110KExecute on defined tasks, learn patterns
MidData Engineer II2-5 years$110K-$150KDesign pipelines independently, own systems
SeniorSenior Data Engineer5-8 years$150K-$200KArchitecture decisions, mentoring, cross-team
StaffStaff/Lead Data Engineer8+ years$190K-$250KTechnology strategy, org-wide standards

Core vs Nice-to-Have Skills

The biggest hiring mistake is requiring every skill in the ecosystem. Focus on transferable fundamentals — specific tools can be learned.

Core (Must Have)Nice-to-Have (Learnable)
SQL (advanced — window functions, CTEs, query optimization)Spark / PySpark
Python (data processing, not just scripting)Kafka / streaming architectures
ETL/ELT pipeline design and maintenancedbt (data build tool)
Cloud data services (at least ONE: AWS/Azure/GCP)Data modeling (star schema, dimensional)
Git and version controlCI/CD for data pipelines
Data quality concepts (validation, profiling, monitoring)Airflow / Dagster / Prefect
Understanding of data warehousing conceptsTerraform / IaC
Basic cloud infrastructure (networking, storage, compute)Real-time/streaming processing

Writing the Job Description

Do:

  • Specify seniority level and salary range (saves everyone time)
  • List 5-6 core requirements, not 20 “must-haves”
  • Describe the actual work (what pipelines, what data, what scale)
  • Mention the tech stack honestly (including legacy systems they’ll maintain)

Don’t:

  • Require “10 years of Snowflake experience” (it hasn’t existed that long)
  • List every tool in the modern data stack as required
  • Hide seniority expectations behind vague titles
  • Require a CS degree if experience demonstrates equivalent skill

Step 2: Technical Interview Questions

SQL Assessment (30 minutes)

Start with fundamentals and escalate to advanced patterns. Any data engineer must be fluent in SQL.

-- Question 1: Window Functions (Mid-level)
-- "Calculate each customer's purchase rank within their region"
SELECT
    customer_name,
    region,
    total_purchases,
    RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;

-- Good answer includes: discussion of RANK vs DENSE_RANK vs ROW_NUMBER
-- Great answer includes: performance implications of window functions on large datasets

-- Question 2: Self-Join / Pattern Detection (Senior-level)
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
    AND b.order_date = a.order_date + INTERVAL '1 day';

-- Great answer: uses LAG window function instead of self-join for performance
-- SELECT customer_id FROM (
--   SELECT customer_id, order_date,
--     LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_date
--   FROM orders
-- ) sub WHERE order_date = prev_date + INTERVAL '1 day';

-- Question 3: Data Quality (Must pass for any level)
-- "Write a query to find duplicate customer records by email"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- Follow-up: "How would you decide which duplicate to keep?"
-- Look for: recency, completeness, source system priority

-- Question 4: Performance (Senior+)
-- "This query takes 45 seconds. How would you diagnose and fix it?"
-- Look for: EXPLAIN ANALYZE, index analysis, query rewrite, partitioning

Python Assessment (30 minutes)

Test real-world data processing, not LeetCode algorithms.

# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

def clean_and_load(filepath: str, engine) -> dict:
    """
    Expected answer should handle:
    - Mixed date formats (2025-01-01, 01/01/2025, Jan 1 2025)
    - Null values (strategic handling, not just dropna())
    - Type coercion (strings that should be numbers)
    - Duplicate detection (email-based dedup)
    - Error logging (not silent failures)
    - Idempotency (re-running doesn't create duplicates)
    """
    df = pd.read_csv(filepath)
    stats = {"raw_rows": len(df)}

    # Clean dates (multiple formats)
    df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')
    stats["invalid_dates"] = df['created_at'].isna().sum()

    # Normalize email (lowercase, strip whitespace)
    df['email'] = df['email'].str.lower().str.strip()

    # Deduplicate (keep most recent)
    before = len(df)
    df = df.sort_values('created_at', ascending=False).drop_duplicates(subset=['email'], keep='first')
    stats["duplicates_removed"] = before - len(df)

    # Validate email format
    valid_mask = df['email'].str.contains(r'^[^@]+@[^@]+\.[^@]+$', na=False)
    invalid = df[~valid_mask]
    df = df[valid_mask]
    stats["invalid_emails"] = len(invalid)

    # Log invalid records for investigation
    if len(invalid) > 0:
        invalid.to_csv('rejected_records.csv', index=False)
        logger.warning(f"Wrote {len(invalid)} rejected records to rejected_records.csv")

    # Load (with conflict handling for idempotency)
    df.to_sql('customers', engine, if_exists='append', index=False, method='multi')
    stats["loaded"] = len(df)

    return stats

Evaluation criteria for Python assessment:

AspectJunior PassSenior Pass
CorrectnessWorks on happy pathHandles all edge cases
Error handlingTry/except existsSpecific exceptions, logging, rejected records saved
IdempotencyNot consideredHandles re-runs (upsert or dedup logic)
Code qualityFunctionalClean naming, docstrings, type hints
TestingNot includedMentions or writes test cases

Step 3: Take-Home Project (Optional, 2-4 Hours)

## Data Pipeline Challenge

**Time limit: 2-4 hours** (respect the candidate's time)

You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).

Build a pipeline that:
1. Reads and profiles the source data (record count, null rates, distribution)
2. Applies data quality rules (document your rules and thresholds)
3. Transforms into a star schema (1 fact table + 2-3 dimension tables)
4. Loads into SQLite (or DuckDB)
5. Writes a brief report of data quality findings and decisions

Evaluation criteria (shared with candidate upfront):
- Code quality and structure (35%)
- Data quality handling and documentation (30%)
- Schema design decisions (20%)
- Documentation and assumptions (15%)

Bonus (not required):
- Unit tests for transformation logic
- Data validation checks post-load
- Dockerized environment

Take-Home Anti-Patterns (Reject These)

Red Flag in SubmissionWhat It Means
No error handling anywhereWill build fragile pipelines
df.dropna() with no analysisDoesn’t understand data quality
No README or documentationCannot communicate technical decisions
Hardcoded file paths and credentialsNo awareness of deployment practices
Single 500-line scriptCannot organize code or think modularly

Step 4: Evaluation Scorecard

Standardize evaluation across all interviewers to reduce bias.

CriteriaWeightScore (1-5)Notes
SQL proficiency (window functions, optimization, CTEs)25%___
Python data processing (pandas, edge cases, quality)20%___
Pipeline design thinking (idempotency, failure recovery, monitoring)20%___
Communication and collaboration15%___
Cloud services knowledge (at least one major cloud)10%___
Culture fit and growth mindset10%___

Green Flags 🟢

  • Asks about data quality before building pipelines (“What does bad data look like?”)
  • Thinks about idempotency and failure recovery without being prompted
  • Writes tests for data transformations (even simple assertions)
  • Considers downstream consumers (“Who uses this data? What do they need?”)
  • Documents assumptions and trade-offs in their code
  • Has opinions about data modeling (star schema vs normalized, with reasoning)
  • Asks about monitoring and alerting for data freshness and quality

Red Flags 🔴

  • “I just use pandas for everything” (won’t scale past 1M rows)
  • Cannot explain trade-offs between batch and streaming processing
  • No concept of data validation, profiling, or quality frameworks
  • Cannot debug or explain a slow SQL query (no EXPLAIN ANALYZE awareness)
  • No interest in understanding the business context of the data
  • No mention of version control or testing in their workflow
  • Only talks about tools, never about patterns (tool followers, not problem solvers)

Step 5: Onboarding Plan (30/60/90 Days)

The hire doesn’t end at the offer letter. A structured onboarding plan reduces ramp-up time by 40%.

MilestoneGoalsSupport Needed
Day 1-30Understand data architecture, ship first small PR, complete access setupAssigned buddy, architecture walkthrough, pair programming sessions
Day 31-60Own one pipeline end-to-end, fix one data quality issue, contribute to documentationRegular 1:1s, code reviews, stakeholder introductions
Day 61-90Design and build a new pipeline independently, present to teamReduced supervision, ownership of a domain area

Hiring Checklist

  • Role level and responsibilities clearly defined (not a wishlist)
  • Core vs nice-to-have skills documented and agreed upon
  • Salary range benchmarked and included in job posting
  • SQL assessment prepared (3 questions, escalating difficulty)
  • Python assessment prepared (real-world data processing problem)
  • Take-home project ready with clear evaluation rubric shared upfront
  • Interview panel includes at least one practicing data engineer
  • Evaluation scorecard standardized across all interviewers
  • Onboarding plan prepared (30/60/90 day milestones)
  • Rejection criteria documented (what’s a hard no vs coachable gap)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →