How would you design a CI/CD pipeline for a 50-engineer organization with 10 services?

Start with non-functional requirements: time from commit to production (< X min), parallelization across services, rollback capability, security scanning. Pick a primary tool (GitHub Actions / GitLab / Argo CD) and explain why. Design: per-service pipeline with templated stages (build, test, security scan, package, deploy to staging, integration tests, canary to prod). Standardize the template across services to reduce maintenance. Add concurrency controls so 10 services do not contend for runners. Most important: explicit deploy-to-prod approval flow with automated rollback on health-check failure.

A Kubernetes pod is in CrashLoopBackOff. Walk me through how you debug it.

`kubectl describe pod ` first — look at events, exit codes, and resource limits. `kubectl logs --previous` to see the logs from the last crashed instance. Check image pull errors. Check resource limits — OOMKilled (exit 137) means memory limit too low. Check readiness/liveness probe configs — too-aggressive probes cause restart loops. Check ConfigMap/Secret mounts — missing or malformed config is a common cause. Check init containers if they exist. Most important: check the namespace and node — sometimes the pod is fine and a node-level issue (disk pressure, network) is the cause.

Design a Terraform module structure for managing infrastructure across dev / staging / prod.

Three layers. Layer 1 — reusable modules (vpc, eks-cluster, rds-instance) that encapsulate primitives. Layer 2 — environment compositions (envs/dev/main.tf, envs/staging/main.tf, envs/prod/main.tf) that call modules with environment-specific inputs. Layer 3 — backend / state management (remote state in S3 with DynamoDB locking, separate state files per environment). Use workspaces only for ephemeral envs, not for prod/staging. Version pin modules. Use a `terragrunt` or wrapper script for shared variables across environments. Most important: changes to prod must go through PR review with `terraform plan` output included.

How do you decide what to monitor and what to alert on?

Monitor everything — logs, metrics, traces. The fact that you have data does not mean you alert on it. Alert on: SLO violations (latency, error rate, availability), saturation that predicts SLO violations, and anomalies that have historically preceded incidents. Do NOT alert on: any single failed request, any short-duration spike, any non-SLO metric that lacks an action. Every alert should have a runbook. Most important: track alert noise — if an alert fires repeatedly without action, delete or tune it. The four golden signals (latency, traffic, errors, saturation) are still the foundation in 2026.

Tell me about a time you disagreed with a developer or another engineer about a production issue.

Pick a real example. Describe the disagreement (often: deploy in spite of failing health checks, ship without observability, skip incident review). Describe your reasoning and how you raised it (one-on-one first, with data, with options). Describe their reasoning — and acknowledge it had merit. Describe the resolution — sometimes you were right, sometimes they were, sometimes a compromise. Close with what changed about the team's process.

Write a script that parses last 1000 lines of a log file and reports the top 5 errors.

In Python: read last 1000 lines (use `collections.deque` with maxlen=1000 for efficiency on large files), parse each line (regex or split on log format), filter for error lines, count error message frequency with Counter, return top 5. Discuss edge cases: empty file, file with fewer than 1000 lines, malformed lines, performance for very large files (deque approach handles this). Most important: handle the parsing failure case gracefully — log files are messy.

What is your on-call experience and philosophy?

Describe rotations you have done — frequency (weekly / bi-weekly), service scope, escalation path. Describe what good on-call looks like to you: clear runbooks, well-tuned alerts, fair rotation, post-incident learning. Be honest about what burns you out: alert fatigue, lone on-call without backup, no runbooks. Ask about their on-call: rotation frequency, alert volume, runbook coverage. This is one of the few interview questions where being honest about constraints helps you, not hurts you.

How do you approach security in your day-to-day infrastructure work?

Threat-model new infrastructure changes — what is the blast radius if this is compromised. Pre-commit / pre-merge scans: secrets scanning (gitleaks, trufflehog), Terraform scanning (tfsec, checkov), container scanning (trivy, snyk). Runtime: least-privilege IAM, network segmentation, secrets rotation (Vault, AWS Secrets Manager). Patch management cadence. Incident response: know how to revoke credentials, isolate compromised systems, audit access logs. Most important: security partners with you on architecture decisions, not just reviewing after-the-fact.

A service is showing increased p99 latency but normal p50. What is your debugging approach?

The fact that p50 is unchanged tells you the median request path is fine — something is affecting a subset. Common causes: garbage collection pauses (heap dump, GC logs), lock contention (profile thread states), slow downstream dependency (check downstream p99), DNS resolution flake, network hops, cold caches for less-common queries, noisy neighbor on shared infrastructure. Start with traces — sample 100 slow requests and compare to 100 fast requests. Look at the histograms — is it bimodal (two distinct populations) or a long tail. Most important: do not assume it is your code — distributed systems often have p99 issues that are infrastructure-level.

How do you handle a team that wants to ship a service without proper observability?

Pick a real example. Describe the pressure (deadline, business priority, "we will add it later"). Describe your position — you cannot operate a service in production without metrics, logs, and traces; the cost of adding them after a production incident is much higher. Describe what you did — likely a compromise: ship with minimum observability (golden signals + structured logs) and a commitment to add deeper observability in next sprint. Close with what actually happened — did the team follow through on the commitment.

What is your experience with cloud cost optimization?

Be specific. Describe a real cost-reduction project: how you identified the spend (cost-allocation tags, CUDOS dashboard, Vantage), the targets you picked (often: idle compute, oversized instances, unattached storage, NAT gateway, cross-AZ traffic, RI/SP underutilization), the savings you achieved (in dollars). Discuss the trade-offs — cost optimization can hurt reliability if done wrong. Most important: cost work needs ongoing measurement, not one-time projects.

What direction do you want your career to go?

Be honest about direction. Staff/Principal SRE: technical depth, large-scale infrastructure, incident leadership. Platform engineering: developer experience, internal tooling, abstraction layers. Engineering manager: hiring, leveling, team-building. Acknowledge uncertainty if you have it. If you might pivot to security or backend, mention it.

Do you have any questions for me?

Tailor by interviewer. For hiring manager: "What is the operational maturity gap that frustrates you most?" For peer SRE: "What was the last sev-1, and what changed because of it?" For dev partner: "How does infrastructure support or block your team?" Logistics last.

How long is the DevOps / SRE interview process in 2026?

4-6 weeks at most US companies. FAANG can stretch to 6-8 weeks. Series A/B startups can close in 2-3 weeks with a more compressed loop.

Do I need Kubernetes experience for every DevOps role in 2026?

Strongly preferred at almost all mid+ roles. Some companies (small startups still on Heroku/Render, traditional enterprises on VMs) do not require it but they are increasingly rare. Even tutorial-level K8s without production experience is a significant disadvantage at senior+ levels.

How important is on-call experience for SRE interviews?

Essential at senior+ levels. Interviewers expect at least one substantial on-call rotation with real production incidents. Junior candidates can substitute lab/personal-project incident response, but senior candidates without it usually fail loops.

What changed about DevOps interviews after the 2025 layoffs?

Bar rose meaningfully. Junior DevOps roles largely disappeared (companies expect entry via SWE → DevOps pivot, not direct entry). Senior+ roles became more competitive but also more available, as cost-focused orgs invest in efficiency engineering.

Are AI tools changing the DevOps interview format?

Modestly. Some companies ask how you would use LLMs for postmortem writing, runbook generation, or capacity planning. The core skills (incident response, K8s operations, IaC) are unchanged. AI is a question, not a replacement.

Technology$90,000 - $180,000

DevOps Engineer Interview Questions

Q: Walk me through the worst production incident you have led the response on.

Pick a real sev-1 or sev-2. Open with the impact in one sentence (users affected, revenue impact, duration). Walk through the timeline — how you got paged, what you saw first, what you tried, what worked. Describe the recovery — rollback or hotfix, how you communicated to stakeholders, who you escalated to. End with the postmortem — root cause, what process or system change came from it, what you would do differently. Most important: do not blame upstream or downstream teams.

Q: A Kubernetes pod is in CrashLoopBackOff. Walk me through how you debug it.

`kubectl describe pod ` first — look at events, exit codes, and resource limits. `kubectl logs --previous` to see the logs from the last crashed instance. Check image pull errors. Check resource limits — OOMKilled (exit 137) means memory limit too low. Check readiness/liveness probe configs — too-aggressive probes cause restart loops. Check ConfigMap/Secret mounts — missing or malformed config is a common cause. Check init containers if they exist. Most important: check the namespace and node — sometimes the pod is fine and a node-level issue (disk pressure, network) is the cause.

Q: Design a Terraform module structure for managing infrastructure across dev / staging / prod.

Three layers. Layer 1 — reusable modules (vpc, eks-cluster, rds-instance) that encapsulate primitives. Layer 2 — environment compositions (envs/dev/main.tf, envs/staging/main.tf, envs/prod/main.tf) that call modules with environment-specific inputs. Layer 3 — backend / state management (remote state in S3 with DynamoDB locking, separate state files per environment). Use workspaces only for ephemeral envs, not for prod/staging. Version pin modules. Use a `terragrunt` or wrapper script for shared variables across environments. Most important: changes to prod must go through PR review with `terraform plan` output included.

Q: How do you decide what to monitor and what to alert on?

Monitor everything — logs, metrics, traces. The fact that you have data does not mean you alert on it. Alert on: SLO violations (latency, error rate, availability), saturation that predicts SLO violations, and anomalies that have historically preceded incidents. Do NOT alert on: any single failed request, any short-duration spike, any non-SLO metric that lacks an action. Every alert should have a runbook. Most important: track alert noise — if an alert fires repeatedly without action, delete or tune it. The four golden signals (latency, traffic, errors, saturation) are still the foundation in 2026.

Q: Tell me about a time you disagreed with a developer or another engineer about a production issue.

Pick a real example. Describe the disagreement (often: deploy in spite of failing health checks, ship without observability, skip incident review). Describe your reasoning and how you raised it (one-on-one first, with data, with options). Describe their reasoning — and acknowledge it had merit. Describe the resolution — sometimes you were right, sometimes they were, sometimes a compromise. Close with what changed about the team's process.

Q: Write a script that parses last 1000 lines of a log file and reports the top 5 errors.

In Python: read last 1000 lines (use `collections.deque` with maxlen=1000 for efficiency on large files), parse each line (regex or split on log format), filter for error lines, count error message frequency with Counter, return top 5. Discuss edge cases: empty file, file with fewer than 1000 lines, malformed lines, performance for very large files (deque approach handles this). Most important: handle the parsing failure case gracefully — log files are messy.

Q: What is your on-call experience and philosophy?

Describe rotations you have done — frequency (weekly / bi-weekly), service scope, escalation path. Describe what good on-call looks like to you: clear runbooks, well-tuned alerts, fair rotation, post-incident learning. Be honest about what burns you out: alert fatigue, lone on-call without backup, no runbooks. Ask about their on-call: rotation frequency, alert volume, runbook coverage. This is one of the few interview questions where being honest about constraints helps you, not hurts you.

Q: How do you approach security in your day-to-day infrastructure work?

Threat-model new infrastructure changes — what is the blast radius if this is compromised. Pre-commit / pre-merge scans: secrets scanning (gitleaks, trufflehog), Terraform scanning (tfsec, checkov), container scanning (trivy, snyk). Runtime: least-privilege IAM, network segmentation, secrets rotation (Vault, AWS Secrets Manager). Patch management cadence. Incident response: know how to revoke credentials, isolate compromised systems, audit access logs. Most important: security partners with you on architecture decisions, not just reviewing after-the-fact.

Q: A service is showing increased p99 latency but normal p50. What is your debugging approach?

The fact that p50 is unchanged tells you the median request path is fine — something is affecting a subset. Common causes: garbage collection pauses (heap dump, GC logs), lock contention (profile thread states), slow downstream dependency (check downstream p99), DNS resolution flake, network hops, cold caches for less-common queries, noisy neighbor on shared infrastructure. Start with traces — sample 100 slow requests and compare to 100 fast requests. Look at the histograms — is it bimodal (two distinct populations) or a long tail. Most important: do not assume it is your code — distributed systems often have p99 issues that are infrastructure-level.

What 2026 DevOps interviews actually test

DevOps and SRE interviews in 2026 weight incident-response storytelling and Kubernetes-at-scale experience above traditional coding. The bar for production operations fluency has risen sharply. This guide covers the SRE-track loop (most common at FAANG / scale-ups) with notes on where platform-engineering loops differ.

Typical rounds

End-to-end time

4-6 weeks

Questions covered

What the DevOps Engineer interview loop actually looks like

Recruiter Screen

• Phone call• 30 min

Comp, on-call expectations, tooling inventory (K8s, Terraform, cloud platform). Recruiters filter on production K8s experience here.

Hiring Manager Screen

• Video call• 60 min

Recent incident walkthrough. Pick a sev-1 you led, walk through detection / mitigation / resolution / postmortem. This round filters heavily on production operations fluency.

Technical Phone Screen

• Live coding or systems design• 60 min

For SRE: coding problem with operations flavor (parsing logs, processing time series, retry logic). For platform: systems design discussion (build an internal developer platform).

Onsite — Infrastructure Design

• 60 min systems design• 60 min

Design a globally distributed system, a CI/CD platform, or a multi-tenant K8s cluster. Bring real opinions about trade-offs.

Onsite — Behavioral / Bar Raiser

• 45 min cross-team interview• 45 min

On-call culture, incident response under pressure, working with developer teams. Often determines marginal-call hires.

14 DevOps Engineer interview questions

Tap any question to see what the interviewer is really asking, how to structure your answer, and the red flags to avoid.

What they're really asking

Single most important question in any senior DevOps/SRE loop. Tests production fluency, communication under pressure, and learning-orientation.

Answer framework

Pick a real sev-1 or sev-2. Open with the impact in one sentence (users affected, revenue impact, duration). Walk through the timeline — how you got paged, what you saw first, what you tried, what worked. Describe the recovery — rollback or hotfix, how you communicated to stakeholders, who you escalated to. End with the postmortem — root cause, what process or system change came from it, what you would do differently. Most important: do not blame upstream or downstream teams.

What a strong answer signals

You have the timeline with specific times (or minutes elapsed). You separate proximate cause from root cause. You can articulate a process change that came from the postmortem.

Red flags to avoid

•"I have not been in a major incident" — implausible for senior+ DevOps
•Blaming developers, the dev team, or upstream services
•Cannot articulate the difference between proximate and root cause

How DevOps Engineer hires actually get decided

Approximate weight hiring committees place on each dimension. Use this to focus your prep on what actually moves the decision.

Incident response and production operations

30%

Can you actually operate systems in production. Single most important dimension for senior+ DevOps/SRE roles.

Kubernetes / IaC / cloud platform depth

25%

Tool fluency at the production level, not the tutorial level. The Kubernetes premium is real here.

Systems design at infrastructure scope

20%

Can you design CI/CD, observability stacks, multi-region infrastructure. Differentiator at staff+ levels.

Cross-team collaboration

15%

How you work with developer teams. DevOps that becomes adversarial fails regardless of technical skill.

Coding ability

10%

Lower weight than for SWE but still real. Senior SREs need to read and write code fluently.

How to prepare for a DevOps Engineer interview

Have 3 incident stories at increasing severity

Senior DevOps loops always probe incident response. Prepare a sev-3 (small bug, contained quickly), a sev-2 (significant impact, multi-team response), and a sev-1 (major outage, you led). Each rehearsed with timeline, what you did, root cause, and process change.

Refresh Kubernetes debugging muscle memory

Most K8s questions are debugging scenarios. Spin up a kind/minikube cluster, deploy a service, break it intentionally (wrong image, bad probe, OOM, missing config), and practice debugging. The interviewer will ask exactly these scenarios.

Build one IaC project end-to-end

Terraform a small AWS or GCP setup: VPC, EKS, IAM, a deployed service, observability. Push to GitHub. Discuss it in interviews — concrete examples beat conceptual knowledge.

Read 2-3 recent public postmortems

Cloudflare, GitHub, Stripe, AWS — these companies publish detailed postmortems. Reading 3 fresh ones builds your incident vocabulary and gives you "I read about a similar issue" references. Use them in answers.

For SRE roles, refresh distributed systems basics

CAP theorem, consistency levels, leader election, retries with backoff, idempotency, circuit breakers. Designing Data-Intensive Applications chapters 5-9 is the standard reference.

DevOps Engineer interview FAQs

More for DevOps Engineers

DevOps Engineer resume example DevOps Engineer cover letter DevOps Engineer salary guide Skills for DevOps Engineers How to become a DevOps Engineer

Interview prep for related roles

Backend Developer interview questions

Technology • $80,000 - $160,000

Software Engineer interview questions

Technology • $80,000 - $180,000