🗄️ SQL Interview Questions
40 questions with theory, real code, real-world scenarios, common mistakes and follow-up questions — from basic joins to query optimization.
SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It was developed at IBM in the 1970s and became an ANSI/ISO standard in 1986.
SQL is divided into four sub-languages based on the type of operation:
- DDL (Data Definition Language) — defines or modifies database structure:
CREATE,ALTER,DROP,TRUNCATE - DML (Data Manipulation Language) — reads and modifies data:
SELECT,INSERT,UPDATE,DELETE - DCL (Data Control Language) — manages permissions:
GRANT,REVOKE - TCL (Transaction Control Language) — manages transactions:
BEGIN,COMMIT,ROLLBACK,SAVEPOINT
Understanding these categories helps you reason about what each statement does — DDL changes the schema (and usually auto-commits), DML changes the data (and can be rolled back within a transaction), DCL controls who can do what, and TCL controls when changes become permanent.
-- DDL: Define structure
CREATE TABLE employees (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
department VARCHAR(50),
salary DECIMAL(10,2),
hire_date DATE
);
-- DML: Manipulate data
INSERT INTO employees (name, department, salary, hire_date)
VALUES ('Priya Sharma', 'Engineering', 85000.00, '2024-03-15');
SELECT name, salary FROM employees WHERE department = 'Engineering';
UPDATE employees SET salary = 92000.00 WHERE id = 1;
DELETE FROM employees WHERE id = 1;
-- DCL: Control access
GRANT SELECT, INSERT ON employees TO 'analyst_role';
REVOKE INSERT ON employees FROM 'analyst_role';
-- TCL: Control transactions
BEGIN;
UPDATE employees SET salary = salary * 1.10 WHERE department = 'Engineering';
-- Check results before committing
COMMIT;
At a fintech startup, a junior developer ran ALTER TABLE DROP COLUMN on a production table without realising DDL auto-commits in MySQL — the column was gone instantly with no ROLLBACK possible. The team had to restore from a backup, causing 2 hours of downtime. After the incident, they enforced a policy: all DDL changes go through a migration tool (Flyway) with a review step, and DML changes are always wrapped in explicit transactions with a manual COMMIT after verification.
Running DDL statements like DROP TABLE or TRUNCATE thinking they can be rolled back like DML statements.
BEGIN;
DROP TABLE temp_reports; -- Auto-commits immediately!
ROLLBACK; -- Too late, table is goneBEGIN;
DELETE FROM temp_reports WHERE created_at < '2024-01-01';
-- Verify: SELECT COUNT(*) FROM temp_reports;
COMMIT; -- Only when you are sureWhat is the difference between TRUNCATE and DELETE? Can TRUNCATE be rolled back?
WHERE filters individual rows before any grouping happens. HAVING filters groups after the GROUP BY aggregation is applied.
The key difference is timing in the execution pipeline:
WHEREruns first — it removes rows that don't match before the database even starts grouping. This meansWHEREcannot use aggregate functions likeCOUNT(),SUM(), orAVG().HAVINGruns afterGROUP BY— it can reference aggregate functions because the groups have already been computed.
Performance tip: always prefer WHERE over HAVING when possible. Filtering rows early (with WHERE) reduces the number of rows the database needs to group and aggregate, making the query faster.
-- Table: orders (id, customer_id, amount, status, order_date)
-- WHERE: filter rows BEFORE grouping
-- "Total spent by each customer, but only count completed orders"
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed' -- filters rows first
GROUP BY customer_id;
-- HAVING: filter groups AFTER aggregation
-- "Customers who spent more than $5,000 total"
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
HAVING SUM(amount) > 5000; -- filters groups after
-- Common mistake: using HAVING when WHERE would work
-- ❌ Slow: HAVING filters after grouping all rows
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING department != 'Intern';
-- ✅ Fast: WHERE filters before grouping
SELECT department, COUNT(*) AS emp_count
FROM employees
WHERE department != 'Intern'
GROUP BY department;
A reporting dashboard at an e-commerce company was taking 45 seconds to load because the query used HAVING to filter out cancelled orders instead of WHERE. The query was grouping 12 million rows before filtering, when it could have excluded 3 million cancelled rows upfront. Moving the status filter from HAVING to WHERE reduced the query time from 45 seconds to 8 seconds — a 5.6x improvement with a one-line change.
Using HAVING for conditions that don't involve aggregate functions — this forces the database to group everything before filtering.
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING department IN ('Engineering', 'Sales');SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE department IN ('Engineering', 'Sales')
GROUP BY department;Can you use both WHERE and HAVING in the same query? What is the SQL execution order?
SQL does not execute in the order you write it. The logical execution order is:
- FROM / JOIN — identify the source tables and perform joins
- WHERE — filter individual rows
- GROUP BY — group remaining rows
- HAVING — filter groups
- SELECT — compute output columns and expressions
- DISTINCT — remove duplicates
- ORDER BY — sort the result set
- LIMIT / OFFSET — restrict the number of rows returned
This order explains many common confusions: you cannot use a column alias defined in SELECT inside WHERE (because SELECT runs after WHERE). You can use it in ORDER BY because that runs last. And HAVING can reference aggregates because GROUP BY has already run.
-- Consider this query:
SELECT department,
COUNT(*) AS emp_count,
AVG(salary) AS avg_salary
FROM employees
JOIN departments ON employees.dept_id = departments.id
WHERE hire_date >= '2023-01-01'
GROUP BY department
HAVING COUNT(*) >= 5
ORDER BY avg_salary DESC
LIMIT 10;
-- Execution order:
-- 1. FROM employees JOIN departments → combine tables
-- 2. WHERE hire_date >= '2023-01-01' → filter to recent hires
-- 3. GROUP BY department → group remaining rows
-- 4. HAVING COUNT(*) >= 5 → keep groups with 5+ people
-- 5. SELECT department, COUNT(*), AVG(salary) → compute columns
-- 6. ORDER BY avg_salary DESC → sort by average salary
-- 7. LIMIT 10 → return top 10 only
-- This is why this FAILS:
-- SELECT name, salary * 12 AS annual FROM employees WHERE annual > 100000;
-- ❌ "annual" alias does not exist yet at WHERE stage
-- Fix: repeat the expression or use a subquery
SELECT name, salary * 12 AS annual
FROM employees
WHERE salary * 12 > 100000;
A data analyst at a logistics company wrote a complex dashboard query with 4 JOINs, WHERE filters, GROUP BY, and HAVING. It returned wrong results because they referenced a SELECT alias in the HAVING clause, which some databases (MySQL) allow but PostgreSQL rejects. Understanding execution order helped them debug it in minutes — they replaced the alias with the full expression in HAVING and got correct results across both databases.
Using a column alias from SELECT in the WHERE or HAVING clause, assuming SQL executes top to bottom.
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE annual_salary > 100000;
-- ERROR: column "annual_salary" does not exist-- Option 1: Repeat the expression
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE salary * 12 > 100000;
-- Option 2: Use a subquery / CTE
WITH emp AS (
SELECT name, salary * 12 AS annual_salary FROM employees
)
SELECT * FROM emp WHERE annual_salary > 100000;Why can you use a column alias in ORDER BY but not in WHERE?
JOINs combine rows from two or more tables based on a related column. SQL supports five main types:
- INNER JOIN — returns only rows that have matching values in both tables. This is the most common join and the default when you write just
JOIN. - LEFT (OUTER) JOIN — returns all rows from the left table, plus matched rows from the right table. If there is no match, right-side columns are
NULL. - RIGHT (OUTER) JOIN — returns all rows from the right table, plus matched rows from the left table. Rarely used in practice — you can always rewrite it as a LEFT JOIN by swapping the table order.
- FULL (OUTER) JOIN — returns all rows from both tables. Unmatched rows on either side get
NULLin the other table's columns. - CROSS JOIN — returns the Cartesian product — every row in table A paired with every row in table B. No
ONclause needed. Use with caution on large tables.
There is also SELF JOIN, which is not a separate type but a technique where you join a table to itself using an alias (e.g., finding employees and their managers from the same employees table).
-- Sample tables:
-- employees: id, name, dept_id
-- departments: id, dept_name
-- INNER JOIN: only employees who have a valid department
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;
-- LEFT JOIN: all employees, even those without a department
SELECT e.name, COALESCE(d.dept_name, 'Unassigned') AS dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- Find employees WITHOUT a department (anti-join pattern)
SELECT e.name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id
WHERE d.id IS NULL;
-- FULL OUTER JOIN: all employees + all departments (even unmatched)
SELECT e.name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;
-- CROSS JOIN: every employee paired with every project (careful!)
SELECT e.name, p.project_name
FROM employees e
CROSS JOIN projects p;
-- If employees has 100 rows and projects has 10, result = 1,000 rows
-- SELF JOIN: find each employee's manager
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
An HR analytics team needed to find employees who had never submitted a timesheet. They initially used INNER JOIN between employees and timesheets, which only returned employees WITH timesheets — the opposite of what they needed. Switching to LEFT JOIN with a WHERE timesheets.id IS NULL correctly identified 47 employees with missing timesheets. This anti-join pattern is one of the most practical uses of LEFT JOIN in business reporting.
Using INNER JOIN when you need LEFT JOIN, which silently drops unmatched rows from the result — leading to missing data in reports.
-- "Show all customers and their order totals"
SELECT c.name, SUM(o.amount) AS total
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- Customers with zero orders are completely missing!SELECT c.name, COALESCE(SUM(o.amount), 0) AS total
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- All customers shown, including those with $0 ordersWhat is the difference between LEFT JOIN with WHERE and LEFT JOIN with AND in the ON clause?
Both UNION and UNION ALL combine the results of two or more SELECT statements vertically (stacking rows). The key difference:
- UNION — removes duplicate rows from the combined result. It performs a
DISTINCToperation behind the scenes, which requires sorting or hashing to find and eliminate duplicates. - UNION ALL — keeps all rows, including duplicates. No deduplication step, so it is significantly faster.
Rules for both: each SELECT must have the same number of columns, and the corresponding columns must have compatible data types. Column names come from the first SELECT.
Performance rule of thumb: always use UNION ALL unless you specifically need deduplication. If the source queries are guaranteed to produce distinct rows (e.g., they pull from different tables with no overlap), UNION wastes time sorting.
-- Combine active and archived orders into one report
-- UNION ALL: faster, keeps all rows (no duplicates possible here)
SELECT id, customer_name, amount, 'active' AS source
FROM orders
WHERE status = 'active'
UNION ALL
SELECT id, customer_name, amount, 'archive' AS source
FROM orders_archive;
-- UNION: removes duplicates (slower due to sort/hash)
SELECT city FROM customers
UNION
SELECT city FROM suppliers;
-- Returns each city only once, even if it appears in both tables
-- UNION ALL: keeps duplicates
SELECT city FROM customers
UNION ALL
SELECT city FROM suppliers;
-- "London" appears twice if it is in both tables
-- Combining 3+ queries with UNION ALL
SELECT name, 'employee' AS type FROM employees
UNION ALL
SELECT name, 'contractor' AS type FROM contractors
UNION ALL
SELECT name, 'intern' AS type FROM interns
ORDER BY name; -- ORDER BY applies to the entire result
A data warehouse ETL pipeline at a retail company combined sales from 12 regional databases using UNION (with dedup). Processing took 25 minutes because the dedup step sorted 80 million rows. Since each regional database had unique transaction IDs, duplicates were impossible. Switching to UNION ALL dropped processing time to 4 minutes — an 84% improvement with zero risk of data loss.
Using UNION by default everywhere, paying the sorting cost when duplicates are either impossible or acceptable.
-- Each table has unique IDs, duplicates are impossible
SELECT id, name, amount FROM current_orders
UNION
SELECT id, name, amount FROM archived_orders;
-- Sorts all rows to remove duplicates that don't existSELECT id, name, amount FROM current_orders
UNION ALL
SELECT id, name, amount FROM archived_orders;
-- Skips the sort, directly concatenates resultsCan you use ORDER BY with UNION? Where does the ORDER BY clause go?
NULL in SQL represents an unknown or missing value — it is not zero, not an empty string, and not false. It is the absence of any value.
NULL has special behaviour that trips up even experienced developers:
- Comparisons: Any comparison with NULL returns
NULL(not true or false).NULL = NULLisNULL, notTRUE. UseIS NULLorIS NOT NULLto test for NULLs. - Aggregates: Most aggregate functions (
SUM,AVG,COUNT(column)) ignore NULL values.COUNT(*)counts all rows including NULLs, butCOUNT(column)only counts non-NULL values. - COALESCE: Returns the first non-NULL argument.
COALESCE(NULL, NULL, 'default')returns'default'. - Boolean logic:
NULL AND TRUE = NULL,NULL OR TRUE = TRUE,NOT NULL = NULL. This three-valued logic (TRUE, FALSE, NULL) can cause unexpected WHERE clause results.
-- NULL comparison pitfalls
SELECT * FROM employees WHERE manager_id = NULL; -- ❌ Returns nothing!
SELECT * FROM employees WHERE manager_id IS NULL; -- ✅ Correct
-- NULL in arithmetic
SELECT 100 + NULL; -- Result: NULL (not 100)
SELECT NULL = NULL; -- Result: NULL (not TRUE)
SELECT NULL != NULL; -- Result: NULL (not TRUE)
-- COUNT behavior with NULLs
-- employees table: 100 rows, 15 have NULL in bonus column
SELECT COUNT(*) FROM employees; -- 100 (all rows)
SELECT COUNT(bonus) FROM employees; -- 85 (non-NULL only)
SELECT AVG(bonus) FROM employees; -- average of 85 values, not 100
-- COALESCE: provide fallback values
SELECT name,
COALESCE(phone, email, 'No contact info') AS contact
FROM customers;
-- NULLIF: returns NULL if both values are equal (useful to avoid division by zero)
SELECT revenue / NULLIF(expenses, 0) AS ratio
FROM financials;
-- If expenses = 0, NULLIF returns NULL, avoiding division-by-zero error
-- Sorting with NULLs
SELECT name, bonus FROM employees
ORDER BY bonus NULLS LAST; -- PostgreSQL syntax
A payroll system calculated average bonuses using AVG(bonus) which silently ignored 200 employees with NULL bonuses. The reported average was $4,500, but when the CFO manually calculated with zeros for missing bonuses, the true average was $3,200. The fix was COALESCE: AVG(COALESCE(bonus, 0)) — which treats missing bonuses as zero instead of ignoring them.
Using = NULL or != NULL instead of IS NULL / IS NOT NULL — the comparison silently returns no results.
SELECT * FROM employees WHERE bonus = NULL;
-- Returns 0 rows, even if 50 employees have NULL bonus
SELECT * FROM employees WHERE bonus != NULL;
-- Also returns 0 rows! Neither condition matches NULL.SELECT * FROM employees WHERE bonus IS NULL;
-- Returns all 50 employees with missing bonus
SELECT * FROM employees WHERE bonus IS NOT NULL;
-- Returns employees who have a bonus valueWhat is the difference between COALESCE and IFNULL/ISNULL? Which is ANSI standard?
All three remove data, but they operate at different levels and have different implications:
- DELETE (DML) — removes specific rows based on a
WHEREclause. Logs each deleted row individually, can be rolled back within a transaction. Fires triggers. Does not reset auto-increment counters. - TRUNCATE (DDL) — removes all rows from a table instantly. Deallocates data pages instead of logging individual row deletions, making it much faster than DELETE for large tables. Cannot use WHERE. Resets auto-increment counters. Cannot be rolled back in most databases (auto-commits in MySQL, but can be rolled back in PostgreSQL and SQL Server).
- DROP (DDL) — removes the entire table structure, data, indexes, constraints, and permissions. The table no longer exists. Cannot be rolled back.
Speed comparison for a table with 10 million rows: DELETE takes minutes (row-by-row logging), TRUNCATE takes milliseconds (deallocates pages), DROP takes milliseconds (removes metadata).
-- DELETE: remove specific rows (can rollback)
BEGIN;
DELETE FROM orders WHERE status = 'cancelled' AND created_at < '2023-01-01';
-- Check: SELECT COUNT(*) FROM orders WHERE status = 'cancelled';
ROLLBACK; -- Undo if needed, or COMMIT to confirm
-- DELETE all rows (slow for large tables, row-by-row)
DELETE FROM temp_logs; -- Triggers fire, auto-increment NOT reset
-- TRUNCATE: remove all rows instantly (DDL)
TRUNCATE TABLE temp_logs; -- No WHERE clause, resets auto-increment
-- Much faster than DELETE for millions of rows
-- DROP: remove the entire table
DROP TABLE IF EXISTS temp_logs; -- Table structure gone forever
-- Practical comparison:
-- 10M row table:
-- DELETE FROM big_table; → ~3 minutes (logged row-by-row)
-- TRUNCATE TABLE big_table; → ~50ms (deallocates pages)
-- DROP TABLE big_table; → ~10ms (removes metadata)
A developer at a SaaS company ran TRUNCATE TABLE users in production instead of TRUNCATE TABLE test_users (a typo in the table name). Because TRUNCATE auto-commits in MySQL, the 280,000 user records were gone instantly with no way to ROLLBACK. Recovery took 4 hours from a backup, and the company implemented a rule: TRUNCATE is banned in production — use DELETE with explicit WHERE clause and transaction wrapping instead.
Using TRUNCATE in production thinking it can be rolled back like DELETE — in MySQL and Oracle, TRUNCATE auto-commits and cannot be undone.
BEGIN;
TRUNCATE TABLE user_sessions; -- Auto-commits immediately!
ROLLBACK; -- Too late, all data is goneBEGIN;
DELETE FROM user_sessions WHERE expires_at < NOW();
-- Verify: SELECT COUNT(*) FROM user_sessions;
COMMIT; -- Only when confirmedCan TRUNCATE be rolled back in PostgreSQL? What about SQL Server?
Constraints enforce data integrity rules at the database level — they prevent invalid data from being inserted or updated.
- PRIMARY KEY — uniquely identifies each row in a table. Combines
NOT NULL+UNIQUE. A table can have only one primary key (which can be a single column or a composite of multiple columns). Automatically creates a clustered index in most databases. - FOREIGN KEY — creates a link between two tables by referencing the primary key of another table. Enforces referential integrity — you cannot insert a value that does not exist in the referenced table, and you cannot delete a referenced row without handling the dependent rows first (using CASCADE, SET NULL, or RESTRICT).
- UNIQUE — ensures all values in a column (or combination of columns) are distinct. Unlike PRIMARY KEY, UNIQUE allows one NULL value (in most databases). A table can have multiple UNIQUE constraints.
Other common constraints: NOT NULL (column cannot be empty), CHECK (column value must satisfy a condition), DEFAULT (provides a fallback value).
-- PRIMARY KEY: single column
CREATE TABLE employees (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
email VARCHAR(150) UNIQUE -- UNIQUE constraint
);
-- PRIMARY KEY: composite (multiple columns)
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT NOT NULL CHECK (quantity > 0),
PRIMARY KEY (order_id, product_id) -- composite PK
);
-- FOREIGN KEY with referential actions
CREATE TABLE orders (
id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
order_date DATE DEFAULT CURRENT_DATE,
FOREIGN KEY (customer_id) REFERENCES customers(id)
ON DELETE RESTRICT -- prevent deleting customer with orders
ON UPDATE CASCADE -- update customer_id if customer.id changes
);
-- FOREIGN KEY with CASCADE delete
CREATE TABLE order_items (
id INT PRIMARY KEY AUTO_INCREMENT,
order_id INT NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(id)
ON DELETE CASCADE -- delete items when order is deleted
);
-- Adding constraints to existing table
ALTER TABLE employees ADD CONSTRAINT uq_email UNIQUE (email);
ALTER TABLE employees ADD CONSTRAINT fk_dept
FOREIGN KEY (dept_id) REFERENCES departments(id);
An e-commerce platform had no FOREIGN KEY constraints between orders and customers tables. Over time, developers deleted customer records without cleaning up their orders, creating 15,000 "orphan" orders with customer_id values pointing to non-existent customers. Monthly revenue reports were inflated by $2.3M because these orphan orders were included. Adding FOREIGN KEY constraints with ON DELETE RESTRICT prevented future orphans, and a cleanup script archived the existing ones.
Not defining FOREIGN KEY constraints and relying on application code to maintain referential integrity — this leads to orphaned records when data is modified outside the application.
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT -- No FK! Any value accepted
);
-- This succeeds even though customer 9999 does not exist:
INSERT INTO orders (id, customer_id) VALUES (1, 9999);CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(id)
);
-- This fails with an error:
INSERT INTO orders (id, customer_id) VALUES (1, 9999);
-- ERROR: foreign key constraint violationWhat is the difference between ON DELETE CASCADE, SET NULL, and RESTRICT?
An alias is a temporary name given to a table or column in a query. It exists only for the duration of that query and does not rename anything in the database.
- Column aliases — rename output columns for readability using
AS. Useful for calculated expressions, aggregate functions, or when column names are unclear. TheASkeyword is optional but recommended for clarity. - Table aliases — give short names to tables, especially useful in JOINs with long table names. Required for self-joins (joining a table to itself) to distinguish the two instances.
Important rules:
- Column aliases defined in
SELECTcannot be used inWHEREorHAVING(in standard SQL) becauseSELECTexecutes after them. They can be used inORDER BY. - If an alias contains spaces or special characters, wrap it in double quotes (
"Total Sales") or backticks (`Total Sales`in MySQL).
-- Column aliases: make output readable
SELECT
first_name || ' ' || last_name AS full_name,
salary * 12 AS annual_salary,
ROUND(salary * 12 * 0.30, 2) AS estimated_tax
FROM employees;
-- Table aliases: simplify JOINs
SELECT e.name, d.dept_name, m.name AS manager_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
LEFT JOIN employees m ON e.manager_id = m.id;
-- Self-join REQUIRES table aliases
SELECT
e.name AS employee,
m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
-- Subquery alias (required in most databases)
SELECT avg_dept.department, avg_dept.avg_salary
FROM (
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
) AS avg_dept -- subquery MUST have an alias
WHERE avg_dept.avg_salary > 75000;
-- Alias in ORDER BY (works because ORDER BY runs after SELECT)
SELECT name, salary * 12 AS annual
FROM employees
ORDER BY annual DESC; -- ✅ Valid
A reporting team inherited a legacy database with columns named c1, c2, c3, c4 across 200+ tables. Every query they wrote used column aliases to make output readable: SELECT c1 AS customer_name, c2 AS email, c3 AS signup_date. Table aliases shortened their 6-table JOIN queries from unreadable 300-character lines to clean, maintainable SQL. They also used aliases in CTEs to create a "virtual data dictionary" layer over the cryptic schema.
Trying to use a column alias in WHERE, which fails because WHERE executes before SELECT where the alias is defined.
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE annual_salary > 100000;
-- ERROR: column "annual_salary" does not exist-- Option 1: Repeat the expression
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE salary * 12 > 100000;
-- Option 2: CTE
WITH emp AS (
SELECT name, salary * 12 AS annual_salary FROM employees
)
SELECT * FROM emp WHERE annual_salary > 100000;Can you use a column alias in GROUP BY? Does it differ across databases?
These are the three main string data types in SQL, differing in storage, performance, and max length:
- CHAR(n) — fixed-length string. Always stores exactly
ncharacters, padding with spaces if the value is shorter. Best for data that is always the same length (country codes, state abbreviations, fixed-format IDs). Wastes space on variable-length data but can be slightly faster for comparisons because the database knows the exact length. - VARCHAR(n) — variable-length string up to
ncharacters. Only stores the actual characters plus 1-2 bytes for length. Best for most string data where length varies (names, emails, addresses). Most commonly used string type. - TEXT — variable-length string with a very large max size (65KB in MySQL, 1GB in PostgreSQL). No length limit needs to be specified. Best for long-form content (blog posts, descriptions, JSON blobs). Cannot be used as an index key in most databases (or only a prefix can be indexed).
Storage comparison for storing "Hello" (5 chars): CHAR(20) uses 20 bytes. VARCHAR(20) uses 5 + 1 = 6 bytes. TEXT uses 5 + 2 = 7 bytes (MySQL).
-- CHAR: fixed-length, padded with spaces
CREATE TABLE countries (
code CHAR(2) NOT NULL, -- Always 2 chars: "US", "IN", "UK"
name VARCHAR(100) NOT NULL -- Variable length
);
-- VARCHAR: variable-length, most common choice
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50) NOT NULL, -- Max 50, stores actual length
email VARCHAR(255) NOT NULL, -- Standard email max length
first_name VARCHAR(100),
last_name VARCHAR(100)
);
-- TEXT: large content, no length specified
CREATE TABLE articles (
id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(200) NOT NULL,
body TEXT NOT NULL, -- Blog post content (can be huge)
summary VARCHAR(500) -- Short description
);
-- CHAR padding behavior
INSERT INTO countries (code, name) VALUES ('U', 'Test');
-- 'U' stored as 'U ' (padded to 2 chars)
-- LENGTH(code) = 2 in MySQL, but CHAR_LENGTH may vary
-- When to use which:
-- CHAR(2) → country codes, state codes
-- CHAR(36) → UUIDs (always 36 chars)
-- VARCHAR(n) → names, emails, URLs (variable length)
-- TEXT → blog posts, comments, JSON documents
A SaaS platform stored user bios in a VARCHAR(255) column. When users started writing longer bios, the application silently truncated text at 255 characters without warning. Changing the column to TEXT removed the length limit, but the team discovered that their full-text search index on the bio column broke (MySQL cannot fully index TEXT columns). The fix was adding a prefix index: CREATE INDEX idx_bio ON users(bio(100)) — indexing just the first 100 characters for search.
Using CHAR for variable-length data, which wastes storage and causes padding-related comparison bugs.
CREATE TABLE users (
name CHAR(100), -- "Alice" stored as "Alice" + 95 spaces
email CHAR(255) -- "a@b.com" stored as "a@b.com" + 248 spaces
);
-- Wastes ~345 bytes per row on padding
-- WHERE name = 'Alice' may not match 'Alice ' in some databasesCREATE TABLE users (
name VARCHAR(100), -- "Alice" stores 5 + 1 = 6 bytes
email VARCHAR(255) -- "a@b.com" stores 7 + 1 = 8 bytes
);
-- Stores only actual data + 1 byte for lengthWhat is the difference between VARCHAR(255) and VARCHAR(MAX)? Does the specified length affect performance?
A subquery (or inner query) is a SELECT statement nested inside another SQL statement. Subqueries can appear in SELECT, FROM, WHERE, or HAVING clauses.
There are two types:
- Non-correlated (simple) subquery — executes once, independently of the outer query. The inner query produces a result set that the outer query then uses. Think of it as "run this first, then use the result."
- Correlated subquery — references columns from the outer query, so it must re-execute for every row the outer query processes. Think of it as "for each row in the outer query, run this inner query." Correlated subqueries are often slower because they execute N times (once per outer row).
Common use cases: filtering with IN or EXISTS, computing derived values in SELECT, creating inline views in FROM.
-- NON-CORRELATED subquery: runs once, result reused
-- "Employees earning more than the company average"
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Inner query runs once → returns one number → outer query filters
-- NON-CORRELATED with IN
-- "Customers who placed an order in 2024"
SELECT name FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024);
-- CORRELATED subquery: runs once PER outer row
-- "Employees earning more than their department average"
SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e.department -- references outer query!
);
-- Inner query re-executes for EACH employee row
-- Correlated subquery in SELECT (scalar subquery)
SELECT
d.dept_name,
(SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.id) AS emp_count
FROM departments d;
-- Rewrite correlated subquery as JOIN (usually faster)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN (
SELECT department, AVG(salary) AS avg_sal
FROM employees GROUP BY department
) dept_avg ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_sal;
A reporting query at a retail company used a correlated subquery to find each product's rank within its category — it processed 50,000 products × 200 categories, executing the inner query 50,000 times. The query took 8 minutes. Rewriting it as a JOIN with a derived table reduced execution to 3 seconds. Later, the team replaced it entirely with a window function (ROW_NUMBER() OVER), which was even cleaner and ran in 1.5 seconds.
Using a correlated subquery when a simple JOIN or window function would be faster and more readable.
SELECT name, salary,
(SELECT MAX(salary) FROM employees e2
WHERE e2.department = e1.department) AS dept_max
FROM employees e1;
-- Executes inner query once per employeeSELECT name, salary,
MAX(salary) OVER (PARTITION BY department) AS dept_max
FROM employees;
-- Single pass over the dataWhat is the difference between IN and EXISTS for subqueries? When is each faster?
A CTE (Common Table Expression) is a temporary named result set defined with the WITH keyword. It exists only for the duration of a single SQL statement and makes complex queries more readable by breaking them into logical steps.
- Readability — CTEs replace deeply nested subqueries with named, sequential steps. Each CTE is like a variable you can reference later.
- Reusability — a CTE can be referenced multiple times in the same query (unlike a subquery which must be repeated).
- Recursion — CTEs support recursive queries (hierarchical data like org charts, category trees, bill of materials).
CTE vs subquery: CTEs are generally not faster than subqueries — most query optimizers inline them. The advantage is purely readability and maintainability. However, in PostgreSQL, CTEs before v12 were optimization fences (materialized by default), which could hurt performance.
CTE vs temp table: CTEs are not stored — they are query-scoped. Temp tables are stored on disk and persist across statements in a session.
-- Basic CTE: break a complex query into readable steps
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date)
),
revenue_with_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month))
/ LAG(revenue) OVER (ORDER BY month) * 100, 1) AS growth_pct
FROM monthly_revenue
)
SELECT * FROM revenue_with_growth
WHERE growth_pct IS NOT NULL
ORDER BY month;
-- CTE referenced multiple times (vs repeating a subquery)
WITH active_users AS (
SELECT user_id, COUNT(*) AS login_count
FROM logins
WHERE login_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
)
SELECT
(SELECT COUNT(*) FROM active_users) AS total_active,
(SELECT AVG(login_count) FROM active_users) AS avg_logins,
(SELECT MAX(login_count) FROM active_users) AS max_logins;
-- Multiple CTEs chained together
WITH step1 AS (...),
step2 AS (SELECT ... FROM step1 ...),
step3 AS (SELECT ... FROM step2 ...)
SELECT * FROM step3;
A data team at a SaaS company had a 120-line nested subquery for a churn analysis report. No one could understand or modify it without breaking something. Rewriting it as 5 chained CTEs (active_users → churned_users → churn_by_month → churn_by_plan → final_report) made each step independently testable and readable. Debugging time for report issues dropped from hours to minutes.
Overusing CTEs for simple queries that don't need them, or assuming CTEs are always materialized (cached) — in most databases, the optimizer inlines them.
WITH all_employees AS (
SELECT * FROM employees
)
SELECT name FROM all_employees WHERE department = 'Sales';
-- CTE just wraps a simple table scan — adds noiseWITH dept_stats AS (
SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS cnt
FROM employees GROUP BY department
)
SELECT e.name, e.salary, d.avg_sal, d.cnt
FROM employees e
JOIN dept_stats d ON e.department = d.department
WHERE e.salary > d.avg_sal;What are recursive CTEs? Give an example of querying hierarchical data.
Window functions perform calculations across a set of rows that are somehow related to the current row — without collapsing them into a single output row like GROUP BY does. They use the OVER() clause.
The four ranking window functions:
- ROW_NUMBER() — assigns a unique sequential number to each row within the partition. No ties — if two rows have the same value, they get different numbers (order is non-deterministic for ties).
- RANK() — like ROW_NUMBER but handles ties by giving the same rank to equal values, then skipping the next number(s). Example: 1, 2, 2, 4 (skips 3).
- DENSE_RANK() — like RANK but does NOT skip numbers after ties. Example: 1, 2, 2, 3 (no gap).
- NTILE(n) — divides rows into
nroughly equal groups (buckets) and assigns the group number. Useful for percentiles and quartiles.
Key syntax: FUNCTION() OVER (PARTITION BY col ORDER BY col). PARTITION BY is optional — it creates separate "windows" for each group. ORDER BY determines the ranking order.
-- Sample data: employees with name, department, salary
-- ROW_NUMBER: unique number per row within each department
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→3
-- RANK: ties get same rank, gaps after
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→2, Dave(90k)→4
-- DENSE_RANK: ties get same rank, NO gaps
SELECT name, department, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rnk
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→2, Dave(90k)→3
-- NTILE: divide into groups (quartiles)
SELECT name, salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile
FROM employees;
-- Bottom 25% → 1, next 25% → 2, next 25% → 3, top 25% → 4
-- Practical: Get top 3 earners per department
WITH ranked AS (
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
)
SELECT name, department, salary
FROM ranked WHERE rn <= 3;
An e-commerce analytics team needed to find each customer's most recent order. The initial approach used a correlated subquery (SELECT MAX(order_date) for each customer), which took 12 seconds on 5 million orders. Replacing it with ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) and filtering WHERE rn = 1 reduced the query time to 1.8 seconds — and the code was shorter and clearer.
Using ROW_NUMBER to handle ties when you actually need RANK or DENSE_RANK — ROW_NUMBER arbitrarily picks one of the tied rows.
-- Find the highest-paid employee per department
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees
)
SELECT * FROM ranked WHERE rn = 1;
-- If two employees tie at $100k, only ONE is returned!WITH ranked AS (
SELECT *, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rnk
FROM employees
)
SELECT * FROM ranked WHERE rnk = 1;
-- Both tied employees at $100k are returnedWhat are other window functions like LAG, LEAD, SUM OVER, and how do frame clauses (ROWS BETWEEN) work?
An index is a data structure (usually a B-tree) that the database maintains alongside the table to speed up data retrieval. Think of it like a book's index — instead of reading every page (full table scan), you look up the topic in the index and jump directly to the right page.
How indexes work:
- Without an index, a query like
WHERE email = 'alice@example.com'must scan every row in the table (full table scan) — O(n). - With a B-tree index on
email, the database traverses the tree in O(log n) to find the matching row(s). - Indexes store a sorted copy of the indexed column(s) along with pointers to the actual table rows.
Trade-offs:
- Reads faster: SELECT queries with WHERE, JOIN, ORDER BY on indexed columns are dramatically faster.
- Writes slower: Every INSERT, UPDATE, DELETE must also update all relevant indexes, adding overhead.
- Storage cost: Each index takes additional disk space (often 10-30% of the table size).
Rule of thumb: index columns that appear in WHERE, JOIN ON, and ORDER BY. Don't index columns with low cardinality (e.g., boolean columns) or tables with heavy write loads.
-- Create an index on a frequently queried column
CREATE INDEX idx_email ON users(email);
-- Now this query uses the index instead of full table scan:
SELECT * FROM users WHERE email = 'alice@example.com';
-- Without index: scans 1,000,000 rows → ~500ms
-- With index: B-tree lookup → ~2ms
-- Composite (multi-column) index
CREATE INDEX idx_dept_salary ON employees(department, salary);
-- Helps queries that filter on department, or department + salary
-- Does NOT help queries that filter on salary alone (leftmost prefix rule)
-- Unique index (also enforces uniqueness)
CREATE UNIQUE INDEX idx_unique_email ON users(email);
-- Check existing indexes
SHOW INDEX FROM users; -- MySQL
SELECT * FROM pg_indexes WHERE tablename = 'users'; -- PostgreSQL
-- Drop an index
DROP INDEX idx_email ON users; -- MySQL
DROP INDEX idx_email; -- PostgreSQL
-- See if a query uses an index
EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- Look for "index scan" or "index seek" (good)
-- vs "seq scan" or "full table scan" (bad)
A social media platform's user search was taking 3-5 seconds on a table with 8 million users. The WHERE clause filtered on username, but there was no index on that column. Adding CREATE INDEX idx_username ON users(username) reduced search time to 15ms — a 200x improvement. However, the team later discovered that adding 12 indexes to the users table slowed down user registration (INSERT) from 5ms to 45ms. They removed 4 rarely-used indexes as a compromise.
Adding indexes on every column "just in case" — this slows down writes significantly and wastes disk space.
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT,
product_id INT,
amount DECIMAL(10,2),
status VARCHAR(20),
created_at DATETIME
);
CREATE INDEX idx_1 ON orders(customer_id);
CREATE INDEX idx_2 ON orders(product_id);
CREATE INDEX idx_3 ON orders(amount);
CREATE INDEX idx_4 ON orders(status); -- low cardinality!
CREATE INDEX idx_5 ON orders(created_at);
-- 5 indexes = every INSERT updates 5 B-trees-- Only index columns used in WHERE/JOIN/ORDER BY
CREATE INDEX idx_customer ON orders(customer_id); -- JOIN queries
CREATE INDEX idx_created ON orders(created_at); -- date range queries
-- 2 indexes, covering 90% of query patternsWhat is the difference between clustered and non-clustered indexes?
This is one of the most asked SQL interview questions. The difference goes much deeper than "one sorts the data, the other doesn't".
Clustered Index (InnoDB / SQL Server):
- The leaf nodes of the B+ tree ARE the table data pages. The table is literally stored as a B+ tree ordered by the clustered key. There is no separate "heap" or "table" — the clustered index IS the table.
- In InnoDB (MySQL), the primary key is always the clustered index. If you don't define a PK, InnoDB creates a hidden 6-byte row ID as the clustered key.
- Only one per table — data can be physically sorted only one way.
- Range scans are fast — since rows are contiguous on disk, reading
WHERE id BETWEEN 100 AND 200is a sequential disk read.
Non-Clustered Index:
- A separate B+ tree whose leaf nodes store: (indexed column values) + (a pointer back to the actual row).
- In SQL Server, the pointer is the clustered key value (not a physical address). In InnoDB, non-clustered (secondary) index leaves store the primary key value.
- This means a non-clustered lookup does two B+ tree traversals: (1) search the secondary index B+ tree → get the PK value → (2) search the clustered index B+ tree using that PK → get the row. This second step is called a bookmark lookup (SQL Server) or key lookup.
- Many per table (typically 5-15 in practice).
Heap table (PostgreSQL model): PostgreSQL does not have clustered indexes in the InnoDB sense. All tables are heaps (unordered). All indexes are secondary, and leaf nodes store a ctid (physical tuple ID = page number + offset). The CLUSTER command reorders the heap once but isn't maintained on writes.
-- InnoDB CLUSTERED INDEX internal structure:
-- PRIMARY KEY on id → B+ tree IS the table
-- B+ tree root (internal node):
-- [ 500 | 1000 | 1500 ]
-- | | | \
-- Leaf pages (actual data rows stored here!):
-- Page 1: [id=1, name=Alice, salary=50k], [id=2, name=Bob, salary=60k], ...
-- Page 2: [id=501, name=Carol, salary=55k], [id=502, ...], ...
-- Page 3: [id=1001, ...], ...
-- Leaf pages are doubly-linked for range scans
-- NON-CLUSTERED (SECONDARY) INDEX in InnoDB:
-- CREATE INDEX idx_name ON employees(name);
-- Separate B+ tree:
-- Leaf: [Alice → PK=1], [Bob → PK=2], [Carol → PK=501]
-- ↑ sorted by name ↑ stores PRIMARY KEY, not row address
-- Lookup: SELECT * FROM employees WHERE name = 'Alice';
-- Step 1: Search idx_name B+ tree → find Alice → PK=1
-- Step 2: Search clustered index B+ tree using PK=1 → get full row
-- = TWO B+ tree traversals (this is the "bookmark/key lookup")
-- SQL Server: explicit clustered index on non-PK column
CREATE TABLE events (
id INT PRIMARY KEY NONCLUSTERED,
event_date DATE,
description VARCHAR(500)
);
CREATE CLUSTERED INDEX idx_event_date ON events(event_date);
-- Table stored sorted by event_date, not by id
-- id lookups now require a key lookup through the non-clustered PK index
-- Covering index: eliminates the second B+ tree traversal
CREATE INDEX idx_name_salary ON employees(name) INCLUDE (salary);
-- SELECT name, salary WHERE name = 'Alice';
-- Only Step 1 needed — salary is in the index leaf → no key lookup!
-- EXPLAIN shows "Index Only Scan" (PG) or "Index Seek" without "Key Lookup" (SQL Server)
-- PostgreSQL: heap + ctid-based indexes
-- All indexes store ctid (page_number, tuple_offset)
-- No "key lookup" — index leaf points directly to heap page
-- But: HOT updates (Heap-Only Tuples) can update rows without updating indexes
-- if the indexed columns didn't change
-- Check index type in PostgreSQL:
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'employees';
-- Fill factor: leave room in pages to reduce page splits
CREATE INDEX idx_name ON employees(name) WITH (fillfactor = 80);
-- Each leaf page only fills 80% → 20% free for future inserts
-- Reduces page splits at the cost of 20% more disk space
-- InnoDB: check fragmentation from random inserts
SELECT table_name, data_free, data_length,
ROUND(data_free / data_length * 100, 1) AS fragmentation_pct
FROM information_schema.tables
WHERE table_name = 'employees';
A SaaS platform used UUIDs as primary keys in InnoDB. Because InnoDB's clustered index is the PK, every INSERT went to a random position in the B+ tree — causing 15,000 page splits per minute. Secondary indexes were also affected: every secondary index leaf stores the PK (UUID), making secondary indexes 36 bytes wider per entry than necessary. Switching to BIGINT AUTO_INCREMENT as the clustered PK and keeping UUID as a UNIQUE secondary index reduced page splits to near zero, shrunk secondary indexes by 40%, and improved INSERT throughput by 3x. The team also set innodb_fill_factor (via page merge threshold) to reduce future splits.
Not understanding the double B+ tree traversal in non-clustered lookups, leading to surprise slow queries when selecting many columns via a non-clustered index.
CREATE INDEX idx_email ON users(email);
SELECT * FROM users WHERE email LIKE 'a%';
-- Index finds 50,000 matching emails
-- For EACH of the 50,000: key lookup into clustered index → fetch full row
-- 50,000 random I/Os into the clustered B+ tree → SLOW!
-- Optimizer may choose full table scan instead (cheaper)CREATE INDEX idx_email_cover ON users(email) INCLUDE (name, created_at);
SELECT name, email, created_at FROM users WHERE email LIKE 'a%';
-- All requested columns are in the index leaf → index-only scan
-- 50,000 sequential reads from the index → fast!
-- No key lookup into clustered index at allWhat is a covering index and when does it eliminate the need for a bookmark lookup? How do HOT updates work in PostgreSQL?
A view is a virtual table defined by a stored SQL query. It does not store data itself — every time you query a view, the database executes the underlying SELECT statement.
Advantages:
- Simplification — wrap complex JOINs into a simple table-like interface that analysts can query without knowing the underlying schema.
- Security — expose only specific columns or filtered rows. Grant users access to the view but not the underlying tables.
- Consistency — centralize business logic (e.g., "active customer" definition) in one place instead of repeating it in dozens of queries.
- Abstraction — change the underlying tables without breaking dependent queries, as long as the view's output stays the same.
Limitations:
- Views do not improve performance — the underlying query runs every time (unlike materialized views).
- Complex views with multiple JOINs and subqueries can actually be slower because the optimizer has less flexibility.
- Not all views are updatable — views with JOINs, GROUP BY, DISTINCT, or UNION generally cannot be used with INSERT/UPDATE/DELETE.
-- Create a view: simplify a complex JOIN
CREATE VIEW v_employee_details AS
SELECT
e.id, e.name, e.salary,
d.dept_name,
m.name AS manager_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id
LEFT JOIN employees m ON e.manager_id = m.id;
-- Query the view like a regular table
SELECT * FROM v_employee_details WHERE dept_name = 'Engineering';
-- Security: create a restricted view for HR
CREATE VIEW v_employee_public AS
SELECT id, name, department, hire_date
FROM employees;
-- Excludes salary, SSN, personal info
GRANT SELECT ON v_employee_public TO hr_readonly_role;
-- Updatable view (simple single-table view)
CREATE VIEW v_active_employees AS
SELECT * FROM employees WHERE status = 'active';
-- This INSERT works through the view:
INSERT INTO v_active_employees (name, department, status)
VALUES ('New Hire', 'Sales', 'active');
-- Replace a view definition
CREATE OR REPLACE VIEW v_employee_details AS
SELECT ... -- updated query;
-- Drop a view
DROP VIEW IF EXISTS v_employee_details;
A healthcare company had 15 different teams writing queries against a patient database with 40+ tables. Each team had slightly different definitions of "active patient" — some checked last_visit_date, others checked insurance_status, others checked both. Creating a view v_active_patients with the official definition and requiring all teams to use it eliminated data discrepancies across 60+ reports. When the definition changed (adding a new status check), they updated the single view instead of hunting through hundreds of queries.
Nesting views inside views inside views, creating a "view stack" that is impossible to debug and can have terrible performance.
CREATE VIEW v1 AS SELECT ... FROM big_table JOIN ...;
CREATE VIEW v2 AS SELECT ... FROM v1 JOIN ...;
CREATE VIEW v3 AS SELECT ... FROM v2 JOIN ...;
CREATE VIEW v4 AS SELECT ... FROM v3 WHERE ...;
-- Query on v4: optimizer must unpack 4 layers
-- Debugging v4 requires understanding all 4 viewsCREATE VIEW v_report AS
SELECT ... FROM big_table
JOIN table2 ON ...
JOIN table3 ON ...
WHERE ...;
-- One view, one level, easy to debug and optimizeWhat are materialized views and how do they differ from regular views?
Both are named blocks of reusable SQL code stored in the database, but they have different purposes and capabilities:
- Stored Procedures — execute a series of SQL statements. Can modify data (INSERT, UPDATE, DELETE), manage transactions, return multiple result sets, use output parameters, and have side effects. Called with
CALLorEXEC. Cannot be used inside a SELECT statement. - Functions (UDFs) — compute and return a single value (scalar function) or a table (table-valued function). Must return something. Can be used inside SELECT, WHERE, and JOIN clauses like built-in functions. Generally cannot modify data or manage transactions (though this varies by database).
When to use each:
- Use procedures for operations that modify data, involve complex business logic with multiple steps, or need transaction control.
- Use functions for calculations, transformations, or lookups that you want to use inline within queries.
-- STORED PROCEDURE: transfer money between accounts
DELIMITER //
CREATE PROCEDURE sp_transfer_funds(
IN from_acct INT,
IN to_acct INT,
IN amount DECIMAL(10,2),
OUT success BOOLEAN
)
BEGIN
DECLARE from_balance DECIMAL(10,2);
START TRANSACTION;
SELECT balance INTO from_balance FROM accounts
WHERE id = from_acct FOR UPDATE;
IF from_balance >= amount THEN
UPDATE accounts SET balance = balance - amount WHERE id = from_acct;
UPDATE accounts SET balance = balance + amount WHERE id = to_acct;
INSERT INTO transactions (from_id, to_id, amount) VALUES (from_acct, to_acct, amount);
SET success = TRUE;
COMMIT;
ELSE
SET success = FALSE;
ROLLBACK;
END IF;
END //
DELIMITER ;
-- Call the procedure
CALL sp_transfer_funds(101, 202, 500.00, @result);
SELECT @result;
-- FUNCTION: calculate tax for a salary
CREATE FUNCTION fn_calc_tax(salary DECIMAL(10,2))
RETURNS DECIMAL(10,2)
DETERMINISTIC
BEGIN
IF salary <= 50000 THEN RETURN salary * 0.10;
ELSEIF salary <= 100000 THEN RETURN salary * 0.20;
ELSE RETURN salary * 0.30;
END IF;
END;
-- Use function inline in a query
SELECT name, salary, fn_calc_tax(salary) AS tax
FROM employees;
A banking application had fund transfer logic duplicated across 3 microservices (web, mobile, API). Each had slightly different validation rules, causing edge cases where transfers could overdraw accounts. Moving the logic into a single stored procedure sp_transfer_funds with proper locking (SELECT ... FOR UPDATE) and transaction handling eliminated the inconsistency and reduced overdraft incidents from ~15/month to zero.
Putting complex business logic in functions that get called per-row in a query — functions execute once per row, so a function called in a SELECT over 1 million rows runs 1 million times.
CREATE FUNCTION fn_get_order_count(cust_id INT)
RETURNS INT
BEGIN
RETURN (SELECT COUNT(*) FROM orders WHERE customer_id = cust_id);
END;
SELECT name, fn_get_order_count(id) FROM customers;
-- Executes the COUNT query 100,000 times for 100K customers!SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- Single query, single passWhat are the security implications of stored procedures? How do they help prevent SQL injection?
A trigger is a stored procedure that automatically executes in response to specific events on a table — INSERT, UPDATE, or DELETE. Triggers fire without being explicitly called.
Trigger timing:
- BEFORE — fires before the operation. Can modify the incoming data or cancel the operation. Use for validation or data transformation.
- AFTER — fires after the operation. The data has already been written. Use for audit logging, cascading updates, or notifications.
- INSTEAD OF — replaces the operation entirely (SQL Server/PostgreSQL for views). Use to make complex views updatable.
Inside a trigger, you have access to NEW (the new row for INSERT/UPDATE) and OLD (the original row for UPDATE/DELETE).
Caution: Triggers are invisible to application code, making them hard to debug. Overuse leads to "hidden logic" that developers aren't aware of. They also add overhead to every write operation on the table.
-- BEFORE INSERT: auto-fill created_at timestamp
CREATE TRIGGER trg_before_insert_orders
BEFORE INSERT ON orders
FOR EACH ROW
SET NEW.created_at = NOW(),
NEW.updated_at = NOW();
-- AFTER UPDATE: audit log
DELIMITER //
CREATE TRIGGER trg_audit_salary_change
AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
IF OLD.salary != NEW.salary THEN
INSERT INTO salary_audit (
employee_id, old_salary, new_salary,
changed_at, changed_by
) VALUES (
NEW.id, OLD.salary, NEW.salary,
NOW(), CURRENT_USER()
);
END IF;
END //
DELIMITER ;
-- BEFORE UPDATE: prevent invalid data
DELIMITER //
CREATE TRIGGER trg_prevent_salary_decrease
BEFORE UPDATE ON employees
FOR EACH ROW
BEGIN
IF NEW.salary < OLD.salary THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Salary cannot be decreased';
END IF;
END //
DELIMITER ;
-- View existing triggers
SHOW TRIGGERS; -- MySQL
SELECT * FROM information_schema.triggers; -- Standard SQL
A compliance team at a financial institution needed a complete audit trail of all changes to the accounts table — who changed what, when, and from what value. Rather than modifying 30+ application endpoints to add logging, they created AFTER UPDATE and AFTER DELETE triggers that automatically logged every change to an accounts_audit table. The audit table grew to 200 million rows, so they added partitioning by month and auto-archived partitions older than 7 years.
Creating triggers that modify other tables which also have triggers, causing an infinite trigger chain or unpredictable cascading effects.
-- Trigger on table A updates table B
-- Trigger on table B updates table A → infinite loop!
CREATE TRIGGER trg_a AFTER UPDATE ON table_a
FOR EACH ROW UPDATE table_b SET val = NEW.val WHERE id = NEW.ref_id;
CREATE TRIGGER trg_b AFTER UPDATE ON table_b
FOR EACH ROW UPDATE table_a SET val = NEW.val WHERE id = NEW.ref_id;CREATE TRIGGER trg_audit AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
-- Only log, don't modify other tables with triggers
INSERT INTO audit_log (table_name, record_id, action, changed_at)
VALUES ('employees', NEW.id, 'UPDATE', NOW());
END;Can triggers cause performance problems? How do you disable them during bulk operations?
A transaction is a sequence of SQL operations that are treated as a single atomic unit — either all operations succeed (COMMIT), or all are undone (ROLLBACK). There is no partial state.
Transaction workflow:
BEGIN(orSTART TRANSACTION) — marks the start of a transaction. All subsequent operations are part of this transaction.COMMIT— makes all changes permanent. Once committed, changes are durable even if the server crashes.ROLLBACK— undoes all changes made since BEGIN. The database returns to its state before the transaction started.SAVEPOINT— creates a checkpoint within a transaction. You can ROLLBACK TO a savepoint without undoing the entire transaction.
Transactions ensure the ACID properties: Atomicity (all-or-nothing), Consistency (data stays valid), Isolation (concurrent transactions don't interfere), Durability (committed data survives crashes).
Many databases use autocommit by default — each statement is its own transaction. Explicit BEGIN...COMMIT groups multiple statements into one atomic operation.
-- Basic transaction: transfer funds
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 101;
UPDATE accounts SET balance = balance + 500 WHERE id = 202;
INSERT INTO transfers (from_id, to_id, amount, created_at)
VALUES (101, 202, 500, NOW());
-- Verify before committing
SELECT id, balance FROM accounts WHERE id IN (101, 202);
COMMIT; -- All 3 operations become permanent
-- ROLLBACK example: something went wrong
BEGIN;
DELETE FROM orders WHERE customer_id = 42;
-- Oops, wrong customer!
ROLLBACK; -- Undo the delete, no data lost
-- SAVEPOINT: partial rollback
BEGIN;
INSERT INTO orders (customer_id, amount) VALUES (1, 100);
SAVEPOINT sp1;
INSERT INTO orders (customer_id, amount) VALUES (2, 200);
-- Oops, second insert had wrong data
ROLLBACK TO sp1; -- Only undoes the second INSERT
-- First INSERT is still pending
COMMIT; -- Commits only the first INSERT
-- Check autocommit setting
SELECT @@autocommit; -- MySQL (1 = on)
SHOW autocommit; -- PostgreSQL
-- Disable autocommit for a session
SET autocommit = 0; -- MySQL
-- Now every statement requires explicit COMMIT
An airline booking system processes seat reservations in a transaction: (1) check seat availability, (2) reserve the seat, (3) charge the credit card, (4) send confirmation email. If step 3 fails (card declined), the transaction rolls back step 2 (unreserve the seat) so it becomes available for others. Without transactions, a card failure would leave a "ghost reservation" — a seat marked as taken but never paid for. The system processes 50,000 bookings/day, and transactions prevent approximately 200 ghost reservations daily.
Forgetting to COMMIT or ROLLBACK, leaving a transaction open — this can hold locks for hours and block other users.
BEGIN;
UPDATE products SET price = 29.99 WHERE id = 42;
-- Developer gets distracted, forgets to COMMIT or ROLLBACK
-- The row is locked for hours, blocking all other queries on it
-- Eventually the DBA kills the sessionBEGIN;
UPDATE products SET price = 29.99 WHERE id = 42;
-- Application code: if success → COMMIT, else → ROLLBACK
COMMIT;
-- In application code, use try/finally:
-- try { begin(); update(); commit(); }
-- catch { rollback(); }
-- finally { close connection; }What are the ACID properties? Explain each one in detail.
Both EXISTS and IN filter rows based on a subquery, but they work differently:
- IN — compares a value against a list of values returned by the subquery. The subquery executes first and returns a complete result set. Then the outer query checks if each row's value is in that set. Best when the subquery returns a small result set.
- EXISTS — checks whether the subquery returns any rows at all (true/false). The subquery is correlated — it references the outer query and re-executes for each outer row. However, it stops as soon as it finds one match (short-circuits). Best when the subquery has a large result set or when an index exists on the correlated column.
Performance rule of thumb:
- Use
INwhen the subquery result set is small. - Use
EXISTSwhen the outer table is small and the subquery table is large (EXISTS stops early). - In modern databases, the optimizer often rewrites one to the other, making them equivalent in many cases.
NULL handling: IN can give unexpected results with NULLs (because NULL IN (...) returns NULL, not FALSE). EXISTS is not affected by NULLs.
-- IN: get all customers who have placed an order
SELECT name FROM customers
WHERE id IN (SELECT customer_id FROM orders);
-- Subquery runs once → returns a list of IDs
-- Outer query checks each customer against the list
-- EXISTS: same result, different approach
SELECT name FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- For each customer, check if ANY order exists → stop early
-- NOT IN vs NOT EXISTS (critical difference with NULLs!)
-- If orders.customer_id has NULL values:
SELECT name FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);
-- ⚠️ Returns EMPTY result if any customer_id is NULL!
-- Because: id NOT IN (1, 2, NULL) → id != 1 AND id != 2 AND id != NULL
-- The NULL comparison makes the whole expression NULL (falsy)
-- NOT EXISTS handles NULLs correctly
SELECT name FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- ✅ Works correctly regardless of NULLs
-- Performance comparison (depends on data distribution):
-- 10 customers, 1M orders → IN may be faster (small subquery result)
-- 1M customers, 100 orders → EXISTS may be faster (stops early)
-- Modern optimizers often make them equivalent
A data team at a subscription service used NOT IN to find users without active subscriptions. The query worked perfectly in development but returned zero results in production. After debugging, they found that 5 rows in the subscriptions table had NULL user_id values (from a migration bug). NOT IN with NULLs silently returns empty results. Switching to NOT EXISTS fixed the issue immediately and became the team's standard pattern for anti-joins.
Using NOT IN with a subquery that might contain NULLs — it silently returns zero results instead of the expected rows.
-- Find customers with no orders
SELECT name FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);
-- If ANY customer_id is NULL → returns EMPTY result set!
-- No error, no warning — just wrong dataSELECT name FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- Works correctly regardless of NULL valuesHow does the query optimizer decide whether to use IN or EXISTS internally? Can you convert between them?
ACID is a set of four properties that guarantee reliable processing of database transactions:
- Atomicity — a transaction is all-or-nothing. If any part fails, the entire transaction is rolled back. Example: a bank transfer deducts from Account A and credits Account B. If the credit fails, the deduction is also reversed.
- Consistency — a transaction moves the database from one valid state to another. All constraints (primary keys, foreign keys, checks) are enforced. Example: you cannot transfer more money than exists in an account if a CHECK constraint enforces non-negative balances.
- Isolation — concurrent transactions do not interfere with each other. Each transaction sees the database as if it were the only one running (the degree depends on the isolation level). Example: two users buying the last concert ticket simultaneously — isolation ensures only one succeeds.
- Durability — once a transaction is committed, the changes survive any subsequent system failure (crash, power outage). The database uses write-ahead logging (WAL) to ensure committed data can always be recovered.
-- ATOMICITY: all-or-nothing
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE id = 1;
UPDATE accounts SET balance = balance + 1000 WHERE id = 2;
-- If server crashes between these two statements:
-- → Both are rolled back, no money is lost
COMMIT;
-- CONSISTENCY: constraints prevent invalid states
CREATE TABLE accounts (
id INT PRIMARY KEY,
balance DECIMAL(10,2) CHECK (balance >= 0) -- no negative balance
);
BEGIN;
UPDATE accounts SET balance = balance - 5000 WHERE id = 1;
-- If balance would go negative → CHECK violation → transaction fails
COMMIT;
-- ISOLATION: concurrent transactions don't see each other's changes
-- Session 1: | Session 2:
-- BEGIN; |
-- UPDATE products SET qty=9 |
-- WHERE id=1; |
-- | SELECT qty FROM products WHERE id=1;
-- | → Still sees qty=10 (not 9)
-- COMMIT; |
-- | SELECT qty FROM products WHERE id=1;
-- | → Now sees qty=9
-- DURABILITY: committed data survives crashes
BEGIN;
INSERT INTO orders (customer_id, total) VALUES (42, 299.99);
COMMIT; -- Written to WAL (Write-Ahead Log) on disk
-- Even if server loses power NOW, this order is safe
In 2012, Knight Capital lost $440 million in 45 minutes due to a software deployment error that sent duplicate stock orders. The system lacked proper transaction atomicity — partial orders were committed without the corresponding hedge trades. If the trading system had used proper ACID transactions grouping the order and its hedge as one atomic unit, the runaway orders could have been rolled back. This incident is one of the most cited examples of why ACID matters in financial systems.
Assuming that writing data to the database means it is durable — without a proper COMMIT, data can be lost on crash.
SET autocommit = 0;
INSERT INTO payments (user_id, amount) VALUES (1, 500);
UPDATE users SET balance = balance - 500 WHERE id = 1;
-- No COMMIT! If server crashes → both operations are lost
-- User sees "Payment successful" but data is goneBEGIN;
INSERT INTO payments (user_id, amount) VALUES (1, 500);
UPDATE users SET balance = balance - 500 WHERE id = 1;
COMMIT; -- Data written to WAL → survives crashWhat are the four transaction isolation levels and what problems does each solve?
Isolation levels control how much one transaction can see changes made by other concurrent transactions. Higher isolation = fewer concurrency bugs but more locking/overhead.
The four levels (weakest to strongest):
- Read Uncommitted — can see uncommitted changes from other transactions (dirty reads). Rarely used in practice.
- Read Committed — can only see committed data. Prevents dirty reads but allows non-repeatable reads (a row read twice in the same transaction can return different values if another transaction commits between reads). Default in PostgreSQL, Oracle, SQL Server.
- Repeatable Read — guarantees that if you read a row twice, you get the same value. Prevents dirty and non-repeatable reads but allows phantom reads (new rows matching a WHERE clause can appear if another transaction inserts and commits). Default in MySQL/InnoDB.
- Serializable — complete isolation. Transactions behave as if they ran one after another. Prevents all concurrency anomalies but has the highest performance cost (more locks, potential deadlocks).
-- Set isolation level (per transaction)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- sees only committed data
COMMIT;
-- DIRTY READ (Read Uncommitted)
-- Session 1: BEGIN; UPDATE accounts SET balance=0 WHERE id=1; (no commit)
-- Session 2: SELECT balance FROM accounts WHERE id=1; → sees 0 (dirty!)
-- Session 1: ROLLBACK; → balance is back to original
-- Session 2 used incorrect data!
-- NON-REPEATABLE READ (Read Committed)
-- Session 1: BEGIN; SELECT price FROM products WHERE id=1; → $10
-- Session 2: UPDATE products SET price=15 WHERE id=1; COMMIT;
-- Session 1: SELECT price FROM products WHERE id=1; → $15 (changed!)
-- Same row, different value within the same transaction
-- PHANTOM READ (Repeatable Read)
-- Session 1: BEGIN; SELECT COUNT(*) FROM orders WHERE date='2024-01-01'; → 5
-- Session 2: INSERT INTO orders (date,...) VALUES ('2024-01-01',...); COMMIT;
-- Session 1: SELECT COUNT(*) FROM orders WHERE date='2024-01-01'; → 6 (phantom!)
-- Check current isolation level
SELECT @@transaction_isolation; -- MySQL
SHOW transaction_isolation; -- PostgreSQL
-- Set default for session
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
An e-commerce platform experienced a double-spend bug: two concurrent sessions could both read a gift card balance of $100, each process a $80 purchase, and both commit successfully — resulting in $160 spent from a $100 card. The root cause was Read Committed isolation, which allowed both sessions to read the same balance before either committed. Upgrading to Serializable for gift card transactions (and using SELECT ... FOR UPDATE) eliminated the race condition. The team kept Read Committed as the default for other transactions to avoid performance impact.
Using the default isolation level without understanding its implications — leading to subtle concurrency bugs that only appear under load.
-- Session 1 & 2 both do this simultaneously:
BEGIN;
SELECT balance FROM gift_cards WHERE id = 1; -- Both see $100
UPDATE gift_cards SET balance = balance - 80 WHERE id = 1;
COMMIT;
-- Both succeed! Card balance is now $20, but $160 was spentBEGIN;
SELECT balance FROM gift_cards WHERE id = 1 FOR UPDATE;
-- Row is locked — other sessions must wait
IF balance >= 80 THEN
UPDATE gift_cards SET balance = balance - 80 WHERE id = 1;
END IF;
COMMIT; -- Lock released, next session can proceedWhat is SELECT ... FOR UPDATE and how does it prevent race conditions?
Normalization is the process of organizing database tables to reduce data redundancy and prevent update anomalies. Each normal form (NF) builds on the previous one:
- 1NF (First Normal Form) — every column contains atomic (indivisible) values. No repeating groups or arrays. Each row is unique (has a primary key).
- 2NF (Second Normal Form) — is in 1NF AND every non-key column depends on the entire primary key, not just part of it. Only relevant for composite primary keys. Eliminates partial dependencies.
- 3NF (Third Normal Form) — is in 2NF AND no non-key column depends on another non-key column. Eliminates transitive dependencies. A non-key column must depend only on the primary key.
- BCNF (Boyce-Codd Normal Form) — a stricter version of 3NF. Every determinant (column that determines another column) must be a candidate key. Handles edge cases where 3NF still allows redundancy.
Trade-off: higher normalization reduces redundancy but increases the number of JOINs needed for queries, which can hurt read performance. Many production systems intentionally denormalize for performance.
-- ❌ NOT 1NF: repeating groups / non-atomic values
-- | student_id | name | courses |
-- | 1 | Alice | Math, Physics | ← not atomic!
-- | 2 | Bob | Chemistry |
-- ✅ 1NF: atomic values, separate rows
-- | student_id | name | course |
-- | 1 | Alice | Math |
-- | 1 | Alice | Physics |
-- | 2 | Bob | Chemistry |
-- ❌ NOT 2NF: partial dependency (composite key)
-- PK: (student_id, course_id)
-- | student_id | course_id | student_name | grade |
-- student_name depends on student_id alone (partial dependency)
-- ✅ 2NF: split into two tables
-- students: student_id (PK), student_name
-- enrollments: student_id, course_id (composite PK), grade
-- ❌ NOT 3NF: transitive dependency
-- | employee_id | dept_id | dept_name | salary |
-- dept_name depends on dept_id (not on employee_id directly)
-- ✅ 3NF: remove transitive dependency
-- employees: employee_id (PK), dept_id (FK), salary
-- departments: dept_id (PK), dept_name
-- Real schema showing 3NF:
CREATE TABLE departments (
id INT PRIMARY KEY,
dept_name VARCHAR(100) NOT NULL
);
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
dept_id INT REFERENCES departments(id),
salary DECIMAL(10,2)
);
A retail chain stored product data in a single denormalized table with supplier_name, supplier_phone, and supplier_address repeated on every product row. When a supplier changed their phone number, they had to update 15,000 rows (one per product). Some rows were missed, causing order forms to show different phone numbers for the same supplier. Normalizing into separate products and suppliers tables meant the phone number was stored once — one UPDATE fixed everything.
Storing derived or repeated data in the same table, creating update anomalies when the source data changes.
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT,
customer_name VARCHAR(100), -- duplicated from customers table!
customer_email VARCHAR(255), -- also duplicated!
amount DECIMAL(10,2)
);
-- Customer changes name → must update ALL their order rowsCREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255)
);
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT REFERENCES customers(id),
amount DECIMAL(10,2)
);
-- Customer name stored once, JOIN when neededWhen is denormalization justified? Give examples of intentional denormalization for performance.
An execution plan (or query plan) shows how the database engine will execute a query — which indexes it uses, how it joins tables, and where it spends time. It is the most important tool for query optimization.
EXPLAIN— shows the planned execution strategy without running the query. Shows estimated costs and row counts.EXPLAIN ANALYZE(PostgreSQL) /EXPLAIN FORMAT=ANALYZE— actually runs the query and shows real execution times alongside estimates. More accurate but actually executes the query (be careful with UPDATE/DELETE).
Key things to look for in a plan:
- Seq Scan / Full Table Scan — reading every row in the table. Bad for large tables, usually means a missing index.
- Index Scan / Index Seek — using an index to find rows efficiently. Good.
- Nested Loop / Hash Join / Merge Join — the method used to join tables. Each is optimal for different data sizes.
- Sort — an explicit sort step (for ORDER BY or DISTINCT). Can be expensive for large result sets.
- Estimated vs actual rows — large discrepancies indicate stale statistics (run ANALYZE).
-- PostgreSQL: EXPLAIN shows the plan
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';
-- Seq Scan on employees (cost=0.00..35.50 rows=10 width=64)
-- Filter: (department = 'Engineering')
-- → Full table scan! Need an index.
-- Add index and re-check
CREATE INDEX idx_dept ON employees(department);
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';
-- Index Scan using idx_dept on employees (cost=0.28..8.29 rows=10 width=64)
-- Index Cond: (department = 'Engineering')
-- → Index scan! Much better.
-- EXPLAIN ANALYZE: shows actual times
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.dept_name = 'Engineering';
-- Hash Join (cost=... rows=10) (actual time=0.5..1.2 rows=10 loops=1)
-- → Seq Scan on departments (actual time=0.01..0.05 rows=1)
-- → Seq Scan on employees (actual time=0.1..0.8 rows=100)
-- Planning time: 0.2ms Execution time: 1.5ms
-- MySQL: EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;
-- +----+------+---------------+---------+------+------+----------+
-- | id | type | possible_keys | key | rows | Extra |
-- +----+------+---------------+---------+------+------+----------+
-- | 1 | ref | idx_customer | idx_cust| 15 | Using index |
-- type=ref (index lookup), rows=15 (estimated), Using index=good
-- MySQL: EXPLAIN FORMAT=JSON for detailed analysis
EXPLAIN FORMAT=JSON SELECT ...;
A fintech company's daily reconciliation report took 45 minutes. EXPLAIN ANALYZE revealed that a JOIN between transactions (50M rows) and accounts (200K rows) was using a Nested Loop with a Seq Scan on transactions — 200K × 50M row scans. The estimated rows said 100, but actual rows were 50M (stale statistics). Running ANALYZE transactions updated the stats, and the optimizer switched to a Hash Join — query time dropped to 3 minutes. Adding a composite index further reduced it to 45 seconds.
Optimizing queries without looking at the execution plan — guessing which part is slow instead of measuring.
-- "This query is slow, let me add indexes on everything"
CREATE INDEX idx_1 ON orders(customer_id);
CREATE INDEX idx_2 ON orders(status);
CREATE INDEX idx_3 ON orders(amount);
CREATE INDEX idx_4 ON orders(created_at);
-- Still slow! Never checked what the query actually doesEXPLAIN ANALYZE SELECT ... FROM orders WHERE ...;
-- Shows: Seq Scan on orders → 12 seconds
-- The bottleneck is a missing index on (customer_id, created_at)
CREATE INDEX idx_cust_date ON orders(customer_id, created_at);
-- Re-run: Index Scan → 50msWhat are the different join algorithms (Nested Loop, Hash Join, Merge Join) and when does the optimizer choose each?
Locking strategies control how concurrent transactions handle conflicts when they try to modify the same data:
- Pessimistic locking — assumes conflicts will happen. Locks the row/resource before reading it, preventing anyone else from modifying it until the lock is released. Implemented with
SELECT ... FOR UPDATE. Guarantees no conflicts but can cause blocking and deadlocks. - Optimistic locking — assumes conflicts are rare. Does NOT lock the row when reading. Instead, checks at update time whether the data has changed since it was read (using a version number or timestamp). If it has changed, the update fails and the application retries. Implemented at the application level (not a database feature). Better throughput but requires retry logic.
When to use each:
- Pessimistic: high contention (many users editing the same rows), short transactions, critical financial operations.
- Optimistic: low contention (users rarely edit the same row), long read phases (like a user editing a form), high-throughput systems.
-- PESSIMISTIC LOCKING: SELECT ... FOR UPDATE
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Row is LOCKED — other transactions trying to update this row will WAIT
-- ... do calculations ...
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT; -- Lock released
-- FOR UPDATE with NOWAIT (fail immediately instead of waiting)
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE NOWAIT;
-- If row is already locked → ERROR instead of waiting
-- FOR UPDATE with SKIP LOCKED (skip locked rows)
SELECT * FROM job_queue WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Worker picks up the next available (unlocked) job
-- OPTIMISTIC LOCKING: version column
-- Step 1: Read with version
SELECT id, name, salary, version FROM employees WHERE id = 1;
-- Returns: id=1, name=Alice, salary=90000, version=3
-- Step 2: Update only if version hasn't changed
UPDATE employees
SET salary = 95000, version = version + 1
WHERE id = 1 AND version = 3;
-- If another transaction changed the row, version != 3
-- → UPDATE affects 0 rows → application detects conflict and retries
-- Check if update succeeded
-- IF affected_rows = 0 THEN
-- "Conflict detected, please retry"
-- END IF
An airline booking system uses pessimistic locking for seat selection: when a user selects seat 14A, SELECT ... FOR UPDATE locks that seat row. If two users select the same seat simultaneously, the second one waits. A wiki platform uses optimistic locking: when you edit an article, the page sends a version number with your changes. If someone else saved a change while you were editing, the version mismatch triggers a "conflict" page showing both edits. Wikipedia uses this exact pattern for its edit conflict resolution.
Using pessimistic locking with long transactions — holding locks while waiting for user input causes other users to be blocked for minutes.
BEGIN;
SELECT * FROM products WHERE id = 42 FOR UPDATE;
-- Lock acquired... now show edit form to user
-- User takes 5 minutes to fill out the form
-- ALL other users trying to edit product 42 are BLOCKED
UPDATE products SET price = 29.99 WHERE id = 42;
COMMIT; -- Lock held for 5 minutes!-- Read without locking
SELECT id, price, version FROM products WHERE id = 42;
-- Show form to user (no locks held)
-- User takes 5 minutes, submits form
UPDATE products SET price = 29.99, version = version + 1
WHERE id = 42 AND version = 5; -- check version
-- If affected_rows = 0 → someone else changed it → show conflictWhat is a deadlock and how do databases detect and resolve them?
A recursive CTE is a CTE that references itself, allowing you to traverse hierarchical or graph-like data structures. It has two parts:
- Anchor member — the base case that starts the recursion (non-recursive SELECT).
- Recursive member — references the CTE itself and JOINs with the source table to walk the hierarchy level by level.
The database executes the anchor first, then repeatedly executes the recursive member using the previous iteration's results until no new rows are produced.
Common use cases:
- Org charts (employee → manager → CEO)
- Category trees (sub-category → category → root)
- Bill of materials (part → sub-assembly → assembly → product)
- File system paths (folder → parent folder → root)
- Graph traversal (social networks, routing)
Important: always include a termination condition or MAXRECURSION limit to prevent infinite loops on cyclic data.
-- Org chart: find all employees under a manager
-- employees: id, name, manager_id (self-referencing FK)
WITH RECURSIVE org_tree AS (
-- Anchor: start from the CEO (no manager)
SELECT id, name, manager_id, 0 AS depth,
CAST(name AS VARCHAR(500)) AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: find direct reports of each person
SELECT e.id, e.name, e.manager_id, t.depth + 1,
CAST(t.path || ' → ' || e.name AS VARCHAR(500))
FROM employees e
JOIN org_tree t ON e.manager_id = t.id
)
SELECT depth, path FROM org_tree ORDER BY path;
-- Output:
-- 0 CEO
-- 1 CEO → VP Engineering
-- 2 CEO → VP Engineering → Tech Lead
-- 3 CEO → VP Engineering → Tech Lead → Developer
-- Category tree: get full breadcrumb path
WITH RECURSIVE cat_tree AS (
SELECT id, name, parent_id, name AS full_path
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id,
ct.full_path || ' > ' || c.name
FROM categories c
JOIN cat_tree ct ON c.parent_id = ct.id
)
SELECT * FROM cat_tree;
-- Electronics > Phones > Smartphones
-- Number sequence (simple recursion demo)
WITH RECURSIVE nums AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM nums WHERE n < 10
)
SELECT n FROM nums; -- 1, 2, 3, ..., 10
An e-commerce platform needed to display full category breadcrumbs (Home > Electronics > Phones > iPhones) for 50,000 products. Initially, the application made multiple database queries per product (one per level), hitting the database 200,000 times for a catalog page. A single recursive CTE replaced all those queries, generating all breadcrumbs in one query under 200ms. The CTE output was cached in a materialized view that refreshed every hour.
Not adding a recursion limit, causing infinite loops when data has cycles (e.g., employee A reports to B, B reports to A).
WITH RECURSIVE tree AS (
SELECT id, parent_id FROM nodes WHERE id = 1
UNION ALL
SELECT n.id, n.parent_id FROM nodes n JOIN tree t ON n.parent_id = t.id
)
SELECT * FROM tree;
-- If node 5 → node 3 → node 5 (cycle) → infinite loop!WITH RECURSIVE tree AS (
SELECT id, parent_id, 0 AS depth FROM nodes WHERE id = 1
UNION ALL
SELECT n.id, n.parent_id, t.depth + 1
FROM nodes n JOIN tree t ON n.parent_id = t.id
WHERE t.depth < 20 -- Safety limit
)
SELECT * FROM tree;How do you detect cycles in recursive CTEs? What is the CYCLE clause in SQL:2003?
Aggregate functions with GROUP BY collapse multiple rows into one summary row per group. Window (analytic) functions compute values across a "window" of rows related to the current row — but keep all individual rows in the output.
Key window functions beyond ranking:
- SUM/AVG/COUNT OVER() — running totals, moving averages, cumulative counts without losing row detail.
- LAG(column, n) — access the value from
nrows before the current row. Useful for comparing to the previous period. - LEAD(column, n) — access the value from
nrows after the current row. - FIRST_VALUE / LAST_VALUE — get the first or last value in the window frame.
The PARTITION BY clause divides rows into independent groups (like GROUP BY but without collapsing). ORDER BY within OVER determines the row ordering. The frame clause (ROWS BETWEEN) controls exactly which rows are included in the calculation.
-- Running total (cumulative sum)
SELECT
order_date, amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;
-- Jan 1: $100, running=$100
-- Jan 2: $200, running=$300
-- Jan 3: $150, running=$450
-- Running total per customer (PARTITION BY)
SELECT
customer_id, order_date, amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS cust_running
FROM orders;
-- Moving average (last 3 months)
SELECT
month, revenue,
AVG(revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3m
FROM monthly_revenue;
-- LAG: compare to previous month
SELECT
month, revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS change,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month) * 100, 1) AS pct_change
FROM monthly_revenue;
-- LEAD: look ahead to next value
SELECT
name, salary,
LEAD(salary) OVER (ORDER BY salary DESC) AS next_lower_salary,
salary - LEAD(salary) OVER (ORDER BY salary DESC) AS gap
FROM employees;
-- FIRST_VALUE: highest salary in each department
SELECT name, department, salary,
FIRST_VALUE(name) OVER (PARTITION BY department ORDER BY salary DESC) AS top_earner
FROM employees;
A financial analytics team needed month-over-month growth rates for 500 product categories. The initial approach used self-joins (JOIN revenue ON month = prev_month), creating a 500-line query that was slow and error-prone. Replacing it with LAG(revenue) OVER (PARTITION BY category ORDER BY month) reduced the query to 10 lines and ran 8x faster. The same pattern was used for running totals (SUM OVER), moving averages (AVG OVER with ROWS BETWEEN), and YTD calculations — all without self-joins.
Using self-joins to compare rows with previous/next rows when LAG/LEAD would be cleaner and faster.
SELECT a.month, a.revenue, b.revenue AS prev_revenue,
a.revenue - b.revenue AS change
FROM monthly_revenue a
LEFT JOIN monthly_revenue b
ON b.month = a.month - INTERVAL 1 MONTH;
-- Breaks if months are missing, complex for multiple periodsSELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS change
FROM monthly_revenue;
-- Handles gaps naturally, easy to extend to 2+ periodsWhat are frame clauses (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and how do they affect window functions?
Pivot transforms rows into columns — turning unique values from one column into multiple columns in the output. It converts "long" (normalized) data into "wide" (denormalized) format, commonly used for cross-tabulation reports.
Unpivot is the reverse — transforms columns into rows, converting "wide" data back to "long" format.
Implementation varies by database:
- SQL Server — has native
PIVOTandUNPIVOToperators. - PostgreSQL — use
crosstab()from thetablefuncextension, or manual CASE/FILTER approach. - MySQL — no native PIVOT. Use conditional aggregation with
CASE WHEN+SUM/MAX.
The most portable approach (works everywhere) is conditional aggregation: SUM(CASE WHEN category = 'X' THEN value END).
-- Source data (long format):
-- | year | quarter | revenue |
-- | 2024 | Q1 | 100000 |
-- | 2024 | Q2 | 120000 |
-- | 2024 | Q3 | 110000 |
-- | 2024 | Q4 | 150000 |
-- PIVOT using conditional aggregation (works everywhere)
SELECT
year,
SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS Q1,
SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS Q2,
SUM(CASE WHEN quarter = 'Q3' THEN revenue END) AS Q3,
SUM(CASE WHEN quarter = 'Q4' THEN revenue END) AS Q4,
SUM(revenue) AS total
FROM quarterly_revenue
GROUP BY year;
-- Output (wide format):
-- | year | Q1 | Q2 | Q3 | Q4 | total |
-- | 2024 | 100000 | 120000 | 110000 | 150000 | 480000 |
-- SQL Server native PIVOT
SELECT year, [Q1], [Q2], [Q3], [Q4]
FROM quarterly_revenue
PIVOT (SUM(revenue) FOR quarter IN ([Q1], [Q2], [Q3], [Q4])) AS pvt;
-- UNPIVOT: columns back to rows
-- Source: | product | jan_sales | feb_sales | mar_sales |
-- Target: | product | month | sales |
-- Using UNION ALL (portable)
SELECT product, 'January' AS month, jan_sales AS sales FROM products
UNION ALL
SELECT product, 'February', feb_sales FROM products
UNION ALL
SELECT product, 'March', mar_sales FROM products;
-- PostgreSQL: LATERAL + VALUES for unpivot
SELECT p.product, m.month, m.sales
FROM products p,
LATERAL (VALUES ('Jan', jan_sales), ('Feb', feb_sales), ('Mar', mar_sales))
AS m(month, sales);
A business intelligence team needed to generate a monthly sales matrix showing products as rows and months as columns. The raw data had 2 million rows in long format (product_id, month, amount). Pivoting with conditional aggregation (CASE WHEN month = ... THEN amount END) produced a clean spreadsheet-ready output. The team created this as a materialized view that refreshed nightly, and their Excel-based reporting tool consumed it via ODBC — eliminating 4 hours of manual spreadsheet formatting per week.
Hardcoding column values in PIVOT queries that need to handle dynamic data — adding a new quarter or category requires changing the SQL.
SELECT year,
SUM(CASE WHEN quarter='Q1' THEN revenue END) AS Q1,
SUM(CASE WHEN quarter='Q2' THEN revenue END) AS Q2,
SUM(CASE WHEN quarter='Q3' THEN revenue END) AS Q3,
SUM(CASE WHEN quarter='Q4' THEN revenue END) AS Q4
FROM revenue GROUP BY year;
-- What if a new quarter-type "Q5-Bonus" is added?-- Build the CASE columns dynamically in application code
-- or use dynamic SQL in a stored procedure:
SET @sql = NULL;
SELECT GROUP_CONCAT(DISTINCT
CONCAT('SUM(CASE WHEN quarter=\'', quarter,
'\' THEN revenue END) AS `', quarter, '`'))
INTO @sql FROM revenue;
SET @sql = CONCAT('SELECT year, ', @sql, ' FROM revenue GROUP BY year');
PREPARE stmt FROM @sql; EXECUTE stmt;How do you handle dynamic pivot when the column values are not known at query-writing time?
Partitioning splits a large table into smaller, physically separate pieces (partitions) while maintaining a single logical table. Queries that target specific partitions only scan those partitions (partition pruning), dramatically improving performance.
Three main strategies:
- Range partitioning — partition by value ranges. Most common for time-series data: one partition per month or year. Queries with date range filters benefit from partition pruning.
- List partitioning — partition by discrete values. Useful for geographic regions, status codes, or product categories.
- Hash partitioning — partition by hash of a column value. Distributes data evenly across N partitions. Good when there's no natural range or list but you want to spread I/O across disks.
Benefits: faster queries via partition pruning, easier maintenance (drop old partitions instead of DELETE), parallel query execution across partitions, independent backup/restore per partition.
When to partition: tables with hundreds of millions of rows, clear partition key (date, region), queries that almost always include the partition key in WHERE clause.
-- RANGE partitioning by date (PostgreSQL)
CREATE TABLE orders (
id SERIAL,
customer_id INT NOT NULL,
amount DECIMAL(10,2),
order_date DATE NOT NULL
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2024_q1 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
CREATE TABLE orders_2024_q3 PARTITION OF orders
FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');
CREATE TABLE orders_2024_q4 PARTITION OF orders
FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');
-- Query with partition pruning (only scans Q1 partition)
SELECT * FROM orders WHERE order_date BETWEEN '2024-02-01' AND '2024-03-31';
-- LIST partitioning by region
CREATE TABLE customers (
id SERIAL,
name VARCHAR(100),
region VARCHAR(20)
) PARTITION BY LIST (region);
CREATE TABLE customers_americas PARTITION OF customers FOR VALUES IN ('US', 'CA', 'MX', 'BR');
CREATE TABLE customers_europe PARTITION OF customers FOR VALUES IN ('UK', 'DE', 'FR', 'ES');
CREATE TABLE customers_asia PARTITION OF customers FOR VALUES IN ('IN', 'JP', 'CN', 'SG');
-- HASH partitioning (even distribution)
CREATE TABLE logs (
id SERIAL,
user_id INT,
action VARCHAR(50)
) PARTITION BY HASH (user_id);
CREATE TABLE logs_p0 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE logs_p1 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE logs_p2 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE logs_p3 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 3);
-- Maintenance: drop old data by detaching/dropping partition
ALTER TABLE orders DETACH PARTITION orders_2024_q1;
DROP TABLE orders_2024_q1; -- Instant, vs DELETE millions of rows
A ride-sharing company's trips table had 2 billion rows. Queries for monthly reports scanned the entire table, taking 30+ minutes. After range-partitioning by month (24 partitions for 2 years), monthly report queries only scanned one partition (~80M rows) — execution time dropped to 90 seconds. The biggest win was data retention: dropping a 2-year-old partition (DROP TABLE trips_2022_01) took milliseconds vs. DELETE which would have taken hours and generated massive WAL.
Partitioning on a column that is rarely used in WHERE clauses — the optimizer cannot prune partitions, so it scans all of them (worse than no partitioning due to overhead).
-- Partitioned by region, but all queries filter by date
CREATE TABLE orders (...) PARTITION BY LIST (region);
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Must scan ALL region partitions → slower than unpartitioned!-- Partitioned by date, queries always filter by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Only scans the June partition → fastWhat is the difference between partitioning and sharding? When would you choose each?
Sharding distributes data across multiple database servers (nodes) to handle more data and traffic than a single server can manage. Unlike partitioning (which splits data within one server), sharding splits data across different servers.
- Horizontal sharding — same table schema on each shard, but different rows. A shard key determines which shard holds each row. Example: users 1-1M on shard 1, users 1M-2M on shard 2. Most common approach.
- Vertical sharding — different tables or columns on different servers. Example: user profiles on server A, user activity logs on server B. Simpler but limited — eventually each table outgrows its server.
Horizontal sharding challenges:
- Shard key selection — choose poorly and you get hotspots (one shard handling most traffic).
- Cross-shard queries — JOINs across shards are expensive or impossible.
- Distributed transactions — ACID across shards requires 2-phase commit (slow) or eventual consistency.
- Rebalancing — adding a new shard requires redistributing data.
- Application complexity — routing logic must know which shard to query.
-- HORIZONTAL SHARDING example concept:
-- Shard key: user_id % num_shards
-- Shard 0 (users where user_id % 3 = 0): user 3, 6, 9, ...
-- Shard 1 (users where user_id % 3 = 1): user 1, 4, 7, ...
-- Shard 2 (users where user_id % 3 = 2): user 2, 5, 8, ...
-- Application routing (pseudocode):
-- shard_id = user_id % NUM_SHARDS
-- connection = shard_connections[shard_id]
-- connection.query("SELECT * FROM users WHERE id = ?", user_id)
-- Each shard has the SAME schema:
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255)
);
-- VERTICAL SHARDING example:
-- Server A: users, profiles (frequently read)
-- Server B: orders, payments (write-heavy)
-- Server C: analytics, logs (append-only)
-- Cross-shard query problem:
-- "Find all orders for users in California"
-- User data on Shard A, order data distributed across all shards
-- → Application must query users shard, get IDs, then fan out to order shards
-- → No simple SQL JOIN possible
-- Consistent hashing: handles adding/removing shards
-- Hash ring: hash(user_id) maps to a point on a ring
-- Each shard owns a range of the ring
-- Adding a shard only moves ~1/N of the data
Instagram sharded their PostgreSQL database by user_id when they hit 25 million users on a single server. They used a shard-aware Django ORM layer that routed queries to the correct shard based on user_id % num_shards. Cross-shard queries (like "find all comments on a photo") required fan-out to multiple shards and merging results in the application. They managed 12 shards initially and grew to 64. The biggest lesson: choosing user_id as the shard key meant user-centric queries (profile, feed) were fast, but global queries (trending posts, search) required a separate denormalized search index.
Choosing a shard key that creates hotspots — where one shard handles disproportionately more traffic than others.
-- Shard by created_at date range:
-- Shard 1: Jan-Jun 2024, Shard 2: Jul-Dec 2024
-- ALL new writes go to Shard 2 (current month)
-- Shard 2 is overloaded, Shard 1 is idle
-- Problem gets worse over time-- Shard by hash of user_id:
-- hash(user_id) % num_shards → even distribution
-- Both reads and writes spread across all shards
-- New users naturally distribute to all shardsWhat is consistent hashing and why is it used in sharding? How does it handle adding or removing shards?
Replication copies data from one database server to one or more replicas. It provides high availability (if the primary fails, a replica can take over), read scaling (distribute read queries across replicas), and geographic distribution (replicas closer to users).
- Master-slave (primary-replica) — one server handles all writes. Changes are replicated to one or more read-only replicas. Reads can go to any replica. Simplest model. If the master fails, one replica is promoted.
- Multi-master — multiple servers accept writes. Each propagates changes to the others. More complex — requires conflict resolution when two masters modify the same row simultaneously.
Replication types:
- Synchronous — write is not confirmed until all replicas have it. Strong consistency but higher latency.
- Asynchronous — write is confirmed immediately. Replicas catch up later. Faster but can have replication lag — a read from a replica may return stale data.
Eventual consistency means that if no new writes are made, all replicas will eventually converge to the same state. The "window of inconsistency" is the replication lag (typically milliseconds to seconds).
-- MySQL: check replication status on replica
SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 0 ← replica is caught up
-- Seconds_Behind_Source: 15 ← 15 seconds of lag
-- Application-level read/write splitting (pseudocode):
-- writes → primary connection
-- reads → replica connection (may have slight lag)
-- Read-after-write consistency pattern:
-- 1. User creates a post → write to PRIMARY
-- 2. Redirect to view the post → read from PRIMARY (not replica!)
-- 3. All other reads → read from REPLICA
-- PostgreSQL streaming replication setup (primary):
-- postgresql.conf:
-- wal_level = replica
-- max_wal_senders = 3
-- MySQL Group Replication (multi-master):
-- All members can accept writes
-- Built-in conflict detection (certification-based)
-- Conflicts resolved automatically (first writer wins)
-- Conflict example in multi-master:
-- Node A: UPDATE users SET name='Alice' WHERE id=1; (timestamp T1)
-- Node B: UPDATE users SET name='Alicia' WHERE id=1; (timestamp T2)
-- Conflict resolution: T1 < T2 → Node B's change wins
-- Both nodes converge to name='Alicia'
A social media platform with 50 million users used master-slave replication with 5 read replicas. Reads (90% of traffic) were distributed across replicas, while the primary handled writes. They encountered a common issue: a user creates a post, the page redirects to their profile, and the post doesn't appear (because the profile read hit a replica that hadn't received the write yet). The fix was "read-your-writes consistency": after any write, the user's next read is routed to the primary for 5 seconds, then back to replicas.
Reading from a replica immediately after writing to the primary, expecting the latest data — replication lag causes stale reads.
-- Write to primary
INSERT INTO posts (user_id, content) VALUES (42, 'Hello World');
-- Immediately read from replica
SELECT * FROM posts WHERE user_id = 42 ORDER BY created_at DESC LIMIT 1;
-- May NOT return the post just created (replication lag)
-- User sees "post failed" even though it succeeded-- Write to primary
INSERT INTO posts (user_id, content) VALUES (42, 'Hello World');
-- Read from PRIMARY for this user's next request
SELECT * FROM posts WHERE user_id = 42 ORDER BY created_at DESC LIMIT 1;
-- Always sees their own writes
-- After 5 seconds, switch back to replica for this userWhat is the CAP theorem and how does it relate to database replication choices?
An ORM (Object-Relational Mapper) maps database tables to programming language objects, letting you interact with the database using code instead of SQL strings. Popular ORMs: SQLAlchemy (Python), Hibernate (Java), Entity Framework (.NET), Sequelize (Node.js), ActiveRecord (Ruby).
ORM advantages: type safety, less boilerplate, prevents SQL injection (parameterized queries), database-agnostic code, migration management, easier for simple CRUD operations.
Raw SQL advantages: full control over query optimization, complex queries (CTEs, window functions, lateral joins), database-specific features, better performance for batch operations, necessary for query tuning.
The N+1 query problem is the most common ORM performance pitfall: the ORM executes 1 query to fetch N parent records, then N separate queries to fetch each parent's related records — instead of a single JOIN query.
-- N+1 PROBLEM example (Python/SQLAlchemy pseudocode):
# ❌ N+1: 1 query for authors + N queries for books
authors = session.query(Author).all() # Query 1: SELECT * FROM authors
for author in authors:
print(author.books) # Query 2..N+1: SELECT * FROM books WHERE author_id = ?
# If 100 authors → 101 queries!
# ✅ Fix: Eager loading (1 query with JOIN)
authors = session.query(Author).options(joinedload(Author.books)).all()
# Single query: SELECT * FROM authors JOIN books ON ...
# ✅ Fix: Subquery loading (2 queries total)
authors = session.query(Author).options(subqueryload(Author.books)).all()
# Query 1: SELECT * FROM authors
# Query 2: SELECT * FROM books WHERE author_id IN (1,2,3,...100)
-- Raw SQL equivalent (always efficient):
SELECT a.name, b.title
FROM authors a
LEFT JOIN books b ON a.id = b.author_id;
-- 1 query regardless of the number of authors
-- When to use raw SQL:
-- Complex analytics with window functions
SELECT name, salary,
RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as dept_rank,
salary - AVG(salary) OVER (PARTITION BY dept) as vs_avg
FROM employees;
-- ORMs cannot express this elegantly
-- When ORM is fine:
-- Simple CRUD operations
-- user = User(name="Alice", email="alice@example.com")
-- session.add(user)
-- session.commit()
A Django REST API serving a product catalog had a response time of 4 seconds for the products list endpoint. Django Debug Toolbar revealed 847 queries per request — a classic N+1 problem. Each product triggered separate queries for its category, brand, images, and reviews. Adding select_related (JOIN) for category and brand, and prefetch_related (2 queries with IN) for images and reviews reduced the query count from 847 to 5, and response time from 4 seconds to 120ms — a 33x improvement with 3 lines of code.
Accessing related objects in a loop without eager loading, causing hundreds of invisible queries (N+1 problem).
# Python/Django
orders = Order.objects.all() # 1 query
for order in orders:
print(order.customer.name) # N queries (1 per order)
print(order.items.count()) # N more queries
# 500 orders → 1001 queries → slow!# Python/Django
orders = Order.objects.select_related('customer').prefetch_related('items').all()
# Query 1: SELECT orders.*, customers.* FROM orders JOIN customers
# Query 2: SELECT * FROM items WHERE order_id IN (1,2,...500)
for order in orders:
print(order.customer.name) # No query — already loaded
print(order.items.count()) # No query — already loadedHow do you detect N+1 queries in production? What monitoring tools help?
Schema migrations are changes to the database structure (adding/removing columns, changing types, creating tables). In production systems with zero-downtime requirements, these changes must be applied without locking tables or breaking running application code.
Key principles for zero-downtime migrations:
- Backward compatibility — the old application code must work with the new schema, and the new code must work with the old schema (during the rolling deployment window).
- Multi-phase deployment — separate the migration into deploy new code + apply schema change + clean up old schema.
- Avoid exclusive locks — operations like ALTER TABLE ADD COLUMN in MySQL < 8.0 lock the entire table. Use online DDL tools (pt-online-schema-change, gh-ost) for large tables.
Common migration patterns:
- Add column — safe. Add with a DEFAULT or nullable. Deploy code that writes to it. Backfill old rows. Then add NOT NULL constraint.
- Remove column — first remove all code references, deploy, then drop the column.
- Rename column — never rename directly. Add new column → copy data → deploy code using new column → drop old column.
-- SAFE: Add nullable column (no lock, no downtime)
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT NULL;
-- Existing rows get NULL, existing code keeps working
-- SAFE: Rename column in 4 phases
-- Phase 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(200);
-- Phase 2: Dual-write (application writes to both columns)
UPDATE users SET full_name = name WHERE full_name IS NULL; -- backfill
-- Phase 3: Deploy code using full_name, stop writing to name
-- Phase 4: Drop old column
ALTER TABLE users DROP COLUMN name;
-- DANGEROUS: Direct rename (breaks all running code instantly)
-- ALTER TABLE users RENAME COLUMN name TO full_name; -- ❌ Don't!
-- SAFE: Add NOT NULL constraint in stages
-- Step 1: Add column as nullable
ALTER TABLE orders ADD COLUMN status VARCHAR(20);
-- Step 2: Backfill existing rows
UPDATE orders SET status = 'completed' WHERE status IS NULL;
-- Step 3: Add NOT NULL constraint (after all rows have values)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- Large table migration with gh-ost (MySQL):
-- gh-ost creates a shadow table, copies data in background,
-- applies changes to shadow, then swaps atomically
-- No table locks during the copy process
GitHub built gh-ost (GitHub Online Schema Transmogrifier) because MySQL's ALTER TABLE would lock their tables with billions of rows for hours. gh-ost creates a shadow copy of the table, applies the schema change to the shadow, streams changes from the binary log to keep the shadow in sync, and atomically renames the tables when done. This allows schema changes on tables with 1+ billion rows with zero downtime and no read/write impact. They open-sourced it and it is now used by companies like Shopify and Booking.com.
Dropping a column before all application code stops referencing it — causes instant errors for any server still running old code during rolling deployment.
-- Step 1: Drop column from database
ALTER TABLE users DROP COLUMN legacy_status;
-- Step 2: Deploy new code that doesn't reference it
-- But during rolling deploy, old servers still running:
-- SELECT legacy_status FROM users; ← ERROR! Column doesn't exist!-- Step 1: Deploy code that no longer reads/writes legacy_status
-- (All servers now ignore the column)
-- Step 2: After full deployment, safely drop column
ALTER TABLE users DROP COLUMN legacy_status;
-- No code references it → no errorsWhat is the expand-and-contract migration pattern? How do you handle data backfilling for billions of rows?
Multi-tenant database design determines how you isolate data between customers (tenants) in a SaaS application. Three main approaches:
- Shared database, shared schema (row-level isolation) — all tenants share the same tables. A
tenant_idcolumn on every table distinguishes data. Simplest to manage, lowest cost, but requires careful access control and has the risk of data leaks. - Shared database, separate schemas — each tenant gets their own schema (namespace) within one database. Better isolation than row-level. Moderate management complexity. Schema changes must be applied to every tenant schema.
- Separate databases — each tenant gets their own database. Strongest isolation and easiest to customize per tenant. Highest operational cost — managing 1,000 databases is complex. Best for enterprise customers with strict compliance requirements.
Choosing the right approach:
- 10–10,000 tenants with moderate data → shared schema + tenant_id
- 100–500 enterprise tenants with compliance needs → separate schemas
- 10–100 large enterprise tenants → separate databases
-- APPROACH 1: Shared schema with tenant_id (most common)
CREATE TABLE tenants (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
plan VARCHAR(20) DEFAULT 'free'
);
CREATE TABLE projects (
id INT PRIMARY KEY AUTO_INCREMENT,
tenant_id INT NOT NULL,
name VARCHAR(200),
FOREIGN KEY (tenant_id) REFERENCES tenants(id),
INDEX idx_tenant (tenant_id) -- Every query filters by tenant
);
CREATE TABLE tasks (
id INT PRIMARY KEY AUTO_INCREMENT,
tenant_id INT NOT NULL,
project_id INT NOT NULL,
title VARCHAR(300),
FOREIGN KEY (tenant_id) REFERENCES tenants(id),
INDEX idx_tenant (tenant_id)
);
-- CRITICAL: Every query MUST include tenant_id
SELECT * FROM projects WHERE tenant_id = 42 AND status = 'active';
-- Row-Level Security (PostgreSQL) — enforce at DB level
CREATE POLICY tenant_isolation ON projects
USING (tenant_id = current_setting('app.tenant_id')::INT);
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
-- Set tenant context per request
SET app.tenant_id = '42';
SELECT * FROM projects; -- Only sees tenant 42's data automatically
-- APPROACH 2: Separate schemas
CREATE SCHEMA tenant_42;
CREATE TABLE tenant_42.projects (...);
CREATE TABLE tenant_42.tasks (...);
-- SET search_path = 'tenant_42';
Salesforce uses a shared schema with row-level tenant isolation (Organization ID on every table) to serve 150,000+ customers from a shared infrastructure. They patented their multi-tenant architecture. Slack started with shared schema + tenant_id but migrated their largest enterprise customers to separate database clusters for compliance and performance isolation. The hybrid approach — shared schema for standard plans, dedicated databases for enterprise — is becoming the industry standard.
Forgetting to filter by tenant_id in even one query, which causes a data leak between tenants — the most critical SaaS security bug.
-- Forgot tenant_id in the WHERE clause
SELECT * FROM invoices WHERE status = 'overdue';
-- Returns overdue invoices from ALL tenants!
-- Tenant A sees Tenant B's financial data-- Application always includes tenant_id
SELECT * FROM invoices WHERE tenant_id = 42 AND status = 'overdue';
-- Plus: PostgreSQL RLS as a safety net
CREATE POLICY tenant_policy ON invoices
USING (tenant_id = current_setting('app.tenant_id')::INT);
-- Even if application forgets, DB enforces isolationHow do you handle schema migrations across 10,000 tenant schemas? What about tenant-specific customizations?
A materialized view is like a regular view except it stores the query results on disk. Instead of re-executing the underlying query every time it is accessed, it returns the pre-computed results — making reads instantaneous.
Key differences from regular views:
- Regular view — virtual table, no data stored, query runs every time. Always shows current data.
- Materialized view — physical table, stores query results. Must be refreshed to get updated data. Reads are fast, but data can be stale between refreshes.
Refresh strategies:
REFRESH MATERIALIZED VIEW name— full refresh, re-executes the entire query. Blocks reads during refresh (in PostgreSQL, useCONCURRENTLYto avoid blocking).- Incremental refresh — only updates rows that changed since the last refresh (Oracle supports this natively with
ON COMMITrefresh). - Scheduled refresh — refresh via cron job or database scheduler (every hour, every night, etc.).
Best for: expensive aggregate queries (dashboards, reports), queries that don't need real-time data, denormalized read models, pre-computed rankings or statistics.
-- Create a materialized view (PostgreSQL)
CREATE MATERIALIZED VIEW mv_monthly_revenue AS
SELECT
DATE_TRUNC('month', order_date) AS month,
product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders
JOIN products ON orders.product_id = products.id
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date), product_category;
-- Query the materialized view (instant — reads from stored result)
SELECT * FROM mv_monthly_revenue
WHERE month >= '2024-01-01'
ORDER BY total_revenue DESC;
-- No JOINs, no aggregation at query time — just a table scan
-- Refresh (full): re-run the entire underlying query
REFRESH MATERIALIZED VIEW mv_monthly_revenue;
-- Refresh concurrently (no read blocking, requires unique index)
CREATE UNIQUE INDEX idx_mv_month_cat ON mv_monthly_revenue(month, product_category);
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_revenue;
-- Check when it was last refreshed
SELECT relname, last_refresh
FROM pg_stat_user_tables
WHERE relname = 'mv_monthly_revenue';
-- Schedule refresh every hour (using pg_cron extension)
SELECT cron.schedule('refresh-revenue', '0 * * * *',
'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_revenue');
-- Drop a materialized view
DROP MATERIALIZED VIEW IF EXISTS mv_monthly_revenue;
An analytics dashboard at a logistics company queried 500 million shipment records with 6 JOINs and multiple aggregations. Each page load took 45 seconds. Creating a materialized view that pre-computed the dashboard metrics reduced page load to 200ms. They refreshed it every 15 minutes using pg_cron. The trade-off was that dashboard data could be up to 15 minutes old — which was perfectly acceptable for trend analysis. For the real-time tracking page, they kept the live query.
Creating a materialized view and never refreshing it — the data becomes increasingly stale, and users don't realize they're looking at old data.
CREATE MATERIALIZED VIEW mv_stats AS SELECT ... ;
-- Created on Jan 1, never refreshed
-- On March 15, users still seeing January data
-- No automated refresh, no staleness indicator-- Create with concurrent refresh support
CREATE MATERIALIZED VIEW mv_stats AS SELECT ... ;
CREATE UNIQUE INDEX idx_mv ON mv_stats(...);
-- Schedule hourly refresh
SELECT cron.schedule('0 * * * *',
'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_stats');
-- Show last refresh time in the dashboard UI
-- "Data as of: 2024-06-15 14:00 UTC (refreshes hourly)"When would you choose a materialized view over a cache (Redis)? How do you handle cache invalidation for materialized views?
Beyond basic indexing, advanced index optimization techniques can eliminate disk I/O entirely:
- Composite (multi-column) index — an index on two or more columns. Column order matters: the index can efficiently serve queries that filter on the leftmost prefix of columns. An index on
(A, B, C)can serve queries filtering on A, or A+B, or A+B+C — but NOT B alone or C alone. - Covering index — an index that contains all the columns needed by a query. The database can answer the query entirely from the index without touching the table data — called an index-only scan. This is the fastest possible read.
- Index selectivity — the ratio of distinct values to total rows. High selectivity (e.g., email — almost unique) = good index candidate. Low selectivity (e.g., gender — 2-3 values) = poor index candidate. A boolean column with 50/50 distribution has the worst selectivity.
- Partial (filtered) index — an index on a subset of rows matching a WHERE condition. Smaller index size, faster lookups for specific queries.
-- COMPOSITE INDEX: column order matters!
CREATE INDEX idx_orders_status_date ON orders(status, order_date);
-- ✅ Uses the index (filters on leftmost column first)
SELECT * FROM orders WHERE status = 'pending' AND order_date > '2024-01-01';
-- ✅ Uses the index (leftmost prefix)
SELECT * FROM orders WHERE status = 'pending';
-- ❌ Cannot efficiently use this index (skips leftmost column)
SELECT * FROM orders WHERE order_date > '2024-01-01';
-- COVERING INDEX: all needed columns IN the index
CREATE INDEX idx_orders_covering ON orders(status, order_date, customer_id, amount);
SELECT customer_id, amount FROM orders
WHERE status = 'shipped' AND order_date > '2024-01-01';
-- Index-only scan: no table access needed — all columns are in the index!
-- EXPLAIN shows "Using index" or "Index Only Scan"
-- PostgreSQL: INCLUDE clause for non-searchable covering columns
CREATE INDEX idx_orders_cover ON orders(status, order_date) INCLUDE (customer_id, amount);
-- status, order_date = searchable; customer_id, amount = just stored for covering
-- INDEX SELECTIVITY: check before creating
SELECT
COUNT(DISTINCT status) AS distinct_vals,
COUNT(*) AS total_rows,
COUNT(DISTINCT status)::float / COUNT(*) AS selectivity
FROM orders;
-- status: 5 distinct / 1,000,000 rows = 0.000005 → very low selectivity
-- email: 999,500 distinct / 1,000,000 rows = 0.9995 → high selectivity ✅
-- PARTIAL INDEX: index only the rows that matter
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';
-- Only indexes pending orders (maybe 5% of the table)
-- Much smaller than indexing all orders
-- Perfect for: SELECT * FROM orders WHERE status='pending' AND order_date>...
-- Unused index detection (PostgreSQL)
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
A SaaS application had a slow dashboard query: SELECT customer_name, total, order_date FROM orders WHERE tenant_id = ? AND status = 'active' ORDER BY order_date DESC LIMIT 20. EXPLAIN showed a table scan despite having separate indexes on tenant_id and status. Creating a composite index (tenant_id, status, order_date DESC) with INCLUDE (customer_name, total) turned it into an index-only backward scan — query time dropped from 2.3 seconds to 3ms. The covering index meant zero table I/O.
Creating composite indexes with columns in the wrong order, making the index useless for the actual query patterns.
-- Index: (order_date, status)
CREATE INDEX idx ON orders(order_date, status);
-- Query filters on status first (not leftmost prefix!)
SELECT * FROM orders WHERE status = 'pending';
-- Cannot use this index efficiently → full table scan
-- Even worse: random index on low-selectivity column
CREATE INDEX idx_gender ON users(gender);
-- 2 values for 10M rows → index returns 5M rows → slower than table scan-- Index matches query: status first, then date for ordering
CREATE INDEX idx ON orders(status, order_date DESC);
SELECT * FROM orders WHERE status = 'pending' ORDER BY order_date DESC LIMIT 20;
-- Uses index efficiently: seek to status='pending', scan in date order, stop at 20What are expression indexes and how are they useful? How do you index JSON columns?
Query profiling is the process of analyzing why a query is slow and finding the optimal execution strategy. Key tools:
- EXPLAIN / EXPLAIN ANALYZE — shows the query execution plan: which indexes are used, join strategies, estimated vs actual row counts, and cost.
EXPLAINshows the plan without executing.EXPLAIN ANALYZEactually runs the query and shows real timing. - Slow query log — database log that captures queries exceeding a time threshold. MySQL:
slow_query_log+long_query_time. PostgreSQL:log_min_duration_statement. - Index hints — force or suggest the optimizer to use a specific index. MySQL:
USE INDEX,FORCE INDEX,IGNORE INDEX. PostgreSQL: no direct hint syntax — useSET enable_seqscan = offfor testing.
What to look for in EXPLAIN output:
- Seq Scan / Full Table Scan — no index used, scanning every row.
- Nested Loop vs Hash Join vs Merge Join — join strategies with different performance characteristics.
- Sort / filesort — data sorted on disk (expensive).
- Estimated vs actual rows — large discrepancy means stale statistics, run ANALYZE.
-- PostgreSQL: EXPLAIN ANALYZE with buffer info
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, c.name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending' AND o.order_date > '2024-01-01'
ORDER BY o.order_date DESC
LIMIT 20;
-- Sample output:
-- Limit (cost=0.87..52.34 rows=20 actual time=0.04..0.12 rows=20)
-- → Nested Loop (cost=0.87..15234.56 rows=5890 actual time=0.04..0.11)
-- → Index Scan Backward on idx_orders_status_date (actual rows=20)
-- Index Cond: (status = 'pending' AND order_date > '2024-01-01')
-- → Index Scan on customers_pkey (actual rows=1 per loop)
-- Index Cond: (id = o.customer_id)
-- Planning Time: 0.2ms Execution Time: 0.15ms
-- MySQL: EXPLAIN showing problem
EXPLAIN SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- type: ALL (full table scan!)
-- Extra: Using where
-- Problem: YEAR() wraps the column → cannot use index
-- Fix: rewrite to be index-friendly
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- type: range key: idx_order_date rows: 50000 (vs 5M before)
-- MySQL: Slow query log configuration
-- SET GLOBAL slow_query_log = ON;
-- SET GLOBAL long_query_time = 1; -- log queries > 1 second
-- SET GLOBAL log_queries_not_using_indexes = ON;
-- PostgreSQL: log slow queries
-- postgresql.conf:
-- log_min_duration_statement = 500 -- log queries > 500ms
-- Update table statistics (fix bad estimates)
ANALYZE orders; -- PostgreSQL
ANALYZE TABLE orders; -- MySQL
-- MySQL: Force index usage
SELECT * FROM orders FORCE INDEX (idx_status_date)
WHERE status = 'pending' AND order_date > '2024-01-01';
A query on a 200-million-row table took 45 seconds. EXPLAIN ANALYZE revealed a sequential scan — the index existed but the optimizer chose not to use it because table statistics were stale (estimated 100 rows, actual 2 million). Running ANALYZE updated the statistics, and the optimizer switched to an index scan, dropping the query to 200ms. The team added ANALYZE to their weekly maintenance cron. They also enabled pg_stat_statements to continuously track the top 20 slowest queries, setting up alerts when any query exceeded 5 seconds.
Wrapping an indexed column in a function, preventing the database from using the index — called a non-sargable predicate.
-- Index on order_date, but YEAR() wraps it
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Full table scan! Optimizer cannot use index on order_date
-- Same problem: WHERE LOWER(email) = 'alice@mail.com'
-- Same problem: WHERE amount + tax > 1000-- Range comparison directly on the indexed column
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Index range scan → fast!
-- For LOWER(): create an expression index
CREATE INDEX idx_email_lower ON users(LOWER(email));
-- Or: CREATE INDEX idx_email ON users((email::text) text_pattern_ops);What is pg_stat_statements and how do you use it to find the slowest queries in production?
Denormalization is the deliberate introduction of redundancy into a database schema to improve read performance. It trades write complexity (keeping redundant data in sync) for read speed (fewer JOINs, pre-computed values).
When to denormalize:
- Read-heavy workloads — when reads outnumber writes 100:1 or more (dashboards, reporting, product catalogs).
- Expensive JOINs — when a query requires 5+ JOINs across large tables, and denormalizing eliminates most of them.
- Pre-computed aggregates — storing counts, totals, or averages to avoid COUNT/SUM on millions of rows per request.
- Caching data from external services — storing a user's display name alongside their foreign key to avoid JOIN or API call.
Common denormalization techniques:
- Storing computed columns (full_name = first + last)
- Duplicating frequently-accessed columns in related tables
- Maintaining counter caches (comment_count on posts table)
- Creating summary/rollup tables
- Using materialized views (automated denormalization)
The cost: every write must update all denormalized copies. If they go out of sync, you have data inconsistency bugs.
-- NORMALIZED: correct but slow for dashboards
-- To get "product name, category name, and total orders":
SELECT p.name, c.name, COUNT(o.id)
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN order_items oi ON oi.product_id = p.id
JOIN orders o ON oi.order_id = o.id
WHERE o.status = 'completed'
GROUP BY p.name, c.name;
-- 4 JOINs + GROUP BY on 100M order_items → slow
-- DENORMALIZED: fast reads, redundant data
ALTER TABLE products ADD COLUMN category_name VARCHAR(100);
ALTER TABLE products ADD COLUMN total_orders INT DEFAULT 0;
ALTER TABLE products ADD COLUMN total_revenue DECIMAL(12,2) DEFAULT 0;
-- Dashboard query is now trivial:
SELECT name, category_name, total_orders, total_revenue
FROM products WHERE total_orders > 0 ORDER BY total_revenue DESC;
-- No JOINs, no aggregation → instant
-- Keep denormalized data in sync with triggers:
CREATE OR REPLACE FUNCTION update_product_stats()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
UPDATE products SET
total_orders = total_orders + 1,
total_revenue = total_revenue + NEW.amount
WHERE id = NEW.product_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_order_item_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_product_stats();
-- COUNTER CACHE example
-- Instead of: SELECT COUNT(*) FROM comments WHERE post_id = 42;
-- Store it: posts.comment_count (updated via trigger or application)
ALTER TABLE posts ADD COLUMN comment_count INT DEFAULT 0;
-- Update on insert
UPDATE posts SET comment_count = comment_count + 1 WHERE id = NEW.post_id;
Stack Overflow denormalizes vote counts on their posts table — each post row stores its score (upvotes - downvotes) directly, rather than computing it from the votes table on every page load. With 50 million posts and 200 million votes, computing scores with COUNT/SUM per page load would be impractical. The denormalized score column is updated atomically on each vote. They accept the minimal risk of count drift (which they fix with periodic reconciliation jobs) in exchange for the massive read performance gain.
Denormalizing prematurely without measuring — adding complexity for writes when the normalized query was fast enough.
-- "JOINs are slow, let's denormalize everything!"
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
ALTER TABLE orders ADD COLUMN customer_phone VARCHAR(20);
-- Now every customer name change must update ALL their orders
-- Customer changes name → UPDATE orders SET customer_name=? WHERE customer_id=?
-- 50,000 orders for one customer → expensive write, possible inconsistency-- Step 1: Measure — is the JOIN actually slow?
EXPLAIN ANALYZE SELECT o.*, c.name FROM orders o JOIN customers c ON ...;
-- If it returns in 5ms → don't denormalize!
-- Step 2: If genuinely slow, denormalize only what's needed
ALTER TABLE orders ADD COLUMN customer_display_name VARCHAR(100);
-- Only the display name (not email, phone) — minimal redundancy
-- Accept that name changes won't backfill old orders (usually OK)What is CQRS (Command Query Responsibility Segregation) and how does it relate to denormalization?
OFFSET pagination uses LIMIT N OFFSET M to skip rows. Simple but has a fatal flaw: the database must read and discard all skipped rows. Page 1000 with 20 rows per page = read 20,000 rows, discard 19,980, return 20. Gets slower with every page.
Cursor-based (keyset) pagination uses a WHERE clause on the last seen value instead of OFFSET. It says "give me 20 rows after the last one I saw" instead of "skip N rows". Performance is constant regardless of page depth — always reads only the rows it returns.
Comparison:
- OFFSET: O(offset + limit) — linear degradation. Simple. Supports jumping to arbitrary page. Inconsistent if data changes between pages (duplicates/gaps).
- Cursor: O(limit) — constant performance. Requires a unique, ordered column. No arbitrary page jumps (only next/previous). Consistent even if data changes.
For APIs with infinite scroll or "Load More" buttons, cursor-based is the standard. For traditional page-numbered UIs, OFFSET is simpler but add a reasonable maximum page limit.
-- OFFSET PAGINATION (simple but slow at scale)
-- Page 1:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 0;
-- Page 2:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 20;
-- Page 500:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 9980;
-- Must scan 10,000 rows to return 20 → SLOW!
-- CURSOR-BASED PAGINATION (fast at any depth)
-- Page 1 (no cursor yet):
SELECT * FROM products ORDER BY id LIMIT 20;
-- Returns ids 1-20, last_id = 20
-- Page 2 (cursor = last seen id):
SELECT * FROM products WHERE id > 20 ORDER BY id LIMIT 20;
-- Seeks directly to id=21 via index, returns 20 rows → FAST!
-- Page 500 (cursor = last seen id from page 499):
SELECT * FROM products WHERE id > 9980 ORDER BY id LIMIT 20;
-- Same speed as page 1! Index seek to 9981 → 20 rows
-- Multi-column cursor (e.g., order by created_at, id for ties)
-- Last seen: created_at='2024-06-15', id=42
SELECT * FROM products
WHERE (created_at, id) > ('2024-06-15', 42)
ORDER BY created_at, id
LIMIT 20;
-- API response format for cursor pagination:
-- {
-- "data": [...20 products...],
-- "next_cursor": "eyJpZCI6NjAsImNyZWF0ZWRfYXQiOiIyMDI0LTA2LTE1In0=",
-- "has_more": true
-- }
-- The cursor is a Base64-encoded token containing the last row's sort values
-- OFFSET with total count (for page numbers UI):
SELECT COUNT(*) FROM products WHERE category = 'electronics';
-- Total: 5,000 → show pages 1-250
-- Cap at page 100 to prevent deep pagination abuse
Twitter's API switched from OFFSET to cursor-based pagination early on. With 500 million tweets per day, OFFSET pagination on a timeline query at page 10,000 would need to scan 200,000+ rows. Their cursor (since_id / max_id parameters) uses the tweet ID as a cursor — the query WHERE id < max_id ORDER BY id DESC LIMIT 20 always hits an index and returns in constant time. Slack, Stripe, GitHub, and most modern APIs use the same pattern, returning opaque cursor tokens instead of page numbers.
Using OFFSET for API endpoints that can be paginated deeply, causing database performance to degrade linearly.
-- API: GET /products?page=5000&per_page=20
SELECT * FROM products ORDER BY created_at DESC
LIMIT 20 OFFSET 99980;
-- Database reads 100,000 rows, discards 99,980, returns 20
-- Takes 15 seconds on a table with 10M rows
-- Attackers can DoS your DB by requesting page=999999-- API: GET /products?cursor=eyJpZCI6MTIzNDV9&per_page=20
-- Decode cursor → last_id = 12345
SELECT * FROM products WHERE id < 12345
ORDER BY id DESC LIMIT 20;
-- Index seek → 20 rows → constant time regardless of depth
-- No way to "skip ahead" → no deep pagination DoSHow do you implement bidirectional cursor pagination (next + previous page)?
Three infrastructure-level techniques that dramatically improve database performance without changing your queries:
- Connection pooling — maintaining a pool of pre-established database connections that are reused across requests. Creating a new TCP connection + SSL handshake + authentication per request takes 50-200ms. A pool keeps connections warm and hands them out instantly. Tools: PgBouncer (PostgreSQL), ProxySQL (MySQL), HikariCP (Java), built-in pools in ORMs.
- Prepared statements — pre-compile a query once, then execute it many times with different parameters. Benefits: (1) SQL injection prevention (parameters are never interpolated into SQL), (2) performance — the database parses and plans the query once, reuses the plan for subsequent executions.
- Query caching — storing query results in memory to serve repeated identical queries without hitting the database. Levels: (1) application cache (Redis/Memcached), (2) ORM query cache, (3) database-level cache (MySQL Query Cache — deprecated in 8.0 because of contention).
These three techniques combined can reduce database load by 80-90% in typical web applications.
-- CONNECTION POOLING with PgBouncer (PostgreSQL)
-- pgbouncer.ini:
-- [databases]
-- mydb = host=localhost port=5432 dbname=mydb
-- [pgbouncer]
-- pool_mode = transaction -- release connection after each transaction
-- max_client_conn = 1000 -- accept up to 1000 app connections
-- default_pool_size = 20 -- only 20 actual PostgreSQL connections
-- Result: 1000 app servers share 20 DB connections
-- Without pooling: 1000 app servers = 1000 DB connections = OOM
-- With pooling: 1000 app servers → PgBouncer → 20 DB connections = efficient
-- PREPARED STATEMENTS
-- PostgreSQL:
PREPARE get_user(INT) AS
SELECT id, name, email FROM users WHERE id = $1;
EXECUTE get_user(42);
EXECUTE get_user(99); -- Reuses the parsed/planned query
-- Application-level (Python/psycopg2):
-- cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
-- psycopg2 automatically uses server-side prepared statements
-- ❌ String interpolation (SQL injection + no plan reuse!)
-- cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
-- QUERY CACHING with Redis (application level)
-- Pseudocode:
-- key = "products:category:electronics:page:1"
-- result = redis.get(key)
-- if result is None:
-- result = db.query("SELECT * FROM products WHERE category='electronics' LIMIT 20")
-- redis.setex(key, 300, serialize(result)) -- cache for 5 minutes
-- return result
-- Cache invalidation: delete cache when data changes
-- When a product is updated:
-- redis.delete("products:category:electronics:*") -- invalidate related caches
-- Connection pool sizing formula:
-- pool_size = num_cores * 2 + effective_spindle_count
-- For SSD: pool_size = num_cores * 2 + 1
-- Example: 4-core server → pool_size = 9
-- More connections ≠ better! Too many → context switching overhead
A Node.js API serving 10,000 requests per second was opening a new PostgreSQL connection per request. Each connection took ~150ms to establish, and PostgreSQL maxed out at 100 concurrent connections (each consuming ~10MB RAM). Adding PgBouncer with a pool of 25 connections (transaction mode) eliminated connection overhead — API latency dropped from 200ms to 15ms. They also added Redis caching for the 20 most-hit API endpoints (product listings, category pages), which reduced database queries by 85%. The remaining 15% of queries used prepared statements, saving ~5ms per query in parse/plan time.
Opening a new database connection for every request instead of using a connection pool — burns resources and limits throughput.
# Python — connection per request
def handle_request(user_id):
conn = psycopg2.connect(host='db', dbname='mydb') # 150ms overhead!
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
result = cursor.fetchone()
conn.close() # Connection discarded after one query
return result
# 10,000 req/s = 10,000 connections opened + closed per second
# PostgreSQL max_connections = 100 → crash!# Python — connection pool
from psycopg2 import pool
db_pool = pool.ThreadedConnectionPool(5, 20, host='db', dbname='mydb')
def handle_request(user_id):
conn = db_pool.getconn() # Instant — reuse existing connection
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
finally:
db_pool.putconn(conn) # Return to pool, not closed
# 10,000 req/s share 20 connections → fast and efficientWhat is the difference between PgBouncer session mode, transaction mode, and statement mode? When would you use each?
Almost every relational database (PostgreSQL, MySQL InnoDB, SQL Server, Oracle) uses B+ trees as the default index data structure. Understanding the internals explains why certain operations are fast and others are slow.
B-tree vs B+ tree:
- B-tree — stores keys and data pointers in both internal nodes and leaf nodes. Any node can satisfy a lookup.
- B+ tree (what databases actually use) — stores keys in internal nodes (for routing) and all data pointers only in leaf nodes. Leaf nodes are linked as a doubly-linked list for efficient range scans.
B+ tree structure (typical InnoDB):
- Root node — always kept in memory. Contains keys that route searches left or right.
- Internal (branch) nodes — contain keys and pointers to child nodes. Each node fits in one disk page (typically 8KB or 16KB). A 16KB page can hold ~500 key-pointer pairs, so a 3-level tree indexes ~125 million rows.
- Leaf nodes — contain the actual indexed values and row pointers (or the full row data for clustered indexes). Linked in order for range scans.
Lookup: O(log n) disk reads — traverse root → internal → leaf. For 100 million rows with a branching factor of 500, that is only 3-4 levels = 3-4 disk pages read.
Page split — when inserting into a full leaf node, the database splits it into two nodes (each ~50% full), and pushes a new key up to the parent. If the parent is also full, it splits too (cascading up). This is why random key inserts (UUIDs) are expensive — they cause splits everywhere. Sequential keys (auto-increment) always append to the rightmost leaf — no splits.
-- Visualize B+ tree index structure (conceptual):
-- INDEX on employee.salary:
-- ROOT NODE (level 2, always in memory):
-- [ 30000 | 60000 | 90000 ]
-- | | | \
-- ↓ ↓ ↓ ↓
-- INTERNAL NODES (level 1):
-- [10k|20k] [40k|50k] [70k|80k] [100k|110k]
-- ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
-- LEAF NODES (level 0, linked list for range scan):
-- [10k,emp3][20k,emp7] ↔ [30k,emp1][40k,emp9] ↔ [50k,emp5]...
-- ^^^^^^^^ ^^^^^^^^
-- key + row pointer key + row pointer
-- Lookup: WHERE salary = 50000
-- 1. Root: 50000 >= 30000 and < 60000 → go to 2nd child
-- 2. Internal: 50000 >= 40000 and < 60000 → go to 2nd child
-- 3. Leaf: scan for 50000 → found! Follow row pointer to table
-- Total: 3 page reads (root cached → effectively 2 disk I/Os)
-- Range scan: WHERE salary BETWEEN 40000 AND 80000
-- 1. B+ tree lookup to find leaf with 40000
-- 2. Follow the leaf linked list: 40k → 50k → 60k → 70k → 80k
-- Sequential read, very fast!
-- PAGE SPLIT visualization:
-- Before INSERT salary=45000 into full leaf [40k, 42k, 44k, 46k, 48k]:
-- Leaf is full (max 5 keys per page)!
-- Step 1: Split into two leaves:
-- [40k, 42k, 44k] and [45k, 46k, 48k]
-- Step 2: Push middle key (45k) up to parent internal node
-- Step 3: Parent gets a new pointer to the new leaf
-- InnoDB page size (check/set):
SHOW VARIABLES LIKE 'innodb_page_size';
-- Default: 16384 (16KB)
-- PostgreSQL: inspect B-tree index structure
CREATE EXTENSION pageinspect;
SELECT * FROM bt_metap('idx_salary'); -- tree height, root page, etc.
SELECT * FROM bt_page_items('idx_salary', 1); -- items in page 1
-- Fragmentation check after many splits (SQL Server):
SELECT name, avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED')
JOIN sys.indexes ON object_id = object_id AND index_id = index_id
WHERE avg_fragmentation_in_percent > 30;
-- Fix fragmentation:
ALTER INDEX idx_salary ON employees REBUILD; -- offline, full rebuild
ALTER INDEX idx_salary ON employees REORGANIZE; -- online, defragment
A fintech company inserted 20 million rows/day with UUID primary keys. InnoDB used a clustered B+ tree on the PK, and every insert went to a random leaf page — causing 10,000+ page splits per second. This fragmented the B+ tree so badly that range queries slowed from 50ms to 12 seconds over a month. The fix was switching to ULID (Universally Unique Lexicographically Sortable Identifier) — which is time-ordered, so inserts always append to the rightmost leaf. Page splits dropped to near zero, and they added a weekly OPTIMIZE TABLE to reclaim wasted space from the UUID era.
Not understanding that B+ tree leaf nodes are linked — leading to incorrect assumptions about range scan performance, or ignoring page splits caused by random keys.
CREATE TABLE events (
id CHAR(36) PRIMARY KEY DEFAULT (UUID()), -- random!
data TEXT,
ts DATETIME
);
-- Every INSERT goes to a random leaf in the B+ tree
-- Full leaves split → 50% wasted space per page
-- Tree becomes fragmented → range scans read scattered pages
-- InnoDB fill factor cannot be configured (always 15/16)CREATE TABLE events (
id BIGINT PRIMARY KEY AUTO_INCREMENT, -- sequential
uuid CHAR(36) UNIQUE, -- non-clustered index for lookups
data TEXT,
ts DATETIME
);
-- INSERTs always go to the rightmost leaf → no page splits
-- B+ tree stays compact and unfragmented
-- Alternative: ULID or UUIDv7 (time-ordered UUIDs)What are other index structures databases use — hash indexes, GiST, GIN, BRIN? When would you choose each over B+ tree?
Partitioning creates physically separate storage units (child tables or segments) behind a single logical table name. Understanding the internals explains why certain queries benefit and others don't.
Physical storage:
- Each partition is an independent heap or B+ tree on disk — with its own data pages, its own indexes, and its own statistics.
- In PostgreSQL, each partition is literally a child table with an implicit CHECK constraint. The parent table has no data of its own — it's a routing layer.
- In MySQL InnoDB, each partition gets its own
.ibdtablespace file (or subpartition within a shared tablespace). - In SQL Server, partitions are mapped to filegroups via a partition function (defines boundary values) and a partition scheme (maps partitions to filegroups).
Partition pruning (elimination): The query optimizer examines WHERE clauses at planning time and determines which partitions can possibly contain matching rows. Non-matching partitions are excluded from the execution plan entirely — they are never opened or scanned.
How pruning works internally:
- Parser extracts the partition key column and comparison value from WHERE.
- Optimizer compares the value against each partition's boundary metadata (CHECK constraints, range bounds, hash modulus).
- Non-matching partitions are removed from the plan — their indexes and data pages are never touched.
EXPLAINshows only the accessed partitions.
Pruning limitations: Only works when the partition key appears directly in the WHERE clause with a constant or parameter. Functions on the partition key (WHERE YEAR(date_col) = 2024) break pruning. JOINs where the partition key comes from another table usually prevent pruning (except with partition-wise joins in PostgreSQL 11+).
-- Create range-partitioned table (PostgreSQL)
CREATE TABLE sensor_readings (
id BIGSERIAL,
sensor_id INT NOT NULL,
value DOUBLE PRECISION,
recorded_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (recorded_at);
-- Each partition = a separate physical table with its own storage
CREATE TABLE readings_2024_q1 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE readings_2024_q2 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
CREATE TABLE readings_2024_q3 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');
CREATE TABLE readings_2024_q4 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');
-- Each partition has its own index (LOCAL index)
CREATE INDEX ON readings_2024_q1 (sensor_id, recorded_at);
CREATE INDEX ON readings_2024_q2 (sensor_id, recorded_at);
-- Or: CREATE INDEX ON sensor_readings (sensor_id, recorded_at);
-- PostgreSQL auto-creates per-partition indexes
-- PARTITION PRUNING in action:
EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM sensor_readings
WHERE recorded_at BETWEEN '2024-04-15' AND '2024-06-30';
-- Output shows:
-- Append
-- → Seq Scan on readings_2024_q2 ← only Q2 scanned!
-- Partitions Q1, Q3, Q4 are completely pruned (not even opened)
-- MySQL: check which partitions are accessed
EXPLAIN PARTITIONS
SELECT * FROM sensor_readings WHERE recorded_at = '2024-05-15';
-- partitions: p_2024_q2 ← only 1 of 4 partitions
-- SQL Server: partition function + scheme internals
CREATE PARTITION FUNCTION pf_quarter (DATE)
AS RANGE RIGHT FOR VALUES ('2024-04-01', '2024-07-01', '2024-10-01');
-- Internally creates boundary array: [2024-04-01, 2024-07-01, 2024-10-01]
-- Binary search on boundaries to find target partition → O(log p)
CREATE PARTITION SCHEME ps_quarter
AS PARTITION pf_quarter TO (fg_q1, fg_q2, fg_q3, fg_q4);
-- Maps partition numbers to physical filegroups
-- ❌ PRUNING BREAKS with function on partition key:
SELECT * FROM sensor_readings WHERE EXTRACT(YEAR FROM recorded_at) = 2024;
-- Cannot prune! Must scan ALL partitions
-- Fix: rewrite as range
SELECT * FROM sensor_readings
WHERE recorded_at >= '2024-01-01' AND recorded_at < '2025-01-01';
-- Pruning works → only 2024 partitions scanned
-- Check partition sizes individually
SELECT
child.relname AS partition_name,
pg_size_pretty(pg_relation_size(child.oid)) AS size,
pg_stat_get_live_tuples(child.oid) AS row_count
FROM pg_inherits
JOIN pg_class parent ON inhparent = parent.oid
JOIN pg_class child ON inhrelid = child.oid
WHERE parent.relname = 'sensor_readings'
ORDER BY child.relname;
A telecom company stored 5 years of call detail records (CDR) — 12 billion rows in one table. Queries for monthly billing always scanned the full table (45 minutes). After range-partitioning by month (60 partitions), a billing query for one month only touched 1 partition (~200M rows, 4 minutes). EXPLAIN confirmed pruning: 59 of 60 partitions eliminated. The biggest surprise was maintenance: dropping a 2-year-old partition took 20ms (vs. DELETE WHERE date < ... which would have taken 8 hours and generated 400GB of WAL). They also discovered that per-partition VACUUM and ANALYZE ran 60x faster than on the monolithic table.
Applying functions to the partition key in WHERE clauses, which prevents the optimizer from pruning partitions — defeating the entire purpose of partitioning.
-- Table partitioned by recorded_at (monthly)
-- This query wraps the key in a function:
SELECT * FROM sensor_readings
WHERE DATE_TRUNC('month', recorded_at) = '2024-06-01';
-- Optimizer cannot compare '2024-06-01' against partition boundaries!
-- Must scan ALL 60 partitions → slower than unpartitioned!
EXPLAIN shows: Append → scan on readings_2024_01, readings_2024_02, ...
-- All 60 partitions listed → no pruning-- Same intent, but sargable range comparison:
SELECT * FROM sensor_readings
WHERE recorded_at >= '2024-06-01' AND recorded_at < '2024-07-01';
-- Optimizer binary-searches partition boundaries
-- Finds: only readings_2024_06 matches → prune all others
EXPLAIN shows: Append → scan on readings_2024_06 only
-- 1 of 60 partitions → 60x fasterWhat is partition-wise JOIN and partition-wise aggregation in PostgreSQL? How do they parallelize queries across partitions?
ORDER BY is one of the most expensive SQL operations because the database must arrange all result rows in a specific order before returning them. The internal sort strategy depends on the data size relative to available memory.
Three sorting strategies databases use:
- Index scan (no sort needed) — if an index already stores data in the requested order, the database reads the index in order and avoids sorting entirely. This is the fastest. For example,
ORDER BY idon a table with a clustered index onidrequires zero sorting. Also works for non-clustered indexes that match the ORDER BY columns. - In-memory quicksort — when the data fits in the
work_mem(PostgreSQL) orsort_buffer_size(MySQL) memory budget, the database sorts it entirely in RAM using a variant of quicksort. Fast, but limited by available memory. PostgreSQL default work_mem is only 4MB. - External merge sort (disk-based) — when data exceeds memory, the database uses a multi-phase external sort:
- Run generation: read chunks that fit in memory, sort each chunk (a "run"), write each sorted run to a temp file on disk.
- Merge phase: merge all sorted runs using a k-way merge (priority queue / min-heap). May require multiple merge passes if there are too many runs.
- This is O(n log n) with disk I/O — much slower than in-memory sort.
EXPLAIN tells you which strategy was used:
- PostgreSQL:
Sort Method: quicksort Memory: 25kB(in-memory) vsSort Method: external merge Disk: 45000kB(spilled to disk). - MySQL:
Using filesortin EXPLAIN means the optimizer couldn't use an index for ordering (but may still be in-memory — the name is misleading).
Top-N optimization: For ORDER BY ... LIMIT N, databases use a heap sort / top-N sort — they maintain a min-heap of size N, scanning all rows but only keeping the top N. Memory usage is O(N) regardless of table size. This is why ORDER BY created_at DESC LIMIT 10 is fast even on huge tables (especially with an index).
-- STRATEGY 1: Index eliminates sort entirely
CREATE INDEX idx_created ON orders(created_at DESC);
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;
-- Index Scan Backward on idx_created
-- Sort Method: none — index provides the order!
-- Execution Time: 0.5ms
-- STRATEGY 2: In-memory quicksort (small result set)
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42 ORDER BY amount DESC;
-- Sort (cost=...) Sort Key: amount DESC
-- Sort Method: quicksort Memory: 25kB ← fits in work_mem
-- Execution Time: 3ms
-- STRATEGY 3: External merge sort (large result set)
SET work_mem = '4MB'; -- default PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC;
-- Sort (cost=...)
-- Sort Method: external merge Disk: 45000kB ← spilled to disk!
-- Execution Time: 12000ms (12 seconds!)
-- Fix: increase work_mem for this query
SET work_mem = '256MB';
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC;
-- Sort Method: quicksort Memory: 52000kB ← now fits in RAM
-- Execution Time: 800ms (15x faster!)
-- ⚠️ work_mem is per-sort-operation, not per-query!
-- A query with 5 sorts uses 5 × work_mem
-- TOP-N HEAP SORT optimization
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC LIMIT 10;
-- Sort Sort Key: amount DESC
-- Sort Method: top-N heapsort Memory: 26kB ← only keeps 10 rows!
-- Even on a 100M row table, only 10 rows in memory
-- MySQL: "Using filesort" (misleading name)
EXPLAIN SELECT * FROM orders ORDER BY amount;
-- Extra: Using filesort
-- Does NOT necessarily mean disk — it means "not using an index for sort"
-- Could be in-memory sort_buffer or on-disk merge sort
-- MySQL: check sort behavior
SHOW STATUS LIKE 'Sort_merge_passes';
-- Sort_merge_passes > 0 → external merge sort happened → increase sort_buffer_size
-- CREATE INDEX to eliminate sorts for common queries:
-- Before: ORDER BY customer_id, order_date DESC
CREATE INDEX idx_cust_date ON orders(customer_id, order_date DESC);
-- Now the index provides the order → no sort at all
-- Multi-column sort: index must match EXACT column order AND direction
-- ORDER BY a ASC, b DESC → needs INDEX(a ASC, b DESC)
-- INDEX(a, b) only optimizes ORDER BY a ASC, b ASC
An analytics platform ran nightly reports sorting 200 million rows by revenue. With PostgreSQL's default work_mem of 4MB, the sort spilled to disk generating 48GB of temp files — the report took 45 minutes and filled the disk. Increasing work_mem to 512MB for the reporting session (SET LOCAL work_mem = '512MB' inside a transaction) kept the sort in RAM, cutting the report to 3 minutes. For their top-10 dashboard queries, adding a covering index that matched the ORDER BY eliminated sorting entirely — response time dropped from 2 seconds to 5ms. The team also discovered that MySQL's "Using filesort" warning was misleading: most of their small sorts were actually in-memory, and only the nightly batch was genuinely hitting disk.
Not checking EXPLAIN output for sort spills to disk — a common hidden performance killer. Also, creating an index with wrong column direction that doesn't match the ORDER BY.
-- Default work_mem = 4MB, sorting 50 million rows
SELECT * FROM transactions ORDER BY amount DESC;
-- EXPLAIN ANALYZE:
-- Sort Method: external merge Disk: 120000kB
-- Execution Time: 95000ms (95 seconds!)
-- Wrote 120MB of temp files to disk
-- Also bad: index direction mismatch
CREATE INDEX idx_amount ON transactions(amount ASC);
SELECT * FROM transactions ORDER BY amount DESC;
-- Cannot use index backward efficiently (database-dependent)
-- Falls back to full sort-- Option 1: Index matches ORDER BY direction exactly
CREATE INDEX idx_amount_desc ON transactions(amount DESC);
SELECT * FROM transactions ORDER BY amount DESC LIMIT 100;
-- Index Scan → no sort at all → 2ms
-- Option 2: Increase memory for batch queries
BEGIN;
SET LOCAL work_mem = '256MB'; -- only for this transaction
SELECT * FROM transactions ORDER BY amount DESC;
-- Sort Method: quicksort Memory: 180000kB (in RAM)
-- Execution Time: 3500ms (27x faster than disk sort)
COMMIT; -- work_mem resets to defaultHow does GROUP BY work internally? Does it also use sorting or does it use hash aggregation? When does the optimizer choose one over the other?
When you write a JOIN, the database must choose a physical join algorithm to combine the rows. The optimizer picks the best one based on table sizes, available indexes, data distribution, and memory. Three main algorithms:
1. Nested Loop Join:
- For each row in the outer (driving) table, scan the inner table to find matching rows.
- Complexity: O(N × M) without indexes, O(N × log M) with an index on the inner table's join column.
- Best when: the outer table is small AND there's an index on the inner table's join column. The optimizer will pick the smaller table as the outer.
- Variation — Index Nested Loop: uses an index seek on the inner table for each outer row, making it very efficient for small outer sets.
2. Hash Join:
- Build phase: scan the smaller table, build an in-memory hash table keyed on the join column.
- Probe phase: scan the larger table, hash each row's join column, probe the hash table for matches.
- Complexity: O(N + M) — scans each table exactly once. The hash table must fit in memory (work_mem).
- Best when: both tables are large, no useful indexes, and the smaller table's hash table fits in memory. Cannot handle inequality joins (only equi-joins:
=). - If the hash table exceeds memory, the database uses Grace Hash Join — partitions both tables into batches, spills to disk, then joins each batch.
3. (Sort-)Merge Join:
- Both tables must be sorted by the join column. If not already sorted (e.g., from an index), the database sorts them first.
- Then it "merges" them in one pass — two pointers walk through both sorted sets simultaneously.
- Complexity: O(N log N + M log M) for sorting + O(N + M) for the merge.
- Best when: both tables are already sorted by the join key (from an index or a prior ORDER BY), or when the result needs to be ordered anyway.
-- NESTED LOOP: small outer table + indexed inner table
-- Optimizer uses this when one side is small
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM employees e -- 10 rows (filtered)
JOIN departments d ON e.dept_id = d.id -- 50 rows, index on d.id
WHERE e.salary > 100000;
-- Nested Loop (cost=... rows=10)
-- → Seq Scan on employees (Filter: salary > 100000) → 10 rows
-- → Index Scan on departments using dept_pkey → 1 row per loop
-- Total: 10 index lookups into departments → fast!
-- HASH JOIN: two large tables, no index
EXPLAIN ANALYZE
SELECT o.id, c.name
FROM orders o -- 5,000,000 rows
JOIN customers c ON o.customer_id = c.id; -- 500,000 rows
-- Hash Join (cost=...)
-- Hash Cond: (o.customer_id = c.id)
-- → Seq Scan on orders -- probe side (larger)
-- → Hash -- build side (smaller)
-- → Seq Scan on customers -- builds hash table
-- Build: hash 500K customers into memory (~40MB)
-- Probe: scan 5M orders, look up each customer_id in hash table
-- Total: one pass through each table → O(N + M)
-- Check hash memory usage in EXPLAIN ANALYZE:
-- Hash (Buckets: 65536 Batches: 1 Memory Usage: 38000kB)
-- Batches > 1 means hash spilled to disk (needs more work_mem)
-- MERGE JOIN: both tables sorted or have matching indexes
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM table_a a
JOIN table_b b ON a.sort_key = b.sort_key
ORDER BY a.sort_key;
-- Merge Join (cost=...)
-- Merge Cond: (a.sort_key = b.sort_key)
-- → Index Scan on table_a using idx_a_sort_key
-- → Index Scan on table_b using idx_b_sort_key
-- Both already sorted by index → merge in one pass
-- Force a specific join strategy (PostgreSQL):
SET enable_hashjoin = off; -- disable hash join
SET enable_nestloop = off; -- disable nested loop
SET enable_mergejoin = off; -- disable merge join
-- (for testing only — let the optimizer decide in production)
-- MySQL: JOIN hints
SELECT /*+ HASH_JOIN(o, c) */ o.id, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id;
-- MySQL 8.0.18+ supports hash joins for equi-joins
-- SQL Server: JOIN hints
SELECT o.id, c.name
FROM orders o
INNER MERGE JOIN customers c ON o.customer_id = c.id;
A reporting query joining orders (50M rows) with products (100K rows) was taking 30 minutes. EXPLAIN showed a Nested Loop Join — the optimizer misjudged the outer table size because table statistics were stale (estimated 100 rows, actual 50 million). Running ANALYZE updated statistics, and the optimizer switched to a Hash Join — build a hash table from 100K products (8MB in memory), probe with 50M orders in one pass. Query time dropped from 30 minutes to 45 seconds. The team learned to run ANALYZE after bulk data loads.
Stale table statistics causing the optimizer to choose the wrong join algorithm — usually picking Nested Loop when Hash Join would be orders of magnitude faster.
-- Statistics say orders has 100 rows (stale!)
-- Actual: 50 million rows
EXPLAIN ANALYZE SELECT * FROM orders o JOIN products p ON o.product_id = p.id;
-- Nested Loop (estimated rows=100, actual rows=50000000)
-- For each of 50M orders → index lookup into products
-- 50 million index lookups → 30+ minutes!ANALYZE orders; -- update statistics
ANALYZE products;
EXPLAIN ANALYZE SELECT * FROM orders o JOIN products p ON o.product_id = p.id;
-- Hash Join (estimated rows=50000000, actual rows=50000000) ✓
-- Build hash on products (100K rows, 8MB)
-- Probe with orders (50M rows, one pass)
-- 45 seconds instead of 30 minutes!What is a hash spill and how does Grace Hash Join handle hash tables that exceed memory?
GROUP BY must collect all rows with the same group key and compute aggregate functions (SUM, COUNT, AVG, etc.) for each group. Internally, databases use two strategies:
1. Hash Aggregation:
- Builds an in-memory hash table keyed on the GROUP BY columns.
- For each input row, hash the group key → find or create the hash entry → update the running aggregate (increment count, add to sum, etc.).
- After processing all rows, each hash entry contains the final aggregate result.
- Complexity: O(N) — single pass through the data. Memory: O(G) where G = number of distinct groups.
- Best when: the number of groups is moderate and fits in work_mem. Cannot produce sorted output.
- If the hash table exceeds memory → spills to disk using partitioned (Grace) hashing.
2. Sort-Based Aggregation (GroupAggregate):
- First, sort all rows by the GROUP BY columns (using quicksort or external merge sort).
- Then, scan the sorted data sequentially — all rows in the same group are adjacent, so the database reads each group in one pass and computes aggregates.
- Complexity: O(N log N) for sorting + O(N) for the aggregation pass.
- Best when: data is already sorted by an index on the GROUP BY column (no sort needed → O(N)), or when the output needs to be in GROUP BY order (ORDER BY same columns), or when there are too many groups for a hash table to fit in memory.
Optimizer decision: PostgreSQL's optimizer estimates the number of distinct groups. Few groups → Hash Aggregate (small hash table). Many groups with no index → Hash Aggregate if it fits in memory, otherwise Sort + GroupAggregate. Data already sorted by index → GroupAggregate (free sort).
-- HASH AGGREGATE: small number of groups
EXPLAIN ANALYZE
SELECT status, COUNT(*), SUM(amount)
FROM orders
GROUP BY status;
-- HashAggregate (cost=...) (actual rows=5)
-- Group Key: status
-- Batches: 1 Memory Usage: 24kB ← 5 groups fit easily
-- → Seq Scan on orders
-- Hash table: {pending: {count:1200, sum:45000}, shipped: {...}, ...}
-- Single pass through orders → O(N)
-- SORT-BASED (GroupAggregate): data already sorted by index
CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*), AVG(amount)
FROM orders
GROUP BY customer_id
ORDER BY customer_id;
-- GroupAggregate (cost=...)
-- Group Key: customer_id
-- → Index Scan on idx_orders_customer
-- Data comes from index already sorted by customer_id
-- No sort needed! Read sequentially, aggregate each group → O(N)
-- HASH SPILL to disk: too many groups for memory
SET work_mem = '4MB';
EXPLAIN ANALYZE
SELECT user_id, COUNT(*)
FROM page_views -- 500M rows, 10M distinct user_ids
GROUP BY user_id;
-- HashAggregate
-- Batches: 16 Memory Usage: 4MB Disk Usage: 230MB ← spilled!
-- Hash table partitioned into 16 batches, each processed separately
-- Much slower than in-memory hash
-- Fix: increase work_mem
SET work_mem = '512MB';
-- HashAggregate Batches: 1 Memory Usage: 380MB ← all in memory now
-- When optimizer switches strategy automatically:
-- Few groups (< 10K) → HashAggregate almost always
-- Many groups + matching index → GroupAggregate (sorted by index)
-- Many groups + no index + enough memory → HashAggregate
-- Many groups + no index + insufficient memory → Sort + GroupAggregate
-- Mixed strategy with partial aggregation (parallel query):
EXPLAIN ANALYZE
SELECT department, AVG(salary) FROM employees GROUP BY department;
-- Partial HashAggregate (worker 1: partial sums per dept)
-- Partial HashAggregate (worker 2: partial sums per dept)
-- Finalize HashAggregate (combine partial results from all workers)
-- Parallelism splits the work, each worker builds its own hash table
-- MySQL: only uses temporary table + filesort for GROUP BY
-- (no hash aggregation in MySQL < 8.0.27)
EXPLAIN SELECT status, COUNT(*) FROM orders GROUP BY status;
-- Extra: Using temporary; Using filesort
A dashboard query grouped 200 million log entries by 15 million distinct user IDs. With default work_mem (4MB), PostgreSQL used HashAggregate that spilled into 64 disk batches — taking 12 minutes. Increasing work_mem to 1GB allowed the entire hash table to fit in memory (580MB for 15M entries), reducing the query to 90 seconds. However, for a different query grouping by just 5 status values, the optimizer correctly chose HashAggregate with a tiny 24KB hash table — proving that the strategy depends on the number of distinct groups, not the table size.
Not realizing that GROUP BY can be eliminated entirely if a matching index exists — or not understanding why adding an ORDER BY that matches the GROUP BY can actually switch the optimizer to a faster plan.
-- 10 million distinct customer_ids, work_mem = 64MB
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id;
-- HashAggregate Batches: 8 Disk: 450MB ← spilled to disk!
-- Processes 8 batches sequentially from disk → 8 minutes
-- No index on customer_id, so GroupAggregate would need a full sort tooCREATE INDEX idx_orders_cust ON orders(customer_id);
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id;
-- GroupAggregate → Index Scan on idx_orders_cust
-- Data arrives already sorted → no hash table, no sort, no disk spill
-- Sequential scan through index, aggregate each group → 45 seconds
-- Bonus: ORDER BY customer_id is free (already sorted)What is partial aggregation in parallel queries? How does the Finalize step combine results from multiple workers?
B+ tree is the default index, but it's not optimal for every data type or query pattern. Specialized index types exist for specific use cases:
1. Hash Index:
- Stores a hash of each key value. Supports only equality lookups (
=). Cannot do range scans (<,>,BETWEEN,ORDER BY). - O(1) lookup vs B+ tree O(log N). Faster for exact-match lookups on high-cardinality columns.
- PostgreSQL: crash-safe since v10. MySQL InnoDB: uses adaptive hash index internally (automatic, not user-created).
- Use when: equality-only lookups (e.g., session tokens, UUIDs). Avoid when range queries are needed.
2. GIN (Generalized Inverted Index):
- Designed for values that contain multiple elements — arrays, JSONB, full-text search (tsvector).
- Internally: maps each element/word → list of row IDs that contain it (inverted index, like a search engine).
- Supports operators:
@>(contains),<@(contained by),&&(overlap),@@(full-text match). - Use when: searching inside arrays, JSONB fields, or full-text search. Not for scalar equality/range.
- Downside: slow to update (each INSERT may update many inverted entries). Use
fastupdatefor batch-inserts.
3. GiST (Generalized Search Tree):
- A balanced tree that supports overlapping keys — used for geometric data, ranges, and nearest-neighbor searches.
- Supports operators:
&&(overlap),@>(contains),<@(within),<->(nearest neighbor/KNN). - Use when: PostGIS spatial queries (find all points within a polygon), range types (overlapping date ranges), or trigram text search (
pg_trgm).
4. BRIN (Block Range Index):
- Extremely small index that stores min/max values for each block range (group of consecutive disk pages, default 128 pages).
- Works when data is physically correlated with the indexed column — e.g., timestamps in an append-only log table (rows are inserted in time order).
- Lookup: check which block ranges could contain the value, scan only those blocks.
- Use when: very large tables with naturally ordered data (time-series, log tables). Index is 1000x smaller than B+ tree.
- Terrible if data is randomly distributed (every block range matches → full scan).
-- HASH INDEX: equality-only lookups
CREATE INDEX idx_session_hash ON sessions USING hash (session_token);
-- O(1) lookup:
SELECT * FROM sessions WHERE session_token = 'abc123xyz'; -- instant!
-- ❌ Cannot do: WHERE session_token > 'abc' (no range support)
-- ❌ Cannot do: ORDER BY session_token (no ordering)
-- GIN INDEX: search inside JSONB
CREATE INDEX idx_tags_gin ON products USING gin (tags);
-- products.tags = '{"color": "red", "size": "large", "material": "cotton"}'
SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- GIN: look up "color:red" → list of matching row IDs → fast!
-- GIN for full-text search
ALTER TABLE articles ADD COLUMN search_vector tsvector;
UPDATE articles SET search_vector = to_tsvector('english', title || ' ' || body);
CREATE INDEX idx_fts ON articles USING gin (search_vector);
SELECT title FROM articles WHERE search_vector @@ to_tsquery('database & optimization');
-- GIN inverted index: "database" → [row 5, 12, 45], "optimization" → [row 12, 45, 78]
-- Intersection: [12, 45] → 2 results, instant!
-- GIN for array containment
CREATE INDEX idx_skills ON employees USING gin (skills);
-- skills = ARRAY['python', 'sql', 'docker']
SELECT * FROM employees WHERE skills @> ARRAY['sql', 'docker'];
-- GiST INDEX: spatial and range queries
CREATE INDEX idx_location ON stores USING gist (coordinates);
-- Find stores within 5km of a point (PostGIS):
SELECT name FROM stores
WHERE ST_DWithin(coordinates, ST_MakePoint(-73.98, 40.75)::geography, 5000);
-- GiST: nearest-neighbor search (KNN):
SELECT name, coordinates <-> ST_MakePoint(-73.98, 40.75) AS distance
FROM stores ORDER BY coordinates <-> ST_MakePoint(-73.98, 40.75) LIMIT 10;
-- GiST for overlapping date ranges
CREATE INDEX idx_booking_range ON bookings USING gist (daterange(check_in, check_out));
SELECT * FROM bookings
WHERE daterange(check_in, check_out) && daterange('2024-06-01', '2024-06-15');
-- Find all bookings that overlap with June 1-15
-- BRIN INDEX: huge time-series tables
CREATE INDEX idx_ts_brin ON sensor_data USING brin (recorded_at)
WITH (pages_per_range = 64);
-- Table: 2 billion rows, 500GB
-- B+ tree index on recorded_at: 45GB
-- BRIN index on recorded_at: 50MB (900x smaller!)
SELECT * FROM sensor_data
WHERE recorded_at BETWEEN '2024-06-01' AND '2024-06-02';
-- BRIN checks: which 64-page ranges have min/max overlapping June 1-2?
-- Answer: block ranges 45000-45100 → scan only those blocks
-- Compare index sizes:
SELECT pg_size_pretty(pg_relation_size('idx_ts_btree')) AS btree_size,
pg_size_pretty(pg_relation_size('idx_ts_brin')) AS brin_size;
-- btree: 45 GB | brin: 50 MB
A food delivery app used PostGIS with GiST indexes on restaurant locations to answer "find restaurants within 3km" in 2ms — serving 50,000 searches per minute. Their menu search used GIN indexes on tsvector columns for full-text search ("spicy chicken near me"), matching across 2 million menu items in 5ms. For their analytics warehouse (180 billion event rows), they replaced a 2TB B+ tree index on event_timestamp with a 200MB BRIN index — queries were only 20% slower but the storage savings of 1.8TB allowed them to keep 2 more years of data on the same hardware.
Using a B+ tree index for full-text search or JSONB containment queries — it simply cannot support those operators.
CREATE INDEX idx_tags ON products (tags); -- default B-tree
SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- ERROR or full table scan!
-- B-tree can only do: tags = 'exact_json_value' (useless)
-- Also wrong: B-tree for text search
CREATE INDEX idx_body ON articles (body);
SELECT * FROM articles WHERE body LIKE '%database%';
-- B-tree cannot use leading wildcard → full scanCREATE INDEX idx_tags ON products USING gin (tags);
SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- GIN inverted index → instant lookup!
-- Full-text search with GIN
CREATE INDEX idx_fts ON articles USING gin (to_tsvector('english', body));
SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('database');
-- GIN inverted index on words → fast!What are bloom indexes and when would you use them? How does the adaptive hash index work in InnoDB?
A deadlock occurs when two or more transactions are waiting for each other to release locks, creating a circular dependency where none can proceed. Example: Transaction A holds lock on row 1 and waits for row 2, while Transaction B holds lock on row 2 and waits for row 1 — neither can finish.
Four necessary conditions for deadlock (Coffman conditions):
- Mutual exclusion — rows/resources can be locked exclusively.
- Hold and wait — a transaction holds locks while waiting for more.
- No preemption — locks cannot be forcibly taken from a transaction.
- Circular wait — a cycle exists in the wait-for graph.
How databases detect deadlocks:
- Wait-for graph (WFG) — the database maintains a directed graph where nodes are transactions and edges represent "is waiting for". Periodically (or on each wait), the database checks for cycles in this graph. A cycle = deadlock.
- InnoDB — checks the WFG immediately when a transaction starts waiting. Detection is instant. Chooses the youngest (least work done) transaction as the victim and rolls it back.
- PostgreSQL — checks the WFG after a configurable wait timeout (
deadlock_timeout, default 1 second). If a cycle is found, one transaction is aborted withERROR: deadlock detected. - SQL Server — background thread checks every 5 seconds. Victim is the transaction with the lowest deadlock priority or least log written.
Prevention strategies:
- Always acquire locks in a consistent order (e.g., always lock lower ID first).
- Keep transactions short — less time holding locks = smaller deadlock window.
- Use appropriate isolation levels — READ COMMITTED produces fewer locks than SERIALIZABLE.
- Retry on deadlock — the application should catch the deadlock error and retry the transaction.
-- CLASSIC DEADLOCK scenario:
-- Transaction A:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- locks row 1
-- ... some processing ...
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- WAITS for row 2!
-- Transaction B (concurrent):
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- locks row 2
-- ... some processing ...
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- WAITS for row 1!
-- DEADLOCK! A waits for B (row 2), B waits for A (row 1)
-- Database detects cycle → kills one transaction:
-- ERROR: deadlock detected
-- DETAIL: Process 1234 waits for ShareLock on row 1;
-- blocked by process 5678.
-- Process 5678 waits for ShareLock on row 2;
-- blocked by process 1234.
-- FIX: Always lock in consistent order (lower ID first)
-- Transaction A:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- lock 1 first
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- then lock 2
COMMIT;
-- Transaction B (also locks in same order):
BEGIN;
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- lock 1 first
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- then lock 2
COMMIT;
-- No circular wait possible → no deadlock!
-- APPLICATION: retry on deadlock (Python/psycopg2)
-- import psycopg2
-- MAX_RETRIES = 3
-- for attempt in range(MAX_RETRIES):
-- try:
-- with conn.cursor() as cur:
-- cur.execute("BEGIN")
-- cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = %s", (from_id,))
-- cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = %s", (to_id,))
-- cur.execute("COMMIT")
-- break # success
-- except psycopg2.errors.DeadlockDetected:
-- conn.rollback()
-- if attempt == MAX_RETRIES - 1:
-- raise # give up after 3 retries
-- PostgreSQL: tune deadlock detection
SHOW deadlock_timeout; -- default: 1s
-- Lower values detect deadlocks faster but increase CPU overhead
-- SET deadlock_timeout = '500ms';
-- InnoDB: check recent deadlocks
SHOW ENGINE INNODB STATUS;
-- Look for "LATEST DETECTED DEADLOCK" section
-- Shows which transactions, which rows, which was rolled back
-- PostgreSQL: log all deadlocks
-- postgresql.conf: log_lock_waits = on
-- Logs: "process 1234 still waiting for ShareLock after 1000ms"
-- Monitor deadlock frequency:
-- PostgreSQL:
SELECT deadlocks FROM pg_stat_database WHERE datname = current_database();
-- MySQL:
SHOW GLOBAL STATUS LIKE 'Innodb_deadlocks';
An e-commerce checkout system experienced 50-100 deadlocks per hour during peak sales. Two concurrent transactions both updated inventory (lock product row) and inserted order items (lock order row) but in different orders depending on the code path. The fix was standardizing the lock order: always lock inventory rows sorted by product_id ASC, then insert order items. Deadlocks dropped to near zero. They also added a retry wrapper with exponential backoff — on the rare deadlock, the transaction retried after 100ms, transparent to the user. Monitoring via pg_stat_database.deadlocks and alerting at >10/hour caught regressions early.
Locking resources in inconsistent order across different code paths — the #1 cause of deadlocks in production applications.
-- transfer_money(from=1, to=2):
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- lock 1
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- lock 2
-- transfer_money(from=2, to=1): (concurrent)
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- lock 2
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- lock 1
-- Opposite order! → deadlock when both run concurrently-- Always lock LOWER id first, regardless of transfer direction
-- transfer_money(from_id, to_id):
-- first_id = min(from_id, to_id)
-- second_id = max(from_id, to_id)
-- transfer_money(from=1, to=2):
UPDATE accounts SET balance = ... WHERE id = 1; -- lock min(1,2)=1
UPDATE accounts SET balance = ... WHERE id = 2; -- lock max(1,2)=2
-- transfer_money(from=2, to=1):
UPDATE accounts SET balance = ... WHERE id = 1; -- lock min(2,1)=1
UPDATE accounts SET balance = ... WHERE id = 2; -- lock max(2,1)=2
-- Same order → no cycle → no deadlock!What is a livelock? How does it differ from a deadlock? What is lock escalation in SQL Server?
Frequently Asked Questions
The most common SQL interview questions cover SELECT queries, JOINs (INNER, LEFT, RIGHT, FULL), GROUP BY with HAVING, subqueries, indexes, normalization, and ACID properties. Our guide covers all of these with real code examples and follow-up questions.
We cover 47 SQL interview questions across 5 difficulty levels: Basic (10), Intermediate (10), Advanced (15), Experienced (7), and Performance & Optimization (5). Each question includes 6 answer sections.
Questions are organized into 5 levels: Basic (0-1 year experience), Intermediate (1-3 years), Advanced (3-5 years), Experienced/Architect (5+ years), and Performance & Optimization (all levels). You can filter by level using the pills above the question list.
All code examples are real, working SQL statements — not pseudocode or foo/bar placeholders. Each example uses realistic table structures, actual data scenarios, and patterns from production databases. You can copy and run them directly on any SQL database.
JOINs, window functions, indexes, and query optimization are the most critical SQL interview topics. Understanding ACID properties, normalization (1NF-3NF), transaction isolation levels, and execution plans are essential for senior roles.
Focus on reading execution plans (EXPLAIN/EXPLAIN ANALYZE), understanding B-tree/B+ tree index internals and page splits, partition pruning mechanics, ORDER BY sort strategies (in-memory quicksort vs external merge sort), and knowing when to denormalize. Our deep-dive and performance sections cover all of these with real-world examples.
Senior SQL interviews focus on database partitioning, sharding strategies, replication (master-slave, multi-master), schema migration for zero-downtime deployments, ORM vs raw SQL trade-offs, materialized views, and multi-tenant database design.
The questions cover standard SQL (ANSI SQL) concepts that apply to all major databases — MySQL, PostgreSQL, SQL Server, Oracle, and SQLite. Where syntax differs between databases, we note the variations in the examples.