🗄️ SQL Interview Questions

40 questions with theory, real code, real-world scenarios, common mistakes and follow-up questions — from basic joins to query optimization.

47Questions
5Levels
6Answer Sections
240Total Answers
Showing 47 of 47 questions
0 of 47 viewed
01 What is SQL and what are its main sub-languages (DDL, DML, DCL, TCL)? basic

SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It was developed at IBM in the 1970s and became an ANSI/ISO standard in 1986.

SQL is divided into four sub-languages based on the type of operation:

  • DDL (Data Definition Language) — defines or modifies database structure: CREATE, ALTER, DROP, TRUNCATE
  • DML (Data Manipulation Language) — reads and modifies data: SELECT, INSERT, UPDATE, DELETE
  • DCL (Data Control Language) — manages permissions: GRANT, REVOKE
  • TCL (Transaction Control Language) — manages transactions: BEGIN, COMMIT, ROLLBACK, SAVEPOINT

Understanding these categories helps you reason about what each statement does — DDL changes the schema (and usually auto-commits), DML changes the data (and can be rolled back within a transaction), DCL controls who can do what, and TCL controls when changes become permanent.

-- DDL: Define structure
CREATE TABLE employees (
    id         INT PRIMARY KEY AUTO_INCREMENT,
    name       VARCHAR(100) NOT NULL,
    department VARCHAR(50),
    salary     DECIMAL(10,2),
    hire_date  DATE
);

-- DML: Manipulate data
INSERT INTO employees (name, department, salary, hire_date)
VALUES ('Priya Sharma', 'Engineering', 85000.00, '2024-03-15');

SELECT name, salary FROM employees WHERE department = 'Engineering';

UPDATE employees SET salary = 92000.00 WHERE id = 1;

DELETE FROM employees WHERE id = 1;

-- DCL: Control access
GRANT SELECT, INSERT ON employees TO 'analyst_role';
REVOKE INSERT ON employees FROM 'analyst_role';

-- TCL: Control transactions
BEGIN;
UPDATE employees SET salary = salary * 1.10 WHERE department = 'Engineering';
-- Check results before committing
COMMIT;

At a fintech startup, a junior developer ran ALTER TABLE DROP COLUMN on a production table without realising DDL auto-commits in MySQL — the column was gone instantly with no ROLLBACK possible. The team had to restore from a backup, causing 2 hours of downtime. After the incident, they enforced a policy: all DDL changes go through a migration tool (Flyway) with a review step, and DML changes are always wrapped in explicit transactions with a manual COMMIT after verification.

DDL changes structure (auto-commits), DML changes data (can rollback), DCL controls access, TCL controls transactions — knowing this prevents accidental irreversible operations.
⚠️ Common Mistake

Running DDL statements like DROP TABLE or TRUNCATE thinking they can be rolled back like DML statements.

❌ Dangerous — DDL auto-commits in most databases
BEGIN;
DROP TABLE temp_reports;   -- Auto-commits immediately!
ROLLBACK;                  -- Too late, table is gone
✅ Safe approach — use DML within transactions
BEGIN;
DELETE FROM temp_reports WHERE created_at < '2024-01-01';
-- Verify: SELECT COUNT(*) FROM temp_reports;
COMMIT;  -- Only when you are sure
🔁 Follow-Up Question

What is the difference between TRUNCATE and DELETE? Can TRUNCATE be rolled back?

02 What is the difference between WHERE and HAVING? basic

WHERE filters individual rows before any grouping happens. HAVING filters groups after the GROUP BY aggregation is applied.

The key difference is timing in the execution pipeline:

  • WHERE runs first — it removes rows that don't match before the database even starts grouping. This means WHERE cannot use aggregate functions like COUNT(), SUM(), or AVG().
  • HAVING runs after GROUP BY — it can reference aggregate functions because the groups have already been computed.

Performance tip: always prefer WHERE over HAVING when possible. Filtering rows early (with WHERE) reduces the number of rows the database needs to group and aggregate, making the query faster.

-- Table: orders (id, customer_id, amount, status, order_date)

-- WHERE: filter rows BEFORE grouping
-- "Total spent by each customer, but only count completed orders"
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'          -- filters rows first
GROUP BY customer_id;

-- HAVING: filter groups AFTER aggregation
-- "Customers who spent more than $5,000 total"
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
HAVING SUM(amount) > 5000;           -- filters groups after

-- Common mistake: using HAVING when WHERE would work
-- ❌ Slow: HAVING filters after grouping all rows
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING department != 'Intern';

-- ✅ Fast: WHERE filters before grouping
SELECT department, COUNT(*) AS emp_count
FROM employees
WHERE department != 'Intern'
GROUP BY department;

A reporting dashboard at an e-commerce company was taking 45 seconds to load because the query used HAVING to filter out cancelled orders instead of WHERE. The query was grouping 12 million rows before filtering, when it could have excluded 3 million cancelled rows upfront. Moving the status filter from HAVING to WHERE reduced the query time from 45 seconds to 8 seconds — a 5.6x improvement with a one-line change.

WHERE filters rows before grouping (cannot use aggregates), HAVING filters groups after aggregation (can use aggregates) — use WHERE whenever possible for better performance.
⚠️ Common Mistake

Using HAVING for conditions that don't involve aggregate functions — this forces the database to group everything before filtering.

❌ Wrong — HAVING without aggregate
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING department IN ('Engineering', 'Sales');
✅ Correct — WHERE for non-aggregate filter
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE department IN ('Engineering', 'Sales')
GROUP BY department;
🔁 Follow-Up Question

Can you use both WHERE and HAVING in the same query? What is the SQL execution order?

03 Explain the SQL query execution order (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY) basic

SQL does not execute in the order you write it. The logical execution order is:

  1. FROM / JOIN — identify the source tables and perform joins
  2. WHERE — filter individual rows
  3. GROUP BY — group remaining rows
  4. HAVING — filter groups
  5. SELECT — compute output columns and expressions
  6. DISTINCT — remove duplicates
  7. ORDER BY — sort the result set
  8. LIMIT / OFFSET — restrict the number of rows returned

This order explains many common confusions: you cannot use a column alias defined in SELECT inside WHERE (because SELECT runs after WHERE). You can use it in ORDER BY because that runs last. And HAVING can reference aggregates because GROUP BY has already run.

-- Consider this query:
SELECT department,
       COUNT(*) AS emp_count,
       AVG(salary) AS avg_salary
FROM employees
JOIN departments ON employees.dept_id = departments.id
WHERE hire_date >= '2023-01-01'
GROUP BY department
HAVING COUNT(*) >= 5
ORDER BY avg_salary DESC
LIMIT 10;

-- Execution order:
-- 1. FROM employees JOIN departments  → combine tables
-- 2. WHERE hire_date >= '2023-01-01'  → filter to recent hires
-- 3. GROUP BY department              → group remaining rows
-- 4. HAVING COUNT(*) >= 5            → keep groups with 5+ people
-- 5. SELECT department, COUNT(*), AVG(salary) → compute columns
-- 6. ORDER BY avg_salary DESC         → sort by average salary
-- 7. LIMIT 10                         → return top 10 only

-- This is why this FAILS:
-- SELECT name, salary * 12 AS annual FROM employees WHERE annual > 100000;
-- ❌ "annual" alias does not exist yet at WHERE stage

-- Fix: repeat the expression or use a subquery
SELECT name, salary * 12 AS annual
FROM employees
WHERE salary * 12 > 100000;

A data analyst at a logistics company wrote a complex dashboard query with 4 JOINs, WHERE filters, GROUP BY, and HAVING. It returned wrong results because they referenced a SELECT alias in the HAVING clause, which some databases (MySQL) allow but PostgreSQL rejects. Understanding execution order helped them debug it in minutes — they replaced the alias with the full expression in HAVING and got correct results across both databases.

SQL executes FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. You cannot use SELECT aliases in WHERE or HAVING (in standard SQL) because SELECT runs after them.
⚠️ Common Mistake

Using a column alias from SELECT in the WHERE or HAVING clause, assuming SQL executes top to bottom.

❌ Error — alias not available in WHERE
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE annual_salary > 100000;
-- ERROR: column "annual_salary" does not exist
✅ Correct — repeat expression or use subquery
-- Option 1: Repeat the expression
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE salary * 12 > 100000;

-- Option 2: Use a subquery / CTE
WITH emp AS (
    SELECT name, salary * 12 AS annual_salary FROM employees
)
SELECT * FROM emp WHERE annual_salary > 100000;
🔁 Follow-Up Question

Why can you use a column alias in ORDER BY but not in WHERE?

04 What are the different types of JOINs in SQL? basic

JOINs combine rows from two or more tables based on a related column. SQL supports five main types:

  • INNER JOIN — returns only rows that have matching values in both tables. This is the most common join and the default when you write just JOIN.
  • LEFT (OUTER) JOIN — returns all rows from the left table, plus matched rows from the right table. If there is no match, right-side columns are NULL.
  • RIGHT (OUTER) JOIN — returns all rows from the right table, plus matched rows from the left table. Rarely used in practice — you can always rewrite it as a LEFT JOIN by swapping the table order.
  • FULL (OUTER) JOIN — returns all rows from both tables. Unmatched rows on either side get NULL in the other table's columns.
  • CROSS JOIN — returns the Cartesian product — every row in table A paired with every row in table B. No ON clause needed. Use with caution on large tables.

There is also SELF JOIN, which is not a separate type but a technique where you join a table to itself using an alias (e.g., finding employees and their managers from the same employees table).

-- Sample tables:
-- employees: id, name, dept_id
-- departments: id, dept_name

-- INNER JOIN: only employees who have a valid department
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;

-- LEFT JOIN: all employees, even those without a department
SELECT e.name, COALESCE(d.dept_name, 'Unassigned') AS dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- Find employees WITHOUT a department (anti-join pattern)
SELECT e.name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id
WHERE d.id IS NULL;

-- FULL OUTER JOIN: all employees + all departments (even unmatched)
SELECT e.name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;

-- CROSS JOIN: every employee paired with every project (careful!)
SELECT e.name, p.project_name
FROM employees e
CROSS JOIN projects p;
-- If employees has 100 rows and projects has 10, result = 1,000 rows

-- SELF JOIN: find each employee's manager
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

An HR analytics team needed to find employees who had never submitted a timesheet. They initially used INNER JOIN between employees and timesheets, which only returned employees WITH timesheets — the opposite of what they needed. Switching to LEFT JOIN with a WHERE timesheets.id IS NULL correctly identified 47 employees with missing timesheets. This anti-join pattern is one of the most practical uses of LEFT JOIN in business reporting.

INNER JOIN = only matches. LEFT JOIN = all from left + matches. FULL JOIN = all from both. CROSS JOIN = every combination. LEFT JOIN + WHERE right.id IS NULL = anti-join (find missing records).
⚠️ Common Mistake

Using INNER JOIN when you need LEFT JOIN, which silently drops unmatched rows from the result — leading to missing data in reports.

❌ Wrong — INNER JOIN hides customers with no orders
-- "Show all customers and their order totals"
SELECT c.name, SUM(o.amount) AS total
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- Customers with zero orders are completely missing!
✅ Correct — LEFT JOIN preserves all customers
SELECT c.name, COALESCE(SUM(o.amount), 0) AS total
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- All customers shown, including those with $0 orders
🔁 Follow-Up Question

What is the difference between LEFT JOIN with WHERE and LEFT JOIN with AND in the ON clause?

05 What is the difference between UNION and UNION ALL? basic

Both UNION and UNION ALL combine the results of two or more SELECT statements vertically (stacking rows). The key difference:

  • UNION — removes duplicate rows from the combined result. It performs a DISTINCT operation behind the scenes, which requires sorting or hashing to find and eliminate duplicates.
  • UNION ALL — keeps all rows, including duplicates. No deduplication step, so it is significantly faster.

Rules for both: each SELECT must have the same number of columns, and the corresponding columns must have compatible data types. Column names come from the first SELECT.

Performance rule of thumb: always use UNION ALL unless you specifically need deduplication. If the source queries are guaranteed to produce distinct rows (e.g., they pull from different tables with no overlap), UNION wastes time sorting.

-- Combine active and archived orders into one report
-- UNION ALL: faster, keeps all rows (no duplicates possible here)
SELECT id, customer_name, amount, 'active' AS source
FROM orders
WHERE status = 'active'

UNION ALL

SELECT id, customer_name, amount, 'archive' AS source
FROM orders_archive;

-- UNION: removes duplicates (slower due to sort/hash)
SELECT city FROM customers
UNION
SELECT city FROM suppliers;
-- Returns each city only once, even if it appears in both tables

-- UNION ALL: keeps duplicates
SELECT city FROM customers
UNION ALL
SELECT city FROM suppliers;
-- "London" appears twice if it is in both tables

-- Combining 3+ queries with UNION ALL
SELECT name, 'employee' AS type FROM employees
UNION ALL
SELECT name, 'contractor' AS type FROM contractors
UNION ALL
SELECT name, 'intern' AS type FROM interns
ORDER BY name;  -- ORDER BY applies to the entire result

A data warehouse ETL pipeline at a retail company combined sales from 12 regional databases using UNION (with dedup). Processing took 25 minutes because the dedup step sorted 80 million rows. Since each regional database had unique transaction IDs, duplicates were impossible. Switching to UNION ALL dropped processing time to 4 minutes — an 84% improvement with zero risk of data loss.

UNION removes duplicates (slower, sorts). UNION ALL keeps everything (faster). Default to UNION ALL unless you specifically need deduplication.
⚠️ Common Mistake

Using UNION by default everywhere, paying the sorting cost when duplicates are either impossible or acceptable.

❌ Slow — unnecessary deduplication
-- Each table has unique IDs, duplicates are impossible
SELECT id, name, amount FROM current_orders
UNION
SELECT id, name, amount FROM archived_orders;
-- Sorts all rows to remove duplicates that don't exist
✅ Fast — UNION ALL when dedup is not needed
SELECT id, name, amount FROM current_orders
UNION ALL
SELECT id, name, amount FROM archived_orders;
-- Skips the sort, directly concatenates results
🔁 Follow-Up Question

Can you use ORDER BY with UNION? Where does the ORDER BY clause go?

06 How do NULLs behave in SQL (comparisons, aggregates, COALESCE, IS NULL)? basic

NULL in SQL represents an unknown or missing value — it is not zero, not an empty string, and not false. It is the absence of any value.

NULL has special behaviour that trips up even experienced developers:

  • Comparisons: Any comparison with NULL returns NULL (not true or false). NULL = NULL is NULL, not TRUE. Use IS NULL or IS NOT NULL to test for NULLs.
  • Aggregates: Most aggregate functions (SUM, AVG, COUNT(column)) ignore NULL values. COUNT(*) counts all rows including NULLs, but COUNT(column) only counts non-NULL values.
  • COALESCE: Returns the first non-NULL argument. COALESCE(NULL, NULL, 'default') returns 'default'.
  • Boolean logic: NULL AND TRUE = NULL, NULL OR TRUE = TRUE, NOT NULL = NULL. This three-valued logic (TRUE, FALSE, NULL) can cause unexpected WHERE clause results.
-- NULL comparison pitfalls
SELECT * FROM employees WHERE manager_id = NULL;     -- ❌ Returns nothing!
SELECT * FROM employees WHERE manager_id IS NULL;     -- ✅ Correct

-- NULL in arithmetic
SELECT 100 + NULL;   -- Result: NULL (not 100)
SELECT NULL = NULL;  -- Result: NULL (not TRUE)
SELECT NULL != NULL; -- Result: NULL (not TRUE)

-- COUNT behavior with NULLs
-- employees table: 100 rows, 15 have NULL in bonus column
SELECT COUNT(*)      FROM employees;  -- 100 (all rows)
SELECT COUNT(bonus)  FROM employees;  -- 85  (non-NULL only)
SELECT AVG(bonus)    FROM employees;  -- average of 85 values, not 100

-- COALESCE: provide fallback values
SELECT name,
       COALESCE(phone, email, 'No contact info') AS contact
FROM customers;

-- NULLIF: returns NULL if both values are equal (useful to avoid division by zero)
SELECT revenue / NULLIF(expenses, 0) AS ratio
FROM financials;
-- If expenses = 0, NULLIF returns NULL, avoiding division-by-zero error

-- Sorting with NULLs
SELECT name, bonus FROM employees
ORDER BY bonus NULLS LAST;  -- PostgreSQL syntax

A payroll system calculated average bonuses using AVG(bonus) which silently ignored 200 employees with NULL bonuses. The reported average was $4,500, but when the CFO manually calculated with zeros for missing bonuses, the true average was $3,200. The fix was COALESCE: AVG(COALESCE(bonus, 0)) — which treats missing bonuses as zero instead of ignoring them.

NULL means unknown, not zero. Use IS NULL (not = NULL) for comparisons. COUNT(*) includes NULLs, COUNT(column) excludes them. COALESCE provides fallback values.
⚠️ Common Mistake

Using = NULL or != NULL instead of IS NULL / IS NOT NULL — the comparison silently returns no results.

❌ Wrong — = NULL always returns NULL (no matches)
SELECT * FROM employees WHERE bonus = NULL;
-- Returns 0 rows, even if 50 employees have NULL bonus

SELECT * FROM employees WHERE bonus != NULL;
-- Also returns 0 rows! Neither condition matches NULL.
✅ Correct — use IS NULL / IS NOT NULL
SELECT * FROM employees WHERE bonus IS NULL;
-- Returns all 50 employees with missing bonus

SELECT * FROM employees WHERE bonus IS NOT NULL;
-- Returns employees who have a bonus value
🔁 Follow-Up Question

What is the difference between COALESCE and IFNULL/ISNULL? Which is ANSI standard?

07 What is the difference between DELETE, TRUNCATE, and DROP? basic

All three remove data, but they operate at different levels and have different implications:

  • DELETE (DML) — removes specific rows based on a WHERE clause. Logs each deleted row individually, can be rolled back within a transaction. Fires triggers. Does not reset auto-increment counters.
  • TRUNCATE (DDL) — removes all rows from a table instantly. Deallocates data pages instead of logging individual row deletions, making it much faster than DELETE for large tables. Cannot use WHERE. Resets auto-increment counters. Cannot be rolled back in most databases (auto-commits in MySQL, but can be rolled back in PostgreSQL and SQL Server).
  • DROP (DDL) — removes the entire table structure, data, indexes, constraints, and permissions. The table no longer exists. Cannot be rolled back.

Speed comparison for a table with 10 million rows: DELETE takes minutes (row-by-row logging), TRUNCATE takes milliseconds (deallocates pages), DROP takes milliseconds (removes metadata).

-- DELETE: remove specific rows (can rollback)
BEGIN;
DELETE FROM orders WHERE status = 'cancelled' AND created_at < '2023-01-01';
-- Check: SELECT COUNT(*) FROM orders WHERE status = 'cancelled';
ROLLBACK;  -- Undo if needed, or COMMIT to confirm

-- DELETE all rows (slow for large tables, row-by-row)
DELETE FROM temp_logs;  -- Triggers fire, auto-increment NOT reset

-- TRUNCATE: remove all rows instantly (DDL)
TRUNCATE TABLE temp_logs;  -- No WHERE clause, resets auto-increment
-- Much faster than DELETE for millions of rows

-- DROP: remove the entire table
DROP TABLE IF EXISTS temp_logs;  -- Table structure gone forever

-- Practical comparison:
-- 10M row table:
-- DELETE FROM big_table;       → ~3 minutes (logged row-by-row)
-- TRUNCATE TABLE big_table;    → ~50ms (deallocates pages)
-- DROP TABLE big_table;        → ~10ms (removes metadata)

A developer at a SaaS company ran TRUNCATE TABLE users in production instead of TRUNCATE TABLE test_users (a typo in the table name). Because TRUNCATE auto-commits in MySQL, the 280,000 user records were gone instantly with no way to ROLLBACK. Recovery took 4 hours from a backup, and the company implemented a rule: TRUNCATE is banned in production — use DELETE with explicit WHERE clause and transaction wrapping instead.

DELETE = specific rows, logged, can rollback, fires triggers. TRUNCATE = all rows, instant, usually auto-commits. DROP = removes table entirely. Default to DELETE in production for safety.
⚠️ Common Mistake

Using TRUNCATE in production thinking it can be rolled back like DELETE — in MySQL and Oracle, TRUNCATE auto-commits and cannot be undone.

❌ Dangerous — TRUNCATE auto-commits in MySQL
BEGIN;
TRUNCATE TABLE user_sessions;  -- Auto-commits immediately!
ROLLBACK;  -- Too late, all data is gone
✅ Safe — DELETE with transaction in production
BEGIN;
DELETE FROM user_sessions WHERE expires_at < NOW();
-- Verify: SELECT COUNT(*) FROM user_sessions;
COMMIT;  -- Only when confirmed
🔁 Follow-Up Question

Can TRUNCATE be rolled back in PostgreSQL? What about SQL Server?

08 What are PRIMARY KEY, FOREIGN KEY, and UNIQUE constraints? basic

Constraints enforce data integrity rules at the database level — they prevent invalid data from being inserted or updated.

  • PRIMARY KEY — uniquely identifies each row in a table. Combines NOT NULL + UNIQUE. A table can have only one primary key (which can be a single column or a composite of multiple columns). Automatically creates a clustered index in most databases.
  • FOREIGN KEY — creates a link between two tables by referencing the primary key of another table. Enforces referential integrity — you cannot insert a value that does not exist in the referenced table, and you cannot delete a referenced row without handling the dependent rows first (using CASCADE, SET NULL, or RESTRICT).
  • UNIQUE — ensures all values in a column (or combination of columns) are distinct. Unlike PRIMARY KEY, UNIQUE allows one NULL value (in most databases). A table can have multiple UNIQUE constraints.

Other common constraints: NOT NULL (column cannot be empty), CHECK (column value must satisfy a condition), DEFAULT (provides a fallback value).

-- PRIMARY KEY: single column
CREATE TABLE employees (
    id    INT PRIMARY KEY AUTO_INCREMENT,
    name  VARCHAR(100) NOT NULL,
    email VARCHAR(150) UNIQUE  -- UNIQUE constraint
);

-- PRIMARY KEY: composite (multiple columns)
CREATE TABLE order_items (
    order_id   INT,
    product_id INT,
    quantity   INT NOT NULL CHECK (quantity > 0),
    PRIMARY KEY (order_id, product_id)  -- composite PK
);

-- FOREIGN KEY with referential actions
CREATE TABLE orders (
    id          INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date  DATE DEFAULT CURRENT_DATE,
    FOREIGN KEY (customer_id) REFERENCES customers(id)
        ON DELETE RESTRICT      -- prevent deleting customer with orders
        ON UPDATE CASCADE       -- update customer_id if customer.id changes
);

-- FOREIGN KEY with CASCADE delete
CREATE TABLE order_items (
    id       INT PRIMARY KEY AUTO_INCREMENT,
    order_id INT NOT NULL,
    FOREIGN KEY (order_id) REFERENCES orders(id)
        ON DELETE CASCADE  -- delete items when order is deleted
);

-- Adding constraints to existing table
ALTER TABLE employees ADD CONSTRAINT uq_email UNIQUE (email);
ALTER TABLE employees ADD CONSTRAINT fk_dept
    FOREIGN KEY (dept_id) REFERENCES departments(id);

An e-commerce platform had no FOREIGN KEY constraints between orders and customers tables. Over time, developers deleted customer records without cleaning up their orders, creating 15,000 "orphan" orders with customer_id values pointing to non-existent customers. Monthly revenue reports were inflated by $2.3M because these orphan orders were included. Adding FOREIGN KEY constraints with ON DELETE RESTRICT prevented future orphans, and a cleanup script archived the existing ones.

PRIMARY KEY = unique + not null + one per table. FOREIGN KEY = referential integrity between tables. UNIQUE = distinct values, allows one NULL. Always define constraints — they catch data bugs before they reach production.
⚠️ Common Mistake

Not defining FOREIGN KEY constraints and relying on application code to maintain referential integrity — this leads to orphaned records when data is modified outside the application.

❌ No FK constraint — orphan records possible
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    customer_id INT  -- No FK! Any value accepted
);

-- This succeeds even though customer 9999 does not exist:
INSERT INTO orders (id, customer_id) VALUES (1, 9999);
✅ FK constraint — database enforces integrity
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    customer_id INT NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(id)
);

-- This fails with an error:
INSERT INTO orders (id, customer_id) VALUES (1, 9999);
-- ERROR: foreign key constraint violation
🔁 Follow-Up Question

What is the difference between ON DELETE CASCADE, SET NULL, and RESTRICT?

09 What are SQL aliases and when do you use them? basic

An alias is a temporary name given to a table or column in a query. It exists only for the duration of that query and does not rename anything in the database.

  • Column aliases — rename output columns for readability using AS. Useful for calculated expressions, aggregate functions, or when column names are unclear. The AS keyword is optional but recommended for clarity.
  • Table aliases — give short names to tables, especially useful in JOINs with long table names. Required for self-joins (joining a table to itself) to distinguish the two instances.

Important rules:

  • Column aliases defined in SELECT cannot be used in WHERE or HAVING (in standard SQL) because SELECT executes after them. They can be used in ORDER BY.
  • If an alias contains spaces or special characters, wrap it in double quotes ("Total Sales") or backticks (`Total Sales` in MySQL).
-- Column aliases: make output readable
SELECT
    first_name || ' ' || last_name AS full_name,
    salary * 12                     AS annual_salary,
    ROUND(salary * 12 * 0.30, 2)    AS estimated_tax
FROM employees;

-- Table aliases: simplify JOINs
SELECT e.name, d.dept_name, m.name AS manager_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
LEFT JOIN employees m ON e.manager_id = m.id;

-- Self-join REQUIRES table aliases
SELECT
    e.name  AS employee,
    m.name  AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Subquery alias (required in most databases)
SELECT avg_dept.department, avg_dept.avg_salary
FROM (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) AS avg_dept   -- subquery MUST have an alias
WHERE avg_dept.avg_salary > 75000;

-- Alias in ORDER BY (works because ORDER BY runs after SELECT)
SELECT name, salary * 12 AS annual
FROM employees
ORDER BY annual DESC;  -- ✅ Valid

A reporting team inherited a legacy database with columns named c1, c2, c3, c4 across 200+ tables. Every query they wrote used column aliases to make output readable: SELECT c1 AS customer_name, c2 AS email, c3 AS signup_date. Table aliases shortened their 6-table JOIN queries from unreadable 300-character lines to clean, maintainable SQL. They also used aliases in CTEs to create a "virtual data dictionary" layer over the cryptic schema.

Column aliases rename output for clarity (use AS). Table aliases shorten JOINs (required for self-joins). Aliases are temporary — they do not change the database schema.
⚠️ Common Mistake

Trying to use a column alias in WHERE, which fails because WHERE executes before SELECT where the alias is defined.

❌ Error — alias not available in WHERE
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE annual_salary > 100000;
-- ERROR: column "annual_salary" does not exist
✅ Correct — repeat expression or use CTE
-- Option 1: Repeat the expression
SELECT name, salary * 12 AS annual_salary
FROM employees
WHERE salary * 12 > 100000;

-- Option 2: CTE
WITH emp AS (
    SELECT name, salary * 12 AS annual_salary FROM employees
)
SELECT * FROM emp WHERE annual_salary > 100000;
🔁 Follow-Up Question

Can you use a column alias in GROUP BY? Does it differ across databases?

10 What is the difference between CHAR, VARCHAR, and TEXT data types? basic

These are the three main string data types in SQL, differing in storage, performance, and max length:

  • CHAR(n) — fixed-length string. Always stores exactly n characters, padding with spaces if the value is shorter. Best for data that is always the same length (country codes, state abbreviations, fixed-format IDs). Wastes space on variable-length data but can be slightly faster for comparisons because the database knows the exact length.
  • VARCHAR(n) — variable-length string up to n characters. Only stores the actual characters plus 1-2 bytes for length. Best for most string data where length varies (names, emails, addresses). Most commonly used string type.
  • TEXT — variable-length string with a very large max size (65KB in MySQL, 1GB in PostgreSQL). No length limit needs to be specified. Best for long-form content (blog posts, descriptions, JSON blobs). Cannot be used as an index key in most databases (or only a prefix can be indexed).

Storage comparison for storing "Hello" (5 chars): CHAR(20) uses 20 bytes. VARCHAR(20) uses 5 + 1 = 6 bytes. TEXT uses 5 + 2 = 7 bytes (MySQL).

-- CHAR: fixed-length, padded with spaces
CREATE TABLE countries (
    code CHAR(2) NOT NULL,       -- Always 2 chars: "US", "IN", "UK"
    name VARCHAR(100) NOT NULL   -- Variable length
);

-- VARCHAR: variable-length, most common choice
CREATE TABLE users (
    id         INT PRIMARY KEY AUTO_INCREMENT,
    username   VARCHAR(50)  NOT NULL,   -- Max 50, stores actual length
    email      VARCHAR(255) NOT NULL,   -- Standard email max length
    first_name VARCHAR(100),
    last_name  VARCHAR(100)
);

-- TEXT: large content, no length specified
CREATE TABLE articles (
    id      INT PRIMARY KEY AUTO_INCREMENT,
    title   VARCHAR(200) NOT NULL,
    body    TEXT NOT NULL,           -- Blog post content (can be huge)
    summary VARCHAR(500)            -- Short description
);

-- CHAR padding behavior
INSERT INTO countries (code, name) VALUES ('U', 'Test');
-- 'U' stored as 'U ' (padded to 2 chars)
-- LENGTH(code) = 2 in MySQL, but CHAR_LENGTH may vary

-- When to use which:
-- CHAR(2)    → country codes, state codes
-- CHAR(36)   → UUIDs (always 36 chars)
-- VARCHAR(n) → names, emails, URLs (variable length)
-- TEXT       → blog posts, comments, JSON documents

A SaaS platform stored user bios in a VARCHAR(255) column. When users started writing longer bios, the application silently truncated text at 255 characters without warning. Changing the column to TEXT removed the length limit, but the team discovered that their full-text search index on the bio column broke (MySQL cannot fully index TEXT columns). The fix was adding a prefix index: CREATE INDEX idx_bio ON users(bio(100)) — indexing just the first 100 characters for search.

CHAR = fixed-length (always pads). VARCHAR = variable-length with max (most common). TEXT = unlimited length (cannot be fully indexed). Use VARCHAR for 95% of string columns.
⚠️ Common Mistake

Using CHAR for variable-length data, which wastes storage and causes padding-related comparison bugs.

❌ Wrong — CHAR wastes space and causes trailing-space issues
CREATE TABLE users (
    name  CHAR(100),   -- "Alice" stored as "Alice" + 95 spaces
    email CHAR(255)    -- "a@b.com" stored as "a@b.com" + 248 spaces
);
-- Wastes ~345 bytes per row on padding
-- WHERE name = 'Alice' may not match 'Alice   ' in some databases
✅ Correct — VARCHAR for variable-length data
CREATE TABLE users (
    name  VARCHAR(100),   -- "Alice" stores 5 + 1 = 6 bytes
    email VARCHAR(255)    -- "a@b.com" stores 7 + 1 = 8 bytes
);
-- Stores only actual data + 1 byte for length
🔁 Follow-Up Question

What is the difference between VARCHAR(255) and VARCHAR(MAX)? Does the specified length affect performance?

11 What are subqueries and how do correlated subqueries differ from non-correlated? intermediate

A subquery (or inner query) is a SELECT statement nested inside another SQL statement. Subqueries can appear in SELECT, FROM, WHERE, or HAVING clauses.

There are two types:

  • Non-correlated (simple) subquery — executes once, independently of the outer query. The inner query produces a result set that the outer query then uses. Think of it as "run this first, then use the result."
  • Correlated subquery — references columns from the outer query, so it must re-execute for every row the outer query processes. Think of it as "for each row in the outer query, run this inner query." Correlated subqueries are often slower because they execute N times (once per outer row).

Common use cases: filtering with IN or EXISTS, computing derived values in SELECT, creating inline views in FROM.

-- NON-CORRELATED subquery: runs once, result reused
-- "Employees earning more than the company average"
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Inner query runs once → returns one number → outer query filters

-- NON-CORRELATED with IN
-- "Customers who placed an order in 2024"
SELECT name FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024);

-- CORRELATED subquery: runs once PER outer row
-- "Employees earning more than their department average"
SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department = e.department  -- references outer query!
);
-- Inner query re-executes for EACH employee row

-- Correlated subquery in SELECT (scalar subquery)
SELECT
    d.dept_name,
    (SELECT COUNT(*) FROM employees e WHERE e.dept_id = d.id) AS emp_count
FROM departments d;

-- Rewrite correlated subquery as JOIN (usually faster)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN (
    SELECT department, AVG(salary) AS avg_sal
    FROM employees GROUP BY department
) dept_avg ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_sal;

A reporting query at a retail company used a correlated subquery to find each product's rank within its category — it processed 50,000 products × 200 categories, executing the inner query 50,000 times. The query took 8 minutes. Rewriting it as a JOIN with a derived table reduced execution to 3 seconds. Later, the team replaced it entirely with a window function (ROW_NUMBER() OVER), which was even cleaner and ran in 1.5 seconds.

Non-correlated subqueries run once (fast). Correlated subqueries run per outer row (can be slow). Prefer JOINs or window functions over correlated subqueries when possible.
⚠️ Common Mistake

Using a correlated subquery when a simple JOIN or window function would be faster and more readable.

❌ Slow — correlated subquery runs per row
SELECT name, salary,
    (SELECT MAX(salary) FROM employees e2
     WHERE e2.department = e1.department) AS dept_max
FROM employees e1;
-- Executes inner query once per employee
✅ Fast — window function runs once
SELECT name, salary,
    MAX(salary) OVER (PARTITION BY department) AS dept_max
FROM employees;
-- Single pass over the data
🔁 Follow-Up Question

What is the difference between IN and EXISTS for subqueries? When is each faster?

12 What are Common Table Expressions (CTEs) and when should you use them? intermediate

A CTE (Common Table Expression) is a temporary named result set defined with the WITH keyword. It exists only for the duration of a single SQL statement and makes complex queries more readable by breaking them into logical steps.

  • Readability — CTEs replace deeply nested subqueries with named, sequential steps. Each CTE is like a variable you can reference later.
  • Reusability — a CTE can be referenced multiple times in the same query (unlike a subquery which must be repeated).
  • Recursion — CTEs support recursive queries (hierarchical data like org charts, category trees, bill of materials).

CTE vs subquery: CTEs are generally not faster than subqueries — most query optimizers inline them. The advantage is purely readability and maintainability. However, in PostgreSQL, CTEs before v12 were optimization fences (materialized by default), which could hurt performance.

CTE vs temp table: CTEs are not stored — they are query-scoped. Temp tables are stored on disk and persist across statements in a session.

-- Basic CTE: break a complex query into readable steps
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', order_date)
),
revenue_with_growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_month,
        ROUND((revenue - LAG(revenue) OVER (ORDER BY month))
            / LAG(revenue) OVER (ORDER BY month) * 100, 1) AS growth_pct
    FROM monthly_revenue
)
SELECT * FROM revenue_with_growth
WHERE growth_pct IS NOT NULL
ORDER BY month;

-- CTE referenced multiple times (vs repeating a subquery)
WITH active_users AS (
    SELECT user_id, COUNT(*) AS login_count
    FROM logins
    WHERE login_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    (SELECT COUNT(*) FROM active_users) AS total_active,
    (SELECT AVG(login_count) FROM active_users) AS avg_logins,
    (SELECT MAX(login_count) FROM active_users) AS max_logins;

-- Multiple CTEs chained together
WITH step1 AS (...),
     step2 AS (SELECT ... FROM step1 ...),
     step3 AS (SELECT ... FROM step2 ...)
SELECT * FROM step3;

A data team at a SaaS company had a 120-line nested subquery for a churn analysis report. No one could understand or modify it without breaking something. Rewriting it as 5 chained CTEs (active_users → churned_users → churn_by_month → churn_by_plan → final_report) made each step independently testable and readable. Debugging time for report issues dropped from hours to minutes.

CTEs (WITH clause) make complex queries readable by breaking them into named steps. They can be referenced multiple times and support recursion. They are query-scoped, not stored.
⚠️ Common Mistake

Overusing CTEs for simple queries that don't need them, or assuming CTEs are always materialized (cached) — in most databases, the optimizer inlines them.

❌ Unnecessary — CTE adds complexity to a simple query
WITH all_employees AS (
    SELECT * FROM employees
)
SELECT name FROM all_employees WHERE department = 'Sales';
-- CTE just wraps a simple table scan — adds noise
✅ Justified — CTE simplifies multi-step logic
WITH dept_stats AS (
    SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS cnt
    FROM employees GROUP BY department
)
SELECT e.name, e.salary, d.avg_sal, d.cnt
FROM employees e
JOIN dept_stats d ON e.department = d.department
WHERE e.salary > d.avg_sal;
🔁 Follow-Up Question

What are recursive CTEs? Give an example of querying hierarchical data.

13 Explain window functions — ROW_NUMBER, RANK, DENSE_RANK, and NTILE intermediate

Window functions perform calculations across a set of rows that are somehow related to the current row — without collapsing them into a single output row like GROUP BY does. They use the OVER() clause.

The four ranking window functions:

  • ROW_NUMBER() — assigns a unique sequential number to each row within the partition. No ties — if two rows have the same value, they get different numbers (order is non-deterministic for ties).
  • RANK() — like ROW_NUMBER but handles ties by giving the same rank to equal values, then skipping the next number(s). Example: 1, 2, 2, 4 (skips 3).
  • DENSE_RANK() — like RANK but does NOT skip numbers after ties. Example: 1, 2, 2, 3 (no gap).
  • NTILE(n) — divides rows into n roughly equal groups (buckets) and assigns the group number. Useful for percentiles and quartiles.

Key syntax: FUNCTION() OVER (PARTITION BY col ORDER BY col). PARTITION BY is optional — it creates separate "windows" for each group. ORDER BY determines the ranking order.

-- Sample data: employees with name, department, salary
-- ROW_NUMBER: unique number per row within each department
SELECT name, department, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→3

-- RANK: ties get same rank, gaps after
SELECT name, department, salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→2, Dave(90k)→4

-- DENSE_RANK: ties get same rank, NO gaps
SELECT name, department, salary,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rnk
FROM employees;
-- Engineering: Alice(100k)→1, Bob(95k)→2, Carol(95k)→2, Dave(90k)→3

-- NTILE: divide into groups (quartiles)
SELECT name, salary,
    NTILE(4) OVER (ORDER BY salary) AS salary_quartile
FROM employees;
-- Bottom 25% → 1, next 25% → 2, next 25% → 3, top 25% → 4

-- Practical: Get top 3 earners per department
WITH ranked AS (
    SELECT name, department, salary,
        ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT name, department, salary
FROM ranked WHERE rn <= 3;

An e-commerce analytics team needed to find each customer's most recent order. The initial approach used a correlated subquery (SELECT MAX(order_date) for each customer), which took 12 seconds on 5 million orders. Replacing it with ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) and filtering WHERE rn = 1 reduced the query time to 1.8 seconds — and the code was shorter and clearer.

ROW_NUMBER = unique sequential (no ties). RANK = same rank for ties, skips gaps. DENSE_RANK = same rank for ties, no gaps. NTILE = divide into N equal groups. All use OVER(PARTITION BY ... ORDER BY ...).
⚠️ Common Mistake

Using ROW_NUMBER to handle ties when you actually need RANK or DENSE_RANK — ROW_NUMBER arbitrarily picks one of the tied rows.

❌ Wrong — ROW_NUMBER misses tied top earners
-- Find the highest-paid employee per department
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT * FROM ranked WHERE rn = 1;
-- If two employees tie at $100k, only ONE is returned!
✅ Correct — RANK keeps all ties
WITH ranked AS (
    SELECT *, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rnk
    FROM employees
)
SELECT * FROM ranked WHERE rnk = 1;
-- Both tied employees at $100k are returned
🔁 Follow-Up Question

What are other window functions like LAG, LEAD, SUM OVER, and how do frame clauses (ROWS BETWEEN) work?

14 What are indexes and how do they improve query performance? intermediate

An index is a data structure (usually a B-tree) that the database maintains alongside the table to speed up data retrieval. Think of it like a book's index — instead of reading every page (full table scan), you look up the topic in the index and jump directly to the right page.

How indexes work:

  • Without an index, a query like WHERE email = 'alice@example.com' must scan every row in the table (full table scan) — O(n).
  • With a B-tree index on email, the database traverses the tree in O(log n) to find the matching row(s).
  • Indexes store a sorted copy of the indexed column(s) along with pointers to the actual table rows.

Trade-offs:

  • Reads faster: SELECT queries with WHERE, JOIN, ORDER BY on indexed columns are dramatically faster.
  • Writes slower: Every INSERT, UPDATE, DELETE must also update all relevant indexes, adding overhead.
  • Storage cost: Each index takes additional disk space (often 10-30% of the table size).

Rule of thumb: index columns that appear in WHERE, JOIN ON, and ORDER BY. Don't index columns with low cardinality (e.g., boolean columns) or tables with heavy write loads.

-- Create an index on a frequently queried column
CREATE INDEX idx_email ON users(email);

-- Now this query uses the index instead of full table scan:
SELECT * FROM users WHERE email = 'alice@example.com';
-- Without index: scans 1,000,000 rows → ~500ms
-- With index: B-tree lookup → ~2ms

-- Composite (multi-column) index
CREATE INDEX idx_dept_salary ON employees(department, salary);
-- Helps queries that filter on department, or department + salary
-- Does NOT help queries that filter on salary alone (leftmost prefix rule)

-- Unique index (also enforces uniqueness)
CREATE UNIQUE INDEX idx_unique_email ON users(email);

-- Check existing indexes
SHOW INDEX FROM users;            -- MySQL
SELECT * FROM pg_indexes WHERE tablename = 'users';  -- PostgreSQL

-- Drop an index
DROP INDEX idx_email ON users;    -- MySQL
DROP INDEX idx_email;             -- PostgreSQL

-- See if a query uses an index
EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- Look for "index scan" or "index seek" (good)
-- vs "seq scan" or "full table scan" (bad)

A social media platform's user search was taking 3-5 seconds on a table with 8 million users. The WHERE clause filtered on username, but there was no index on that column. Adding CREATE INDEX idx_username ON users(username) reduced search time to 15ms — a 200x improvement. However, the team later discovered that adding 12 indexes to the users table slowed down user registration (INSERT) from 5ms to 45ms. They removed 4 rarely-used indexes as a compromise.

Indexes speed up reads (WHERE, JOIN, ORDER BY) but slow down writes (INSERT, UPDATE, DELETE). Index columns you query often, but don't over-index. Use EXPLAIN to verify index usage.
⚠️ Common Mistake

Adding indexes on every column "just in case" — this slows down writes significantly and wastes disk space.

❌ Over-indexing — every column indexed
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    customer_id INT,
    product_id  INT,
    amount      DECIMAL(10,2),
    status      VARCHAR(20),
    created_at  DATETIME
);
CREATE INDEX idx_1 ON orders(customer_id);
CREATE INDEX idx_2 ON orders(product_id);
CREATE INDEX idx_3 ON orders(amount);
CREATE INDEX idx_4 ON orders(status);       -- low cardinality!
CREATE INDEX idx_5 ON orders(created_at);
-- 5 indexes = every INSERT updates 5 B-trees
✅ Strategic indexing — based on actual query patterns
-- Only index columns used in WHERE/JOIN/ORDER BY
CREATE INDEX idx_customer ON orders(customer_id);    -- JOIN queries
CREATE INDEX idx_created ON orders(created_at);      -- date range queries
-- 2 indexes, covering 90% of query patterns
🔁 Follow-Up Question

What is the difference between clustered and non-clustered indexes?

15 Clustered vs non-clustered indexes — internal structure, bookmark lookups, and storage intermediate

This is one of the most asked SQL interview questions. The difference goes much deeper than "one sorts the data, the other doesn't".

Clustered Index (InnoDB / SQL Server):

  • The leaf nodes of the B+ tree ARE the table data pages. The table is literally stored as a B+ tree ordered by the clustered key. There is no separate "heap" or "table" — the clustered index IS the table.
  • In InnoDB (MySQL), the primary key is always the clustered index. If you don't define a PK, InnoDB creates a hidden 6-byte row ID as the clustered key.
  • Only one per table — data can be physically sorted only one way.
  • Range scans are fast — since rows are contiguous on disk, reading WHERE id BETWEEN 100 AND 200 is a sequential disk read.

Non-Clustered Index:

  • A separate B+ tree whose leaf nodes store: (indexed column values) + (a pointer back to the actual row).
  • In SQL Server, the pointer is the clustered key value (not a physical address). In InnoDB, non-clustered (secondary) index leaves store the primary key value.
  • This means a non-clustered lookup does two B+ tree traversals: (1) search the secondary index B+ tree → get the PK value → (2) search the clustered index B+ tree using that PK → get the row. This second step is called a bookmark lookup (SQL Server) or key lookup.
  • Many per table (typically 5-15 in practice).

Heap table (PostgreSQL model): PostgreSQL does not have clustered indexes in the InnoDB sense. All tables are heaps (unordered). All indexes are secondary, and leaf nodes store a ctid (physical tuple ID = page number + offset). The CLUSTER command reorders the heap once but isn't maintained on writes.

-- InnoDB CLUSTERED INDEX internal structure:
-- PRIMARY KEY on id → B+ tree IS the table

-- B+ tree root (internal node):
--   [ 500 | 1000 | 1500 ]
--      |     |      |     \
-- Leaf pages (actual data rows stored here!):
-- Page 1: [id=1, name=Alice, salary=50k], [id=2, name=Bob, salary=60k], ...
-- Page 2: [id=501, name=Carol, salary=55k], [id=502, ...], ...
-- Page 3: [id=1001, ...], ...
-- Leaf pages are doubly-linked for range scans

-- NON-CLUSTERED (SECONDARY) INDEX in InnoDB:
-- CREATE INDEX idx_name ON employees(name);
-- Separate B+ tree:
-- Leaf: [Alice → PK=1], [Bob → PK=2], [Carol → PK=501]
--        ↑ sorted by name     ↑ stores PRIMARY KEY, not row address

-- Lookup: SELECT * FROM employees WHERE name = 'Alice';
-- Step 1: Search idx_name B+ tree → find Alice → PK=1
-- Step 2: Search clustered index B+ tree using PK=1 → get full row
-- = TWO B+ tree traversals (this is the "bookmark/key lookup")

-- SQL Server: explicit clustered index on non-PK column
CREATE TABLE events (
    id          INT PRIMARY KEY NONCLUSTERED,
    event_date  DATE,
    description VARCHAR(500)
);
CREATE CLUSTERED INDEX idx_event_date ON events(event_date);
-- Table stored sorted by event_date, not by id
-- id lookups now require a key lookup through the non-clustered PK index

-- Covering index: eliminates the second B+ tree traversal
CREATE INDEX idx_name_salary ON employees(name) INCLUDE (salary);
-- SELECT name, salary WHERE name = 'Alice';
-- Only Step 1 needed — salary is in the index leaf → no key lookup!
-- EXPLAIN shows "Index Only Scan" (PG) or "Index Seek" without "Key Lookup" (SQL Server)

-- PostgreSQL: heap + ctid-based indexes
-- All indexes store ctid (page_number, tuple_offset)
-- No "key lookup" — index leaf points directly to heap page
-- But: HOT updates (Heap-Only Tuples) can update rows without updating indexes
-- if the indexed columns didn't change

-- Check index type in PostgreSQL:
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'employees';

-- Fill factor: leave room in pages to reduce page splits
CREATE INDEX idx_name ON employees(name) WITH (fillfactor = 80);
-- Each leaf page only fills 80% → 20% free for future inserts
-- Reduces page splits at the cost of 20% more disk space

-- InnoDB: check fragmentation from random inserts
SELECT table_name, data_free, data_length,
       ROUND(data_free / data_length * 100, 1) AS fragmentation_pct
FROM information_schema.tables
WHERE table_name = 'employees';

A SaaS platform used UUIDs as primary keys in InnoDB. Because InnoDB's clustered index is the PK, every INSERT went to a random position in the B+ tree — causing 15,000 page splits per minute. Secondary indexes were also affected: every secondary index leaf stores the PK (UUID), making secondary indexes 36 bytes wider per entry than necessary. Switching to BIGINT AUTO_INCREMENT as the clustered PK and keeping UUID as a UNIQUE secondary index reduced page splits to near zero, shrunk secondary indexes by 40%, and improved INSERT throughput by 3x. The team also set innodb_fill_factor (via page merge threshold) to reduce future splits.

Clustered index = B+ tree leaf nodes ARE the table data (InnoDB, SQL Server). Non-clustered = separate B+ tree whose leaves store the PK/clustered key → requires a second B+ tree traversal (key lookup). PostgreSQL uses heaps + ctid pointers instead. Use covering indexes to eliminate key lookups. Choose sequential clustered keys to avoid page splits.
⚠️ Common Mistake

Not understanding the double B+ tree traversal in non-clustered lookups, leading to surprise slow queries when selecting many columns via a non-clustered index.

❌ Each row requires a key lookup — expensive at scale
CREATE INDEX idx_email ON users(email);
SELECT * FROM users WHERE email LIKE 'a%';
-- Index finds 50,000 matching emails
-- For EACH of the 50,000: key lookup into clustered index → fetch full row
-- 50,000 random I/Os into the clustered B+ tree → SLOW!
-- Optimizer may choose full table scan instead (cheaper)
✅ Covering index — no key lookup needed
CREATE INDEX idx_email_cover ON users(email) INCLUDE (name, created_at);
SELECT name, email, created_at FROM users WHERE email LIKE 'a%';
-- All requested columns are in the index leaf → index-only scan
-- 50,000 sequential reads from the index → fast!
-- No key lookup into clustered index at all
🔁 Follow-Up Question

What is a covering index and when does it eliminate the need for a bookmark lookup? How do HOT updates work in PostgreSQL?

16 What are views and what are their advantages and limitations? intermediate

A view is a virtual table defined by a stored SQL query. It does not store data itself — every time you query a view, the database executes the underlying SELECT statement.

Advantages:

  • Simplification — wrap complex JOINs into a simple table-like interface that analysts can query without knowing the underlying schema.
  • Security — expose only specific columns or filtered rows. Grant users access to the view but not the underlying tables.
  • Consistency — centralize business logic (e.g., "active customer" definition) in one place instead of repeating it in dozens of queries.
  • Abstraction — change the underlying tables without breaking dependent queries, as long as the view's output stays the same.

Limitations:

  • Views do not improve performance — the underlying query runs every time (unlike materialized views).
  • Complex views with multiple JOINs and subqueries can actually be slower because the optimizer has less flexibility.
  • Not all views are updatable — views with JOINs, GROUP BY, DISTINCT, or UNION generally cannot be used with INSERT/UPDATE/DELETE.
-- Create a view: simplify a complex JOIN
CREATE VIEW v_employee_details AS
SELECT
    e.id, e.name, e.salary,
    d.dept_name,
    m.name AS manager_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id
LEFT JOIN employees m ON e.manager_id = m.id;

-- Query the view like a regular table
SELECT * FROM v_employee_details WHERE dept_name = 'Engineering';

-- Security: create a restricted view for HR
CREATE VIEW v_employee_public AS
SELECT id, name, department, hire_date
FROM employees;
-- Excludes salary, SSN, personal info
GRANT SELECT ON v_employee_public TO hr_readonly_role;

-- Updatable view (simple single-table view)
CREATE VIEW v_active_employees AS
SELECT * FROM employees WHERE status = 'active';

-- This INSERT works through the view:
INSERT INTO v_active_employees (name, department, status)
VALUES ('New Hire', 'Sales', 'active');

-- Replace a view definition
CREATE OR REPLACE VIEW v_employee_details AS
SELECT ... -- updated query;

-- Drop a view
DROP VIEW IF EXISTS v_employee_details;

A healthcare company had 15 different teams writing queries against a patient database with 40+ tables. Each team had slightly different definitions of "active patient" — some checked last_visit_date, others checked insurance_status, others checked both. Creating a view v_active_patients with the official definition and requiring all teams to use it eliminated data discrepancies across 60+ reports. When the definition changed (adding a new status check), they updated the single view instead of hunting through hundreds of queries.

Views are virtual tables (stored queries, not stored data). Use them for simplification, security, and consistency. They don't improve performance — consider materialized views for that.
⚠️ Common Mistake

Nesting views inside views inside views, creating a "view stack" that is impossible to debug and can have terrible performance.

❌ Bad — deeply nested views
CREATE VIEW v1 AS SELECT ... FROM big_table JOIN ...;
CREATE VIEW v2 AS SELECT ... FROM v1 JOIN ...;
CREATE VIEW v3 AS SELECT ... FROM v2 JOIN ...;
CREATE VIEW v4 AS SELECT ... FROM v3 WHERE ...;
-- Query on v4: optimizer must unpack 4 layers
-- Debugging v4 requires understanding all 4 views
✅ Better — flat views or CTEs
CREATE VIEW v_report AS
SELECT ... FROM big_table
JOIN table2 ON ...
JOIN table3 ON ...
WHERE ...;
-- One view, one level, easy to debug and optimize
🔁 Follow-Up Question

What are materialized views and how do they differ from regular views?

17 What are stored procedures vs functions — when to use each? intermediate

Both are named blocks of reusable SQL code stored in the database, but they have different purposes and capabilities:

  • Stored Procedures — execute a series of SQL statements. Can modify data (INSERT, UPDATE, DELETE), manage transactions, return multiple result sets, use output parameters, and have side effects. Called with CALL or EXEC. Cannot be used inside a SELECT statement.
  • Functions (UDFs) — compute and return a single value (scalar function) or a table (table-valued function). Must return something. Can be used inside SELECT, WHERE, and JOIN clauses like built-in functions. Generally cannot modify data or manage transactions (though this varies by database).

When to use each:

  • Use procedures for operations that modify data, involve complex business logic with multiple steps, or need transaction control.
  • Use functions for calculations, transformations, or lookups that you want to use inline within queries.
-- STORED PROCEDURE: transfer money between accounts
DELIMITER //
CREATE PROCEDURE sp_transfer_funds(
    IN from_acct INT,
    IN to_acct INT,
    IN amount DECIMAL(10,2),
    OUT success BOOLEAN
)
BEGIN
    DECLARE from_balance DECIMAL(10,2);

    START TRANSACTION;

    SELECT balance INTO from_balance FROM accounts
    WHERE id = from_acct FOR UPDATE;

    IF from_balance >= amount THEN
        UPDATE accounts SET balance = balance - amount WHERE id = from_acct;
        UPDATE accounts SET balance = balance + amount WHERE id = to_acct;
        INSERT INTO transactions (from_id, to_id, amount) VALUES (from_acct, to_acct, amount);
        SET success = TRUE;
        COMMIT;
    ELSE
        SET success = FALSE;
        ROLLBACK;
    END IF;
END //
DELIMITER ;

-- Call the procedure
CALL sp_transfer_funds(101, 202, 500.00, @result);
SELECT @result;

-- FUNCTION: calculate tax for a salary
CREATE FUNCTION fn_calc_tax(salary DECIMAL(10,2))
RETURNS DECIMAL(10,2)
DETERMINISTIC
BEGIN
    IF salary <= 50000 THEN RETURN salary * 0.10;
    ELSEIF salary <= 100000 THEN RETURN salary * 0.20;
    ELSE RETURN salary * 0.30;
    END IF;
END;

-- Use function inline in a query
SELECT name, salary, fn_calc_tax(salary) AS tax
FROM employees;

A banking application had fund transfer logic duplicated across 3 microservices (web, mobile, API). Each had slightly different validation rules, causing edge cases where transfers could overdraw accounts. Moving the logic into a single stored procedure sp_transfer_funds with proper locking (SELECT ... FOR UPDATE) and transaction handling eliminated the inconsistency and reduced overdraft incidents from ~15/month to zero.

Procedures = multi-step operations with side effects (CALL). Functions = return values usable inline in queries (SELECT). Use procedures for data modification, functions for calculations.
⚠️ Common Mistake

Putting complex business logic in functions that get called per-row in a query — functions execute once per row, so a function called in a SELECT over 1 million rows runs 1 million times.

❌ Slow — function called per row with internal queries
CREATE FUNCTION fn_get_order_count(cust_id INT)
RETURNS INT
BEGIN
    RETURN (SELECT COUNT(*) FROM orders WHERE customer_id = cust_id);
END;

SELECT name, fn_get_order_count(id) FROM customers;
-- Executes the COUNT query 100,000 times for 100K customers!
✅ Fast — use a JOIN instead
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
-- Single query, single pass
🔁 Follow-Up Question

What are the security implications of stored procedures? How do they help prevent SQL injection?

18 What are triggers and what are their common use cases? intermediate

A trigger is a stored procedure that automatically executes in response to specific events on a table — INSERT, UPDATE, or DELETE. Triggers fire without being explicitly called.

Trigger timing:

  • BEFORE — fires before the operation. Can modify the incoming data or cancel the operation. Use for validation or data transformation.
  • AFTER — fires after the operation. The data has already been written. Use for audit logging, cascading updates, or notifications.
  • INSTEAD OF — replaces the operation entirely (SQL Server/PostgreSQL for views). Use to make complex views updatable.

Inside a trigger, you have access to NEW (the new row for INSERT/UPDATE) and OLD (the original row for UPDATE/DELETE).

Caution: Triggers are invisible to application code, making them hard to debug. Overuse leads to "hidden logic" that developers aren't aware of. They also add overhead to every write operation on the table.

-- BEFORE INSERT: auto-fill created_at timestamp
CREATE TRIGGER trg_before_insert_orders
BEFORE INSERT ON orders
FOR EACH ROW
SET NEW.created_at = NOW(),
    NEW.updated_at = NOW();

-- AFTER UPDATE: audit log
DELIMITER //
CREATE TRIGGER trg_audit_salary_change
AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
    IF OLD.salary != NEW.salary THEN
        INSERT INTO salary_audit (
            employee_id, old_salary, new_salary,
            changed_at, changed_by
        ) VALUES (
            NEW.id, OLD.salary, NEW.salary,
            NOW(), CURRENT_USER()
        );
    END IF;
END //
DELIMITER ;

-- BEFORE UPDATE: prevent invalid data
DELIMITER //
CREATE TRIGGER trg_prevent_salary_decrease
BEFORE UPDATE ON employees
FOR EACH ROW
BEGIN
    IF NEW.salary < OLD.salary THEN
        SIGNAL SQLSTATE '45000'
        SET MESSAGE_TEXT = 'Salary cannot be decreased';
    END IF;
END //
DELIMITER ;

-- View existing triggers
SHOW TRIGGERS;                    -- MySQL
SELECT * FROM information_schema.triggers;  -- Standard SQL

A compliance team at a financial institution needed a complete audit trail of all changes to the accounts table — who changed what, when, and from what value. Rather than modifying 30+ application endpoints to add logging, they created AFTER UPDATE and AFTER DELETE triggers that automatically logged every change to an accounts_audit table. The audit table grew to 200 million rows, so they added partitioning by month and auto-archived partitions older than 7 years.

Triggers auto-execute on INSERT/UPDATE/DELETE — BEFORE (validate/transform), AFTER (audit/cascade), INSTEAD OF (for views). Use sparingly — they are invisible to app code and add write overhead.
⚠️ Common Mistake

Creating triggers that modify other tables which also have triggers, causing an infinite trigger chain or unpredictable cascading effects.

❌ Dangerous — trigger chain / infinite loop
-- Trigger on table A updates table B
-- Trigger on table B updates table A → infinite loop!
CREATE TRIGGER trg_a AFTER UPDATE ON table_a
FOR EACH ROW UPDATE table_b SET val = NEW.val WHERE id = NEW.ref_id;

CREATE TRIGGER trg_b AFTER UPDATE ON table_b
FOR EACH ROW UPDATE table_a SET val = NEW.val WHERE id = NEW.ref_id;
✅ Safe — one-directional trigger with guard
CREATE TRIGGER trg_audit AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
    -- Only log, don't modify other tables with triggers
    INSERT INTO audit_log (table_name, record_id, action, changed_at)
    VALUES ('employees', NEW.id, 'UPDATE', NOW());
END;
🔁 Follow-Up Question

Can triggers cause performance problems? How do you disable them during bulk operations?

19 Explain transactions and the BEGIN, COMMIT, ROLLBACK workflow intermediate

A transaction is a sequence of SQL operations that are treated as a single atomic unit — either all operations succeed (COMMIT), or all are undone (ROLLBACK). There is no partial state.

Transaction workflow:

  • BEGIN (or START TRANSACTION) — marks the start of a transaction. All subsequent operations are part of this transaction.
  • COMMIT — makes all changes permanent. Once committed, changes are durable even if the server crashes.
  • ROLLBACK — undoes all changes made since BEGIN. The database returns to its state before the transaction started.
  • SAVEPOINT — creates a checkpoint within a transaction. You can ROLLBACK TO a savepoint without undoing the entire transaction.

Transactions ensure the ACID properties: Atomicity (all-or-nothing), Consistency (data stays valid), Isolation (concurrent transactions don't interfere), Durability (committed data survives crashes).

Many databases use autocommit by default — each statement is its own transaction. Explicit BEGIN...COMMIT groups multiple statements into one atomic operation.

-- Basic transaction: transfer funds
BEGIN;

UPDATE accounts SET balance = balance - 500 WHERE id = 101;
UPDATE accounts SET balance = balance + 500 WHERE id = 202;
INSERT INTO transfers (from_id, to_id, amount, created_at)
VALUES (101, 202, 500, NOW());

-- Verify before committing
SELECT id, balance FROM accounts WHERE id IN (101, 202);

COMMIT;  -- All 3 operations become permanent

-- ROLLBACK example: something went wrong
BEGIN;
DELETE FROM orders WHERE customer_id = 42;
-- Oops, wrong customer!
ROLLBACK;  -- Undo the delete, no data lost

-- SAVEPOINT: partial rollback
BEGIN;
INSERT INTO orders (customer_id, amount) VALUES (1, 100);
SAVEPOINT sp1;
INSERT INTO orders (customer_id, amount) VALUES (2, 200);
-- Oops, second insert had wrong data
ROLLBACK TO sp1;  -- Only undoes the second INSERT
-- First INSERT is still pending
COMMIT;  -- Commits only the first INSERT

-- Check autocommit setting
SELECT @@autocommit;           -- MySQL (1 = on)
SHOW autocommit;               -- PostgreSQL

-- Disable autocommit for a session
SET autocommit = 0;            -- MySQL
-- Now every statement requires explicit COMMIT

An airline booking system processes seat reservations in a transaction: (1) check seat availability, (2) reserve the seat, (3) charge the credit card, (4) send confirmation email. If step 3 fails (card declined), the transaction rolls back step 2 (unreserve the seat) so it becomes available for others. Without transactions, a card failure would leave a "ghost reservation" — a seat marked as taken but never paid for. The system processes 50,000 bookings/day, and transactions prevent approximately 200 ghost reservations daily.

BEGIN groups operations atomically. COMMIT makes them permanent. ROLLBACK undoes them all. SAVEPOINT allows partial rollback. Always use transactions for multi-step operations that must succeed or fail together.
⚠️ Common Mistake

Forgetting to COMMIT or ROLLBACK, leaving a transaction open — this can hold locks for hours and block other users.

❌ Dangerous — forgotten open transaction holds locks
BEGIN;
UPDATE products SET price = 29.99 WHERE id = 42;
-- Developer gets distracted, forgets to COMMIT or ROLLBACK
-- The row is locked for hours, blocking all other queries on it
-- Eventually the DBA kills the session
✅ Safe — always use try/catch with guaranteed COMMIT or ROLLBACK
BEGIN;
    UPDATE products SET price = 29.99 WHERE id = 42;
    -- Application code: if success → COMMIT, else → ROLLBACK
COMMIT;

-- In application code, use try/finally:
-- try { begin(); update(); commit(); }
-- catch { rollback(); }
-- finally { close connection; }
🔁 Follow-Up Question

What are the ACID properties? Explain each one in detail.

20 What is the difference between EXISTS and IN? intermediate

Both EXISTS and IN filter rows based on a subquery, but they work differently:

  • IN — compares a value against a list of values returned by the subquery. The subquery executes first and returns a complete result set. Then the outer query checks if each row's value is in that set. Best when the subquery returns a small result set.
  • EXISTS — checks whether the subquery returns any rows at all (true/false). The subquery is correlated — it references the outer query and re-executes for each outer row. However, it stops as soon as it finds one match (short-circuits). Best when the subquery has a large result set or when an index exists on the correlated column.

Performance rule of thumb:

  • Use IN when the subquery result set is small.
  • Use EXISTS when the outer table is small and the subquery table is large (EXISTS stops early).
  • In modern databases, the optimizer often rewrites one to the other, making them equivalent in many cases.

NULL handling: IN can give unexpected results with NULLs (because NULL IN (...) returns NULL, not FALSE). EXISTS is not affected by NULLs.

-- IN: get all customers who have placed an order
SELECT name FROM customers
WHERE id IN (SELECT customer_id FROM orders);
-- Subquery runs once → returns a list of IDs
-- Outer query checks each customer against the list

-- EXISTS: same result, different approach
SELECT name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- For each customer, check if ANY order exists → stop early

-- NOT IN vs NOT EXISTS (critical difference with NULLs!)
-- If orders.customer_id has NULL values:
SELECT name FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);
-- ⚠️ Returns EMPTY result if any customer_id is NULL!
-- Because: id NOT IN (1, 2, NULL) → id != 1 AND id != 2 AND id != NULL
-- The NULL comparison makes the whole expression NULL (falsy)

-- NOT EXISTS handles NULLs correctly
SELECT name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- ✅ Works correctly regardless of NULLs

-- Performance comparison (depends on data distribution):
-- 10 customers, 1M orders → IN may be faster (small subquery result)
-- 1M customers, 100 orders → EXISTS may be faster (stops early)
-- Modern optimizers often make them equivalent

A data team at a subscription service used NOT IN to find users without active subscriptions. The query worked perfectly in development but returned zero results in production. After debugging, they found that 5 rows in the subscriptions table had NULL user_id values (from a migration bug). NOT IN with NULLs silently returns empty results. Switching to NOT EXISTS fixed the issue immediately and became the team's standard pattern for anti-joins.

IN compares against a list (good for small subqueries). EXISTS checks for existence (short-circuits, good for large tables). Always use NOT EXISTS over NOT IN to avoid NULL pitfalls.
⚠️ Common Mistake

Using NOT IN with a subquery that might contain NULLs — it silently returns zero results instead of the expected rows.

❌ Bug — NOT IN fails silently with NULLs
-- Find customers with no orders
SELECT name FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);
-- If ANY customer_id is NULL → returns EMPTY result set!
-- No error, no warning — just wrong data
✅ Safe — NOT EXISTS ignores NULLs
SELECT name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- Works correctly regardless of NULL values
🔁 Follow-Up Question

How does the query optimizer decide whether to use IN or EXISTS internally? Can you convert between them?

21 Explain ACID properties with real-world examples advanced

ACID is a set of four properties that guarantee reliable processing of database transactions:

  • Atomicity — a transaction is all-or-nothing. If any part fails, the entire transaction is rolled back. Example: a bank transfer deducts from Account A and credits Account B. If the credit fails, the deduction is also reversed.
  • Consistency — a transaction moves the database from one valid state to another. All constraints (primary keys, foreign keys, checks) are enforced. Example: you cannot transfer more money than exists in an account if a CHECK constraint enforces non-negative balances.
  • Isolation — concurrent transactions do not interfere with each other. Each transaction sees the database as if it were the only one running (the degree depends on the isolation level). Example: two users buying the last concert ticket simultaneously — isolation ensures only one succeeds.
  • Durability — once a transaction is committed, the changes survive any subsequent system failure (crash, power outage). The database uses write-ahead logging (WAL) to ensure committed data can always be recovered.
-- ATOMICITY: all-or-nothing
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE id = 1;
UPDATE accounts SET balance = balance + 1000 WHERE id = 2;
-- If server crashes between these two statements:
-- → Both are rolled back, no money is lost
COMMIT;

-- CONSISTENCY: constraints prevent invalid states
CREATE TABLE accounts (
    id      INT PRIMARY KEY,
    balance DECIMAL(10,2) CHECK (balance >= 0)  -- no negative balance
);
BEGIN;
UPDATE accounts SET balance = balance - 5000 WHERE id = 1;
-- If balance would go negative → CHECK violation → transaction fails
COMMIT;

-- ISOLATION: concurrent transactions don't see each other's changes
-- Session 1:                    | Session 2:
-- BEGIN;                        |
-- UPDATE products SET qty=9     |
--   WHERE id=1;                 |
--                               | SELECT qty FROM products WHERE id=1;
--                               | → Still sees qty=10 (not 9)
-- COMMIT;                       |
--                               | SELECT qty FROM products WHERE id=1;
--                               | → Now sees qty=9

-- DURABILITY: committed data survives crashes
BEGIN;
INSERT INTO orders (customer_id, total) VALUES (42, 299.99);
COMMIT;  -- Written to WAL (Write-Ahead Log) on disk
-- Even if server loses power NOW, this order is safe

In 2012, Knight Capital lost $440 million in 45 minutes due to a software deployment error that sent duplicate stock orders. The system lacked proper transaction atomicity — partial orders were committed without the corresponding hedge trades. If the trading system had used proper ACID transactions grouping the order and its hedge as one atomic unit, the runaway orders could have been rolled back. This incident is one of the most cited examples of why ACID matters in financial systems.

Atomicity = all or nothing. Consistency = data stays valid. Isolation = concurrent transactions don't interfere. Durability = committed data survives crashes. ACID is the foundation of reliable databases.
⚠️ Common Mistake

Assuming that writing data to the database means it is durable — without a proper COMMIT, data can be lost on crash.

❌ Not durable — autocommit off, no explicit COMMIT
SET autocommit = 0;
INSERT INTO payments (user_id, amount) VALUES (1, 500);
UPDATE users SET balance = balance - 500 WHERE id = 1;
-- No COMMIT! If server crashes → both operations are lost
-- User sees "Payment successful" but data is gone
✅ Durable — explicit COMMIT after all operations
BEGIN;
INSERT INTO payments (user_id, amount) VALUES (1, 500);
UPDATE users SET balance = balance - 500 WHERE id = 1;
COMMIT;  -- Data written to WAL → survives crash
🔁 Follow-Up Question

What are the four transaction isolation levels and what problems does each solve?

22 What are transaction isolation levels and what concurrency problems do they solve? advanced

Isolation levels control how much one transaction can see changes made by other concurrent transactions. Higher isolation = fewer concurrency bugs but more locking/overhead.

The four levels (weakest to strongest):

  • Read Uncommitted — can see uncommitted changes from other transactions (dirty reads). Rarely used in practice.
  • Read Committed — can only see committed data. Prevents dirty reads but allows non-repeatable reads (a row read twice in the same transaction can return different values if another transaction commits between reads). Default in PostgreSQL, Oracle, SQL Server.
  • Repeatable Read — guarantees that if you read a row twice, you get the same value. Prevents dirty and non-repeatable reads but allows phantom reads (new rows matching a WHERE clause can appear if another transaction inserts and commits). Default in MySQL/InnoDB.
  • Serializable — complete isolation. Transactions behave as if they ran one after another. Prevents all concurrency anomalies but has the highest performance cost (more locks, potential deadlocks).
-- Set isolation level (per transaction)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees only committed data
COMMIT;

-- DIRTY READ (Read Uncommitted)
-- Session 1: BEGIN; UPDATE accounts SET balance=0 WHERE id=1; (no commit)
-- Session 2: SELECT balance FROM accounts WHERE id=1; → sees 0 (dirty!)
-- Session 1: ROLLBACK; → balance is back to original
-- Session 2 used incorrect data!

-- NON-REPEATABLE READ (Read Committed)
-- Session 1: BEGIN; SELECT price FROM products WHERE id=1; → $10
-- Session 2: UPDATE products SET price=15 WHERE id=1; COMMIT;
-- Session 1: SELECT price FROM products WHERE id=1; → $15 (changed!)
-- Same row, different value within the same transaction

-- PHANTOM READ (Repeatable Read)
-- Session 1: BEGIN; SELECT COUNT(*) FROM orders WHERE date='2024-01-01'; → 5
-- Session 2: INSERT INTO orders (date,...) VALUES ('2024-01-01',...); COMMIT;
-- Session 1: SELECT COUNT(*) FROM orders WHERE date='2024-01-01'; → 6 (phantom!)

-- Check current isolation level
SELECT @@transaction_isolation;   -- MySQL
SHOW transaction_isolation;       -- PostgreSQL

-- Set default for session
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;

An e-commerce platform experienced a double-spend bug: two concurrent sessions could both read a gift card balance of $100, each process a $80 purchase, and both commit successfully — resulting in $160 spent from a $100 card. The root cause was Read Committed isolation, which allowed both sessions to read the same balance before either committed. Upgrading to Serializable for gift card transactions (and using SELECT ... FOR UPDATE) eliminated the race condition. The team kept Read Committed as the default for other transactions to avoid performance impact.

Read Uncommitted → dirty reads. Read Committed → no dirty reads. Repeatable Read → no non-repeatable reads. Serializable → no anomalies (slowest). Choose the lowest level that your application can tolerate.
⚠️ Common Mistake

Using the default isolation level without understanding its implications — leading to subtle concurrency bugs that only appear under load.

❌ Race condition — Read Committed allows double-spend
-- Session 1 & 2 both do this simultaneously:
BEGIN;
SELECT balance FROM gift_cards WHERE id = 1;  -- Both see $100
UPDATE gift_cards SET balance = balance - 80 WHERE id = 1;
COMMIT;
-- Both succeed! Card balance is now $20, but $160 was spent
✅ Safe — SELECT FOR UPDATE locks the row
BEGIN;
SELECT balance FROM gift_cards WHERE id = 1 FOR UPDATE;
-- Row is locked — other sessions must wait
IF balance >= 80 THEN
    UPDATE gift_cards SET balance = balance - 80 WHERE id = 1;
END IF;
COMMIT;  -- Lock released, next session can proceed
🔁 Follow-Up Question

What is SELECT ... FOR UPDATE and how does it prevent race conditions?

23 Explain database normalization (1NF, 2NF, 3NF, BCNF) with examples advanced

Normalization is the process of organizing database tables to reduce data redundancy and prevent update anomalies. Each normal form (NF) builds on the previous one:

  • 1NF (First Normal Form) — every column contains atomic (indivisible) values. No repeating groups or arrays. Each row is unique (has a primary key).
  • 2NF (Second Normal Form) — is in 1NF AND every non-key column depends on the entire primary key, not just part of it. Only relevant for composite primary keys. Eliminates partial dependencies.
  • 3NF (Third Normal Form) — is in 2NF AND no non-key column depends on another non-key column. Eliminates transitive dependencies. A non-key column must depend only on the primary key.
  • BCNF (Boyce-Codd Normal Form) — a stricter version of 3NF. Every determinant (column that determines another column) must be a candidate key. Handles edge cases where 3NF still allows redundancy.

Trade-off: higher normalization reduces redundancy but increases the number of JOINs needed for queries, which can hurt read performance. Many production systems intentionally denormalize for performance.

-- ❌ NOT 1NF: repeating groups / non-atomic values
-- | student_id | name  | courses           |
-- | 1          | Alice | Math, Physics     |  ← not atomic!
-- | 2          | Bob   | Chemistry         |

-- ✅ 1NF: atomic values, separate rows
-- | student_id | name  | course    |
-- | 1          | Alice | Math      |
-- | 1          | Alice | Physics   |
-- | 2          | Bob   | Chemistry |

-- ❌ NOT 2NF: partial dependency (composite key)
-- PK: (student_id, course_id)
-- | student_id | course_id | student_name | grade |
-- student_name depends on student_id alone (partial dependency)

-- ✅ 2NF: split into two tables
-- students: student_id (PK), student_name
-- enrollments: student_id, course_id (composite PK), grade

-- ❌ NOT 3NF: transitive dependency
-- | employee_id | dept_id | dept_name     | salary |
-- dept_name depends on dept_id (not on employee_id directly)

-- ✅ 3NF: remove transitive dependency
-- employees: employee_id (PK), dept_id (FK), salary
-- departments: dept_id (PK), dept_name

-- Real schema showing 3NF:
CREATE TABLE departments (
    id        INT PRIMARY KEY,
    dept_name VARCHAR(100) NOT NULL
);
CREATE TABLE employees (
    id      INT PRIMARY KEY,
    name    VARCHAR(100) NOT NULL,
    dept_id INT REFERENCES departments(id),
    salary  DECIMAL(10,2)
);

A retail chain stored product data in a single denormalized table with supplier_name, supplier_phone, and supplier_address repeated on every product row. When a supplier changed their phone number, they had to update 15,000 rows (one per product). Some rows were missed, causing order forms to show different phone numbers for the same supplier. Normalizing into separate products and suppliers tables meant the phone number was stored once — one UPDATE fixed everything.

1NF = atomic values. 2NF = no partial dependencies. 3NF = no transitive dependencies. BCNF = every determinant is a candidate key. Normalize to reduce redundancy, denormalize selectively for read performance.
⚠️ Common Mistake

Storing derived or repeated data in the same table, creating update anomalies when the source data changes.

❌ Not normalized — redundant data
CREATE TABLE orders (
    id            INT PRIMARY KEY,
    customer_id   INT,
    customer_name VARCHAR(100),  -- duplicated from customers table!
    customer_email VARCHAR(255), -- also duplicated!
    amount        DECIMAL(10,2)
);
-- Customer changes name → must update ALL their order rows
✅ Normalized — single source of truth
CREATE TABLE customers (
    id    INT PRIMARY KEY,
    name  VARCHAR(100),
    email VARCHAR(255)
);
CREATE TABLE orders (
    id          INT PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    amount      DECIMAL(10,2)
);
-- Customer name stored once, JOIN when needed
🔁 Follow-Up Question

When is denormalization justified? Give examples of intentional denormalization for performance.

24 How do you read and interpret a query execution plan (EXPLAIN / EXPLAIN ANALYZE)? advanced

An execution plan (or query plan) shows how the database engine will execute a query — which indexes it uses, how it joins tables, and where it spends time. It is the most important tool for query optimization.

  • EXPLAIN — shows the planned execution strategy without running the query. Shows estimated costs and row counts.
  • EXPLAIN ANALYZE (PostgreSQL) / EXPLAIN FORMAT=ANALYZE — actually runs the query and shows real execution times alongside estimates. More accurate but actually executes the query (be careful with UPDATE/DELETE).

Key things to look for in a plan:

  • Seq Scan / Full Table Scan — reading every row in the table. Bad for large tables, usually means a missing index.
  • Index Scan / Index Seek — using an index to find rows efficiently. Good.
  • Nested Loop / Hash Join / Merge Join — the method used to join tables. Each is optimal for different data sizes.
  • Sort — an explicit sort step (for ORDER BY or DISTINCT). Can be expensive for large result sets.
  • Estimated vs actual rows — large discrepancies indicate stale statistics (run ANALYZE).
-- PostgreSQL: EXPLAIN shows the plan
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';
-- Seq Scan on employees  (cost=0.00..35.50 rows=10 width=64)
--   Filter: (department = 'Engineering')
-- → Full table scan! Need an index.

-- Add index and re-check
CREATE INDEX idx_dept ON employees(department);
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';
-- Index Scan using idx_dept on employees  (cost=0.28..8.29 rows=10 width=64)
--   Index Cond: (department = 'Engineering')
-- → Index scan! Much better.

-- EXPLAIN ANALYZE: shows actual times
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.dept_name = 'Engineering';
-- Hash Join  (cost=... rows=10) (actual time=0.5..1.2 rows=10 loops=1)
--   → Seq Scan on departments (actual time=0.01..0.05 rows=1)
--   → Seq Scan on employees   (actual time=0.1..0.8 rows=100)
-- Planning time: 0.2ms  Execution time: 1.5ms

-- MySQL: EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;
-- +----+------+---------------+---------+------+------+----------+
-- | id | type | possible_keys | key     | rows | Extra          |
-- +----+------+---------------+---------+------+------+----------+
-- | 1  | ref  | idx_customer  | idx_cust| 15   | Using index    |
-- type=ref (index lookup), rows=15 (estimated), Using index=good

-- MySQL: EXPLAIN FORMAT=JSON for detailed analysis
EXPLAIN FORMAT=JSON SELECT ...;

A fintech company's daily reconciliation report took 45 minutes. EXPLAIN ANALYZE revealed that a JOIN between transactions (50M rows) and accounts (200K rows) was using a Nested Loop with a Seq Scan on transactions — 200K × 50M row scans. The estimated rows said 100, but actual rows were 50M (stale statistics). Running ANALYZE transactions updated the stats, and the optimizer switched to a Hash Join — query time dropped to 3 minutes. Adding a composite index further reduced it to 45 seconds.

EXPLAIN shows HOW the database will execute a query. Look for Seq Scans (need index), row estimate mismatches (run ANALYZE), and expensive sorts. EXPLAIN ANALYZE shows actual times but executes the query.
⚠️ Common Mistake

Optimizing queries without looking at the execution plan — guessing which part is slow instead of measuring.

❌ Guessing — adding random indexes
-- "This query is slow, let me add indexes on everything"
CREATE INDEX idx_1 ON orders(customer_id);
CREATE INDEX idx_2 ON orders(status);
CREATE INDEX idx_3 ON orders(amount);
CREATE INDEX idx_4 ON orders(created_at);
-- Still slow! Never checked what the query actually does
✅ Measure first — use EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT ... FROM orders WHERE ...;
-- Shows: Seq Scan on orders → 12 seconds
-- The bottleneck is a missing index on (customer_id, created_at)
CREATE INDEX idx_cust_date ON orders(customer_id, created_at);
-- Re-run: Index Scan → 50ms
🔁 Follow-Up Question

What are the different join algorithms (Nested Loop, Hash Join, Merge Join) and when does the optimizer choose each?

25 What are pessimistic vs optimistic locking strategies? advanced

Locking strategies control how concurrent transactions handle conflicts when they try to modify the same data:

  • Pessimistic locking — assumes conflicts will happen. Locks the row/resource before reading it, preventing anyone else from modifying it until the lock is released. Implemented with SELECT ... FOR UPDATE. Guarantees no conflicts but can cause blocking and deadlocks.
  • Optimistic locking — assumes conflicts are rare. Does NOT lock the row when reading. Instead, checks at update time whether the data has changed since it was read (using a version number or timestamp). If it has changed, the update fails and the application retries. Implemented at the application level (not a database feature). Better throughput but requires retry logic.

When to use each:

  • Pessimistic: high contention (many users editing the same rows), short transactions, critical financial operations.
  • Optimistic: low contention (users rarely edit the same row), long read phases (like a user editing a form), high-throughput systems.
-- PESSIMISTIC LOCKING: SELECT ... FOR UPDATE
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Row is LOCKED — other transactions trying to update this row will WAIT
-- ... do calculations ...
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;  -- Lock released

-- FOR UPDATE with NOWAIT (fail immediately instead of waiting)
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE NOWAIT;
-- If row is already locked → ERROR instead of waiting

-- FOR UPDATE with SKIP LOCKED (skip locked rows)
SELECT * FROM job_queue WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Worker picks up the next available (unlocked) job

-- OPTIMISTIC LOCKING: version column
-- Step 1: Read with version
SELECT id, name, salary, version FROM employees WHERE id = 1;
-- Returns: id=1, name=Alice, salary=90000, version=3

-- Step 2: Update only if version hasn't changed
UPDATE employees
SET salary = 95000, version = version + 1
WHERE id = 1 AND version = 3;
-- If another transaction changed the row, version != 3
-- → UPDATE affects 0 rows → application detects conflict and retries

-- Check if update succeeded
-- IF affected_rows = 0 THEN
--     "Conflict detected, please retry"
-- END IF

An airline booking system uses pessimistic locking for seat selection: when a user selects seat 14A, SELECT ... FOR UPDATE locks that seat row. If two users select the same seat simultaneously, the second one waits. A wiki platform uses optimistic locking: when you edit an article, the page sends a version number with your changes. If someone else saved a change while you were editing, the version mismatch triggers a "conflict" page showing both edits. Wikipedia uses this exact pattern for its edit conflict resolution.

Pessimistic = lock first, modify later (safe but can block). Optimistic = read freely, check version at update time (more throughput but needs retry logic). Choose based on conflict frequency.
⚠️ Common Mistake

Using pessimistic locking with long transactions — holding locks while waiting for user input causes other users to be blocked for minutes.

❌ Bad — lock held during user interaction
BEGIN;
SELECT * FROM products WHERE id = 42 FOR UPDATE;
-- Lock acquired... now show edit form to user
-- User takes 5 minutes to fill out the form
-- ALL other users trying to edit product 42 are BLOCKED
UPDATE products SET price = 29.99 WHERE id = 42;
COMMIT;  -- Lock held for 5 minutes!
✅ Better — optimistic locking for user interactions
-- Read without locking
SELECT id, price, version FROM products WHERE id = 42;
-- Show form to user (no locks held)
-- User takes 5 minutes, submits form
UPDATE products SET price = 29.99, version = version + 1
WHERE id = 42 AND version = 5;  -- check version
-- If affected_rows = 0 → someone else changed it → show conflict
🔁 Follow-Up Question

What is a deadlock and how do databases detect and resolve them?

26 What are recursive CTEs and how do you query hierarchical data? advanced

A recursive CTE is a CTE that references itself, allowing you to traverse hierarchical or graph-like data structures. It has two parts:

  • Anchor member — the base case that starts the recursion (non-recursive SELECT).
  • Recursive member — references the CTE itself and JOINs with the source table to walk the hierarchy level by level.

The database executes the anchor first, then repeatedly executes the recursive member using the previous iteration's results until no new rows are produced.

Common use cases:

  • Org charts (employee → manager → CEO)
  • Category trees (sub-category → category → root)
  • Bill of materials (part → sub-assembly → assembly → product)
  • File system paths (folder → parent folder → root)
  • Graph traversal (social networks, routing)

Important: always include a termination condition or MAXRECURSION limit to prevent infinite loops on cyclic data.

-- Org chart: find all employees under a manager
-- employees: id, name, manager_id (self-referencing FK)
WITH RECURSIVE org_tree AS (
    -- Anchor: start from the CEO (no manager)
    SELECT id, name, manager_id, 0 AS depth,
           CAST(name AS VARCHAR(500)) AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive: find direct reports of each person
    SELECT e.id, e.name, e.manager_id, t.depth + 1,
           CAST(t.path || ' → ' || e.name AS VARCHAR(500))
    FROM employees e
    JOIN org_tree t ON e.manager_id = t.id
)
SELECT depth, path FROM org_tree ORDER BY path;
-- Output:
-- 0  CEO
-- 1  CEO → VP Engineering
-- 2  CEO → VP Engineering → Tech Lead
-- 3  CEO → VP Engineering → Tech Lead → Developer

-- Category tree: get full breadcrumb path
WITH RECURSIVE cat_tree AS (
    SELECT id, name, parent_id, name AS full_path
    FROM categories WHERE parent_id IS NULL

    UNION ALL

    SELECT c.id, c.name, c.parent_id,
           ct.full_path || ' > ' || c.name
    FROM categories c
    JOIN cat_tree ct ON c.parent_id = ct.id
)
SELECT * FROM cat_tree;
-- Electronics > Phones > Smartphones

-- Number sequence (simple recursion demo)
WITH RECURSIVE nums AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM nums WHERE n < 10
)
SELECT n FROM nums;  -- 1, 2, 3, ..., 10

An e-commerce platform needed to display full category breadcrumbs (Home > Electronics > Phones > iPhones) for 50,000 products. Initially, the application made multiple database queries per product (one per level), hitting the database 200,000 times for a catalog page. A single recursive CTE replaced all those queries, generating all breadcrumbs in one query under 200ms. The CTE output was cached in a materialized view that refreshed every hour.

Recursive CTEs have an anchor (base case) + recursive member (self-join). Use for hierarchies: org charts, categories, bill of materials. Always add a depth limit to prevent infinite loops.
⚠️ Common Mistake

Not adding a recursion limit, causing infinite loops when data has cycles (e.g., employee A reports to B, B reports to A).

❌ Dangerous — no recursion limit on potentially cyclic data
WITH RECURSIVE tree AS (
    SELECT id, parent_id FROM nodes WHERE id = 1
    UNION ALL
    SELECT n.id, n.parent_id FROM nodes n JOIN tree t ON n.parent_id = t.id
)
SELECT * FROM tree;
-- If node 5 → node 3 → node 5 (cycle) → infinite loop!
✅ Safe — add depth limit
WITH RECURSIVE tree AS (
    SELECT id, parent_id, 0 AS depth FROM nodes WHERE id = 1
    UNION ALL
    SELECT n.id, n.parent_id, t.depth + 1
    FROM nodes n JOIN tree t ON n.parent_id = t.id
    WHERE t.depth < 20  -- Safety limit
)
SELECT * FROM tree;
🔁 Follow-Up Question

How do you detect cycles in recursive CTEs? What is the CYCLE clause in SQL:2003?

27 Explain aggregate vs analytic window functions — SUM, AVG, LAG, LEAD, PARTITION BY advanced

Aggregate functions with GROUP BY collapse multiple rows into one summary row per group. Window (analytic) functions compute values across a "window" of rows related to the current row — but keep all individual rows in the output.

Key window functions beyond ranking:

  • SUM/AVG/COUNT OVER() — running totals, moving averages, cumulative counts without losing row detail.
  • LAG(column, n) — access the value from n rows before the current row. Useful for comparing to the previous period.
  • LEAD(column, n) — access the value from n rows after the current row.
  • FIRST_VALUE / LAST_VALUE — get the first or last value in the window frame.

The PARTITION BY clause divides rows into independent groups (like GROUP BY but without collapsing). ORDER BY within OVER determines the row ordering. The frame clause (ROWS BETWEEN) controls exactly which rows are included in the calculation.

-- Running total (cumulative sum)
SELECT
    order_date, amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;
-- Jan 1: $100, running=$100
-- Jan 2: $200, running=$300
-- Jan 3: $150, running=$450

-- Running total per customer (PARTITION BY)
SELECT
    customer_id, order_date, amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS cust_running
FROM orders;

-- Moving average (last 3 months)
SELECT
    month, revenue,
    AVG(revenue) OVER (
        ORDER BY month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3m
FROM monthly_revenue;

-- LAG: compare to previous month
SELECT
    month, revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS change,
    ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month))
        / LAG(revenue, 1) OVER (ORDER BY month) * 100, 1) AS pct_change
FROM monthly_revenue;

-- LEAD: look ahead to next value
SELECT
    name, salary,
    LEAD(salary) OVER (ORDER BY salary DESC) AS next_lower_salary,
    salary - LEAD(salary) OVER (ORDER BY salary DESC) AS gap
FROM employees;

-- FIRST_VALUE: highest salary in each department
SELECT name, department, salary,
    FIRST_VALUE(name) OVER (PARTITION BY department ORDER BY salary DESC) AS top_earner
FROM employees;

A financial analytics team needed month-over-month growth rates for 500 product categories. The initial approach used self-joins (JOIN revenue ON month = prev_month), creating a 500-line query that was slow and error-prone. Replacing it with LAG(revenue) OVER (PARTITION BY category ORDER BY month) reduced the query to 10 lines and ran 8x faster. The same pattern was used for running totals (SUM OVER), moving averages (AVG OVER with ROWS BETWEEN), and YTD calculations — all without self-joins.

Window functions (OVER) keep all rows while computing across related rows. SUM OVER = running totals. LAG/LEAD = previous/next row values. PARTITION BY = group without collapsing. Frame clause controls the window size.
⚠️ Common Mistake

Using self-joins to compare rows with previous/next rows when LAG/LEAD would be cleaner and faster.

❌ Complex — self-join for previous month comparison
SELECT a.month, a.revenue, b.revenue AS prev_revenue,
       a.revenue - b.revenue AS change
FROM monthly_revenue a
LEFT JOIN monthly_revenue b
ON b.month = a.month - INTERVAL 1 MONTH;
-- Breaks if months are missing, complex for multiple periods
✅ Clean — LAG window function
SELECT month, revenue,
       LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
       revenue - LAG(revenue) OVER (ORDER BY month) AS change
FROM monthly_revenue;
-- Handles gaps naturally, easy to extend to 2+ periods
🔁 Follow-Up Question

What are frame clauses (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and how do they affect window functions?

28 What are pivot and unpivot operations and how do you implement them? advanced

Pivot transforms rows into columns — turning unique values from one column into multiple columns in the output. It converts "long" (normalized) data into "wide" (denormalized) format, commonly used for cross-tabulation reports.

Unpivot is the reverse — transforms columns into rows, converting "wide" data back to "long" format.

Implementation varies by database:

  • SQL Server — has native PIVOT and UNPIVOT operators.
  • PostgreSQL — use crosstab() from the tablefunc extension, or manual CASE/FILTER approach.
  • MySQL — no native PIVOT. Use conditional aggregation with CASE WHEN + SUM/MAX.

The most portable approach (works everywhere) is conditional aggregation: SUM(CASE WHEN category = 'X' THEN value END).

-- Source data (long format):
-- | year | quarter | revenue |
-- | 2024 | Q1      | 100000  |
-- | 2024 | Q2      | 120000  |
-- | 2024 | Q3      | 110000  |
-- | 2024 | Q4      | 150000  |

-- PIVOT using conditional aggregation (works everywhere)
SELECT
    year,
    SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS Q1,
    SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS Q2,
    SUM(CASE WHEN quarter = 'Q3' THEN revenue END) AS Q3,
    SUM(CASE WHEN quarter = 'Q4' THEN revenue END) AS Q4,
    SUM(revenue) AS total
FROM quarterly_revenue
GROUP BY year;
-- Output (wide format):
-- | year | Q1     | Q2     | Q3     | Q4     | total  |
-- | 2024 | 100000 | 120000 | 110000 | 150000 | 480000 |

-- SQL Server native PIVOT
SELECT year, [Q1], [Q2], [Q3], [Q4]
FROM quarterly_revenue
PIVOT (SUM(revenue) FOR quarter IN ([Q1], [Q2], [Q3], [Q4])) AS pvt;

-- UNPIVOT: columns back to rows
-- Source: | product | jan_sales | feb_sales | mar_sales |
-- Target: | product | month     | sales     |

-- Using UNION ALL (portable)
SELECT product, 'January' AS month, jan_sales AS sales FROM products
UNION ALL
SELECT product, 'February', feb_sales FROM products
UNION ALL
SELECT product, 'March', mar_sales FROM products;

-- PostgreSQL: LATERAL + VALUES for unpivot
SELECT p.product, m.month, m.sales
FROM products p,
LATERAL (VALUES ('Jan', jan_sales), ('Feb', feb_sales), ('Mar', mar_sales))
    AS m(month, sales);

A business intelligence team needed to generate a monthly sales matrix showing products as rows and months as columns. The raw data had 2 million rows in long format (product_id, month, amount). Pivoting with conditional aggregation (CASE WHEN month = ... THEN amount END) produced a clean spreadsheet-ready output. The team created this as a materialized view that refreshed nightly, and their Excel-based reporting tool consumed it via ODBC — eliminating 4 hours of manual spreadsheet formatting per week.

Pivot = rows to columns (for reports). Unpivot = columns to rows (for analysis). Use CASE WHEN + aggregate for portable pivot. SQL Server has native PIVOT/UNPIVOT syntax.
⚠️ Common Mistake

Hardcoding column values in PIVOT queries that need to handle dynamic data — adding a new quarter or category requires changing the SQL.

❌ Brittle — hardcoded pivot columns
SELECT year,
    SUM(CASE WHEN quarter='Q1' THEN revenue END) AS Q1,
    SUM(CASE WHEN quarter='Q2' THEN revenue END) AS Q2,
    SUM(CASE WHEN quarter='Q3' THEN revenue END) AS Q3,
    SUM(CASE WHEN quarter='Q4' THEN revenue END) AS Q4
FROM revenue GROUP BY year;
-- What if a new quarter-type "Q5-Bonus" is added?
✅ Flexible — dynamic pivot with prepared statement
-- Build the CASE columns dynamically in application code
-- or use dynamic SQL in a stored procedure:
SET @sql = NULL;
SELECT GROUP_CONCAT(DISTINCT
    CONCAT('SUM(CASE WHEN quarter=\'', quarter,
           '\' THEN revenue END) AS `', quarter, '`'))
INTO @sql FROM revenue;
SET @sql = CONCAT('SELECT year, ', @sql, ' FROM revenue GROUP BY year');
PREPARE stmt FROM @sql; EXECUTE stmt;
🔁 Follow-Up Question

How do you handle dynamic pivot when the column values are not known at query-writing time?

29 Database partitioning strategies — range, list, hash — when and why? experienced

Partitioning splits a large table into smaller, physically separate pieces (partitions) while maintaining a single logical table. Queries that target specific partitions only scan those partitions (partition pruning), dramatically improving performance.

Three main strategies:

  • Range partitioning — partition by value ranges. Most common for time-series data: one partition per month or year. Queries with date range filters benefit from partition pruning.
  • List partitioning — partition by discrete values. Useful for geographic regions, status codes, or product categories.
  • Hash partitioning — partition by hash of a column value. Distributes data evenly across N partitions. Good when there's no natural range or list but you want to spread I/O across disks.

Benefits: faster queries via partition pruning, easier maintenance (drop old partitions instead of DELETE), parallel query execution across partitions, independent backup/restore per partition.

When to partition: tables with hundreds of millions of rows, clear partition key (date, region), queries that almost always include the partition key in WHERE clause.

-- RANGE partitioning by date (PostgreSQL)
CREATE TABLE orders (
    id          SERIAL,
    customer_id INT NOT NULL,
    amount      DECIMAL(10,2),
    order_date  DATE NOT NULL
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024_q1 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
CREATE TABLE orders_2024_q3 PARTITION OF orders
    FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');
CREATE TABLE orders_2024_q4 PARTITION OF orders
    FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');

-- Query with partition pruning (only scans Q1 partition)
SELECT * FROM orders WHERE order_date BETWEEN '2024-02-01' AND '2024-03-31';

-- LIST partitioning by region
CREATE TABLE customers (
    id     SERIAL,
    name   VARCHAR(100),
    region VARCHAR(20)
) PARTITION BY LIST (region);

CREATE TABLE customers_americas PARTITION OF customers FOR VALUES IN ('US', 'CA', 'MX', 'BR');
CREATE TABLE customers_europe PARTITION OF customers FOR VALUES IN ('UK', 'DE', 'FR', 'ES');
CREATE TABLE customers_asia PARTITION OF customers FOR VALUES IN ('IN', 'JP', 'CN', 'SG');

-- HASH partitioning (even distribution)
CREATE TABLE logs (
    id      SERIAL,
    user_id INT,
    action  VARCHAR(50)
) PARTITION BY HASH (user_id);

CREATE TABLE logs_p0 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE logs_p1 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE logs_p2 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE logs_p3 PARTITION OF logs FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- Maintenance: drop old data by detaching/dropping partition
ALTER TABLE orders DETACH PARTITION orders_2024_q1;
DROP TABLE orders_2024_q1;  -- Instant, vs DELETE millions of rows

A ride-sharing company's trips table had 2 billion rows. Queries for monthly reports scanned the entire table, taking 30+ minutes. After range-partitioning by month (24 partitions for 2 years), monthly report queries only scanned one partition (~80M rows) — execution time dropped to 90 seconds. The biggest win was data retention: dropping a 2-year-old partition (DROP TABLE trips_2022_01) took milliseconds vs. DELETE which would have taken hours and generated massive WAL.

Range = time-series data. List = discrete categories. Hash = even distribution. Partition when tables are 100M+ rows and queries consistently filter on the partition key. Biggest win: instant DROP vs slow DELETE for old data.
⚠️ Common Mistake

Partitioning on a column that is rarely used in WHERE clauses — the optimizer cannot prune partitions, so it scans all of them (worse than no partitioning due to overhead).

❌ Bad — partition key not in queries
-- Partitioned by region, but all queries filter by date
CREATE TABLE orders (...) PARTITION BY LIST (region);
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Must scan ALL region partitions → slower than unpartitioned!
✅ Good — partition key matches query patterns
-- Partitioned by date, queries always filter by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Only scans the June partition → fast
🔁 Follow-Up Question

What is the difference between partitioning and sharding? When would you choose each?

30 Horizontal vs vertical sharding — trade-offs and implementation experienced

Sharding distributes data across multiple database servers (nodes) to handle more data and traffic than a single server can manage. Unlike partitioning (which splits data within one server), sharding splits data across different servers.

  • Horizontal sharding — same table schema on each shard, but different rows. A shard key determines which shard holds each row. Example: users 1-1M on shard 1, users 1M-2M on shard 2. Most common approach.
  • Vertical sharding — different tables or columns on different servers. Example: user profiles on server A, user activity logs on server B. Simpler but limited — eventually each table outgrows its server.

Horizontal sharding challenges:

  • Shard key selection — choose poorly and you get hotspots (one shard handling most traffic).
  • Cross-shard queries — JOINs across shards are expensive or impossible.
  • Distributed transactions — ACID across shards requires 2-phase commit (slow) or eventual consistency.
  • Rebalancing — adding a new shard requires redistributing data.
  • Application complexity — routing logic must know which shard to query.
-- HORIZONTAL SHARDING example concept:
-- Shard key: user_id % num_shards

-- Shard 0 (users where user_id % 3 = 0): user 3, 6, 9, ...
-- Shard 1 (users where user_id % 3 = 1): user 1, 4, 7, ...
-- Shard 2 (users where user_id % 3 = 2): user 2, 5, 8, ...

-- Application routing (pseudocode):
-- shard_id = user_id % NUM_SHARDS
-- connection = shard_connections[shard_id]
-- connection.query("SELECT * FROM users WHERE id = ?", user_id)

-- Each shard has the SAME schema:
CREATE TABLE users (
    id    INT PRIMARY KEY,
    name  VARCHAR(100),
    email VARCHAR(255)
);

-- VERTICAL SHARDING example:
-- Server A: users, profiles (frequently read)
-- Server B: orders, payments (write-heavy)
-- Server C: analytics, logs (append-only)

-- Cross-shard query problem:
-- "Find all orders for users in California"
-- User data on Shard A, order data distributed across all shards
-- → Application must query users shard, get IDs, then fan out to order shards
-- → No simple SQL JOIN possible

-- Consistent hashing: handles adding/removing shards
-- Hash ring: hash(user_id) maps to a point on a ring
-- Each shard owns a range of the ring
-- Adding a shard only moves ~1/N of the data

Instagram sharded their PostgreSQL database by user_id when they hit 25 million users on a single server. They used a shard-aware Django ORM layer that routed queries to the correct shard based on user_id % num_shards. Cross-shard queries (like "find all comments on a photo") required fan-out to multiple shards and merging results in the application. They managed 12 shards initially and grew to 64. The biggest lesson: choosing user_id as the shard key meant user-centric queries (profile, feed) were fast, but global queries (trending posts, search) required a separate denormalized search index.

Horizontal sharding = same schema, different rows across servers (scale reads+writes). Vertical sharding = different tables on different servers (simpler but limited). Cross-shard JOINs are the hardest problem — choose your shard key to minimize them.
⚠️ Common Mistake

Choosing a shard key that creates hotspots — where one shard handles disproportionately more traffic than others.

❌ Bad shard key — temporal hotspot
-- Shard by created_at date range:
-- Shard 1: Jan-Jun 2024, Shard 2: Jul-Dec 2024
-- ALL new writes go to Shard 2 (current month)
-- Shard 2 is overloaded, Shard 1 is idle
-- Problem gets worse over time
✅ Good shard key — even distribution
-- Shard by hash of user_id:
-- hash(user_id) % num_shards → even distribution
-- Both reads and writes spread across all shards
-- New users naturally distribute to all shards
🔁 Follow-Up Question

What is consistent hashing and why is it used in sharding? How does it handle adding or removing shards?

31 Database replication — master-slave, multi-master, and eventual consistency experienced

Replication copies data from one database server to one or more replicas. It provides high availability (if the primary fails, a replica can take over), read scaling (distribute read queries across replicas), and geographic distribution (replicas closer to users).

  • Master-slave (primary-replica) — one server handles all writes. Changes are replicated to one or more read-only replicas. Reads can go to any replica. Simplest model. If the master fails, one replica is promoted.
  • Multi-master — multiple servers accept writes. Each propagates changes to the others. More complex — requires conflict resolution when two masters modify the same row simultaneously.

Replication types:

  • Synchronous — write is not confirmed until all replicas have it. Strong consistency but higher latency.
  • Asynchronous — write is confirmed immediately. Replicas catch up later. Faster but can have replication lag — a read from a replica may return stale data.

Eventual consistency means that if no new writes are made, all replicas will eventually converge to the same state. The "window of inconsistency" is the replication lag (typically milliseconds to seconds).

-- MySQL: check replication status on replica
SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 0     ← replica is caught up
-- Seconds_Behind_Source: 15    ← 15 seconds of lag

-- Application-level read/write splitting (pseudocode):
-- writes → primary connection
-- reads  → replica connection (may have slight lag)

-- Read-after-write consistency pattern:
-- 1. User creates a post → write to PRIMARY
-- 2. Redirect to view the post → read from PRIMARY (not replica!)
-- 3. All other reads → read from REPLICA

-- PostgreSQL streaming replication setup (primary):
-- postgresql.conf:
-- wal_level = replica
-- max_wal_senders = 3

-- MySQL Group Replication (multi-master):
-- All members can accept writes
-- Built-in conflict detection (certification-based)
-- Conflicts resolved automatically (first writer wins)

-- Conflict example in multi-master:
-- Node A: UPDATE users SET name='Alice' WHERE id=1; (timestamp T1)
-- Node B: UPDATE users SET name='Alicia' WHERE id=1; (timestamp T2)
-- Conflict resolution: T1 < T2 → Node B's change wins
-- Both nodes converge to name='Alicia'

A social media platform with 50 million users used master-slave replication with 5 read replicas. Reads (90% of traffic) were distributed across replicas, while the primary handled writes. They encountered a common issue: a user creates a post, the page redirects to their profile, and the post doesn't appear (because the profile read hit a replica that hadn't received the write yet). The fix was "read-your-writes consistency": after any write, the user's next read is routed to the primary for 5 seconds, then back to replicas.

Master-slave = one writer, many readers (simple). Multi-master = multiple writers (complex, needs conflict resolution). Async replication = fast but has lag. Plan for read-after-write consistency.
⚠️ Common Mistake

Reading from a replica immediately after writing to the primary, expecting the latest data — replication lag causes stale reads.

❌ Stale read — replica hasn't received the write yet
-- Write to primary
INSERT INTO posts (user_id, content) VALUES (42, 'Hello World');

-- Immediately read from replica
SELECT * FROM posts WHERE user_id = 42 ORDER BY created_at DESC LIMIT 1;
-- May NOT return the post just created (replication lag)
-- User sees "post failed" even though it succeeded
✅ Read-your-writes — route to primary after write
-- Write to primary
INSERT INTO posts (user_id, content) VALUES (42, 'Hello World');

-- Read from PRIMARY for this user's next request
SELECT * FROM posts WHERE user_id = 42 ORDER BY created_at DESC LIMIT 1;
-- Always sees their own writes
-- After 5 seconds, switch back to replica for this user
🔁 Follow-Up Question

What is the CAP theorem and how does it relate to database replication choices?

32 ORM vs raw SQL — when to use each and the N+1 query problem experienced

An ORM (Object-Relational Mapper) maps database tables to programming language objects, letting you interact with the database using code instead of SQL strings. Popular ORMs: SQLAlchemy (Python), Hibernate (Java), Entity Framework (.NET), Sequelize (Node.js), ActiveRecord (Ruby).

ORM advantages: type safety, less boilerplate, prevents SQL injection (parameterized queries), database-agnostic code, migration management, easier for simple CRUD operations.

Raw SQL advantages: full control over query optimization, complex queries (CTEs, window functions, lateral joins), database-specific features, better performance for batch operations, necessary for query tuning.

The N+1 query problem is the most common ORM performance pitfall: the ORM executes 1 query to fetch N parent records, then N separate queries to fetch each parent's related records — instead of a single JOIN query.

-- N+1 PROBLEM example (Python/SQLAlchemy pseudocode):

# ❌ N+1: 1 query for authors + N queries for books
authors = session.query(Author).all()        # Query 1: SELECT * FROM authors
for author in authors:
    print(author.books)                       # Query 2..N+1: SELECT * FROM books WHERE author_id = ?
# If 100 authors → 101 queries!

# ✅ Fix: Eager loading (1 query with JOIN)
authors = session.query(Author).options(joinedload(Author.books)).all()
# Single query: SELECT * FROM authors JOIN books ON ...

# ✅ Fix: Subquery loading (2 queries total)
authors = session.query(Author).options(subqueryload(Author.books)).all()
# Query 1: SELECT * FROM authors
# Query 2: SELECT * FROM books WHERE author_id IN (1,2,3,...100)

-- Raw SQL equivalent (always efficient):
SELECT a.name, b.title
FROM authors a
LEFT JOIN books b ON a.id = b.author_id;
-- 1 query regardless of the number of authors

-- When to use raw SQL:
-- Complex analytics with window functions
SELECT name, salary,
    RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as dept_rank,
    salary - AVG(salary) OVER (PARTITION BY dept) as vs_avg
FROM employees;
-- ORMs cannot express this elegantly

-- When ORM is fine:
-- Simple CRUD operations
-- user = User(name="Alice", email="alice@example.com")
-- session.add(user)
-- session.commit()

A Django REST API serving a product catalog had a response time of 4 seconds for the products list endpoint. Django Debug Toolbar revealed 847 queries per request — a classic N+1 problem. Each product triggered separate queries for its category, brand, images, and reviews. Adding select_related (JOIN) for category and brand, and prefetch_related (2 queries with IN) for images and reviews reduced the query count from 847 to 5, and response time from 4 seconds to 120ms — a 33x improvement with 3 lines of code.

ORMs = great for simple CRUD, type safety, and SQL injection prevention. Raw SQL = necessary for complex analytics and performance tuning. N+1 is the #1 ORM pitfall — always use eager loading (joinedload/prefetch_related).
⚠️ Common Mistake

Accessing related objects in a loop without eager loading, causing hundreds of invisible queries (N+1 problem).

❌ N+1 — each iteration triggers a query
# Python/Django
orders = Order.objects.all()  # 1 query
for order in orders:
    print(order.customer.name)  # N queries (1 per order)
    print(order.items.count())  # N more queries
# 500 orders → 1001 queries → slow!
✅ Eager loading — 2-3 queries total
# Python/Django
orders = Order.objects.select_related('customer').prefetch_related('items').all()
# Query 1: SELECT orders.*, customers.* FROM orders JOIN customers
# Query 2: SELECT * FROM items WHERE order_id IN (1,2,...500)
for order in orders:
    print(order.customer.name)  # No query — already loaded
    print(order.items.count())  # No query — already loaded
🔁 Follow-Up Question

How do you detect N+1 queries in production? What monitoring tools help?

33 Schema migration strategies for zero-downtime deployments experienced

Schema migrations are changes to the database structure (adding/removing columns, changing types, creating tables). In production systems with zero-downtime requirements, these changes must be applied without locking tables or breaking running application code.

Key principles for zero-downtime migrations:

  • Backward compatibility — the old application code must work with the new schema, and the new code must work with the old schema (during the rolling deployment window).
  • Multi-phase deployment — separate the migration into deploy new code + apply schema change + clean up old schema.
  • Avoid exclusive locks — operations like ALTER TABLE ADD COLUMN in MySQL < 8.0 lock the entire table. Use online DDL tools (pt-online-schema-change, gh-ost) for large tables.

Common migration patterns:

  • Add column — safe. Add with a DEFAULT or nullable. Deploy code that writes to it. Backfill old rows. Then add NOT NULL constraint.
  • Remove column — first remove all code references, deploy, then drop the column.
  • Rename column — never rename directly. Add new column → copy data → deploy code using new column → drop old column.
-- SAFE: Add nullable column (no lock, no downtime)
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT NULL;
-- Existing rows get NULL, existing code keeps working

-- SAFE: Rename column in 4 phases
-- Phase 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(200);

-- Phase 2: Dual-write (application writes to both columns)
UPDATE users SET full_name = name WHERE full_name IS NULL;  -- backfill

-- Phase 3: Deploy code using full_name, stop writing to name

-- Phase 4: Drop old column
ALTER TABLE users DROP COLUMN name;

-- DANGEROUS: Direct rename (breaks all running code instantly)
-- ALTER TABLE users RENAME COLUMN name TO full_name;  -- ❌ Don't!

-- SAFE: Add NOT NULL constraint in stages
-- Step 1: Add column as nullable
ALTER TABLE orders ADD COLUMN status VARCHAR(20);

-- Step 2: Backfill existing rows
UPDATE orders SET status = 'completed' WHERE status IS NULL;

-- Step 3: Add NOT NULL constraint (after all rows have values)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- Large table migration with gh-ost (MySQL):
-- gh-ost creates a shadow table, copies data in background,
-- applies changes to shadow, then swaps atomically
-- No table locks during the copy process

GitHub built gh-ost (GitHub Online Schema Transmogrifier) because MySQL's ALTER TABLE would lock their tables with billions of rows for hours. gh-ost creates a shadow copy of the table, applies the schema change to the shadow, streams changes from the binary log to keep the shadow in sync, and atomically renames the tables when done. This allows schema changes on tables with 1+ billion rows with zero downtime and no read/write impact. They open-sourced it and it is now used by companies like Shopify and Booking.com.

Never break backward compatibility in migrations. Add columns first (nullable), deploy new code, backfill, then add constraints. Never rename directly — use add/copy/drop pattern. Use online DDL tools (gh-ost, pt-osc) for large tables.
⚠️ Common Mistake

Dropping a column before all application code stops referencing it — causes instant errors for any server still running old code during rolling deployment.

❌ Wrong order — drop column before removing code references
-- Step 1: Drop column from database
ALTER TABLE users DROP COLUMN legacy_status;

-- Step 2: Deploy new code that doesn't reference it
-- But during rolling deploy, old servers still running:
-- SELECT legacy_status FROM users;  ← ERROR! Column doesn't exist!
✅ Right order — remove code references first
-- Step 1: Deploy code that no longer reads/writes legacy_status
-- (All servers now ignore the column)

-- Step 2: After full deployment, safely drop column
ALTER TABLE users DROP COLUMN legacy_status;
-- No code references it → no errors
🔁 Follow-Up Question

What is the expand-and-contract migration pattern? How do you handle data backfilling for billions of rows?

34 How do you design a database schema for a multi-tenant SaaS application? experienced

Multi-tenant database design determines how you isolate data between customers (tenants) in a SaaS application. Three main approaches:

  • Shared database, shared schema (row-level isolation) — all tenants share the same tables. A tenant_id column on every table distinguishes data. Simplest to manage, lowest cost, but requires careful access control and has the risk of data leaks.
  • Shared database, separate schemas — each tenant gets their own schema (namespace) within one database. Better isolation than row-level. Moderate management complexity. Schema changes must be applied to every tenant schema.
  • Separate databases — each tenant gets their own database. Strongest isolation and easiest to customize per tenant. Highest operational cost — managing 1,000 databases is complex. Best for enterprise customers with strict compliance requirements.

Choosing the right approach:

  • 10–10,000 tenants with moderate data → shared schema + tenant_id
  • 100–500 enterprise tenants with compliance needs → separate schemas
  • 10–100 large enterprise tenants → separate databases
-- APPROACH 1: Shared schema with tenant_id (most common)
CREATE TABLE tenants (
    id   INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100) NOT NULL,
    plan VARCHAR(20) DEFAULT 'free'
);

CREATE TABLE projects (
    id        INT PRIMARY KEY AUTO_INCREMENT,
    tenant_id INT NOT NULL,
    name      VARCHAR(200),
    FOREIGN KEY (tenant_id) REFERENCES tenants(id),
    INDEX idx_tenant (tenant_id)  -- Every query filters by tenant
);

CREATE TABLE tasks (
    id         INT PRIMARY KEY AUTO_INCREMENT,
    tenant_id  INT NOT NULL,
    project_id INT NOT NULL,
    title      VARCHAR(300),
    FOREIGN KEY (tenant_id) REFERENCES tenants(id),
    INDEX idx_tenant (tenant_id)
);

-- CRITICAL: Every query MUST include tenant_id
SELECT * FROM projects WHERE tenant_id = 42 AND status = 'active';

-- Row-Level Security (PostgreSQL) — enforce at DB level
CREATE POLICY tenant_isolation ON projects
    USING (tenant_id = current_setting('app.tenant_id')::INT);
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;

-- Set tenant context per request
SET app.tenant_id = '42';
SELECT * FROM projects;  -- Only sees tenant 42's data automatically

-- APPROACH 2: Separate schemas
CREATE SCHEMA tenant_42;
CREATE TABLE tenant_42.projects (...);
CREATE TABLE tenant_42.tasks (...);
-- SET search_path = 'tenant_42';

Salesforce uses a shared schema with row-level tenant isolation (Organization ID on every table) to serve 150,000+ customers from a shared infrastructure. They patented their multi-tenant architecture. Slack started with shared schema + tenant_id but migrated their largest enterprise customers to separate database clusters for compliance and performance isolation. The hybrid approach — shared schema for standard plans, dedicated databases for enterprise — is becoming the industry standard.

Shared schema + tenant_id = simplest, cheapest, most common. Use Row-Level Security (RLS) to enforce isolation at the DB level. Add tenant_id to every table and every query. Separate databases only for enterprise/compliance needs.
⚠️ Common Mistake

Forgetting to filter by tenant_id in even one query, which causes a data leak between tenants — the most critical SaaS security bug.

❌ Data leak — missing tenant_id filter
-- Forgot tenant_id in the WHERE clause
SELECT * FROM invoices WHERE status = 'overdue';
-- Returns overdue invoices from ALL tenants!
-- Tenant A sees Tenant B's financial data
✅ Safe — always filter + use RLS as safety net
-- Application always includes tenant_id
SELECT * FROM invoices WHERE tenant_id = 42 AND status = 'overdue';

-- Plus: PostgreSQL RLS as a safety net
CREATE POLICY tenant_policy ON invoices
    USING (tenant_id = current_setting('app.tenant_id')::INT);
-- Even if application forgets, DB enforces isolation
🔁 Follow-Up Question

How do you handle schema migrations across 10,000 tenant schemas? What about tenant-specific customizations?

35 What are materialized views and how do they differ from regular views? experienced

A materialized view is like a regular view except it stores the query results on disk. Instead of re-executing the underlying query every time it is accessed, it returns the pre-computed results — making reads instantaneous.

Key differences from regular views:

  • Regular view — virtual table, no data stored, query runs every time. Always shows current data.
  • Materialized view — physical table, stores query results. Must be refreshed to get updated data. Reads are fast, but data can be stale between refreshes.

Refresh strategies:

  • REFRESH MATERIALIZED VIEW name — full refresh, re-executes the entire query. Blocks reads during refresh (in PostgreSQL, use CONCURRENTLY to avoid blocking).
  • Incremental refresh — only updates rows that changed since the last refresh (Oracle supports this natively with ON COMMIT refresh).
  • Scheduled refresh — refresh via cron job or database scheduler (every hour, every night, etc.).

Best for: expensive aggregate queries (dashboards, reports), queries that don't need real-time data, denormalized read models, pre-computed rankings or statistics.

-- Create a materialized view (PostgreSQL)
CREATE MATERIALIZED VIEW mv_monthly_revenue AS
SELECT
    DATE_TRUNC('month', order_date) AS month,
    product_category,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
JOIN products ON orders.product_id = products.id
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date), product_category;

-- Query the materialized view (instant — reads from stored result)
SELECT * FROM mv_monthly_revenue
WHERE month >= '2024-01-01'
ORDER BY total_revenue DESC;
-- No JOINs, no aggregation at query time — just a table scan

-- Refresh (full): re-run the entire underlying query
REFRESH MATERIALIZED VIEW mv_monthly_revenue;

-- Refresh concurrently (no read blocking, requires unique index)
CREATE UNIQUE INDEX idx_mv_month_cat ON mv_monthly_revenue(month, product_category);
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_revenue;

-- Check when it was last refreshed
SELECT relname, last_refresh
FROM pg_stat_user_tables
WHERE relname = 'mv_monthly_revenue';

-- Schedule refresh every hour (using pg_cron extension)
SELECT cron.schedule('refresh-revenue', '0 * * * *',
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_revenue');

-- Drop a materialized view
DROP MATERIALIZED VIEW IF EXISTS mv_monthly_revenue;

An analytics dashboard at a logistics company queried 500 million shipment records with 6 JOINs and multiple aggregations. Each page load took 45 seconds. Creating a materialized view that pre-computed the dashboard metrics reduced page load to 200ms. They refreshed it every 15 minutes using pg_cron. The trade-off was that dashboard data could be up to 15 minutes old — which was perfectly acceptable for trend analysis. For the real-time tracking page, they kept the live query.

Materialized views = cached query results stored on disk. Reads are instant but data can be stale. Refresh manually, on schedule, or on commit. Use for dashboards, reports, and expensive aggregate queries that tolerate slight staleness.
⚠️ Common Mistake

Creating a materialized view and never refreshing it — the data becomes increasingly stale, and users don't realize they're looking at old data.

❌ Stale data — never refreshed
CREATE MATERIALIZED VIEW mv_stats AS SELECT ... ;
-- Created on Jan 1, never refreshed
-- On March 15, users still seeing January data
-- No automated refresh, no staleness indicator
✅ Automated refresh with monitoring
-- Create with concurrent refresh support
CREATE MATERIALIZED VIEW mv_stats AS SELECT ... ;
CREATE UNIQUE INDEX idx_mv ON mv_stats(...);

-- Schedule hourly refresh
SELECT cron.schedule('0 * * * *',
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_stats');

-- Show last refresh time in the dashboard UI
-- "Data as of: 2024-06-15 14:00 UTC (refreshes hourly)"
🔁 Follow-Up Question

When would you choose a materialized view over a cache (Redis)? How do you handle cache invalidation for materialized views?

36 Index optimization — covering indexes, composite indexes, and index selectivity performance

Beyond basic indexing, advanced index optimization techniques can eliminate disk I/O entirely:

  • Composite (multi-column) index — an index on two or more columns. Column order matters: the index can efficiently serve queries that filter on the leftmost prefix of columns. An index on (A, B, C) can serve queries filtering on A, or A+B, or A+B+C — but NOT B alone or C alone.
  • Covering index — an index that contains all the columns needed by a query. The database can answer the query entirely from the index without touching the table data — called an index-only scan. This is the fastest possible read.
  • Index selectivity — the ratio of distinct values to total rows. High selectivity (e.g., email — almost unique) = good index candidate. Low selectivity (e.g., gender — 2-3 values) = poor index candidate. A boolean column with 50/50 distribution has the worst selectivity.
  • Partial (filtered) index — an index on a subset of rows matching a WHERE condition. Smaller index size, faster lookups for specific queries.
-- COMPOSITE INDEX: column order matters!
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

-- ✅ Uses the index (filters on leftmost column first)
SELECT * FROM orders WHERE status = 'pending' AND order_date > '2024-01-01';

-- ✅ Uses the index (leftmost prefix)
SELECT * FROM orders WHERE status = 'pending';

-- ❌ Cannot efficiently use this index (skips leftmost column)
SELECT * FROM orders WHERE order_date > '2024-01-01';

-- COVERING INDEX: all needed columns IN the index
CREATE INDEX idx_orders_covering ON orders(status, order_date, customer_id, amount);

SELECT customer_id, amount FROM orders
WHERE status = 'shipped' AND order_date > '2024-01-01';
-- Index-only scan: no table access needed — all columns are in the index!
-- EXPLAIN shows "Using index" or "Index Only Scan"

-- PostgreSQL: INCLUDE clause for non-searchable covering columns
CREATE INDEX idx_orders_cover ON orders(status, order_date) INCLUDE (customer_id, amount);
-- status, order_date = searchable; customer_id, amount = just stored for covering

-- INDEX SELECTIVITY: check before creating
SELECT
    COUNT(DISTINCT status) AS distinct_vals,
    COUNT(*) AS total_rows,
    COUNT(DISTINCT status)::float / COUNT(*) AS selectivity
FROM orders;
-- status: 5 distinct / 1,000,000 rows = 0.000005 → very low selectivity
-- email:  999,500 distinct / 1,000,000 rows = 0.9995  → high selectivity ✅

-- PARTIAL INDEX: index only the rows that matter
CREATE INDEX idx_orders_pending ON orders(order_date)
    WHERE status = 'pending';
-- Only indexes pending orders (maybe 5% of the table)
-- Much smaller than indexing all orders
-- Perfect for: SELECT * FROM orders WHERE status='pending' AND order_date>...

-- Unused index detection (PostgreSQL)
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

A SaaS application had a slow dashboard query: SELECT customer_name, total, order_date FROM orders WHERE tenant_id = ? AND status = 'active' ORDER BY order_date DESC LIMIT 20. EXPLAIN showed a table scan despite having separate indexes on tenant_id and status. Creating a composite index (tenant_id, status, order_date DESC) with INCLUDE (customer_name, total) turned it into an index-only backward scan — query time dropped from 2.3 seconds to 3ms. The covering index meant zero table I/O.

Composite index: leftmost prefix rule — column order must match your queries. Covering index: include all SELECT/WHERE columns to achieve index-only scans. High selectivity columns make good index candidates. Partial indexes save space when you only query a subset of rows.
⚠️ Common Mistake

Creating composite indexes with columns in the wrong order, making the index useless for the actual query patterns.

❌ Wrong column order — index not used
-- Index: (order_date, status)
CREATE INDEX idx ON orders(order_date, status);

-- Query filters on status first (not leftmost prefix!)
SELECT * FROM orders WHERE status = 'pending';
-- Cannot use this index efficiently → full table scan

-- Even worse: random index on low-selectivity column
CREATE INDEX idx_gender ON users(gender);
-- 2 values for 10M rows → index returns 5M rows → slower than table scan
✅ Match index order to query patterns
-- Index matches query: status first, then date for ordering
CREATE INDEX idx ON orders(status, order_date DESC);

SELECT * FROM orders WHERE status = 'pending' ORDER BY order_date DESC LIMIT 20;
-- Uses index efficiently: seek to status='pending', scan in date order, stop at 20
🔁 Follow-Up Question

What are expression indexes and how are they useful? How do you index JSON columns?

37 Query profiling and optimization — EXPLAIN, slow query log, and index hints performance

Query profiling is the process of analyzing why a query is slow and finding the optimal execution strategy. Key tools:

  • EXPLAIN / EXPLAIN ANALYZE — shows the query execution plan: which indexes are used, join strategies, estimated vs actual row counts, and cost. EXPLAIN shows the plan without executing. EXPLAIN ANALYZE actually runs the query and shows real timing.
  • Slow query log — database log that captures queries exceeding a time threshold. MySQL: slow_query_log + long_query_time. PostgreSQL: log_min_duration_statement.
  • Index hints — force or suggest the optimizer to use a specific index. MySQL: USE INDEX, FORCE INDEX, IGNORE INDEX. PostgreSQL: no direct hint syntax — use SET enable_seqscan = off for testing.

What to look for in EXPLAIN output:

  • Seq Scan / Full Table Scan — no index used, scanning every row.
  • Nested Loop vs Hash Join vs Merge Join — join strategies with different performance characteristics.
  • Sort / filesort — data sorted on disk (expensive).
  • Estimated vs actual rows — large discrepancy means stale statistics, run ANALYZE.
-- PostgreSQL: EXPLAIN ANALYZE with buffer info
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, c.name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending' AND o.order_date > '2024-01-01'
ORDER BY o.order_date DESC
LIMIT 20;

-- Sample output:
-- Limit (cost=0.87..52.34 rows=20 actual time=0.04..0.12 rows=20)
--   → Nested Loop (cost=0.87..15234.56 rows=5890 actual time=0.04..0.11)
--        → Index Scan Backward on idx_orders_status_date (actual rows=20)
--             Index Cond: (status = 'pending' AND order_date > '2024-01-01')
--        → Index Scan on customers_pkey (actual rows=1 per loop)
--             Index Cond: (id = o.customer_id)
-- Planning Time: 0.2ms   Execution Time: 0.15ms

-- MySQL: EXPLAIN showing problem
EXPLAIN SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- type: ALL (full table scan!)
-- Extra: Using where
-- Problem: YEAR() wraps the column → cannot use index

-- Fix: rewrite to be index-friendly
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- type: range   key: idx_order_date   rows: 50000 (vs 5M before)

-- MySQL: Slow query log configuration
-- SET GLOBAL slow_query_log = ON;
-- SET GLOBAL long_query_time = 1;  -- log queries > 1 second
-- SET GLOBAL log_queries_not_using_indexes = ON;

-- PostgreSQL: log slow queries
-- postgresql.conf:
-- log_min_duration_statement = 500   -- log queries > 500ms

-- Update table statistics (fix bad estimates)
ANALYZE orders;           -- PostgreSQL
ANALYZE TABLE orders;     -- MySQL

-- MySQL: Force index usage
SELECT * FROM orders FORCE INDEX (idx_status_date)
WHERE status = 'pending' AND order_date > '2024-01-01';

A query on a 200-million-row table took 45 seconds. EXPLAIN ANALYZE revealed a sequential scan — the index existed but the optimizer chose not to use it because table statistics were stale (estimated 100 rows, actual 2 million). Running ANALYZE updated the statistics, and the optimizer switched to an index scan, dropping the query to 200ms. The team added ANALYZE to their weekly maintenance cron. They also enabled pg_stat_statements to continuously track the top 20 slowest queries, setting up alerts when any query exceeded 5 seconds.

EXPLAIN ANALYZE is your best friend — it shows exactly what the database does. Look for: sequential scans (missing indexes), functions on indexed columns (kills index usage), large estimated-vs-actual row differences (stale stats). Enable slow query log in production.
⚠️ Common Mistake

Wrapping an indexed column in a function, preventing the database from using the index — called a non-sargable predicate.

❌ Non-sargable — function on indexed column breaks index usage
-- Index on order_date, but YEAR() wraps it
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Full table scan! Optimizer cannot use index on order_date
-- Same problem: WHERE LOWER(email) = 'alice@mail.com'
-- Same problem: WHERE amount + tax > 1000
✅ Sargable — direct comparison allows index use
-- Range comparison directly on the indexed column
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Index range scan → fast!

-- For LOWER(): create an expression index
CREATE INDEX idx_email_lower ON users(LOWER(email));
-- Or: CREATE INDEX idx_email ON users((email::text) text_pattern_ops);
🔁 Follow-Up Question

What is pg_stat_statements and how do you use it to find the slowest queries in production?

38 When and why should you denormalize a database? performance

Denormalization is the deliberate introduction of redundancy into a database schema to improve read performance. It trades write complexity (keeping redundant data in sync) for read speed (fewer JOINs, pre-computed values).

When to denormalize:

  • Read-heavy workloads — when reads outnumber writes 100:1 or more (dashboards, reporting, product catalogs).
  • Expensive JOINs — when a query requires 5+ JOINs across large tables, and denormalizing eliminates most of them.
  • Pre-computed aggregates — storing counts, totals, or averages to avoid COUNT/SUM on millions of rows per request.
  • Caching data from external services — storing a user's display name alongside their foreign key to avoid JOIN or API call.

Common denormalization techniques:

  • Storing computed columns (full_name = first + last)
  • Duplicating frequently-accessed columns in related tables
  • Maintaining counter caches (comment_count on posts table)
  • Creating summary/rollup tables
  • Using materialized views (automated denormalization)

The cost: every write must update all denormalized copies. If they go out of sync, you have data inconsistency bugs.

-- NORMALIZED: correct but slow for dashboards
-- To get "product name, category name, and total orders":
SELECT p.name, c.name, COUNT(o.id)
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN order_items oi ON oi.product_id = p.id
JOIN orders o ON oi.order_id = o.id
WHERE o.status = 'completed'
GROUP BY p.name, c.name;
-- 4 JOINs + GROUP BY on 100M order_items → slow

-- DENORMALIZED: fast reads, redundant data
ALTER TABLE products ADD COLUMN category_name VARCHAR(100);
ALTER TABLE products ADD COLUMN total_orders INT DEFAULT 0;
ALTER TABLE products ADD COLUMN total_revenue DECIMAL(12,2) DEFAULT 0;

-- Dashboard query is now trivial:
SELECT name, category_name, total_orders, total_revenue
FROM products WHERE total_orders > 0 ORDER BY total_revenue DESC;
-- No JOINs, no aggregation → instant

-- Keep denormalized data in sync with triggers:
CREATE OR REPLACE FUNCTION update_product_stats()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE products SET
            total_orders = total_orders + 1,
            total_revenue = total_revenue + NEW.amount
        WHERE id = NEW.product_id;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_order_item_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_product_stats();

-- COUNTER CACHE example
-- Instead of: SELECT COUNT(*) FROM comments WHERE post_id = 42;
-- Store it:   posts.comment_count (updated via trigger or application)
ALTER TABLE posts ADD COLUMN comment_count INT DEFAULT 0;

-- Update on insert
UPDATE posts SET comment_count = comment_count + 1 WHERE id = NEW.post_id;

Stack Overflow denormalizes vote counts on their posts table — each post row stores its score (upvotes - downvotes) directly, rather than computing it from the votes table on every page load. With 50 million posts and 200 million votes, computing scores with COUNT/SUM per page load would be impractical. The denormalized score column is updated atomically on each vote. They accept the minimal risk of count drift (which they fix with periodic reconciliation jobs) in exchange for the massive read performance gain.

Denormalize when reads vastly outnumber writes. Store pre-computed aggregates (counts, sums). Use triggers or application logic to keep redundant data in sync. Run periodic reconciliation to detect drift. Start normalized, denormalize only when profiling proves it's needed.
⚠️ Common Mistake

Denormalizing prematurely without measuring — adding complexity for writes when the normalized query was fast enough.

❌ Premature denormalization — unnecessary complexity
-- "JOINs are slow, let's denormalize everything!"
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
ALTER TABLE orders ADD COLUMN customer_phone VARCHAR(20);
-- Now every customer name change must update ALL their orders
-- Customer changes name → UPDATE orders SET customer_name=? WHERE customer_id=?
-- 50,000 orders for one customer → expensive write, possible inconsistency
✅ Measure first, denormalize surgically
-- Step 1: Measure — is the JOIN actually slow?
EXPLAIN ANALYZE SELECT o.*, c.name FROM orders o JOIN customers c ON ...;
-- If it returns in 5ms → don't denormalize!

-- Step 2: If genuinely slow, denormalize only what's needed
ALTER TABLE orders ADD COLUMN customer_display_name VARCHAR(100);
-- Only the display name (not email, phone) — minimal redundancy
-- Accept that name changes won't backfill old orders (usually OK)
🔁 Follow-Up Question

What is CQRS (Command Query Responsibility Segregation) and how does it relate to denormalization?

39 Large-scale data pagination — OFFSET vs cursor-based (keyset) pagination performance

OFFSET pagination uses LIMIT N OFFSET M to skip rows. Simple but has a fatal flaw: the database must read and discard all skipped rows. Page 1000 with 20 rows per page = read 20,000 rows, discard 19,980, return 20. Gets slower with every page.

Cursor-based (keyset) pagination uses a WHERE clause on the last seen value instead of OFFSET. It says "give me 20 rows after the last one I saw" instead of "skip N rows". Performance is constant regardless of page depth — always reads only the rows it returns.

Comparison:

  • OFFSET: O(offset + limit) — linear degradation. Simple. Supports jumping to arbitrary page. Inconsistent if data changes between pages (duplicates/gaps).
  • Cursor: O(limit) — constant performance. Requires a unique, ordered column. No arbitrary page jumps (only next/previous). Consistent even if data changes.

For APIs with infinite scroll or "Load More" buttons, cursor-based is the standard. For traditional page-numbered UIs, OFFSET is simpler but add a reasonable maximum page limit.

-- OFFSET PAGINATION (simple but slow at scale)
-- Page 1:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 0;
-- Page 2:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 20;
-- Page 500:
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 9980;
-- Must scan 10,000 rows to return 20 → SLOW!

-- CURSOR-BASED PAGINATION (fast at any depth)
-- Page 1 (no cursor yet):
SELECT * FROM products ORDER BY id LIMIT 20;
-- Returns ids 1-20, last_id = 20

-- Page 2 (cursor = last seen id):
SELECT * FROM products WHERE id > 20 ORDER BY id LIMIT 20;
-- Seeks directly to id=21 via index, returns 20 rows → FAST!

-- Page 500 (cursor = last seen id from page 499):
SELECT * FROM products WHERE id > 9980 ORDER BY id LIMIT 20;
-- Same speed as page 1! Index seek to 9981 → 20 rows

-- Multi-column cursor (e.g., order by created_at, id for ties)
-- Last seen: created_at='2024-06-15', id=42
SELECT * FROM products
WHERE (created_at, id) > ('2024-06-15', 42)
ORDER BY created_at, id
LIMIT 20;

-- API response format for cursor pagination:
-- {
--   "data": [...20 products...],
--   "next_cursor": "eyJpZCI6NjAsImNyZWF0ZWRfYXQiOiIyMDI0LTA2LTE1In0=",
--   "has_more": true
-- }
-- The cursor is a Base64-encoded token containing the last row's sort values

-- OFFSET with total count (for page numbers UI):
SELECT COUNT(*) FROM products WHERE category = 'electronics';
-- Total: 5,000 → show pages 1-250
-- Cap at page 100 to prevent deep pagination abuse

Twitter's API switched from OFFSET to cursor-based pagination early on. With 500 million tweets per day, OFFSET pagination on a timeline query at page 10,000 would need to scan 200,000+ rows. Their cursor (since_id / max_id parameters) uses the tweet ID as a cursor — the query WHERE id < max_id ORDER BY id DESC LIMIT 20 always hits an index and returns in constant time. Slack, Stripe, GitHub, and most modern APIs use the same pattern, returning opaque cursor tokens instead of page numbers.

OFFSET = simple but O(offset), gets slower on deeper pages. Cursor/keyset = constant performance at any depth, uses WHERE on last seen value. Use cursor for APIs, infinite scroll, and any large dataset. Use OFFSET only for small datasets or when arbitrary page jump is required.
⚠️ Common Mistake

Using OFFSET for API endpoints that can be paginated deeply, causing database performance to degrade linearly.

❌ OFFSET at scale — gets slower every page
-- API: GET /products?page=5000&per_page=20
SELECT * FROM products ORDER BY created_at DESC
LIMIT 20 OFFSET 99980;
-- Database reads 100,000 rows, discards 99,980, returns 20
-- Takes 15 seconds on a table with 10M rows
-- Attackers can DoS your DB by requesting page=999999
✅ Cursor pagination — constant performance
-- API: GET /products?cursor=eyJpZCI6MTIzNDV9&per_page=20
-- Decode cursor → last_id = 12345
SELECT * FROM products WHERE id < 12345
ORDER BY id DESC LIMIT 20;
-- Index seek → 20 rows → constant time regardless of depth
-- No way to "skip ahead" → no deep pagination DoS
🔁 Follow-Up Question

How do you implement bidirectional cursor pagination (next + previous page)?

40 Connection pooling, query caching, and prepared statements for performance performance

Three infrastructure-level techniques that dramatically improve database performance without changing your queries:

  • Connection pooling — maintaining a pool of pre-established database connections that are reused across requests. Creating a new TCP connection + SSL handshake + authentication per request takes 50-200ms. A pool keeps connections warm and hands them out instantly. Tools: PgBouncer (PostgreSQL), ProxySQL (MySQL), HikariCP (Java), built-in pools in ORMs.
  • Prepared statements — pre-compile a query once, then execute it many times with different parameters. Benefits: (1) SQL injection prevention (parameters are never interpolated into SQL), (2) performance — the database parses and plans the query once, reuses the plan for subsequent executions.
  • Query caching — storing query results in memory to serve repeated identical queries without hitting the database. Levels: (1) application cache (Redis/Memcached), (2) ORM query cache, (3) database-level cache (MySQL Query Cache — deprecated in 8.0 because of contention).

These three techniques combined can reduce database load by 80-90% in typical web applications.

-- CONNECTION POOLING with PgBouncer (PostgreSQL)
-- pgbouncer.ini:
-- [databases]
-- mydb = host=localhost port=5432 dbname=mydb
-- [pgbouncer]
-- pool_mode = transaction       -- release connection after each transaction
-- max_client_conn = 1000        -- accept up to 1000 app connections
-- default_pool_size = 20        -- only 20 actual PostgreSQL connections
-- Result: 1000 app servers share 20 DB connections

-- Without pooling: 1000 app servers = 1000 DB connections = OOM
-- With pooling: 1000 app servers → PgBouncer → 20 DB connections = efficient

-- PREPARED STATEMENTS
-- PostgreSQL:
PREPARE get_user(INT) AS
    SELECT id, name, email FROM users WHERE id = $1;
EXECUTE get_user(42);
EXECUTE get_user(99);  -- Reuses the parsed/planned query

-- Application-level (Python/psycopg2):
-- cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
-- psycopg2 automatically uses server-side prepared statements

-- ❌ String interpolation (SQL injection + no plan reuse!)
-- cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")

-- QUERY CACHING with Redis (application level)
-- Pseudocode:
-- key = "products:category:electronics:page:1"
-- result = redis.get(key)
-- if result is None:
--     result = db.query("SELECT * FROM products WHERE category='electronics' LIMIT 20")
--     redis.setex(key, 300, serialize(result))  -- cache for 5 minutes
-- return result

-- Cache invalidation: delete cache when data changes
-- When a product is updated:
-- redis.delete("products:category:electronics:*")  -- invalidate related caches

-- Connection pool sizing formula:
-- pool_size = num_cores * 2 + effective_spindle_count
-- For SSD: pool_size = num_cores * 2 + 1
-- Example: 4-core server → pool_size = 9
-- More connections ≠ better! Too many → context switching overhead

A Node.js API serving 10,000 requests per second was opening a new PostgreSQL connection per request. Each connection took ~150ms to establish, and PostgreSQL maxed out at 100 concurrent connections (each consuming ~10MB RAM). Adding PgBouncer with a pool of 25 connections (transaction mode) eliminated connection overhead — API latency dropped from 200ms to 15ms. They also added Redis caching for the 20 most-hit API endpoints (product listings, category pages), which reduced database queries by 85%. The remaining 15% of queries used prepared statements, saving ~5ms per query in parse/plan time.

Connection pooling = reuse connections instead of creating new ones per request. Prepared statements = parse once, execute many + prevent SQL injection. Application-level caching (Redis) = serve repeated queries from memory. Together, these reduce database load by 80-90%.
⚠️ Common Mistake

Opening a new database connection for every request instead of using a connection pool — burns resources and limits throughput.

❌ New connection per request — resource waste
# Python — connection per request
def handle_request(user_id):
    conn = psycopg2.connect(host='db', dbname='mydb')  # 150ms overhead!
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    result = cursor.fetchone()
    conn.close()  # Connection discarded after one query
    return result
# 10,000 req/s = 10,000 connections opened + closed per second
# PostgreSQL max_connections = 100 → crash!
✅ Connection pool — reuse warm connections
# Python — connection pool
from psycopg2 import pool
db_pool = pool.ThreadedConnectionPool(5, 20, host='db', dbname='mydb')

def handle_request(user_id):
    conn = db_pool.getconn()   # Instant — reuse existing connection
    try:
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        return cursor.fetchone()
    finally:
        db_pool.putconn(conn)  # Return to pool, not closed
# 10,000 req/s share 20 connections → fast and efficient
🔁 Follow-Up Question

What is the difference between PgBouncer session mode, transaction mode, and statement mode? When would you use each?

41 How do B-tree and B+ tree indexes work internally? What happens during page splits? advanced

Almost every relational database (PostgreSQL, MySQL InnoDB, SQL Server, Oracle) uses B+ trees as the default index data structure. Understanding the internals explains why certain operations are fast and others are slow.

B-tree vs B+ tree:

  • B-tree — stores keys and data pointers in both internal nodes and leaf nodes. Any node can satisfy a lookup.
  • B+ tree (what databases actually use) — stores keys in internal nodes (for routing) and all data pointers only in leaf nodes. Leaf nodes are linked as a doubly-linked list for efficient range scans.

B+ tree structure (typical InnoDB):

  • Root node — always kept in memory. Contains keys that route searches left or right.
  • Internal (branch) nodes — contain keys and pointers to child nodes. Each node fits in one disk page (typically 8KB or 16KB). A 16KB page can hold ~500 key-pointer pairs, so a 3-level tree indexes ~125 million rows.
  • Leaf nodes — contain the actual indexed values and row pointers (or the full row data for clustered indexes). Linked in order for range scans.

Lookup: O(log n) disk reads — traverse root → internal → leaf. For 100 million rows with a branching factor of 500, that is only 3-4 levels = 3-4 disk pages read.

Page split — when inserting into a full leaf node, the database splits it into two nodes (each ~50% full), and pushes a new key up to the parent. If the parent is also full, it splits too (cascading up). This is why random key inserts (UUIDs) are expensive — they cause splits everywhere. Sequential keys (auto-increment) always append to the rightmost leaf — no splits.

-- Visualize B+ tree index structure (conceptual):
-- INDEX on employee.salary:

-- ROOT NODE (level 2, always in memory):
--  [ 30000 | 60000 | 90000 ]
--     |        |        |       \
--    ↓        ↓        ↓        ↓

-- INTERNAL NODES (level 1):
-- [10k|20k]  [40k|50k]  [70k|80k]  [100k|110k]
--   ↓   ↓      ↓   ↓      ↓   ↓       ↓    ↓

-- LEAF NODES (level 0, linked list for range scan):
-- [10k,emp3][20k,emp7] ↔ [30k,emp1][40k,emp9] ↔ [50k,emp5]...
--  ^^^^^^^^              ^^^^^^^^
--  key + row pointer     key + row pointer

-- Lookup: WHERE salary = 50000
-- 1. Root: 50000 >= 30000 and < 60000 → go to 2nd child
-- 2. Internal: 50000 >= 40000 and < 60000 → go to 2nd child
-- 3. Leaf: scan for 50000 → found! Follow row pointer to table
-- Total: 3 page reads (root cached → effectively 2 disk I/Os)

-- Range scan: WHERE salary BETWEEN 40000 AND 80000
-- 1. B+ tree lookup to find leaf with 40000
-- 2. Follow the leaf linked list: 40k → 50k → 60k → 70k → 80k
-- Sequential read, very fast!

-- PAGE SPLIT visualization:
-- Before INSERT salary=45000 into full leaf [40k, 42k, 44k, 46k, 48k]:
-- Leaf is full (max 5 keys per page)!
-- Step 1: Split into two leaves:
--   [40k, 42k, 44k] and [45k, 46k, 48k]
-- Step 2: Push middle key (45k) up to parent internal node
-- Step 3: Parent gets a new pointer to the new leaf

-- InnoDB page size (check/set):
SHOW VARIABLES LIKE 'innodb_page_size';
-- Default: 16384 (16KB)

-- PostgreSQL: inspect B-tree index structure
CREATE EXTENSION pageinspect;
SELECT * FROM bt_metap('idx_salary');  -- tree height, root page, etc.
SELECT * FROM bt_page_items('idx_salary', 1);  -- items in page 1

-- Fragmentation check after many splits (SQL Server):
SELECT name, avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED')
JOIN sys.indexes ON object_id = object_id AND index_id = index_id
WHERE avg_fragmentation_in_percent > 30;

-- Fix fragmentation:
ALTER INDEX idx_salary ON employees REBUILD;       -- offline, full rebuild
ALTER INDEX idx_salary ON employees REORGANIZE;    -- online, defragment

A fintech company inserted 20 million rows/day with UUID primary keys. InnoDB used a clustered B+ tree on the PK, and every insert went to a random leaf page — causing 10,000+ page splits per second. This fragmented the B+ tree so badly that range queries slowed from 50ms to 12 seconds over a month. The fix was switching to ULID (Universally Unique Lexicographically Sortable Identifier) — which is time-ordered, so inserts always append to the rightmost leaf. Page splits dropped to near zero, and they added a weekly OPTIMIZE TABLE to reclaim wasted space from the UUID era.

Databases use B+ trees (not plain B-trees): internal nodes route, leaf nodes store data in a linked list for range scans. Lookup = O(log n) ≈ 3-4 disk I/Os for 100M rows. Page splits happen when inserting into a full leaf node — sequential keys avoid splits, random keys (UUID) cause heavy splits and fragmentation.
⚠️ Common Mistake

Not understanding that B+ tree leaf nodes are linked — leading to incorrect assumptions about range scan performance, or ignoring page splits caused by random keys.

❌ Random PKs = constant page splits
CREATE TABLE events (
    id    CHAR(36) PRIMARY KEY DEFAULT (UUID()),  -- random!
    data  TEXT,
    ts    DATETIME
);
-- Every INSERT goes to a random leaf in the B+ tree
-- Full leaves split → 50% wasted space per page
-- Tree becomes fragmented → range scans read scattered pages
-- InnoDB fill factor cannot be configured (always 15/16)
✅ Sequential / time-ordered PKs = append-only inserts
CREATE TABLE events (
    id    BIGINT PRIMARY KEY AUTO_INCREMENT,  -- sequential
    uuid  CHAR(36) UNIQUE,  -- non-clustered index for lookups
    data  TEXT,
    ts    DATETIME
);
-- INSERTs always go to the rightmost leaf → no page splits
-- B+ tree stays compact and unfragmented
-- Alternative: ULID or UUIDv7 (time-ordered UUIDs)
🔁 Follow-Up Question

What are other index structures databases use — hash indexes, GiST, GIN, BRIN? When would you choose each over B+ tree?

42 How does table partitioning work internally? How does the optimizer achieve partition pruning? advanced

Partitioning creates physically separate storage units (child tables or segments) behind a single logical table name. Understanding the internals explains why certain queries benefit and others don't.

Physical storage:

  • Each partition is an independent heap or B+ tree on disk — with its own data pages, its own indexes, and its own statistics.
  • In PostgreSQL, each partition is literally a child table with an implicit CHECK constraint. The parent table has no data of its own — it's a routing layer.
  • In MySQL InnoDB, each partition gets its own .ibd tablespace file (or subpartition within a shared tablespace).
  • In SQL Server, partitions are mapped to filegroups via a partition function (defines boundary values) and a partition scheme (maps partitions to filegroups).

Partition pruning (elimination): The query optimizer examines WHERE clauses at planning time and determines which partitions can possibly contain matching rows. Non-matching partitions are excluded from the execution plan entirely — they are never opened or scanned.

How pruning works internally:

  1. Parser extracts the partition key column and comparison value from WHERE.
  2. Optimizer compares the value against each partition's boundary metadata (CHECK constraints, range bounds, hash modulus).
  3. Non-matching partitions are removed from the plan — their indexes and data pages are never touched.
  4. EXPLAIN shows only the accessed partitions.

Pruning limitations: Only works when the partition key appears directly in the WHERE clause with a constant or parameter. Functions on the partition key (WHERE YEAR(date_col) = 2024) break pruning. JOINs where the partition key comes from another table usually prevent pruning (except with partition-wise joins in PostgreSQL 11+).

-- Create range-partitioned table (PostgreSQL)
CREATE TABLE sensor_readings (
    id          BIGSERIAL,
    sensor_id   INT NOT NULL,
    value       DOUBLE PRECISION,
    recorded_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (recorded_at);

-- Each partition = a separate physical table with its own storage
CREATE TABLE readings_2024_q1 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE readings_2024_q2 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
CREATE TABLE readings_2024_q3 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');
CREATE TABLE readings_2024_q4 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');

-- Each partition has its own index (LOCAL index)
CREATE INDEX ON readings_2024_q1 (sensor_id, recorded_at);
CREATE INDEX ON readings_2024_q2 (sensor_id, recorded_at);
-- Or: CREATE INDEX ON sensor_readings (sensor_id, recorded_at);
-- PostgreSQL auto-creates per-partition indexes

-- PARTITION PRUNING in action:
EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM sensor_readings
WHERE recorded_at BETWEEN '2024-04-15' AND '2024-06-30';
-- Output shows:
-- Append
--   → Seq Scan on readings_2024_q2    ← only Q2 scanned!
-- Partitions Q1, Q3, Q4 are completely pruned (not even opened)

-- MySQL: check which partitions are accessed
EXPLAIN PARTITIONS
SELECT * FROM sensor_readings WHERE recorded_at = '2024-05-15';
-- partitions: p_2024_q2   ← only 1 of 4 partitions

-- SQL Server: partition function + scheme internals
CREATE PARTITION FUNCTION pf_quarter (DATE)
AS RANGE RIGHT FOR VALUES ('2024-04-01', '2024-07-01', '2024-10-01');
-- Internally creates boundary array: [2024-04-01, 2024-07-01, 2024-10-01]
-- Binary search on boundaries to find target partition → O(log p)

CREATE PARTITION SCHEME ps_quarter
AS PARTITION pf_quarter TO (fg_q1, fg_q2, fg_q3, fg_q4);
-- Maps partition numbers to physical filegroups

-- ❌ PRUNING BREAKS with function on partition key:
SELECT * FROM sensor_readings WHERE EXTRACT(YEAR FROM recorded_at) = 2024;
-- Cannot prune! Must scan ALL partitions
-- Fix: rewrite as range
SELECT * FROM sensor_readings
WHERE recorded_at >= '2024-01-01' AND recorded_at < '2025-01-01';
-- Pruning works → only 2024 partitions scanned

-- Check partition sizes individually
SELECT
    child.relname AS partition_name,
    pg_size_pretty(pg_relation_size(child.oid)) AS size,
    pg_stat_get_live_tuples(child.oid) AS row_count
FROM pg_inherits
JOIN pg_class parent ON inhparent = parent.oid
JOIN pg_class child ON inhrelid = child.oid
WHERE parent.relname = 'sensor_readings'
ORDER BY child.relname;

A telecom company stored 5 years of call detail records (CDR) — 12 billion rows in one table. Queries for monthly billing always scanned the full table (45 minutes). After range-partitioning by month (60 partitions), a billing query for one month only touched 1 partition (~200M rows, 4 minutes). EXPLAIN confirmed pruning: 59 of 60 partitions eliminated. The biggest surprise was maintenance: dropping a 2-year-old partition took 20ms (vs. DELETE WHERE date < ... which would have taken 8 hours and generated 400GB of WAL). They also discovered that per-partition VACUUM and ANALYZE ran 60x faster than on the monolithic table.

Each partition is physically separate storage (own files, own indexes, own statistics). Pruning uses boundary metadata at plan time to exclude irrelevant partitions. The partition key MUST appear directly in WHERE (no functions!) for pruning to work. Biggest practical win: instant DROP PARTITION vs. hours-long DELETE.
⚠️ Common Mistake

Applying functions to the partition key in WHERE clauses, which prevents the optimizer from pruning partitions — defeating the entire purpose of partitioning.

❌ Function on partition key — ALL partitions scanned
-- Table partitioned by recorded_at (monthly)
-- This query wraps the key in a function:
SELECT * FROM sensor_readings
WHERE DATE_TRUNC('month', recorded_at) = '2024-06-01';
-- Optimizer cannot compare '2024-06-01' against partition boundaries!
-- Must scan ALL 60 partitions → slower than unpartitioned!

EXPLAIN shows: Append → scan on readings_2024_01, readings_2024_02, ...
-- All 60 partitions listed → no pruning
✅ Direct comparison — single partition scanned
-- Same intent, but sargable range comparison:
SELECT * FROM sensor_readings
WHERE recorded_at >= '2024-06-01' AND recorded_at < '2024-07-01';
-- Optimizer binary-searches partition boundaries
-- Finds: only readings_2024_06 matches → prune all others

EXPLAIN shows: Append → scan on readings_2024_06 only
-- 1 of 60 partitions → 60x faster
🔁 Follow-Up Question

What is partition-wise JOIN and partition-wise aggregation in PostgreSQL? How do they parallelize queries across partitions?

43 How does ORDER BY work internally? When does the database use external merge sort vs in-memory sort? advanced

ORDER BY is one of the most expensive SQL operations because the database must arrange all result rows in a specific order before returning them. The internal sort strategy depends on the data size relative to available memory.

Three sorting strategies databases use:

  1. Index scan (no sort needed) — if an index already stores data in the requested order, the database reads the index in order and avoids sorting entirely. This is the fastest. For example, ORDER BY id on a table with a clustered index on id requires zero sorting. Also works for non-clustered indexes that match the ORDER BY columns.
  2. In-memory quicksort — when the data fits in the work_mem (PostgreSQL) or sort_buffer_size (MySQL) memory budget, the database sorts it entirely in RAM using a variant of quicksort. Fast, but limited by available memory. PostgreSQL default work_mem is only 4MB.
  3. External merge sort (disk-based) — when data exceeds memory, the database uses a multi-phase external sort:
    • Run generation: read chunks that fit in memory, sort each chunk (a "run"), write each sorted run to a temp file on disk.
    • Merge phase: merge all sorted runs using a k-way merge (priority queue / min-heap). May require multiple merge passes if there are too many runs.
    • This is O(n log n) with disk I/O — much slower than in-memory sort.

EXPLAIN tells you which strategy was used:

  • PostgreSQL: Sort Method: quicksort Memory: 25kB (in-memory) vs Sort Method: external merge Disk: 45000kB (spilled to disk).
  • MySQL: Using filesort in EXPLAIN means the optimizer couldn't use an index for ordering (but may still be in-memory — the name is misleading).

Top-N optimization: For ORDER BY ... LIMIT N, databases use a heap sort / top-N sort — they maintain a min-heap of size N, scanning all rows but only keeping the top N. Memory usage is O(N) regardless of table size. This is why ORDER BY created_at DESC LIMIT 10 is fast even on huge tables (especially with an index).

-- STRATEGY 1: Index eliminates sort entirely
CREATE INDEX idx_created ON orders(created_at DESC);

EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;
-- Index Scan Backward on idx_created
-- Sort Method: none — index provides the order!
-- Execution Time: 0.5ms

-- STRATEGY 2: In-memory quicksort (small result set)
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42 ORDER BY amount DESC;
-- Sort  (cost=...)  Sort Key: amount DESC
--   Sort Method: quicksort  Memory: 25kB    ← fits in work_mem
-- Execution Time: 3ms

-- STRATEGY 3: External merge sort (large result set)
SET work_mem = '4MB';  -- default PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC;
-- Sort  (cost=...)
--   Sort Method: external merge  Disk: 45000kB   ← spilled to disk!
-- Execution Time: 12000ms  (12 seconds!)

-- Fix: increase work_mem for this query
SET work_mem = '256MB';
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC;
-- Sort Method: quicksort  Memory: 52000kB   ← now fits in RAM
-- Execution Time: 800ms  (15x faster!)
-- ⚠️ work_mem is per-sort-operation, not per-query!
-- A query with 5 sorts uses 5 × work_mem

-- TOP-N HEAP SORT optimization
EXPLAIN ANALYZE
SELECT * FROM orders ORDER BY amount DESC LIMIT 10;
-- Sort  Sort Key: amount DESC
--   Sort Method: top-N heapsort  Memory: 26kB   ← only keeps 10 rows!
-- Even on a 100M row table, only 10 rows in memory

-- MySQL: "Using filesort" (misleading name)
EXPLAIN SELECT * FROM orders ORDER BY amount;
-- Extra: Using filesort
-- Does NOT necessarily mean disk — it means "not using an index for sort"
-- Could be in-memory sort_buffer or on-disk merge sort

-- MySQL: check sort behavior
SHOW STATUS LIKE 'Sort_merge_passes';
-- Sort_merge_passes > 0 → external merge sort happened → increase sort_buffer_size

-- CREATE INDEX to eliminate sorts for common queries:
-- Before: ORDER BY customer_id, order_date DESC
CREATE INDEX idx_cust_date ON orders(customer_id, order_date DESC);
-- Now the index provides the order → no sort at all

-- Multi-column sort: index must match EXACT column order AND direction
-- ORDER BY a ASC, b DESC → needs INDEX(a ASC, b DESC)
-- INDEX(a, b) only optimizes ORDER BY a ASC, b ASC

An analytics platform ran nightly reports sorting 200 million rows by revenue. With PostgreSQL's default work_mem of 4MB, the sort spilled to disk generating 48GB of temp files — the report took 45 minutes and filled the disk. Increasing work_mem to 512MB for the reporting session (SET LOCAL work_mem = '512MB' inside a transaction) kept the sort in RAM, cutting the report to 3 minutes. For their top-10 dashboard queries, adding a covering index that matched the ORDER BY eliminated sorting entirely — response time dropped from 2 seconds to 5ms. The team also discovered that MySQL's "Using filesort" warning was misleading: most of their small sorts were actually in-memory, and only the nightly batch was genuinely hitting disk.

ORDER BY uses: (1) index scan (zero sort, fastest), (2) in-memory quicksort (fits in work_mem), or (3) external merge sort (spills to disk, slowest). EXPLAIN ANALYZE shows which method was used. Add indexes matching ORDER BY columns to eliminate sorts. For large sorts, increase work_mem. LIMIT N uses a top-N heapsort optimization.
⚠️ Common Mistake

Not checking EXPLAIN output for sort spills to disk — a common hidden performance killer. Also, creating an index with wrong column direction that doesn't match the ORDER BY.

❌ Sort spills to disk — 100x slower than in-memory
-- Default work_mem = 4MB, sorting 50 million rows
SELECT * FROM transactions ORDER BY amount DESC;
-- EXPLAIN ANALYZE:
-- Sort Method: external merge  Disk: 120000kB
-- Execution Time: 95000ms  (95 seconds!)
-- Wrote 120MB of temp files to disk

-- Also bad: index direction mismatch
CREATE INDEX idx_amount ON transactions(amount ASC);
SELECT * FROM transactions ORDER BY amount DESC;
-- Cannot use index backward efficiently (database-dependent)
-- Falls back to full sort
✅ Index eliminates sort, or tune work_mem for unavoidable sorts
-- Option 1: Index matches ORDER BY direction exactly
CREATE INDEX idx_amount_desc ON transactions(amount DESC);
SELECT * FROM transactions ORDER BY amount DESC LIMIT 100;
-- Index Scan → no sort at all → 2ms

-- Option 2: Increase memory for batch queries
BEGIN;
SET LOCAL work_mem = '256MB';  -- only for this transaction
SELECT * FROM transactions ORDER BY amount DESC;
-- Sort Method: quicksort  Memory: 180000kB (in RAM)
-- Execution Time: 3500ms (27x faster than disk sort)
COMMIT;  -- work_mem resets to default
🔁 Follow-Up Question

How does GROUP BY work internally? Does it also use sorting or does it use hash aggregation? When does the optimizer choose one over the other?

44 How do JOINs work internally? Nested Loop vs Hash Join vs Merge Join advanced

When you write a JOIN, the database must choose a physical join algorithm to combine the rows. The optimizer picks the best one based on table sizes, available indexes, data distribution, and memory. Three main algorithms:

1. Nested Loop Join:

  • For each row in the outer (driving) table, scan the inner table to find matching rows.
  • Complexity: O(N × M) without indexes, O(N × log M) with an index on the inner table's join column.
  • Best when: the outer table is small AND there's an index on the inner table's join column. The optimizer will pick the smaller table as the outer.
  • Variation — Index Nested Loop: uses an index seek on the inner table for each outer row, making it very efficient for small outer sets.

2. Hash Join:

  • Build phase: scan the smaller table, build an in-memory hash table keyed on the join column.
  • Probe phase: scan the larger table, hash each row's join column, probe the hash table for matches.
  • Complexity: O(N + M) — scans each table exactly once. The hash table must fit in memory (work_mem).
  • Best when: both tables are large, no useful indexes, and the smaller table's hash table fits in memory. Cannot handle inequality joins (only equi-joins: =).
  • If the hash table exceeds memory, the database uses Grace Hash Join — partitions both tables into batches, spills to disk, then joins each batch.

3. (Sort-)Merge Join:

  • Both tables must be sorted by the join column. If not already sorted (e.g., from an index), the database sorts them first.
  • Then it "merges" them in one pass — two pointers walk through both sorted sets simultaneously.
  • Complexity: O(N log N + M log M) for sorting + O(N + M) for the merge.
  • Best when: both tables are already sorted by the join key (from an index or a prior ORDER BY), or when the result needs to be ordered anyway.
-- NESTED LOOP: small outer table + indexed inner table
-- Optimizer uses this when one side is small
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM employees e             -- 10 rows (filtered)
JOIN departments d ON e.dept_id = d.id   -- 50 rows, index on d.id
WHERE e.salary > 100000;
-- Nested Loop  (cost=... rows=10)
--   → Seq Scan on employees (Filter: salary > 100000)  → 10 rows
--   → Index Scan on departments using dept_pkey  → 1 row per loop
-- Total: 10 index lookups into departments → fast!

-- HASH JOIN: two large tables, no index
EXPLAIN ANALYZE
SELECT o.id, c.name
FROM orders o                -- 5,000,000 rows
JOIN customers c ON o.customer_id = c.id;  -- 500,000 rows
-- Hash Join  (cost=...)
--   Hash Cond: (o.customer_id = c.id)
--   → Seq Scan on orders          -- probe side (larger)
--   → Hash                        -- build side (smaller)
--        → Seq Scan on customers  -- builds hash table
-- Build: hash 500K customers into memory (~40MB)
-- Probe: scan 5M orders, look up each customer_id in hash table
-- Total: one pass through each table → O(N + M)

-- Check hash memory usage in EXPLAIN ANALYZE:
-- Hash  (Buckets: 65536  Batches: 1  Memory Usage: 38000kB)
-- Batches > 1 means hash spilled to disk (needs more work_mem)

-- MERGE JOIN: both tables sorted or have matching indexes
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM table_a a
JOIN table_b b ON a.sort_key = b.sort_key
ORDER BY a.sort_key;
-- Merge Join  (cost=...)
--   Merge Cond: (a.sort_key = b.sort_key)
--   → Index Scan on table_a using idx_a_sort_key
--   → Index Scan on table_b using idx_b_sort_key
-- Both already sorted by index → merge in one pass

-- Force a specific join strategy (PostgreSQL):
SET enable_hashjoin = off;     -- disable hash join
SET enable_nestloop = off;     -- disable nested loop
SET enable_mergejoin = off;    -- disable merge join
-- (for testing only — let the optimizer decide in production)

-- MySQL: JOIN hints
SELECT /*+ HASH_JOIN(o, c) */ o.id, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id;
-- MySQL 8.0.18+ supports hash joins for equi-joins

-- SQL Server: JOIN hints
SELECT o.id, c.name
FROM orders o
INNER MERGE JOIN customers c ON o.customer_id = c.id;

A reporting query joining orders (50M rows) with products (100K rows) was taking 30 minutes. EXPLAIN showed a Nested Loop Join — the optimizer misjudged the outer table size because table statistics were stale (estimated 100 rows, actual 50 million). Running ANALYZE updated statistics, and the optimizer switched to a Hash Join — build a hash table from 100K products (8MB in memory), probe with 50M orders in one pass. Query time dropped from 30 minutes to 45 seconds. The team learned to run ANALYZE after bulk data loads.

Nested Loop = best for small outer + indexed inner (O(N × log M)). Hash Join = best for large tables with equi-joins (O(N+M), needs memory). Merge Join = best when both sides are already sorted (O(N+M)). The optimizer chooses based on table sizes and available indexes — keep statistics updated with ANALYZE.
⚠️ Common Mistake

Stale table statistics causing the optimizer to choose the wrong join algorithm — usually picking Nested Loop when Hash Join would be orders of magnitude faster.

❌ Stale stats — optimizer picks wrong join
-- Statistics say orders has 100 rows (stale!)
-- Actual: 50 million rows
EXPLAIN ANALYZE SELECT * FROM orders o JOIN products p ON o.product_id = p.id;
-- Nested Loop  (estimated rows=100, actual rows=50000000)
-- For each of 50M orders → index lookup into products
-- 50 million index lookups → 30+ minutes!
✅ Fresh stats — optimizer picks optimal join
ANALYZE orders;  -- update statistics
ANALYZE products;
EXPLAIN ANALYZE SELECT * FROM orders o JOIN products p ON o.product_id = p.id;
-- Hash Join  (estimated rows=50000000, actual rows=50000000) ✓
-- Build hash on products (100K rows, 8MB)
-- Probe with orders (50M rows, one pass)
-- 45 seconds instead of 30 minutes!
🔁 Follow-Up Question

What is a hash spill and how does Grace Hash Join handle hash tables that exceed memory?

45 How does GROUP BY work internally? Hash aggregation vs sort-based aggregation advanced

GROUP BY must collect all rows with the same group key and compute aggregate functions (SUM, COUNT, AVG, etc.) for each group. Internally, databases use two strategies:

1. Hash Aggregation:

  • Builds an in-memory hash table keyed on the GROUP BY columns.
  • For each input row, hash the group key → find or create the hash entry → update the running aggregate (increment count, add to sum, etc.).
  • After processing all rows, each hash entry contains the final aggregate result.
  • Complexity: O(N) — single pass through the data. Memory: O(G) where G = number of distinct groups.
  • Best when: the number of groups is moderate and fits in work_mem. Cannot produce sorted output.
  • If the hash table exceeds memory → spills to disk using partitioned (Grace) hashing.

2. Sort-Based Aggregation (GroupAggregate):

  • First, sort all rows by the GROUP BY columns (using quicksort or external merge sort).
  • Then, scan the sorted data sequentially — all rows in the same group are adjacent, so the database reads each group in one pass and computes aggregates.
  • Complexity: O(N log N) for sorting + O(N) for the aggregation pass.
  • Best when: data is already sorted by an index on the GROUP BY column (no sort needed → O(N)), or when the output needs to be in GROUP BY order (ORDER BY same columns), or when there are too many groups for a hash table to fit in memory.

Optimizer decision: PostgreSQL's optimizer estimates the number of distinct groups. Few groups → Hash Aggregate (small hash table). Many groups with no index → Hash Aggregate if it fits in memory, otherwise Sort + GroupAggregate. Data already sorted by index → GroupAggregate (free sort).

-- HASH AGGREGATE: small number of groups
EXPLAIN ANALYZE
SELECT status, COUNT(*), SUM(amount)
FROM orders
GROUP BY status;
-- HashAggregate  (cost=...) (actual rows=5)
--   Group Key: status
--   Batches: 1  Memory Usage: 24kB    ← 5 groups fit easily
--   → Seq Scan on orders
-- Hash table: {pending: {count:1200, sum:45000}, shipped: {...}, ...}
-- Single pass through orders → O(N)

-- SORT-BASED (GroupAggregate): data already sorted by index
CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*), AVG(amount)
FROM orders
GROUP BY customer_id
ORDER BY customer_id;
-- GroupAggregate  (cost=...)
--   Group Key: customer_id
--   → Index Scan on idx_orders_customer
-- Data comes from index already sorted by customer_id
-- No sort needed! Read sequentially, aggregate each group → O(N)

-- HASH SPILL to disk: too many groups for memory
SET work_mem = '4MB';
EXPLAIN ANALYZE
SELECT user_id, COUNT(*)
FROM page_views   -- 500M rows, 10M distinct user_ids
GROUP BY user_id;
-- HashAggregate
--   Batches: 16  Memory Usage: 4MB  Disk Usage: 230MB  ← spilled!
-- Hash table partitioned into 16 batches, each processed separately
-- Much slower than in-memory hash

-- Fix: increase work_mem
SET work_mem = '512MB';
-- HashAggregate  Batches: 1  Memory Usage: 380MB  ← all in memory now

-- When optimizer switches strategy automatically:
-- Few groups (< 10K) → HashAggregate almost always
-- Many groups + matching index → GroupAggregate (sorted by index)
-- Many groups + no index + enough memory → HashAggregate
-- Many groups + no index + insufficient memory → Sort + GroupAggregate

-- Mixed strategy with partial aggregation (parallel query):
EXPLAIN ANALYZE
SELECT department, AVG(salary) FROM employees GROUP BY department;
-- Partial HashAggregate  (worker 1: partial sums per dept)
-- Partial HashAggregate  (worker 2: partial sums per dept)
-- Finalize HashAggregate (combine partial results from all workers)
-- Parallelism splits the work, each worker builds its own hash table

-- MySQL: only uses temporary table + filesort for GROUP BY
-- (no hash aggregation in MySQL < 8.0.27)
EXPLAIN SELECT status, COUNT(*) FROM orders GROUP BY status;
-- Extra: Using temporary; Using filesort

A dashboard query grouped 200 million log entries by 15 million distinct user IDs. With default work_mem (4MB), PostgreSQL used HashAggregate that spilled into 64 disk batches — taking 12 minutes. Increasing work_mem to 1GB allowed the entire hash table to fit in memory (580MB for 15M entries), reducing the query to 90 seconds. However, for a different query grouping by just 5 status values, the optimizer correctly chose HashAggregate with a tiny 24KB hash table — proving that the strategy depends on the number of distinct groups, not the table size.

Hash Aggregation = O(N) single pass, builds hash table of groups. GroupAggregate = needs sorted input (from index or explicit sort), then O(N) sequential scan. Optimizer picks based on estimated distinct groups and available memory. Too many groups for memory → hash spill to disk (slow) — increase work_mem or add an index for sorted aggregation.
⚠️ Common Mistake

Not realizing that GROUP BY can be eliminated entirely if a matching index exists — or not understanding why adding an ORDER BY that matches the GROUP BY can actually switch the optimizer to a faster plan.

❌ Hash aggregate spills — slow on large group cardinality
-- 10 million distinct customer_ids, work_mem = 64MB
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id;
-- HashAggregate  Batches: 8  Disk: 450MB  ← spilled to disk!
-- Processes 8 batches sequentially from disk → 8 minutes

-- No index on customer_id, so GroupAggregate would need a full sort too
✅ Index provides sort order — GroupAggregate in one pass
CREATE INDEX idx_orders_cust ON orders(customer_id);
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id;
-- GroupAggregate → Index Scan on idx_orders_cust
-- Data arrives already sorted → no hash table, no sort, no disk spill
-- Sequential scan through index, aggregate each group → 45 seconds
-- Bonus: ORDER BY customer_id is free (already sorted)
🔁 Follow-Up Question

What is partial aggregation in parallel queries? How does the Finalize step combine results from multiple workers?

46 Hash, GIN, GiST, and BRIN indexes — when to use each instead of B+ tree? advanced

B+ tree is the default index, but it's not optimal for every data type or query pattern. Specialized index types exist for specific use cases:

1. Hash Index:

  • Stores a hash of each key value. Supports only equality lookups (=). Cannot do range scans (<, >, BETWEEN, ORDER BY).
  • O(1) lookup vs B+ tree O(log N). Faster for exact-match lookups on high-cardinality columns.
  • PostgreSQL: crash-safe since v10. MySQL InnoDB: uses adaptive hash index internally (automatic, not user-created).
  • Use when: equality-only lookups (e.g., session tokens, UUIDs). Avoid when range queries are needed.

2. GIN (Generalized Inverted Index):

  • Designed for values that contain multiple elements — arrays, JSONB, full-text search (tsvector).
  • Internally: maps each element/word → list of row IDs that contain it (inverted index, like a search engine).
  • Supports operators: @> (contains), <@ (contained by), && (overlap), @@ (full-text match).
  • Use when: searching inside arrays, JSONB fields, or full-text search. Not for scalar equality/range.
  • Downside: slow to update (each INSERT may update many inverted entries). Use fastupdate for batch-inserts.

3. GiST (Generalized Search Tree):

  • A balanced tree that supports overlapping keys — used for geometric data, ranges, and nearest-neighbor searches.
  • Supports operators: && (overlap), @> (contains), <@ (within), <-> (nearest neighbor/KNN).
  • Use when: PostGIS spatial queries (find all points within a polygon), range types (overlapping date ranges), or trigram text search (pg_trgm).

4. BRIN (Block Range Index):

  • Extremely small index that stores min/max values for each block range (group of consecutive disk pages, default 128 pages).
  • Works when data is physically correlated with the indexed column — e.g., timestamps in an append-only log table (rows are inserted in time order).
  • Lookup: check which block ranges could contain the value, scan only those blocks.
  • Use when: very large tables with naturally ordered data (time-series, log tables). Index is 1000x smaller than B+ tree.
  • Terrible if data is randomly distributed (every block range matches → full scan).
-- HASH INDEX: equality-only lookups
CREATE INDEX idx_session_hash ON sessions USING hash (session_token);
-- O(1) lookup:
SELECT * FROM sessions WHERE session_token = 'abc123xyz';  -- instant!
-- ❌ Cannot do: WHERE session_token > 'abc' (no range support)
-- ❌ Cannot do: ORDER BY session_token (no ordering)

-- GIN INDEX: search inside JSONB
CREATE INDEX idx_tags_gin ON products USING gin (tags);
-- products.tags = '{"color": "red", "size": "large", "material": "cotton"}'

SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- GIN: look up "color:red" → list of matching row IDs → fast!

-- GIN for full-text search
ALTER TABLE articles ADD COLUMN search_vector tsvector;
UPDATE articles SET search_vector = to_tsvector('english', title || ' ' || body);
CREATE INDEX idx_fts ON articles USING gin (search_vector);

SELECT title FROM articles WHERE search_vector @@ to_tsquery('database & optimization');
-- GIN inverted index: "database" → [row 5, 12, 45], "optimization" → [row 12, 45, 78]
-- Intersection: [12, 45] → 2 results, instant!

-- GIN for array containment
CREATE INDEX idx_skills ON employees USING gin (skills);
-- skills = ARRAY['python', 'sql', 'docker']
SELECT * FROM employees WHERE skills @> ARRAY['sql', 'docker'];

-- GiST INDEX: spatial and range queries
CREATE INDEX idx_location ON stores USING gist (coordinates);
-- Find stores within 5km of a point (PostGIS):
SELECT name FROM stores
WHERE ST_DWithin(coordinates, ST_MakePoint(-73.98, 40.75)::geography, 5000);
-- GiST: nearest-neighbor search (KNN):
SELECT name, coordinates <-> ST_MakePoint(-73.98, 40.75) AS distance
FROM stores ORDER BY coordinates <-> ST_MakePoint(-73.98, 40.75) LIMIT 10;

-- GiST for overlapping date ranges
CREATE INDEX idx_booking_range ON bookings USING gist (daterange(check_in, check_out));
SELECT * FROM bookings
WHERE daterange(check_in, check_out) && daterange('2024-06-01', '2024-06-15');
-- Find all bookings that overlap with June 1-15

-- BRIN INDEX: huge time-series tables
CREATE INDEX idx_ts_brin ON sensor_data USING brin (recorded_at)
    WITH (pages_per_range = 64);
-- Table: 2 billion rows, 500GB
-- B+ tree index on recorded_at: 45GB
-- BRIN index on recorded_at: 50MB (900x smaller!)

SELECT * FROM sensor_data
WHERE recorded_at BETWEEN '2024-06-01' AND '2024-06-02';
-- BRIN checks: which 64-page ranges have min/max overlapping June 1-2?
-- Answer: block ranges 45000-45100 → scan only those blocks

-- Compare index sizes:
SELECT pg_size_pretty(pg_relation_size('idx_ts_btree')) AS btree_size,
       pg_size_pretty(pg_relation_size('idx_ts_brin')) AS brin_size;
-- btree: 45 GB  |  brin: 50 MB

A food delivery app used PostGIS with GiST indexes on restaurant locations to answer "find restaurants within 3km" in 2ms — serving 50,000 searches per minute. Their menu search used GIN indexes on tsvector columns for full-text search ("spicy chicken near me"), matching across 2 million menu items in 5ms. For their analytics warehouse (180 billion event rows), they replaced a 2TB B+ tree index on event_timestamp with a 200MB BRIN index — queries were only 20% slower but the storage savings of 1.8TB allowed them to keep 2 more years of data on the same hardware.

B+ tree = default for equality + range. Hash = equality only, O(1). GIN = multi-valued data (JSONB, arrays, full-text). GiST = spatial, range overlap, nearest-neighbor. BRIN = huge tables with physically ordered data (1000x smaller than B+ tree). Choose based on your query operators, not just the column type.
⚠️ Common Mistake

Using a B+ tree index for full-text search or JSONB containment queries — it simply cannot support those operators.

❌ B-tree on JSONB — cannot use containment operators
CREATE INDEX idx_tags ON products (tags);  -- default B-tree
SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- ERROR or full table scan!
-- B-tree can only do: tags = 'exact_json_value' (useless)

-- Also wrong: B-tree for text search
CREATE INDEX idx_body ON articles (body);
SELECT * FROM articles WHERE body LIKE '%database%';
-- B-tree cannot use leading wildcard → full scan
✅ GIN for containment/search operators
CREATE INDEX idx_tags ON products USING gin (tags);
SELECT * FROM products WHERE tags @> '{"color": "red"}';
-- GIN inverted index → instant lookup!

-- Full-text search with GIN
CREATE INDEX idx_fts ON articles USING gin (to_tsvector('english', body));
SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('database');
-- GIN inverted index on words → fast!
🔁 Follow-Up Question

What are bloom indexes and when would you use them? How does the adaptive hash index work in InnoDB?

47 What are deadlocks? How does the database detect and resolve them? advanced

A deadlock occurs when two or more transactions are waiting for each other to release locks, creating a circular dependency where none can proceed. Example: Transaction A holds lock on row 1 and waits for row 2, while Transaction B holds lock on row 2 and waits for row 1 — neither can finish.

Four necessary conditions for deadlock (Coffman conditions):

  1. Mutual exclusion — rows/resources can be locked exclusively.
  2. Hold and wait — a transaction holds locks while waiting for more.
  3. No preemption — locks cannot be forcibly taken from a transaction.
  4. Circular wait — a cycle exists in the wait-for graph.

How databases detect deadlocks:

  • Wait-for graph (WFG) — the database maintains a directed graph where nodes are transactions and edges represent "is waiting for". Periodically (or on each wait), the database checks for cycles in this graph. A cycle = deadlock.
  • InnoDB — checks the WFG immediately when a transaction starts waiting. Detection is instant. Chooses the youngest (least work done) transaction as the victim and rolls it back.
  • PostgreSQL — checks the WFG after a configurable wait timeout (deadlock_timeout, default 1 second). If a cycle is found, one transaction is aborted with ERROR: deadlock detected.
  • SQL Server — background thread checks every 5 seconds. Victim is the transaction with the lowest deadlock priority or least log written.

Prevention strategies:

  • Always acquire locks in a consistent order (e.g., always lock lower ID first).
  • Keep transactions short — less time holding locks = smaller deadlock window.
  • Use appropriate isolation levels — READ COMMITTED produces fewer locks than SERIALIZABLE.
  • Retry on deadlock — the application should catch the deadlock error and retry the transaction.
-- CLASSIC DEADLOCK scenario:
-- Transaction A:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- locks row 1
-- ... some processing ...
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- WAITS for row 2!

-- Transaction B (concurrent):
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- locks row 2
-- ... some processing ...
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- WAITS for row 1!

-- DEADLOCK! A waits for B (row 2), B waits for A (row 1)
-- Database detects cycle → kills one transaction:
-- ERROR: deadlock detected
-- DETAIL: Process 1234 waits for ShareLock on row 1;
--         blocked by process 5678.
--         Process 5678 waits for ShareLock on row 2;
--         blocked by process 1234.

-- FIX: Always lock in consistent order (lower ID first)
-- Transaction A:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- lock 1 first
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- then lock 2
COMMIT;

-- Transaction B (also locks in same order):
BEGIN;
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- lock 1 first
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- then lock 2
COMMIT;
-- No circular wait possible → no deadlock!

-- APPLICATION: retry on deadlock (Python/psycopg2)
-- import psycopg2
-- MAX_RETRIES = 3
-- for attempt in range(MAX_RETRIES):
--     try:
--         with conn.cursor() as cur:
--             cur.execute("BEGIN")
--             cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = %s", (from_id,))
--             cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = %s", (to_id,))
--             cur.execute("COMMIT")
--         break  # success
--     except psycopg2.errors.DeadlockDetected:
--         conn.rollback()
--         if attempt == MAX_RETRIES - 1:
--             raise  # give up after 3 retries

-- PostgreSQL: tune deadlock detection
SHOW deadlock_timeout;  -- default: 1s
-- Lower values detect deadlocks faster but increase CPU overhead
-- SET deadlock_timeout = '500ms';

-- InnoDB: check recent deadlocks
SHOW ENGINE INNODB STATUS;
-- Look for "LATEST DETECTED DEADLOCK" section
-- Shows which transactions, which rows, which was rolled back

-- PostgreSQL: log all deadlocks
-- postgresql.conf: log_lock_waits = on
-- Logs: "process 1234 still waiting for ShareLock after 1000ms"

-- Monitor deadlock frequency:
-- PostgreSQL:
SELECT deadlocks FROM pg_stat_database WHERE datname = current_database();
-- MySQL:
SHOW GLOBAL STATUS LIKE 'Innodb_deadlocks';

An e-commerce checkout system experienced 50-100 deadlocks per hour during peak sales. Two concurrent transactions both updated inventory (lock product row) and inserted order items (lock order row) but in different orders depending on the code path. The fix was standardizing the lock order: always lock inventory rows sorted by product_id ASC, then insert order items. Deadlocks dropped to near zero. They also added a retry wrapper with exponential backoff — on the rare deadlock, the transaction retried after 100ms, transparent to the user. Monitoring via pg_stat_database.deadlocks and alerting at >10/hour caught regressions early.

Deadlocks = circular lock dependency. Databases detect them via wait-for graph cycle detection and roll back one victim transaction. Prevent by: (1) always lock resources in the same order, (2) keep transactions short, (3) use lower isolation levels. Always implement retry logic for deadlock errors in application code.
⚠️ Common Mistake

Locking resources in inconsistent order across different code paths — the #1 cause of deadlocks in production applications.

❌ Inconsistent lock order — guaranteed deadlocks under load
-- transfer_money(from=1, to=2):
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- lock 1
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- lock 2

-- transfer_money(from=2, to=1):  (concurrent)
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- lock 2
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- lock 1
-- Opposite order! → deadlock when both run concurrently
✅ Consistent lock order — deadlock impossible
-- Always lock LOWER id first, regardless of transfer direction
-- transfer_money(from_id, to_id):
--   first_id = min(from_id, to_id)
--   second_id = max(from_id, to_id)

-- transfer_money(from=1, to=2):
UPDATE accounts SET balance = ... WHERE id = 1;  -- lock min(1,2)=1
UPDATE accounts SET balance = ... WHERE id = 2;  -- lock max(1,2)=2

-- transfer_money(from=2, to=1):
UPDATE accounts SET balance = ... WHERE id = 1;  -- lock min(2,1)=1
UPDATE accounts SET balance = ... WHERE id = 2;  -- lock max(2,1)=2
-- Same order → no cycle → no deadlock!
🔁 Follow-Up Question

What is a livelock? How does it differ from a deadlock? What is lock escalation in SQL Server?

Frequently Asked Questions

Written and reviewed by the FreeBytes Editorial Team · Last updated: June 2026