🐍 Python Interview Questions
40 questions with theory, real code, real-world scenarios, common mistakes and follow-up questions — from basic to performance optimization.
Python is a high-level, interpreted programming language created by Guido van Rossum in 1991. It emphasises code readability with significant whitespace and a clean, English-like syntax.
Python is popular because it has a gentle learning curve, a massive standard library ("batteries included"), and thriving ecosystems for web development (Django, Flask), data science (pandas, NumPy), machine learning (TensorFlow, PyTorch), and automation. Its community is one of the largest in the world, which means you can find a library for almost anything.
# Python reads almost like English
students = ["Alice", "Bob", "Charlie"]
for student in students:
if student.startswith("A"):
print(f"{student} gets a welcome bonus!")
# Output: Alice gets a welcome bonus!
Instagram's entire backend started as a Django (Python) monolith and scaled to 1 billion+ users before any major language migration. Dropbox ran on Python for over a decade. Netflix uses Python for its recommendation engine data pipeline that processes 200+ billion events per day.
Many candidates say "Python is slow, so it's only for scripting." This is wrong — Python is the backbone of Instagram, Netflix, and most AI/ML production systems. The correct framing is: Python is slower at CPU-bound loops but excels at I/O-bound work and integrates with C/C++ for hot paths.
What are the differences between Python 2 and Python 3? Why did the migration take so long?
Python has several built-in data types grouped into categories:
Numeric: int (unlimited precision integers), float (64-bit doubles), complex.
Sequence: list (mutable, ordered), tuple (immutable, ordered), range.
Text: str (immutable Unicode).
Set: set (mutable, unordered, unique), frozenset (immutable).
Mapping: dict (mutable key-value pairs).
Boolean: bool (True/False, subclass of int).
None: NoneType — Python's null.
# Numeric
price = 49.99 # float
quantity = 3 # int
# Sequence
cart_items = ["Laptop", "Mouse"] # list (mutable)
dimensions = (1920, 1080) # tuple (immutable)
# Mapping
user = {
"name": "Priya",
"age": 28,
"is_premium": True # bool
}
# Set
unique_tags = {"python", "coding", "python"} # duplicates removed
print(unique_tags) # {'python', 'coding'}
# None
result = None # placeholder before computation
At a fintech startup processing 500K transactions/day, switching order IDs from a list lookup (O(n)) to a set lookup (O(1)) reduced duplicate-detection time from 14 seconds to 0.003 seconds per batch.
Candidates often confuse list and tuple. The key difference is mutability, not syntax. Tuples are hashable and can be dict keys; lists cannot.
What is the difference between a list and a tuple? When would you use one over the other?
A function is defined with the def keyword, takes parameters, and optionally returns a value with return. If no return statement is used, the function returns None.
Python supports default arguments (parameters with preset values), keyword arguments (passing by name), and positional arguments. Default arguments are evaluated once at function definition time — not each call — which is a critical gotcha with mutable defaults.
def calculate_discount(price, discount_pct=10, currency="₹"):
"""Calculate discounted price with optional percentage and currency."""
savings = price * (discount_pct / 100)
final_price = price - savings
return final_price, savings # returns a tuple
# Usage
final, saved = calculate_discount(1000)
print(f"Pay ₹{final}, you saved ₹{saved}")
# Output: Pay ₹900.0, you saved ₹100.0
# With keyword argument
final, saved = calculate_discount(1000, discount_pct=20, currency="$")
print(f"Pay ${final}, you saved ${saved}")
# Output: Pay $800.0, you saved $200.0
In an e-commerce pricing engine processing 50K products, functions with default arguments let the team reuse one calculate_discount() function across 12 different sale campaigns instead of writing 12 separate functions — reducing code from 600 lines to 45.
The #1 Python function bug — mutable default arguments:
def add_item(item, cart=[]):
cart.append(item)
return cart
print(add_item("A")) # ['A']
print(add_item("B")) # ['A', 'B'] — BUG!def add_item(item, cart=None):
if cart is None:
cart = []
cart.append(item)
return cart
print(add_item("A")) # ['A']
print(add_item("B")) # ['B'] — correct!What is the difference between *args and **kwargs?
for loops iterate over any iterable (list, string, range, dict, file). while loops run as long as a condition is True.
break exits the loop entirely. continue skips the current iteration. Python has a unique else clause on loops — the else block runs only if the loop completes without hitting a break. This is useful for search patterns.
# for with else — search pattern
users_db = ["alice", "bob", "charlie", "diana"]
search = "charlie"
for user in users_db:
if user == search:
print(f"Found {user}! Granting access...")
break
else:
# Only runs if break was NOT hit
print(f"{search} not found. Access denied.")
# Output: Found charlie! Granting access...
# while with continue — skip invalid data
raw_scores = [85, -1, 92, 0, 78, -5, 95]
valid_scores = []
i = 0
while i < len(raw_scores):
score = raw_scores[i]
i += 1
if score <= 0:
continue # skip invalid
valid_scores.append(score)
print(f"Valid scores: {valid_scores}")
# Output: Valid scores: [85, 92, 78, 95]
In a data cleaning pipeline for a hospital system, using for-else to detect missing patient IDs eliminated a separate boolean flag variable across 200+ validation functions, making the code cleaner and reducing bugs from forgotten flag resets by 100%.
Candidates forget the else clause on loops or confuse it with if-else. The loop's else means "no break happened" — it does not mean "loop didn't execute." An empty loop still triggers else.
How do you iterate over a dictionary's keys and values simultaneously?
List — mutable, ordered, allows duplicates. Use for collections that change. Syntax: [1, 2, 3].
Tuple — immutable, ordered, allows duplicates. Use for fixed records (coordinates, DB rows). Syntax: (1, 2, 3). Tuples are hashable, so they can be dict keys.
Set — mutable, unordered, no duplicates. Use for membership testing and deduplication. Syntax: {1, 2, 3}.
Performance: set lookups are O(1) average (hash table), list lookups are O(n), tuple lookups are O(n) but tuples use less memory than lists.
# List — shopping cart (changes often)
cart = ["Laptop", "Mouse", "Keyboard"]
cart.append("Monitor")
cart.remove("Mouse")
print(cart) # ['Laptop', 'Keyboard', 'Monitor']
# Tuple — database record (never changes)
employee = ("E1042", "Priya Sharma", "Engineering", 95000)
emp_id, name, dept, salary = employee # unpacking
print(f"{name} in {dept}") # Priya Sharma in Engineering
# Set — unique users who viewed a page
viewers = {"user_101", "user_202", "user_101", "user_303"}
print(len(viewers)) # 3 (duplicate removed)
# Set operations
premium_users = {"user_101", "user_404"}
premium_viewers = viewers & premium_users # intersection
print(premium_viewers) # {'user_101'}
At a social media analytics platform, switching the "unique daily active users" counter from a list (checking `if user not in list`) to a set reduced the daily aggregation job from 23 minutes to 47 seconds on 5M user events.
Candidates say "tuples are just immutable lists." This misses the key implication — because tuples are hashable, they can be used as dictionary keys and set elements, which lists cannot.
What is a frozenset and when would you use it?
Strings in Python are immutable — every string method returns a new string. Key methods: strip() removes whitespace, split() breaks into a list, join() combines a list into a string, replace() substitutes substrings, find()/index() search for substrings, startswith()/endswith() check prefixes/suffixes, upper()/lower()/title() change case.
f-strings (Python 3.6+) are the modern way to format strings — faster and more readable than % or .format().
# Cleaning user input
raw_email = " Priya.Sharma@Gmail.COM "
clean_email = raw_email.strip().lower()
print(clean_email) # "priya.sharma@gmail.com"
# Parsing CSV-like data
log_line = "2025-01-15|ERROR|Database connection timeout"
date, level, message = log_line.split("|")
print(f"[{level}] {message}") # [ERROR] Database connection timeout
# Building output from list
tags = ["python", "interview", "coding"]
hashtags = " ".join(f"#{tag}" for tag in tags)
print(hashtags) # #python #interview #coding
# f-string with expression
price = 1299.5
print(f"Total: ₹{price:,.2f}") # Total: ₹1,299.50
In an email validation microservice handling 2M signups/month, chaining strip().lower() on every email input before database storage prevented 12,000+ duplicate accounts per month caused by leading spaces and mixed-case entries.
Candidates forget strings are immutable and write name.upper() expecting name to change. You must reassign: name = name.upper().
What is the difference between str.find() and str.index()? What happens when the substring is not found?
Python opens files with open(filename, mode). Modes: "r" read (default), "w" write (overwrites), "a" append, "x" create (fails if exists), "b" binary. You should always use the with statement (context manager) which automatically closes the file even if an exception occurs.
Reading methods: read() loads entire file, readline() reads one line, readlines() returns list of lines. For large files, iterate line by line with for line in file: to avoid loading everything into memory.
# Writing a report file
sales_data = [
{"product": "Laptop", "revenue": 125000},
{"product": "Mouse", "revenue": 8500},
{"product": "Monitor","revenue": 45000},
]
with open("sales_report.txt", "w") as f:
f.write("=== Monthly Sales Report ===\n\n")
for item in sales_data:
f.write(f"{item['product']:.<20} ₹{item['revenue']:>10,}\n")
f.write(f"\n{'Total':.<20} ₹{sum(i['revenue'] for i in sales_data):>10,}\n")
# Reading it back — line by line (memory-efficient)
with open("sales_report.txt", "r") as f:
for line in f:
print(line.rstrip())
# Output:
# === Monthly Sales Report ===
#
# Laptop.............. ₹ 125,000
# Mouse............... ₹ 8,500
# Monitor............. ₹ 45,000
#
# Total............... ₹ 178,500
A log analysis script at an e-commerce company needed to scan 15 GB access logs daily. Switching from file.read() (loaded entire file into RAM, crashing on 8 GB servers) to line-by-line iteration reduced memory usage from 15 GB to 12 MB while processing 80M lines in 4 minutes.
f = open("data.txt", "r")
data = f.read()
f.close() # never reached if read() throwswith open("data.txt", "r") as f:
data = f.read()
# file is auto-closed here, even on exceptionHow would you read a very large CSV file (10 GB) without running out of memory?
Python uses try/except/else/finally for error handling. Code that might fail goes in try. The except block catches specific exceptions. else runs only if no exception occurred. finally always runs — for cleanup.
You can raise exceptions manually and create custom exception classes by inheriting from Exception. Always catch specific exceptions, not bare except: which catches everything including KeyboardInterrupt and SystemExit.
def withdraw(balance, amount):
"""Bank withdrawal with proper error handling."""
if not isinstance(amount, (int, float)):
raise TypeError(f"Amount must be a number, got {type(amount).__name__}")
if amount <= 0:
raise ValueError("Withdrawal amount must be positive")
if amount > balance:
raise ValueError(f"Insufficient funds: balance ₹{balance}, requested ₹{amount}")
return balance - amount
# Usage with try/except/else/finally
try:
new_balance = withdraw(10000, 3000)
except TypeError as e:
print(f"Input error: {e}")
except ValueError as e:
print(f"Business rule error: {e}")
else:
# Only runs if NO exception
print(f"Withdrawal successful! New balance: ₹{new_balance}")
finally:
# Always runs — good for logging, cleanup
print("Transaction logged.")
# Output:
# Withdrawal successful! New balance: ₹7000
# Transaction logged.
In a payment gateway processing 10K transactions/hour, adding specific except clauses for ConnectionTimeout, InvalidCard, and InsufficientFunds (instead of a bare except) reduced silent failures from 200/day to zero — every error was now categorized and routed to the correct retry or alert system.
try:
result = process_payment(order)
except:
print("Something went wrong") # hides the real errortry:
result = process_payment(order)
except ConnectionError:
retry_payment(order) # network issue, retry
except ValueError as e:
log.error(f"Invalid order data: {e}") # data issue, log
raise # re-raise after loggingHow do you create a custom exception class in Python? When is it appropriate?
A module is any .py file. A package is a directory with an __init__.py file (can be empty) containing modules. The import statement loads code from modules.
Import styles: import math (full module), from math import sqrt (specific function), from math import * (everything — avoid in production). Python searches for modules in: current directory → standard library → installed packages (sys.path).
__name__ == "__main__" is True only when a file is run directly, not when imported. This is the standard entry-point guard.
# utils/validators.py — a custom module
import re
def validate_email(email):
"""Validate email format and return cleaned version."""
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
clean = email.strip().lower()
if not re.match(pattern, clean):
raise ValueError(f"Invalid email: {email}")
return clean
def validate_phone(phone, country_code="+91"):
"""Validate Indian phone number."""
digits = re.sub(r"\D", "", phone)
if len(digits) != 10:
raise ValueError(f"Phone must be 10 digits, got {len(digits)}")
return f"{country_code}{digits}"
# main.py — importing the module
from utils.validators import validate_email, validate_phone
if __name__ == "__main__":
email = validate_email(" Priya@Gmail.COM ")
phone = validate_phone("98765-43210")
print(f"Email: {email}, Phone: {phone}")
# Email: priya@gmail.com, Phone: +919876543210
A 50-developer team at a SaaS company reduced import-related bugs from 15/sprint to zero by enforcing explicit imports (from module import X) instead of wildcard imports (from module import *), which had been causing name collisions between 300+ modules.
Candidates use from module import * in production. This pollutes the namespace and causes subtle bugs when two modules export the same name. Always use explicit imports: from module import specific_function.
What is the difference between absolute and relative imports? When would you use each?
A list comprehension is a concise way to create lists: [expression for item in iterable if condition]. It combines a loop, an optional filter, and a transformation into a single readable line.
Comprehensions exist for lists [], sets {}, dicts {k:v}, and generators (). They are generally faster than equivalent for loops because the iteration happens in C internally. However, for complex logic (multiple side effects, nested conditions), a regular loop is more readable.
# Regular loop vs comprehension
prices = [1200, 450, 3200, 89, 5600, 230, 780]
# Loop approach — 4 lines
expensive = []
for p in prices:
if p > 500:
expensive.append(p * 0.9) # 10% discount
# Comprehension — 1 line, same result
expensive = [p * 0.9 for p in prices if p > 500]
print(expensive) # [1080.0, 2880.0, 5040.0, 702.0]
# Dict comprehension — word frequency
sentence = "python is great and python is fun"
word_freq = {word: sentence.split().count(word)
for word in set(sentence.split())}
print(word_freq)
# {'python': 2, 'is': 2, 'great': 1, 'and': 1, 'fun': 1}
# Nested comprehension — flatten matrix
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [num for row in matrix for num in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
In a data pipeline processing 2M product listings, replacing 15 for-loop-with-append patterns with list comprehensions reduced the transformation step from 8.2 seconds to 3.1 seconds — a 62% speedup with zero logic changes, purely from Python's internal C-level optimization.
Candidates write deeply nested comprehensions that are unreadable:
result = [x*y for x in range(10) for y in range(10) if x != y if x+y > 5 if x*y < 30]result = []
for x in range(10):
for y in range(10):
if x != y and x + y > 5 and x * y < 30:
result.append(x * y)What is a generator expression and how does it differ from a list comprehension?
Python supports full OOP with classes, inheritance, polymorphism, and encapsulation. A class is a blueprint; an object is an instance. __init__ is the constructor. self refers to the current instance.
Inheritance: a child class inherits methods/attributes from a parent, and can override them. Python supports multiple inheritance (MRO — Method Resolution Order uses C3 linearization). Encapsulation: use a single underscore _var for "protected" (convention) and double underscore __var for name-mangling (not true private, but harder to access accidentally).
class BankAccount:
"""Base bank account with deposit/withdraw."""
def __init__(self, owner, balance=0):
self.owner = owner
self._balance = balance # protected by convention
self.__transactions = [] # name-mangled
@property
def balance(self):
return self._balance
def deposit(self, amount):
if amount <= 0:
raise ValueError("Deposit must be positive")
self._balance += amount
self.__transactions.append(f"+₹{amount}")
return self._balance
def withdraw(self, amount):
if amount > self._balance:
raise ValueError("Insufficient funds")
self._balance -= amount
self.__transactions.append(f"-₹{amount}")
return self._balance
class SavingsAccount(BankAccount):
"""Savings account with interest — inherits from BankAccount."""
def __init__(self, owner, balance=0, interest_rate=0.04):
super().__init__(owner, balance)
self.interest_rate = interest_rate
def apply_interest(self):
interest = self._balance * self.interest_rate
self.deposit(interest)
return interest
# Usage
acc = SavingsAccount("Priya", 50000, 0.06)
acc.deposit(10000)
interest = acc.apply_interest()
print(f"Balance: ₹{acc.balance:,}, Interest earned: ₹{interest:,.2f}")
# Balance: ₹63,600, Interest earned: ₹3,600.00
A fintech team modelled 8 account types (Savings, Current, FD, Recurring, NRI, Joint, Minor, Salary) using a base BankAccount class. Adding a new account type went from 2 weeks (copy-paste 800 lines) to 2 hours (inherit and override 3 methods).
Candidates forget to call super().__init__() in child classes, causing missing attributes. Also, __var is name-mangled (accessible as _ClassName__var), not truly private — don't rely on it for security.
What is the MRO (Method Resolution Order) in Python? How does it handle the diamond problem?
A decorator is a function that takes another function as input and returns an enhanced version of it — without modifying the original function's code. Decorators use the @decorator_name syntax above a function definition.
Under the hood, @my_decorator above def func() is just syntactic sugar for func = my_decorator(func). Decorators are powerful for cross-cutting concerns: logging, timing, authentication, caching, rate-limiting, and input validation.
Use @functools.wraps(func) inside your decorator to preserve the original function's name and docstring.
import functools
import time
def timer(func):
"""Decorator that logs how long a function takes."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = time.perf_counter() - start
print(f"⏱ {func.__name__}() took {elapsed:.4f}s")
return result
return wrapper
def retry(max_attempts=3):
"""Decorator with arguments — retries on failure."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except Exception as e:
print(f"Attempt {attempt}/{max_attempts} failed: {e}")
if attempt == max_attempts:
raise
return wrapper
return decorator
@timer
@retry(max_attempts=3)
def fetch_user_data(user_id):
"""Simulate API call."""
import random
if random.random() < 0.5:
raise ConnectionError("API timeout")
return {"id": user_id, "name": "Priya"}
data = fetch_user_data(42)
At a microservices company, a single @retry(max_attempts=3, backoff=2) decorator applied to 45 API calls reduced cascading failures by 73% during peak traffic — without changing any business logic in those 45 functions.
Forgetting @functools.wraps(func) causes the decorated function to lose its name and docstring:
def my_decorator(func):
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
return wrapper
@my_decorator
def greet(): pass
print(greet.__name__) # "wrapper" — wrong!import functools
def my_decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
return wrapper
@my_decorator
def greet(): pass
print(greet.__name__) # "greet" — correct!Can you stack multiple decorators on one function? In what order do they execute?
A generator is a function that uses yield instead of return. It produces values one at a time, pausing execution between yields and resuming when the next value is requested. This is called lazy evaluation.
Generators are memory-efficient because they don't store the entire sequence in memory — they compute each value on-the-fly. A generator function returns a generator object that implements the iterator protocol (__iter__ and __next__).
Generator expressions use parentheses: (x for x in range(1000000)) — same syntax as list comprehension but uses almost zero memory.
def read_large_csv(filepath, chunk_size=1000):
"""Generator that reads a large CSV in chunks."""
import csv
with open(filepath, "r") as f:
reader = csv.DictReader(f)
chunk = []
for row in reader:
chunk.append(row)
if len(chunk) == chunk_size:
yield chunk
chunk = []
if chunk: # remaining rows
yield chunk
# Usage — processes 10M rows without loading all into memory
total_revenue = 0
for batch in read_large_csv("sales_10M.csv"):
for row in batch:
total_revenue += float(row["revenue"])
print(f"Total revenue: ₹{total_revenue:,.2f}")
# Generator expression vs list comprehension
# List — stores 10M numbers in memory (~80 MB)
squares_list = [x**2 for x in range(10_000_000)]
# Generator — stores only 1 number at a time (~120 bytes)
squares_gen = (x**2 for x in range(10_000_000))
print(sum(squares_gen)) # processes 10M numbers, ~0 MB memory
At a data analytics company, replacing a list-based CSV reader with a generator-based chunked reader let them process a 45 GB transaction log on a server with only 4 GB RAM — previously the job crashed with MemoryError after 3 minutes.
Candidates forget that generators are single-use — once exhausted, they produce nothing:
gen = (x**2 for x in range(5))
print(list(gen)) # [0, 1, 4, 9, 16]
print(list(gen)) # [] — empty! Generator is exhausteddef squares(n):
return (x**2 for x in range(n))
print(list(squares(5))) # [0, 1, 4, 9, 16]
print(list(squares(5))) # [0, 1, 4, 9, 16] — fresh generatorWhat is the difference between yield and yield from? When would you use yield from?
A lambda is an anonymous, single-expression function: lambda args: expression. It's syntactic sugar for small, throwaway functions.
map(func, iterable) applies a function to every item. filter(func, iterable) keeps items where the function returns True. reduce(func, iterable) (from functools) cumulatively combines items left to right.
In modern Python, list comprehensions are usually preferred over map/filter for readability. But lambda is still useful for: sort keys, callback functions, and functional programming patterns.
# Lambda for sorting — sort employees by salary descending
employees = [
{"name": "Priya", "salary": 85000},
{"name": "Rahul", "salary": 72000},
{"name": "Sneha", "salary": 95000},
{"name": "Arjun", "salary": 68000},
]
top_earners = sorted(employees, key=lambda e: e["salary"], reverse=True)
for e in top_earners:
print(f"{e['name']:.<15} ₹{e['salary']:>8,}")
# Sneha........... ₹ 95,000
# Priya........... ₹ 85,000
# Rahul........... ₹ 72,000
# Arjun........... ₹ 68,000
# map + filter — process transaction amounts
transactions = [1200, -500, 3400, -150, 8900, 200]
credits = list(filter(lambda t: t > 0, transactions))
with_tax = list(map(lambda t: round(t * 1.18, 2), credits))
print(f"Credits with 18% GST: {with_tax}")
# Credits with 18% GST: [1416.0, 4012.0, 10502.0, 236.0]
# Equivalent list comprehension (preferred)
with_tax = [round(t * 1.18, 2) for t in transactions if t > 0]
A data team used sorted() with a lambda key to rank 50K customer records by a composite score (recency × frequency × monetary). This one-liner replaced a 30-line custom comparator class, and the sort ran in 0.08 seconds.
Candidates try to cram complex logic into lambda. Lambda is for one expression only — no statements, no assignments, no if-else chains:
process = lambda x: x*2 if x > 0 else (x*-1 if x < -100 else 0)def process(x):
if x > 0:
return x * 2
elif x < -100:
return x * -1
return 0What is functools.reduce() and can you give a practical example where it's better than a loop?
*args collects extra positional arguments into a tuple. **kwargs collects extra keyword arguments into a dict. The names "args" and "kwargs" are conventions — it's the * and ** that matter.
Parameter order matters: def func(regular, *args, keyword_only, **kwargs). Anything after *args must be passed as a keyword argument. This is how Python enforces keyword-only parameters.
The * and ** operators also work for unpacking — *list unpacks a list into positional args, **dict unpacks a dict into keyword args.
def create_html_tag(tag, *children, class_name=None, **attrs):
"""Build an HTML tag with flexible content and attributes."""
attr_str = ""
if class_name:
attr_str += f' class="{class_name}"'
for key, val in attrs.items():
attr_str += f' {key.rstrip("_")}="{val}"'
content = "".join(str(c) for c in children)
return f"<{tag}{attr_str}>{content}</{tag}>"
# Usage — flexible API
print(create_html_tag("h1", "Welcome"))
# <h1>Welcome</h1>
print(create_html_tag("div", "Hello ", "World",
class_name="greeting", id_="main"))
# <div class="greeting" id="main">Hello World</div>
# Unpacking with * and **
config = {"class_name": "btn", "data_action": "submit"}
print(create_html_tag("button", "Click Me", **config))
# <button class="btn" data_action="submit">Click Me</button>
# Forwarding args — common in wrappers
def log_and_call(func, *args, **kwargs):
print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
return func(*args, **kwargs)
The entire Django framework uses **kwargs extensively in its ORM — Model.objects.filter(**conditions) accepts any combination of field lookups. This pattern lets 500K+ Django projects query databases with flexible filters without Django knowing every possible field name in advance.
Candidates confuse the order. Python requires: def f(positional, *args, keyword_only, **kwargs). Putting **kwargs before *args is a syntax error. Also, *args and **kwargs capture extra arguments — named parameters still take priority.
How do keyword-only arguments work in Python 3? How do you enforce them?
A context manager is an object that defines __enter__ (setup) and __exit__ (cleanup) methods. The with statement guarantees cleanup even if an exception occurs — like try/finally but cleaner.
Common built-in context managers: file objects, threading locks, database connections, decimal.localcontext(). You can create custom ones with a class (define __enter__/__exit__) or with @contextlib.contextmanager decorator (yield-based — simpler).
The __exit__ method receives exception info (type, value, traceback). Returning True suppresses the exception.
import contextlib
import time
import sqlite3
# Custom context manager using decorator
@contextlib.contextmanager
def timer(label):
"""Time a block of code."""
start = time.perf_counter()
try:
yield # code inside 'with' runs here
finally:
elapsed = time.perf_counter() - start
print(f"⏱ {label}: {elapsed:.4f}s")
# Usage
with timer("Data processing"):
total = sum(x**2 for x in range(1_000_000))
# Output: ⏱ Data processing: 0.0823s
# Database transaction context manager
@contextlib.contextmanager
def db_transaction(db_path):
"""Auto-commit on success, rollback on failure."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
try:
yield cursor
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
# Usage — auto-handles commit/rollback/close
with db_transaction("app.db") as cursor:
cursor.execute("INSERT INTO users (name) VALUES (?)", ("Priya",))
At a trading platform, wrapping database operations in a custom transaction context manager eliminated 15 "connection leak" incidents per month. Previously, developers forgot conn.close() in 3 out of 40 database functions, causing connection pool exhaustion under load.
Candidates write context managers that swallow exceptions silently by returning True from __exit__. Only suppress exceptions if you truly handle them — otherwise bugs become invisible.
Can you nest multiple context managers? What does contextlib.ExitStack do?
Python's re module provides regex support. Key functions: re.match() checks the start of a string, re.search() finds the first match anywhere, re.findall() returns all matches, re.sub() replaces matches, re.compile() pre-compiles a pattern for reuse.
Use raw strings r"pattern" to avoid escaping backslashes. Named groups (?P<name>...) make matches self-documenting. For performance, compile patterns used in loops with re.compile().
import re
# Extract structured data from log lines
log_pattern = re.compile(
r'(?P<date>\d{4}-\d{2}-\d{2}) '
r'(?P<time>\d{2}:\d{2}:\d{2}) '
r'(?P<level>INFO|WARN|ERROR) '
r'(?P<message>.+)'
)
log_lines = [
"2025-01-15 14:23:01 ERROR Database connection timeout after 30s",
"2025-01-15 14:23:05 INFO Retry successful, connected to replica",
"2025-01-15 14:24:00 WARN Memory usage at 85% threshold",
]
errors = []
for line in log_lines:
match = log_pattern.match(line)
if match and match.group("level") == "ERROR":
errors.append({
"date": match.group("date"),
"message": match.group("message"),
})
print(f"Found {len(errors)} errors")
# Found 1 errors
# Validate and extract email parts
email_pattern = r"^(?P<user>[a-zA-Z0-9._%+-]+)@(?P<domain>[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$"
email = "priya.sharma@company.co.in"
m = re.match(email_pattern, email)
if m:
print(f"User: {m.group('user')}, Domain: {m.group('domain')}")
# User: priya.sharma, Domain: company.co.in
A security team used compiled regex to scan 2M email templates for potential XSS patterns. The scan ran in 8 seconds with re.compile() vs 95 seconds without — compilation overhead is amortized when the pattern is reused in loops.
Candidates use re.match() when they mean re.search(). match() only checks the start of the string, while search() finds a match anywhere in the string. This trips up candidates consistently.
What is the difference between greedy and non-greedy matching? Give an example where it matters.
Python's built-in json module handles JSON serialization (Python → JSON string) and deserialization (JSON string → Python). Key functions: json.dumps() converts dict/list to JSON string, json.loads() parses JSON string to dict/list, json.dump()/json.load() work with files.
JSON maps to Python types: object → dict, array → list, string → str, number → int/float, true/false → True/False, null → None. For custom objects, you need a custom encoder/decoder.
import json
from datetime import datetime
# API response handling
api_response = '''
{
"user_id": 1042,
"name": "Priya Sharma",
"orders": [
{"id": "ORD-5001", "amount": 2499.00, "status": "delivered"},
{"id": "ORD-5023", "amount": 899.50, "status": "shipped"}
],
"is_premium": true,
"last_login": null
}
'''
# Parse JSON
user = json.loads(api_response)
total_spent = sum(order["amount"] for order in user["orders"])
print(f"{user['name']} spent ₹{total_spent:,.2f}")
# Priya Sharma spent ₹3,398.50
# Custom encoder for non-serializable types
class AppEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
return super().default(obj)
# Serialize with custom encoder
event = {
"event": "purchase",
"timestamp": datetime.now(),
"amount": 1299.00,
}
json_str = json.dumps(event, cls=AppEncoder, indent=2)
print(json_str)
# Save to file
with open("user_data.json", "w") as f:
json.dump(user, f, indent=2, ensure_ascii=False)
A REST API backend processes 50K JSON payloads per minute. Using json.loads() with strict validation (checking required keys, types, and ranges) before database insertion prevented 3,000+ malformed records per day from entering the system.
Candidates confuse json.dumps() (to string) with json.dump() (to file). Also, single quotes are invalid JSON — Python dicts use single quotes but JSON requires double quotes. json.dumps() handles this correctly.
How would you handle very large JSON files (5+ GB) that don't fit in memory?
A virtual environment is an isolated Python installation with its own packages, separate from the system Python and other projects. This prevents "dependency hell" — where Project A needs requests==2.28 and Project B needs requests==2.31.
Create with python -m venv myenv, activate with source myenv/bin/activate (Linux/Mac) or myenv\Scripts\activate (Windows). pip freeze > requirements.txt captures exact versions. pip install -r requirements.txt recreates the environment.
Modern alternatives: pipenv (Pipfile.lock), poetry (pyproject.toml), conda (data science), uv (Rust-based, fastest).
# Create and activate virtual environment
# Terminal commands:
# python -m venv project_env
# source project_env/bin/activate (Linux/Mac)
# project_env\Scripts\activate (Windows)
# Install project dependencies
# pip install flask==3.0.0 sqlalchemy==2.0.23 redis==5.0.1
# Freeze exact versions for reproducibility
# pip freeze > requirements.txt
# requirements.txt looks like:
# flask==3.0.0
# sqlalchemy==2.0.23
# redis==5.0.1
# jinja2==3.1.2 (auto-installed dependency)
# werkzeug==3.0.1 (auto-installed dependency)
# Teammate recreates identical environment:
# python -m venv their_env
# source their_env/bin/activate
# pip install -r requirements.txt
# Verify isolation
import sys
print(sys.prefix) # /path/to/project_env (not system Python)
print(sys.executable) # /path/to/project_env/bin/python
# Deactivate when done
# deactivate
A team of 12 developers had "works on my machine" bugs every sprint — different Flask versions, different OS-level packages. After enforcing virtual environments with pinned requirements.txt in CI/CD, deployment failures dropped from 8/month to zero.
Candidates install packages globally with pip install without activating a virtual environment. This causes version conflicts between projects. Another mistake: using pip freeze in the global environment captures every package ever installed.
What is the difference between venv, virtualenv, pipenv, and poetry? When would you choose each?
map(func, iterable) applies a function to every element and returns a map object (lazy iterator). filter(func, iterable) returns elements where func returns True. Both are lazy — they don't compute until iterated.
List comprehensions [expr for x in iterable if cond] can replace most map/filter uses and are generally more Pythonic and readable. However, map() with a named function (not lambda) can be faster because it avoids creating a new function object per iteration.
The choice is readability: if you already have a named function, use map(). If you'd need a lambda, prefer a comprehension.
# map() with named function — cleaner than lambda
prices_usd = [29.99, 49.50, 124.00, 9.99, 299.00]
def usd_to_inr(usd, rate=83.5):
return round(usd * rate, 2)
prices_inr = list(map(usd_to_inr, prices_usd))
print(prices_inr)
# [2504.17, 4133.25, 10354.0, 834.17, 24966.5]
# filter() with named function
def is_high_value(amount):
return amount > 5000
high_value = list(filter(is_high_value, prices_inr))
print(f"High-value items: {high_value}")
# High-value items: [10354.0, 24966.5]
# Equivalent list comprehension — often preferred
prices_inr = [round(p * 83.5, 2) for p in prices_usd]
high_value = [p for p in prices_inr if p > 5000]
# Performance comparison — map with named func is fastest
import timeit
data = list(range(100_000))
t1 = timeit.timeit(lambda: list(map(str, data)), number=50)
t2 = timeit.timeit(lambda: [str(x) for x in data], number=50)
print(f"map: {t1:.3f}s, comprehension: {t2:.3f}s")
# map: 0.682s, comprehension: 0.751s (map is ~10% faster)
In a batch ETL pipeline transforming 8M records, using map() with a pre-defined transform function was 12% faster than the equivalent list comprehension — the difference between a 4-minute and 4.5-minute nightly job.
Candidates forget that map() and filter() return lazy iterators, not lists. Wrapping in list() is needed to see the results or get the length. Also, chaining multiple map/filter calls is less readable than a single comprehension.
What is itertools and how is it different from map/filter? Name three useful itertools functions.
The GIL is a mutex in CPython that allows only one thread to execute Python bytecode at a time, even on multi-core machines. It exists because CPython's memory management (reference counting) is not thread-safe.
Impact: CPU-bound tasks (math, image processing) get no speedup from threading — threads take turns on one core. I/O-bound tasks (network requests, file reads, database queries) do benefit because the GIL is released during I/O waits.
Workarounds: multiprocessing (separate processes, each with its own GIL), concurrent.futures, C extensions (NumPy releases the GIL), or alternative interpreters (PyPy, the upcoming free-threaded CPython 3.13+).
import time
import threading
import multiprocessing
def cpu_bound_task(n):
"""Simulate heavy CPU work — count to n."""
total = 0
for i in range(n):
total += i * i
return total
N = 10_000_000
# Single-threaded
start = time.perf_counter()
cpu_bound_task(N)
cpu_bound_task(N)
print(f"Sequential: {time.perf_counter() - start:.2f}s")
# Multi-threaded — NO speedup due to GIL
start = time.perf_counter()
t1 = threading.Thread(target=cpu_bound_task, args=(N,))
t2 = threading.Thread(target=cpu_bound_task, args=(N,))
t1.start(); t2.start()
t1.join(); t2.join()
print(f"Threaded (GIL): {time.perf_counter() - start:.2f}s")
# Multi-process — real parallelism, bypasses GIL
start = time.perf_counter()
p1 = multiprocessing.Process(target=cpu_bound_task, args=(N,))
p2 = multiprocessing.Process(target=cpu_bound_task, args=(N,))
p1.start(); p2.start()
p1.join(); p2.join()
print(f"Multiprocess: {time.perf_counter() - start:.2f}s")
# Typical output:
# Sequential: 3.42s
# Threaded (GIL): 3.51s ← no speedup!
# Multiprocess: 1.78s ← real 2x speedup
At a computer vision startup, an image processing pipeline using threading to process 10K images took 45 minutes (GIL bottleneck). Switching to multiprocessing.Pool(workers=8) on an 8-core server reduced it to 6 minutes — a 7.5x speedup.
Candidates say "Python can't do parallel processing." This is wrong — threading is limited by the GIL for CPU work, but multiprocessing gives full parallelism. Also, NumPy, pandas, and most C extensions release the GIL during computation.
What is the difference between multiprocessing and concurrent.futures? Which should you choose?
A metaclass is the "class of a class." Just as an object is an instance of a class, a class is an instance of a metaclass. The default metaclass is type. When you write class Foo:, Python calls type('Foo', (object,), {...}) to create the class.
By defining a custom metaclass (inheriting from type and overriding __new__ or __init__), you can control class creation — validate attributes, auto-register classes, enforce coding standards, or add methods dynamically.
Use metaclasses sparingly — 99% of the time, decorators or __init_subclass__ (Python 3.6+) are simpler alternatives.
# Metaclass that auto-registers all subclasses
class PluginMeta(type):
"""Metaclass that maintains a registry of all plugin classes."""
registry = {}
def __new__(mcs, name, bases, namespace):
cls = super().__new__(mcs, name, bases, namespace)
# Don't register the base class itself
if bases:
PluginMeta.registry[name.lower()] = cls
return cls
class Plugin(metaclass=PluginMeta):
"""Base class — all subclasses auto-register."""
def execute(self):
raise NotImplementedError
class CSVExporter(Plugin):
def execute(self):
return "Exporting to CSV..."
class PDFExporter(Plugin):
def execute(self):
return "Generating PDF..."
class ExcelExporter(Plugin):
def execute(self):
return "Writing Excel file..."
# All plugins auto-discovered — no manual registration needed
print(PluginMeta.registry)
# {'csvexporter': <class CSVExporter>, 'pdfexporter': <class PDFExporter>, ...}
# Dynamic dispatch
exporter = PluginMeta.registry["pdfexporter"]()
print(exporter.execute()) # Generating PDF...
Django's ORM uses a metaclass (ModelBase) to convert class attributes into database column definitions. When you write class User(models.Model): name = CharField(), the metaclass intercepts this at class creation time and builds the SQL schema — this powers 500K+ Django apps worldwide.
Candidates overuse metaclasses for problems that decorators or __init_subclass__ can solve. Tim Peters (author of Zen of Python) said: "Metaclasses are deeper magic than 99% of users should ever worry about." Use them only when you need to control class creation itself.
What is __init_subclass__ and how does it provide a simpler alternative to metaclasses?
A descriptor is any object that defines __get__, __set__, or __delete__. When a descriptor is a class attribute, Python intercepts attribute access and calls the descriptor's methods instead.
Data descriptors define __set__ or __delete__ — they take priority over instance __dict__. Non-data descriptors only define __get__ — instance __dict__ takes priority. @property is a data descriptor under the hood. Functions are non-data descriptors (that's how methods bind self).
Descriptors are the mechanism behind @property, @staticmethod, @classmethod, __slots__, and super().
# Custom descriptor for validated attributes
class Percentage:
"""Descriptor that ensures a value is between 0 and 100."""
def __set_name__(self, owner, name):
self.name = name
self.storage_name = f"__{name}"
def __get__(self, obj, objtype=None):
if obj is None:
return self
return getattr(obj, self.storage_name, 0)
def __set__(self, obj, value):
if not isinstance(value, (int, float)):
raise TypeError(f"{self.name} must be a number")
if not 0 <= value <= 100:
raise ValueError(f"{self.name} must be 0-100, got {value}")
setattr(obj, self.storage_name, value)
class StudentReport:
math_score = Percentage()
science_score = Percentage()
english_score = Percentage()
def __init__(self, name, math, science, english):
self.name = name
self.math_score = math
self.science_score = science
self.english_score = english
@property
def average(self):
return (self.math_score + self.science_score + self.english_score) / 3
# Usage
report = StudentReport("Priya", 92, 88, 95)
print(f"{report.name}: Average = {report.average:.1f}%")
# Priya: Average = 91.7%
# Validation works automatically
# report.math_score = 150 # ValueError: math_score must be 0-100
SQLAlchemy uses descriptors for every Column() definition — Column(Integer) creates a descriptor that validates types, handles lazy loading, and tracks changes for the unit-of-work pattern, processing millions of attribute accesses efficiently.
Candidates confuse descriptors with @property. @property handles one attribute per class. Descriptors are reusable — define once, use on many attributes across many classes. If you're copy-pasting @property + validation 10 times, use a descriptor instead.
What is the descriptor lookup chain? What happens when you access obj.attr — what does Python check and in what order?
CPython uses two mechanisms for memory management:
1. Reference counting: Every object has a count of references pointing to it. When the count drops to zero, the object is immediately freed. This handles most memory cleanup.
2. Cycle collector (gc module): Reference counting can't handle circular references (A → B → A). Python's garbage collector runs periodically to detect and collect these cycles using a generational approach (3 generations: gen0 for new objects, gen1, gen2 for long-lived).
You can inspect and control GC with the gc module: gc.collect(), gc.get_referrers(), gc.disable().
import sys
import gc
# Reference counting in action
a = [1, 2, 3]
print(sys.getrefcount(a)) # 2 (a + function argument)
b = a # another reference
print(sys.getrefcount(a)) # 3
del b # remove one reference
print(sys.getrefcount(a)) # 2
# Circular reference — ref counting can't free this
class Node:
def __init__(self, name):
self.name = name
self.next = None
def __del__(self):
print(f"Node {self.name} freed")
# Create cycle: A → B → A
node_a = Node("A")
node_b = Node("B")
node_a.next = node_b
node_b.next = node_a # circular!
# Delete references — ref count never hits 0 due to cycle
del node_a
del node_b
# Nothing printed yet — cycle prevents cleanup
# Force garbage collection to break the cycle
collected = gc.collect()
print(f"GC collected {collected} objects")
# Node A freed
# Node B freed
# GC collected 2 objects
# Check GC stats
print(gc.get_stats())
# [{'collections': 95, 'collected': 312, 'uncollectable': 0}, ...]
A long-running data pipeline had a memory leak — RSS grew by 200 MB/hour. Using gc.get_referrers() and objgraph, the team found a circular reference in a caching layer (Cache → Entry → Cache). Adding weakref.WeakValueDictionary eliminated the leak entirely.
Candidates say "Python has garbage collection so I don't need to think about memory." In reality, circular references can leak, __del__ finalizers can prevent collection, and long-running processes need monitoring. Also, del x doesn't free memory — it decrements the reference count.
What are weak references and when should you use weakref.WeakValueDictionary?
async/await enables cooperative multitasking for I/O-bound operations. An async def function is a coroutine. await suspends the coroutine and lets the event loop run other tasks while waiting for I/O.
The event loop (asyncio.run()) manages coroutines, scheduling them when their I/O completes. asyncio.gather() runs multiple coroutines concurrently. This is not parallelism — it's concurrency through cooperative yielding on a single thread.
asyncio works best for network I/O: HTTP requests, database queries, websockets, file I/O. It does not help CPU-bound tasks (use multiprocessing for those).
import asyncio
import time
async def fetch_user(user_id):
"""Simulate API call — each takes 1 second."""
print(f" Fetching user {user_id}...")
await asyncio.sleep(1) # simulates network I/O
return {"id": user_id, "name": f"User_{user_id}"}
async def fetch_orders(user_id):
"""Simulate database query — each takes 0.5 seconds."""
await asyncio.sleep(0.5)
return [{"order_id": f"ORD-{user_id}-1", "amount": 1299}]
# Sequential — slow (3 users × 1s = 3s)
async def sequential():
start = time.perf_counter()
for uid in [1, 2, 3]:
user = await fetch_user(uid)
print(f"Sequential: {time.perf_counter() - start:.2f}s")
# Concurrent — fast (3 users in parallel = ~1s)
async def concurrent():
start = time.perf_counter()
users = await asyncio.gather(
fetch_user(1),
fetch_user(2),
fetch_user(3),
)
# Fetch orders for all users concurrently too
all_orders = await asyncio.gather(
*[fetch_orders(u["id"]) for u in users]
)
print(f"Concurrent: {time.perf_counter() - start:.2f}s")
print(f"Fetched {len(users)} users with orders")
asyncio.run(sequential()) # ~3.00s
asyncio.run(concurrent()) # ~1.50s — 2x faster!
A price comparison API that fetched prices from 12 vendor APIs switched from sequential requests (12 × 0.8s = 9.6s) to asyncio.gather (all 12 concurrent = 1.1s). Response time dropped from 10s to 1.2s, and the server handled 5x more concurrent users.
Candidates confuse concurrency with parallelism. asyncio runs on one thread — it doesn't bypass the GIL. It speeds up I/O waits by doing other work while waiting. For CPU-bound parallelism, use multiprocessing.
What is the difference between asyncio.gather() and asyncio.TaskGroup? How do you handle errors in concurrent tasks?
Abstract Base Classes (from the abc module) let you define interfaces — classes that cannot be instantiated and that force subclasses to implement specific methods. Use ABC as a base class and @abstractmethod to mark methods that must be overridden.
ABCs are Python's way of establishing contracts: "if you inherit from this base class, you must implement these methods." This catches errors at instantiation time rather than at runtime when a method is called.
Python also provides built-in ABCs in collections.abc (Iterable, Mapping, Sequence) for duck-typing validation.
from abc import ABC, abstractmethod
class PaymentGateway(ABC):
"""Abstract interface — all payment providers must implement these."""
@abstractmethod
def charge(self, amount, currency="INR"):
"""Process a payment. Must return transaction ID."""
pass
@abstractmethod
def refund(self, transaction_id, amount=None):
"""Refund a payment. amount=None means full refund."""
pass
def validate_amount(self, amount):
"""Concrete method — shared by all subclasses."""
if amount <= 0:
raise ValueError(f"Amount must be positive, got {amount}")
class RazorpayGateway(PaymentGateway):
def charge(self, amount, currency="INR"):
self.validate_amount(amount)
# Razorpay-specific API call
return f"rzp_txn_{amount}"
def refund(self, transaction_id, amount=None):
return f"rzp_refund_{transaction_id}"
class StripeGateway(PaymentGateway):
def charge(self, amount, currency="USD"):
self.validate_amount(amount)
return f"stripe_pi_{amount}"
def refund(self, transaction_id, amount=None):
return f"stripe_refund_{transaction_id}"
# Cannot instantiate abstract class
# gateway = PaymentGateway() # TypeError!
# Polymorphism — same interface, different implementations
def process_order(gateway: PaymentGateway, amount: float):
txn_id = gateway.charge(amount)
print(f"Charged ₹{amount} — Transaction: {txn_id}")
process_order(RazorpayGateway(), 2499.00)
process_order(StripeGateway(), 49.99)
A payment platform integrated 6 gateways (Razorpay, Stripe, PayU, Paytm, PhonePe, CCAvenue). The PaymentGateway ABC ensured every new integration implemented charge(), refund(), and verify() — catching 3 missing-method bugs during development instead of in production.
Candidates forget that forgetting a single @abstractmethod implementation raises TypeError at instantiation, not at class definition. If you define the class but don't instantiate it, you won't see the error — tests must create instances.
What is duck typing and how does it relate to ABCs? Can you use ABCs with duck typing?
By default, Python stores instance attributes in a __dict__ (a dictionary per instance). __slots__ replaces this dict with a fixed-size struct, which uses significantly less memory and is slightly faster for attribute access.
Define __slots__ = ('name', 'age') to restrict instances to only those attributes. Benefits: 30-50% less memory per instance, ~10% faster attribute access. Trade-offs: no dynamic attribute addition, complications with multiple inheritance, no __dict__ by default.
Use __slots__ when you have millions of instances of the same class (data processing, game entities, ORM rows).
import sys
# Without __slots__ — each instance has a dict
class PointDict:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# With __slots__ — fixed struct, no dict
class PointSlots:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# Memory comparison
p_dict = PointDict(1.0, 2.0, 3.0)
p_slots = PointSlots(1.0, 2.0, 3.0)
size_dict = sys.getsizeof(p_dict) + sys.getsizeof(p_dict.__dict__)
size_slots = sys.getsizeof(p_slots)
print(f"With __dict__: {size_dict} bytes") # ~152 bytes
print(f"With __slots__: {size_slots} bytes") # ~56 bytes
print(f"Savings: {((size_dict - size_slots) / size_dict * 100):.0f}%")
# Savings: 63%
# Scale: 10 million points
# __dict__: ~1.5 GB
# __slots__: ~560 MB — saves ~1 GB of RAM!
# Trade-off: can't add dynamic attributes
# p_slots.color = "red" # AttributeError!
A real-time IoT platform tracking 5 million sensor readings per minute used __slots__ on SensorReading objects, reducing RAM usage from 4.2 GB to 1.4 GB — the difference between needing a 8 GB server and a 2 GB server ($200/month savings).
Candidates add __slots__ to every class "for performance." This is premature optimization — __slots__ only matters when you have thousands+ instances. For regular classes, the flexibility of __dict__ is worth the small overhead. Also, __slots__ doesn't work well with multiple inheritance unless all parents use __slots__.
Can you combine __slots__ with inheritance? What happens if the parent class doesn't declare __slots__?
A closure is a nested function that remembers the variables from its enclosing scope, even after the outer function has returned. The inner function "closes over" the free variables.
Closures work because Python stores these captured variables in the function's __closure__ attribute as cell objects. The nonlocal keyword (Python 3) lets the inner function modify the captured variable, not just read it.
Closures are the mechanism behind decorators, callback patterns, and factory functions. They provide encapsulation without needing a full class.
# Closure as a factory — configurable validators
def make_range_validator(min_val, max_val, field_name="value"):
"""Factory that returns a validator function."""
def validate(value):
if not min_val <= value <= max_val:
raise ValueError(
f"{field_name} must be {min_val}-{max_val}, got {value}"
)
return value
return validate # returns inner function with captured min/max
# Create specific validators — each remembers its own range
validate_age = make_range_validator(0, 150, "Age")
validate_score = make_range_validator(0, 100, "Score")
validate_salary = make_range_validator(10000, 10000000, "Salary")
print(validate_age(25)) # 25
print(validate_score(92)) # 92
# validate_age(200) # ValueError: Age must be 0-150, got 200
# Counter closure with nonlocal
def make_counter(initial=0):
count = initial
def increment(step=1):
nonlocal count # modify enclosing variable
count += step
return count
def get():
return count
return increment, get
inc, get = make_counter(10)
print(inc()) # 11
print(inc(5)) # 16
print(get()) # 16
# Inspect closure cells
print(validate_age.__closure__[0].cell_contents) # 0 (min_val)
print(validate_age.__closure__[1].cell_contents) # 150 (max_val)
A notification system used closures to create per-channel senders: make_sender("slack", webhook_url), make_sender("email", smtp_config). Each returned function captured its own config, eliminating the need for 5 separate sender classes — reduced 300 lines to 50.
The classic closure trap — capturing a loop variable by reference:
funcs = []
for i in range(3):
funcs.append(lambda: i)
print([f() for f in funcs]) # [2, 2, 2] — all see i=2!funcs = []
for i in range(3):
funcs.append(lambda i=i: i) # default arg captures current i
print([f() for f in funcs]) # [0, 1, 2] — correct!How do closures compare to classes for maintaining state? When would you choose one over the other?
Design patterns in Python are simpler than in Java/C++ because of first-class functions, duck typing, and dynamic features. The most commonly used patterns:
Factory Pattern: A function/method that creates and returns objects based on input, hiding instantiation logic.
Strategy Pattern: Pass different algorithms (functions) as arguments — trivial in Python since functions are first-class.
Observer Pattern: Objects subscribe to events and get notified when state changes.
Singleton: Ensure only one instance exists — use a module-level variable (simplest) or metaclass.
Decorator Pattern: Already built into Python's @decorator syntax.
# Strategy Pattern — payment processing with different strategies
from typing import Callable
def process_payment(amount: float, strategy: Callable[[float], str]) -> str:
"""Process payment using the given strategy function."""
if amount <= 0:
raise ValueError("Amount must be positive")
return strategy(amount)
# Strategies are just functions
def upi_payment(amount):
return f"UPI: ₹{amount:,.2f} debited via UPI ID"
def card_payment(amount):
fee = amount * 0.02 # 2% processing fee
return f"Card: ₹{amount + fee:,.2f} charged (includes ₹{fee:,.2f} fee)"
def wallet_payment(amount):
cashback = min(amount * 0.05, 100) # 5% cashback, max ₹100
return f"Wallet: ₹{amount:,.2f} paid, ₹{cashback:,.2f} cashback earned"
# Usage — strategy selected at runtime
print(process_payment(2500, upi_payment))
print(process_payment(2500, card_payment))
print(process_payment(2500, wallet_payment))
# Factory Pattern — create exporters based on format
class CSVExporter:
def export(self, data): return "CSV output..."
class JSONExporter:
def export(self, data): return "JSON output..."
class ExcelExporter:
def export(self, data): return "Excel output..."
def create_exporter(format_type):
"""Factory function — hides instantiation logic."""
exporters = {
"csv": CSVExporter,
"json": JSONExporter,
"excel": ExcelExporter,
}
cls = exporters.get(format_type)
if not cls:
raise ValueError(f"Unknown format: {format_type}")
return cls()
exporter = create_exporter("json")
print(exporter.export(data=[1, 2, 3]))
A SaaS billing system used the Strategy pattern for 8 payment methods (UPI, card, wallet, net banking, EMI, BNPL, crypto, bank transfer). Adding a new payment method required writing one function and adding it to a dict — zero changes to the core billing logic. New integrations dropped from 2 days to 2 hours.
Candidates implement Singleton with complex metaclasses when a simple module-level variable does the same thing. In Python, modules are singletons by default — import config always returns the same object. Also, candidates over-pattern: not everything needs a Factory or Abstract Factory.
When would you use a class-based approach vs a functional approach for the Strategy pattern in Python?
CPython (the standard Python implementation) executes code in stages:
1. Parsing: Source code → Abstract Syntax Tree (AST). ast module lets you inspect/modify the AST.
2. Compilation: AST → bytecode (.pyc files). Bytecode is a set of instructions for CPython's virtual machine. View with dis module.
3. Execution: The CPython VM (ceval.c) executes bytecode instructions one at a time in a giant switch-case loop. Each stack frame has its own evaluation stack.
Important internals: small integer caching (-5 to 256 are pre-allocated), string interning (common strings are reused), __pycache__ (compiled bytecode storage), and peephole optimizer (constant folding, dead code elimination at bytecode level).
import dis
import ast
import sys
# 1. View bytecode with dis
def add_numbers(a, b):
result = a + b
return result
print("=== Bytecode ===")
dis.dis(add_numbers)
# LOAD_FAST 0 (a)
# LOAD_FAST 1 (b)
# BINARY_ADD
# STORE_FAST 2 (result)
# LOAD_FAST 2 (result)
# RETURN_VALUE
# 2. Small integer caching
a = 256
b = 256
print(a is b) # True — same object (cached)
c = 257
d = 257
print(c is d) # False — different objects (not cached)
# 3. Inspect the AST
source = "prices = [p * 1.18 for p in products if p > 100]"
tree = ast.parse(source)
print(ast.dump(tree, indent=2))
# 4. Check bytecode file location
import json
print(json.__cached__)
# /usr/lib/python3.11/json/__pycache__/__init__.cpython-311.pyc
# 5. Peephole optimization — constant folding
def constants():
x = 24 * 60 * 60 # compiler pre-calculates to 86400
return x
dis.dis(constants)
# LOAD_CONST 1 (86400) ← pre-computed at compile time!
A performance engineer used the dis module to discover that a hot loop was creating unnecessary temporary objects (LOAD_CONST + BUILD_LIST on every iteration). Rewriting to use a pre-allocated list and direct STORE_FAST reduced the loop time from 4.2s to 1.8s on 50M iterations.
Candidates use is to compare values instead of ==. is checks identity (same object in memory), == checks equality (same value). Due to integer caching, 256 is 256 is True but 257 is 257 may be False. Always use == for value comparison.
What is the difference between CPython and PyPy? When would you choose PyPy?
When Python is too slow for a critical section, you can write it in C and call it from Python. Three main approaches:
1. ctypes: Call existing C shared libraries (.so/.dll) from Python — no compilation needed. Good for wrapping existing C code.
2. Cython: Write Python-like code with type annotations, compiled to C. Easiest way to get C performance. .pyx files compile to .so.
3. C API (Python.h): Write raw C extensions using CPython's C API. Most control, most complex. Used by NumPy, pandas internally.
cffi is a modern alternative to ctypes with better ergonomics. pybind11 wraps C++ code for Python.
# Approach 1: ctypes — call existing C library
import ctypes
# Load the system C math library
libm = ctypes.CDLL("libm.so.6") # Linux
# libm = ctypes.CDLL("libm.dylib") # macOS
libm.sqrt.restype = ctypes.c_double
libm.sqrt.argtypes = [ctypes.c_double]
print(f"sqrt(144) = {libm.sqrt(144)}") # 12.0
# Approach 2: Cython (save as fast_math.pyx)
# -------------------------------------------
# cdef double dot_product(double[:] a, double[:] b):
# cdef int i, n = a.shape[0]
# cdef double total = 0.0
# for i in range(n):
# total += a[i] * b[i]
# return total
# -------------------------------------------
# Compile: cythonize -i fast_math.pyx
# Use: from fast_math import dot_product
# Approach 3: Pure Python with NumPy (releases GIL internally)
import numpy as np
a = np.random.rand(10_000_000)
b = np.random.rand(10_000_000)
# NumPy's dot is written in C/Fortran — blazing fast
result = np.dot(a, b) # 10M element dot product in ~5ms
# Performance comparison
import time
# Pure Python: 10M element dot product = ~4.2 seconds
# NumPy (C): 10M element dot product = ~0.005 seconds
# Speedup: ~840x
A quantitative finance team had a risk calculation in pure Python taking 45 minutes. Rewriting the inner loop (matrix multiplication of 50K × 50K) in Cython with typed memoryviews reduced it to 28 seconds — a 96x speedup. The rest of the codebase stayed in Python.
Candidates jump straight to "rewrite in C" when a simple NumPy vectorization would give the same 100x speedup. Always profile first, optimize the hot path, and try NumPy before writing C code. Also, C extensions can have memory leaks and segfaults — much harder to debug than Python.
What is pybind11 and how does it compare to ctypes/Cython for wrapping C++ code?
A production testing strategy has multiple layers:
Unit tests: Test individual functions/classes in isolation. Use pytest with fixtures, parametrize, and mocking. Target: 80%+ coverage on business logic.
Integration tests: Test multiple components together (API + DB, service + message queue). Use test databases and fixtures.
Mocking: unittest.mock.patch replaces external dependencies (APIs, databases, time) so tests are fast and deterministic.
Property-based testing: hypothesis generates random inputs to find edge cases you didn't think of.
CI/CD: Run tests automatically on every commit with pytest + coverage reporting.
import pytest
from unittest.mock import patch, MagicMock
from datetime import datetime
# The code being tested
class PricingEngine:
def __init__(self, tax_rate=0.18):
self.tax_rate = tax_rate
def calculate_total(self, items):
subtotal = sum(item["price"] * item["qty"] for item in items)
tax = subtotal * self.tax_rate
return round(subtotal + tax, 2)
def apply_coupon(self, total, coupon_code, api_client):
"""Calls external API to validate coupon."""
discount = api_client.validate_coupon(coupon_code)
return round(total * (1 - discount / 100), 2)
# --- Tests ---
class TestPricingEngine:
@pytest.fixture
def engine(self):
return PricingEngine(tax_rate=0.18)
@pytest.fixture
def sample_items(self):
return [
{"name": "Laptop", "price": 50000, "qty": 1},
{"name": "Mouse", "price": 500, "qty": 2},
]
def test_calculate_total_with_tax(self, engine, sample_items):
total = engine.calculate_total(sample_items)
assert total == 60180.0 # (50000 + 1000) * 1.18
def test_empty_cart(self, engine):
assert engine.calculate_total([]) == 0.0
@pytest.mark.parametrize("tax_rate,expected", [
(0.0, 51000.0),
(0.05, 53550.0),
(0.18, 60180.0),
(0.28, 65280.0),
])
def test_different_tax_rates(self, sample_items, tax_rate, expected):
engine = PricingEngine(tax_rate=tax_rate)
assert engine.calculate_total(sample_items) == expected
def test_apply_coupon_mocked_api(self, engine):
"""Mock external API — don't call real service in tests."""
mock_api = MagicMock()
mock_api.validate_coupon.return_value = 20 # 20% discount
result = engine.apply_coupon(1000.0, "SAVE20", mock_api)
assert result == 800.0
mock_api.validate_coupon.assert_called_once_with("SAVE20")
A fintech team with 200K lines of Python adopted pytest + mocking + CI. Before: 3-4 production bugs per sprint, 2-hour manual testing cycles. After: 0-1 bugs per sprint, 8-minute automated test suite with 87% coverage. The test suite caught a critical rounding bug in tax calculation that would have affected 50K invoices.
Candidates mock too much or too little. Mock external dependencies (APIs, databases, time), but don't mock the code under test. If you mock everything, you're testing your mocks, not your code. Another mistake: testing private methods instead of public behavior.
What is the difference between mocking and patching? When should you use each?
Modern Python packaging uses pyproject.toml (PEP 517/518) as the single configuration file, replacing the older setup.py + setup.cfg approach.
Key files: pyproject.toml (metadata, dependencies, build config), src/ layout (prevents import confusion), README.md, LICENSE, tests/.
Build tools: setuptools (traditional), hatchling (modern, fast), poetry (dependency management + build), flit (simplest).
Publish: Build with python -m build, upload with twine upload dist/* to PyPI.
For internal packages, use a private PyPI server (devpi, Artifactory) or direct Git dependencies.
# pyproject.toml — modern Python packaging
# [build-system]
# requires = ["hatchling"]
# build-backend = "hatchling.build"
#
# [project]
# name = "invoice-generator"
# version = "2.1.0"
# description = "Generate GST-compliant invoices"
# readme = "README.md"
# license = {text = "MIT"}
# requires-python = ">=3.9"
# authors = [{name = "Priya", email = "priya@company.com"}]
#
# dependencies = [
# "jinja2>=3.1",
# "weasyprint>=60.0",
# "pydantic>=2.0",
# ]
#
# [project.optional-dependencies]
# dev = ["pytest>=7.0", "ruff>=0.1", "mypy>=1.0"]
#
# [project.scripts]
# invoice = "invoice_generator.cli:main"
# Project structure (src layout):
# invoice-generator/
# ├── pyproject.toml
# ├── README.md
# ├── LICENSE
# ├── src/
# │ └── invoice_generator/
# │ ├── __init__.py
# │ ├── cli.py
# │ ├── generator.py
# │ └── templates/
# └── tests/
# ├── test_generator.py
# └── conftest.py
# Build and publish commands:
# pip install build twine
# python -m build # creates dist/*.whl and dist/*.tar.gz
# twine check dist/* # validate package
# twine upload dist/* # upload to PyPI
# pip install invoice-generator # anyone can install it!
A data science team shared 12 internal Python packages across 8 projects using a private DevPI server. Before packaging: copy-paste code, version drift, 30 bugs/quarter from stale copies. After: pip install from internal index, automatic versioning, zero copy-paste bugs.
Candidates still use setup.py for new projects. pyproject.toml is the standard since PEP 517/518 (Python 3.7+). Also, candidates forget to pin dependency ranges — requests>=2.28,<3 is safer than requests>=2.28 which could break with a major version bump.
What is the difference between a wheel (.whl) and a source distribution (.tar.gz)? When does it matter?
CI/CD automates testing, building, and deploying Python projects on every commit.
CI (Continuous Integration): Run tests, linting, type checking, and security scans on every PR. Tools: GitHub Actions, GitLab CI, Jenkins.
CD (Continuous Deployment): Auto-deploy to staging on merge, production on release tag.
Typical Python CI pipeline: ruff check (linting) → mypy (type checking) → pytest --cov (tests + coverage) → bandit (security scan) → python -m build (package) → deploy.
Best practices: test against multiple Python versions (3.9, 3.10, 3.11, 3.12), use dependency caching, fail fast, and keep the pipeline under 5 minutes.
# .github/workflows/ci.yml — GitHub Actions
# name: Python CI
#
# on:
# push:
# branches: [main]
# pull_request:
# branches: [main]
#
# jobs:
# test:
# runs-on: ubuntu-latest
# strategy:
# matrix:
# python-version: ["3.10", "3.11", "3.12"]
#
# steps:
# - uses: actions/checkout@v4
#
# - name: Set up Python ${{ matrix.python-version }}
# uses: actions/setup-python@v5
# with:
# python-version: ${{ matrix.python-version }}
#
# - name: Cache pip packages
# uses: actions/cache@v4
# with:
# path: ~/.cache/pip
# key: ${{ runner.os }}-pip-${{ hashFiles("requirements*.txt") }}
#
# - name: Install dependencies
# run: |
# pip install -e ".[dev]"
#
# - name: Lint with ruff
# run: ruff check src/ tests/
#
# - name: Type check with mypy
# run: mypy src/
#
# - name: Test with pytest
# run: pytest --cov=src --cov-report=xml -v
#
# - name: Security scan
# run: bandit -r src/ -ll
A 15-person team deployed to production manually — every release took 4 hours and broke 30% of the time. After GitHub Actions CI/CD: deploy time dropped to 12 minutes (automated), failure rate dropped to 3%, and the team shipped 3x more features per quarter.
Candidates skip security scanning (bandit) and dependency auditing (pip-audit) in CI. Also, testing only on one Python version is risky — a feature that works on 3.11 may fail on 3.9. Matrix testing catches version-specific bugs before production.
How do you handle database migrations in CI/CD? How do you do zero-downtime deployments?
Large Python codebases need clear structure to stay maintainable as the team grows. Key principles:
Layered architecture: Separate presentation, business logic, and data access into distinct modules. Never let database queries leak into API handlers.
Dependency injection: Pass dependencies as constructor params, not as global imports. Makes testing trivial.
Domain-driven design: Organize code by business domain (users/, orders/, payments/), not by technical role (models/, views/, controllers/).
Type hints: Use type annotations + mypy/pyright for static analysis — catches bugs before runtime.
Configuration: 12-factor app — config from environment variables, not hardcoded.
# Domain-driven project structure
# myapp/
# ├── users/
# │ ├── __init__.py
# │ ├── models.py # User, UserProfile
# │ ├── services.py # Business logic
# │ ├── repository.py # Database access
# │ ├── api.py # HTTP handlers
# │ └── tests/
# ├── orders/
# │ ├── __init__.py
# │ ├── models.py
# │ ├── services.py
# │ ├── repository.py
# │ ├── api.py
# │ └── tests/
# ├── shared/
# │ ├── database.py
# │ ├── config.py
# │ └── exceptions.py
# └── main.py
# Dependency injection — testable service layer
from dataclasses import dataclass
from typing import Protocol
class UserRepository(Protocol):
"""Interface — any class with these methods works."""
def get_by_id(self, user_id: int) -> dict: ...
def save(self, user: dict) -> None: ...
@dataclass
class UserService:
repo: UserRepository # injected, not imported
def upgrade_to_premium(self, user_id: int) -> dict:
user = self.repo.get_by_id(user_id)
if user["total_spent"] < 10000:
raise ValueError("Minimum ₹10,000 spent required for premium")
user["is_premium"] = True
user["discount_rate"] = 0.15
self.repo.save(user)
return user
# In production: UserService(repo=PostgresUserRepo(db))
# In tests: UserService(repo=FakeUserRepo(test_data))
A startup grew from 3 to 25 engineers in 18 months. Their single 15K-line app.py became unmaintainable. Refactoring to domain-driven modules (users/, orders/, payments/, notifications/) with Protocol-based interfaces reduced onboarding time from 3 weeks to 3 days and made it possible for teams to work on different domains without merge conflicts.
Candidates create "god modules" — services.py with 5000 lines, models.py with 50 classes. Each module should have a single responsibility. Also, circular imports are a sign of poor architecture — if A imports B and B imports A, they need restructuring.
How do you handle circular imports in a large Python project? What are the strategies to prevent them?
Profiling measures where your code spends time and memory. Python offers several profiling tools:
cProfile: Built-in CPU profiler. Shows function call counts and time per function. Run with python -m cProfile script.py or use programmatically.
line_profiler: Shows time per line within a function — much more granular than cProfile.
memory_profiler: Shows memory usage per line.
py-spy: Sampling profiler that attaches to running processes without modifying code — great for production profiling.
timeit: Micro-benchmark for small code snippets.
Rule: Profile before optimizing. 90% of runtime is usually in 10% of code. Find the hot path first.
import cProfile
import pstats
import io
from functools import lru_cache
# The code to profile
def process_transactions(transactions):
results = []
for txn in transactions:
# Simulate expensive operations
category = categorize(txn["amount"])
risk = calculate_risk(txn["amount"], txn["merchant"])
results.append({**txn, "category": category, "risk": risk})
return results
def categorize(amount):
"""Intentionally slow — O(n) lookup each time."""
categories = [(0, 500, "micro"), (500, 5000, "small"),
(5000, 50000, "medium"), (50000, float("inf"), "large")]
for low, high, cat in categories:
if low <= amount < high:
return cat
def calculate_risk(amount, merchant):
"""Simulate computation."""
return round(amount * 0.001 * len(merchant), 4)
# Profile it
transactions = [{"amount": i * 100, "merchant": f"shop_{i}"}
for i in range(10000)]
profiler = cProfile.Profile()
profiler.enable()
results = process_transactions(transactions)
profiler.disable()
# Print sorted by cumulative time
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream).sort_stats("cumulative")
stats.print_stats(10) # top 10 functions
print(stream.getvalue())
# Quick timing with timeit
import timeit
time_taken = timeit.timeit(
lambda: process_transactions(transactions[:100]),
number=100
)
print(f"100 txns × 100 runs: {time_taken:.3f}s")
print(f"Per transaction: {time_taken / 10000 * 1000:.4f}ms")
An API endpoint took 4.5 seconds. cProfile revealed that 89% of time was in a single function — a JSON schema validation running on every nested object. Caching the compiled schema reduced the endpoint from 4.5s to 0.3s, a 15x improvement found in 20 minutes of profiling.
Candidates optimize without profiling — they guess at bottlenecks and "optimize" code that runs once while the real bottleneck (called 100K times) is untouched. Another mistake: using time.time() instead of time.perf_counter() for benchmarks — time.time() has lower resolution and can jump due to system clock adjustments.
How do you profile memory usage in Python? What tools detect memory leaks?
The choice depends on the type of work:
Threading (threading, concurrent.futures.ThreadPoolExecutor): Best for I/O-bound tasks — waiting on network, files, databases. Threads share memory so communication is easy, but the GIL prevents CPU parallelism.
Multiprocessing (multiprocessing, ProcessPoolExecutor): Best for CPU-bound tasks — number crunching, image processing, ML training. Each process has its own Python interpreter and GIL, giving true parallelism. Trade-off: higher memory usage and IPC overhead.
Asyncio: Best for high-concurrency I/O — handling 10K+ simultaneous connections (web servers, chat, websockets). Single-threaded, event-loop-based. Requires async libraries (aiohttp, asyncpg).
import time
import asyncio
import aiohttp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
# I/O-bound: Fetching 100 URLs
# 1. Threading — good for I/O
def fetch_url_sync(url):
import urllib.request
return urllib.request.urlopen(url).read()[:100]
def threaded_fetch(urls):
with ThreadPoolExecutor(max_workers=20) as pool:
results = list(pool.map(fetch_url_sync, urls))
return results
# 2. Asyncio — best for high-concurrency I/O
async def async_fetch(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [await r.read() for r in responses]
# CPU-bound: Image processing
def process_image(image_path):
"""CPU-heavy: resize, filter, compress."""
# Simulate CPU work
total = sum(i * i for i in range(500_000))
return total
# 3. Multiprocessing — true parallelism for CPU work
def parallel_process(image_paths):
with ProcessPoolExecutor(max_workers=8) as pool:
results = list(pool.map(process_image, image_paths))
return results
# Decision matrix:
# ┌──────────────────┬─────────────┬────────────────┬──────────┐
# │ Task Type │ Threading │ Multiprocessing│ Asyncio │
# ├──────────────────┼─────────────┼────────────────┼──────────┤
# │ API calls (100) │ ✅ Good │ ❌ Overkill │ ✅ Best │
# │ File processing │ ✅ OK │ ✅ Best │ ❌ No │
# │ CPU computation │ ❌ GIL │ ✅ Best │ ❌ No │
# │ 10K connections │ ❌ Too many │ ❌ Too many │ ✅ Best │
# │ Mixed I/O + CPU │ ✅ OK │ ✅ Best │ ⚠️ Tricky│
# └──────────────────┴─────────────┴────────────────┴──────────┘
A document processing pipeline had 3 stages: download PDFs (I/O), extract text (CPU), upload results (I/O). Using asyncio for downloads (100 concurrent), multiprocessing for extraction (8 cores), and threading for uploads gave a 12x speedup — from 45 minutes to 3.5 minutes for 5K documents.
Candidates use threading for CPU-bound work and wonder why it's not faster (GIL!). They also use multiprocessing for simple I/O tasks, wasting memory on separate processes when threads would suffice. The most common mistake: creating a new thread/process per task instead of using a pool.
How does concurrent.futures simplify threading and multiprocessing? What is the Executor pattern?
Vectorization means performing operations on entire arrays at once using optimized C/Fortran code, instead of looping element-by-element in Python. NumPy arrays store data contiguously in memory (unlike Python lists), enabling CPU cache efficiency and SIMD instructions.
A Python for-loop over 10M elements calls the Python interpreter 10M times. A NumPy operation calls C code once for all 10M elements — 100-1000x faster for numerical work.
Key idea: replace Python loops with NumPy operations — np.sum(), np.where(), broadcasting, fancy indexing, and ufuncs (universal functions). If you find yourself writing for i in range(len(array)):, there's likely a NumPy way.
import numpy as np
import time
# Task: Calculate portfolio returns for 1M stocks over 252 trading days
# ❌ Pure Python — slow loop
def python_returns(prices):
returns = []
for i in range(len(prices)):
stock_returns = []
for j in range(1, len(prices[i])):
r = (prices[i][j] - prices[i][j-1]) / prices[i][j-1]
stock_returns.append(r)
returns.append(stock_returns)
return returns
# ✅ NumPy vectorized — no loops
def numpy_returns(prices):
return (prices[:, 1:] - prices[:, :-1]) / prices[:, :-1]
# Generate test data: 10,000 stocks × 252 days
np.random.seed(42)
stock_prices = np.random.uniform(100, 5000, size=(10_000, 252))
# Benchmark
start = time.perf_counter()
result_np = numpy_returns(stock_prices)
t_numpy = time.perf_counter() - start
stock_list = stock_prices.tolist()
start = time.perf_counter()
result_py = python_returns(stock_list[:100]) # only 100 stocks!
t_python = time.perf_counter() - start
print(f"NumPy (10K stocks): {t_numpy:.4f}s")
print(f"Python (100 stocks): {t_python:.4f}s")
print(f"Estimated Python (10K stocks): {t_python * 100:.1f}s")
print(f"Speedup: ~{(t_python * 100) / t_numpy:.0f}x")
# Typical output:
# NumPy (10K stocks): 0.0089s
# Python (100 stocks): 0.1823s
# Estimated Python (10K stocks): 18.2s
# Speedup: ~2045x
# More vectorization examples
data = np.random.randn(1_000_000)
# Conditional: values > 0 get doubled, others set to 0
result = np.where(data > 0, data * 2, 0)
# Aggregation across axis
matrix = np.random.rand(1000, 500)
col_means = matrix.mean(axis=0) # mean of each column
row_maxes = matrix.max(axis=1) # max of each row
A quant trading firm's daily risk calculation on 50K instruments × 10 years of data took 6 hours in pure Python. Vectorizing with NumPy reduced it to 8 seconds — a 2,700x speedup. The entire overnight batch job moved to a real-time dashboard.
Candidates mix NumPy and Python loops — iterating over a NumPy array with a for loop defeats the purpose:
result = np.zeros(len(data))
for i in range(len(data)):
result[i] = data[i] * 2 + 1 # Python interpreter called 1M timesresult = data * 2 + 1 # single C call for all 1M elementsWhat is NumPy broadcasting and how does it work? Give an example with differently-shaped arrays.
functools.lru_cache is a decorator that caches function return values based on arguments. When the same arguments are used again, the cached result is returned instantly instead of recomputing. LRU = Least Recently Used — when the cache is full, the oldest unused entry is evicted.
@lru_cache(maxsize=128) caches up to 128 unique argument combinations. maxsize=None means unlimited cache (use carefully). Python 3.9+ also has @cache (shortcut for @lru_cache(maxsize=None)).
Requirements: arguments must be hashable (no lists or dicts). For unhashable args, convert them to tuples or use a custom cache.
from functools import lru_cache
import time
# Classic example: recursive Fibonacci
# Without cache: O(2^n) — exponentially slow
# With cache: O(n) — each value computed once
@lru_cache(maxsize=None)
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
start = time.perf_counter()
result = fibonacci(500) # instant with cache, impossible without
elapsed = time.perf_counter() - start
print(f"fib(500) = {result} ({elapsed:.6f}s)")
print(f"Cache stats: {fibonacci.cache_info()}")
# CacheInfo(hits=498, misses=501, maxsize=None, currsize=501)
# Practical: caching expensive database/API lookups
@lru_cache(maxsize=1000)
def get_exchange_rate(from_currency, to_currency, date):
"""Simulate expensive API call — cached for same inputs."""
print(f" API call: {from_currency}→{to_currency} on {date}")
# In real code: requests.get(f"https://api.exchangerate.com/...")
rates = {"USD_INR": 83.5, "EUR_INR": 91.2, "GBP_INR": 106.3}
return rates.get(f"{from_currency}_{to_currency}", 1.0)
# First calls hit the "API"
print(get_exchange_rate("USD", "INR", "2025-01-15")) # API call
print(get_exchange_rate("EUR", "INR", "2025-01-15")) # API call
# Repeated calls served from cache — instant
print(get_exchange_rate("USD", "INR", "2025-01-15")) # cached!
print(get_exchange_rate("USD", "INR", "2025-01-15")) # cached!
# Clear cache when needed
get_exchange_rate.cache_clear()
A pricing engine called an exchange rate API 50K times per batch. Adding @lru_cache reduced API calls from 50K to 180 (unique currency pairs × dates), cutting batch time from 25 minutes to 40 seconds and saving $500/month in API costs.
Candidates use lru_cache on functions with side effects (database writes, API POST requests) or on methods without accounting for self:
class UserService:
@lru_cache(maxsize=100)
def get_user(self, user_id):
return db.query(user_id)
# Every UserService instance has its own 'self', defeating the cache@lru_cache(maxsize=100)
def get_user(user_id):
return db.query(user_id)
# Standalone function — cache works correctlyWhat is the difference between lru_cache and a Redis/Memcached cache? When would you use each?
Cython is a superset of Python that compiles to C. You write Python-like code with optional C type declarations, and Cython generates C code that's compiled into a shared library (.so/.pyd) importable from Python.
Adding type annotations (cdef int, cdef double) removes Python object overhead for numeric operations, giving C-like speed. Cython can also release the GIL with nogil, enabling true multi-threaded parallelism for numerical code.
Use Cython when: NumPy can't vectorize your logic (complex conditionals, graph algorithms, custom loops), you need 10-100x speedup over pure Python, or you want to wrap an existing C library.
# Pure Python version — slow
def python_primes(limit):
"""Find all primes up to limit using Sieve of Eratosthenes."""
sieve = [True] * (limit + 1)
sieve[0] = sieve[1] = False
for i in range(2, int(limit**0.5) + 1):
if sieve[i]:
for j in range(i*i, limit + 1, i):
sieve[j] = False
return [i for i, is_prime in enumerate(sieve) if is_prime]
# Cython version (save as fast_primes.pyx):
# ─────────────────────────────────────────
# def cython_primes(int limit):
# cdef int i, j
# cdef list sieve = [True] * (limit + 1)
# sieve[0] = sieve[1] = False
#
# for i in range(2, <int>(limit**0.5) + 1):
# if sieve[i]:
# for j in range(i*i, limit + 1, i):
# sieve[j] = False
#
# return [i for i in range(limit + 1) if sieve[i]]
# ─────────────────────────────────────────
# Compile: cythonize -i fast_primes.pyx
# Import: from fast_primes import cython_primes
# Benchmark comparison
import time
limit = 10_000_000
start = time.perf_counter()
primes_py = python_primes(limit)
t_python = time.perf_counter() - start
print(f"Python: {t_python:.3f}s — found {len(primes_py)} primes")
# start = time.perf_counter()
# primes_cy = cython_primes(limit)
# t_cython = time.perf_counter() - start
# print(f"Cython: {t_cython:.3f}s — found {len(primes_cy)} primes")
# print(f"Speedup: {t_python / t_cython:.1f}x")
# Typical results:
# Python: 4.200s — found 664,579 primes
# Cython: 0.180s — found 664,579 primes
# Speedup: 23.3x
A bioinformatics lab had a DNA sequence alignment algorithm in pure Python taking 12 hours per genome. Cython with typed memoryviews reduced it to 18 minutes (40x speedup). The team kept 95% of the code in Python and only Cython-ized the inner loop of the Smith-Waterman algorithm.
Candidates Cython-ize everything instead of just the hot loop. Cython code is harder to debug and maintain — only use it where profiling shows a clear bottleneck. Also, forgetting to add type declarations (cdef int) gives zero speedup — untyped Cython is essentially the same speed as Python.
How does Cython compare to PyPy for performance? When would you choose one over the other?
Frequently Asked Questions
The most common Python interview questions cover data types (list vs tuple vs set), list comprehensions, decorators, generators, OOP concepts, the GIL, and memory management. Our guide covers all of these with real code examples and follow-up questions.
We cover 40 Python interview questions across 5 difficulty levels: Basic (10), Intermediate (10), Advanced (8), Experienced (7), and Performance & Optimization (5). Each question includes 6 answer sections.
Questions are organized into 5 levels: Basic (0-1 year experience), Intermediate (1-3 years), Advanced (3-5 years), Experienced/Architect (5+ years), and Performance & Optimization (all levels). You can filter by level using the pills above the question list.
All code examples are real, working Python code — not pseudocode or foo/bar placeholders. Each example uses realistic variable names, actual library usage, and scenarios from production environments. You can copy and run them directly.
The GIL (Global Interpreter Lock) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. It is one of the most frequently asked advanced Python interview questions because it affects multithreading performance and is critical for understanding concurrency in Python.
Focus on profiling with cProfile, understanding multiprocessing vs threading, numpy vectorization for numerical work, caching with functools.lru_cache, and when to consider Cython. Our performance section covers all of these with real-world benchmarks.
Senior Python interviews focus on design patterns (Factory, Strategy, Observer), CPython internals, C extensions, testing strategies (pytest, mocking), packaging (setuptools, pyproject.toml), CI/CD pipeline design, and code architecture for large codebases.
All questions and code examples are for Python 3.8+ (modern Python). Python 2 reached end-of-life in January 2020, so interviews now focus exclusively on Python 3 features like f-strings, walrus operator, type hints, and dataclasses.