leetcode number
linked list, 328
binary search, 34
string, 151
backtracking, 17
siding window, 3
hash table, 146
1. Design a Rate Limiter Every company covered this.
Know: Sliding window algorithms, Redis, race condition handling, token bucket vs. leaky bucket models.
2. Design a Chat Application A WhatsApp-style typing indicator alone can drive a 45-minute discussion.
Know: WebSockets, Redis Pub/Sub, message queues, offline message delivery.
3. Design a URL Shortener Appears straightforward. Becomes complex quickly.
Know: Base62 encoding, collision resolution, analytics tracking, Redis-based caching.
4. Design a Notification System
Know: Push vs. pull architecture, Kafka for asynchronous delivery, retry mechanisms, user preference management.
5. Design a Payment System JPMorgan asked this. So did multiple others.
Know: Idempotency keys, Saga pattern, ACID transactions vs. eventual consistency.
6. Design an API Rate Limiter Different from #1. This focuses on distributed system design.
Know: Token bucket algorithms, Redis INCR, Lua scripting, multi-node coordination.
7. Design a Video Streaming Platform
Know: CDN architecture, chunked uploads, adaptive bitrate streaming, large-scale storage systems.
8. Design a Ride-Hailing Application
Know: Real-time location tracking, matching algorithms, surge pricing strategies, live event processing.
9. Design an E-commerce Checkout System
Know: Inventory reservation, flash-sale scalability, payment retry workflows, order state management.
10. Design a Search Autocomplete System
Know: Trie data structures, frequency-based ranking, result caching, sub-100ms latency optimization.
- How do you train LLMs
- Why LLM is decoder only architecture
- Sampling in LLM
- Providing Context to LLM (needle in haystack problem)
- LLM Evaluation
- MCPs/Skills/Workflows/Agents/Plugin in LLM - Design / Implement
- Prompt Engineering / Guided Genaration
- LLM Inference
- Preference alignment in LLM
- (Optional) Jail break in LLMs/Claude Mythos
AI related design
- Design a document intelligence platform for alternative investment documents.Design a RAG system for financial advisors with source-grounded answers.
Design an agent that automates operations workflows but requires human approval for risky actions.
Design an eval platform for LLM prompts, models, and tools.
Design a model-monitoring system for production AI.
Design cost/latency routing across GPT/Claude/open-source models.
Design a secure AI platform for PII-heavy financial documents.
Design a system to compare OCR + LLM extraction accuracy across vendors.
Design a rollout plan for an AI assistant used by internal operations teams.
Design prompt/model versioning and rollback for production AI workflows.
1. How you train LLMs
Beginner: An LLM is just a next-token predictor. You show it enormous amounts of text and nudge its weights so it gets better at guessing the next word. That’s it at the core.
The three stages:
1. Pretraining — self-supervised next-token prediction over a huge corpus (web, books, code). The model learns grammar, facts, and reasoning patterns. This is the expensive part (thousands of GPUs, millions of dollars). The output is a base model: great at continuing text, bad at following instructions.
2. Supervised fine-tuning (SFT / instruction tuning) — train on curated (instruction, ideal-response) pairs so it behaves like a helpful assistant instead of an autocomplete.
3. Preference alignment — shape outputs toward human preferences (helpful, honest, harmless). Covered in topic 9.
Senior depth: The loss is cross-entropy on the next token. Data quality and mixture matter as much as quantity (dedup, filtering, balancing code vs prose). Scaling laws (Chinchilla) tell you the compute-optimal balance of parameters vs training tokens — more isn’t always better, it’s about the ratio. Tokenization (usually byte-pair encoding) determines how text is chopped into units. Context length is a training-time choice with cost implications. You can also do continued pretraining to adapt a base model to a domain.
Interview tie-in: The classic follow-up is “fine-tune or RAG?” Rule of thumb: RAG for knowledge that’s fresh, private, or changes often; fine-tuning for behavior — format, tone, domain style, structured output. For most RAG systems you don’t fine-tune the base model at all. Saying that confidently signals maturity (and avoids the “over-engineering” pitfall on your list).
2. Why LLMs are decoder-only
Beginner: The original Transformer had two halves — an encoder (reads/understands) and a decoder (generates). That spawned three families: encoder-only (BERT, for understanding/classification), decoder-only (GPT-style, for generation), and encoder-decoder (T5, for translation-style tasks).
Why decoder-only won for general LLMs:
• One simple objective that scales: next-token prediction. Any task — translation, Q&A, summarization, coding — can be cast as “continue this text,” so you don’t need task-specific architectures.
• Causal (autoregressive) attention: each token attends only to tokens before it, which is exactly what generation needs.
• In-context learning emerges: a big enough decoder-only model can do a new task just from examples in the prompt, no retraining. This is the entire foundation of prompting and RAG.
Senior depth: Causal masking means past tokens never need recomputing, which enables clean KV caching (topic 8) and efficient generation. Encoder-decoder models still win for some pure seq2seq tasks, but decoder-only generalizes via prompting, so it scaled better and became the default. The deep point for your interview: because everything is text-to-text, RAG is just “paste the retrieved text into the prompt.” That’s why it works at all.
9. Preference alignment (covering it here, near training)
Beginner: SFT makes a model follow instructions, but you also want it to prefer good answers over bad ones. Alignment teaches that preference.
RLHF (Reinforcement Learning from Human Feedback):
1. Start with the SFT model.
2. Collect human comparisons (given two answers, which is better?) and train a reward model to predict that preference.
3. Optimize the model (policy) with RL — typically PPO — to maximize reward, with a KL penalty that keeps it from drifting too far from the SFT model.
DPO (Direct Preference Optimization): Skips the separate reward model and RL loop — it optimizes directly on preference pairs with a classification-style loss. Simpler, more stable, very popular now.
Other variants: RLAIF / Constitutional AI (Anthropic’s approach) uses an AI to generate preference labels against a written set of principles, reducing the human-labeling bottleneck.
Senior depth: The big failure modes are reward hacking (the model games the reward model), over-optimization (the KL constraint exists to prevent this), and sycophancy (telling users what they want to hear). There’s an inherent helpful-vs-harmless tension.
Interview tie-in: Usually you consume an already-aligned model, so the relevant question is “how do we get the model to follow our policies / refuse our disallowed requests?” — which is partly alignment and partly system design (guardrails, prompts). You can also do lightweight preference tuning on your own task data.
3. Sampling
Beginner: At each step the model outputs a probability distribution over the next token. Sampling is how you pick one.
The controls:
• Greedy — always take the most likely token. Deterministic but repetitive and dull.
• Temperature — scales the distribution. Low temp → sharper, more deterministic; high temp → flatter, more random/creative. Temp 0 ≈ greedy.
• Top-k — sample only from the k most likely tokens.
• Top-p (nucleus) — sample from the smallest set whose cumulative probability ≥ p.
• Repetition / frequency / presence penalties — discourage repeating tokens.
• Beam search — keeps several candidate sequences; more for translation/seq2seq than open-ended chat.
Senior depth: Even at temperature 0 you’re not guaranteed bit-for-bit reproducibility (floating-point and batching effects). Self-consistency is a useful trick: sample several reasoning chains and take the majority answer. Constrained/structured decoding (topic 7) overlaps here.
Interview tie-in: For a RAG or extraction system you want low temperature — you’re after faithful, grounded, reproducible answers, not creativity. Stating “I’d run generation at low/zero temperature to reduce hallucination” is a clean, correct design choice.
8. LLM inference
Beginner: Inference is running the trained model to generate text (as opposed to training, which updates weights). This is where most of your production cost and latency live.
Two phases:
• Prefill — process the entire prompt in parallel, producing the first token and the KV cache. Compute-bound.
• Decode — generate tokens one at a time, each reusing the cache. Memory-bandwidth-bound.
Key concepts and levers:
• KV cache — stored key/value vectors for past tokens so they aren’t recomputed. It grows with sequence length × layers × batch size, so it’s a major memory consumer and the reason long contexts get expensive.
• Latency metrics — TTFT (time to first token), inter-token latency, and throughput (tokens/sec). These trade off against each other.
• Batching — continuous/in-flight batching (e.g., vLLM) keeps the GPU busy across many requests.
• Quantization — running weights at INT8/INT4 instead of FP16 to shrink memory and cost.
• Speculative decoding — a small “draft” model proposes tokens, the big model verifies them in parallel; speeds up decode.
• Prefix caching — reuse the KV cache for a shared prompt prefix across requests. This is huge for RAG, where you have a big fixed system prompt every call.
Interview tie-in: Cost is roughly proportional to tokens (input + output), and context length drives KV-cache memory. Expect questions like “how do you cut latency/cost?” Good answers: prefix-cache the system prompt, cap output length, retrieve fewer/better chunks instead of stuffing, use a smaller model for easy queries (routing), batch requests.
4. Providing context / RAG / needle-in-a-haystack
This is the heart of your interview, so it gets the most space.
Beginner: A model only “knows” two things: what’s frozen in its weights (training data, with a cutoff) and what’s in the current prompt. To give it fresh, private, or company-specific knowledge, you put that knowledge in the prompt. RAG automates finding the right knowledge to insert.
The RAG pipeline (maps directly to their loop):
• Ingest/index — split documents into chunks, convert each chunk into an embedding (a vector capturing meaning), store in a vector database.
• Retrieve — embed the user’s query, find the nearest chunks (semantic search), often combined with keyword search, then rerank to keep the best few.
• Reason — put the top chunks + the question into a prompt; the model answers grounded in them, ideally with citations.
The needle-in-a-haystack problem: A test where you hide one specific fact (the needle) inside a very long context (the haystack) and ask the model to find it. The well-known result is “lost in the middle” — models reliably use information at the start and end of the context but degrade in the middle. The implication is critical: even with million-token context windows, dumping everything in is unreliable. Good retrieval plus smart ordering (put the most important chunk first or last) beats brute-force stuffing — and it’s cheaper and faster.
Senior depth — this is where you win the interview:
• Chunking strategy — size and overlap matter. Too large adds noise and cost; too small loses context. Prefer semantic/structure-aware chunking over fixed character counts.
• Hybrid search — dense (embeddings, for meaning) + sparse (BM25/keyword, for exact terms like error codes, names, IDs). Each covers the other’s blind spot.
• Reranking — a cross-encoder reorders the initial candidates for much better precision before they hit the prompt.
• Query transformation — rewrite vague queries, decompose multi-part questions, or use techniques like multi-query / HyDE to improve recall.
• Metadata filtering — filter by date, source, and especially permissions/access control (so users only retrieve what they’re allowed to see — a privacy point interviewers love).
• Context-budget management — under a token limit you must allocate space across system prompt, retrieved chunks, conversation history, and the output. Techniques: rank, truncate, dedupe, compress/summarize.
• Grounding & citations — instruct the model to answer only from retrieved context, cite sources, and say “I don’t know” when the answer isn’t there. This is your main hallucination defense.
• Failure modes — retrieval miss (right doc never retrieved), distractors, conflicting sources, stale index.
Interview tie-in (their Copilot hint): For an AI coding tool, “context gathering” = the open file, imported/related files, symbols, and repo structure; the codebase is the haystack. RAG over code (plus the cursor’s local context) is exactly this pattern.
7. Prompt engineering / guided generation
Beginner: The prompt is your main steering wheel for a frozen model. How you ask changes what you get.
Core techniques:
• Clear instructions, a role/system prompt, explicit output format, and delimiters around inputs.
• Zero-shot vs few-shot — adding examples in the prompt (in-context learning).
• Chain-of-thought — “think step by step” for reasoning tasks (newer reasoning models do this internally).
• ReAct — interleave reasoning with tool calls (reason → act → observe).
Guided / constrained generation: When a downstream system needs parseable output (JSON, a specific schema), you don’t want to hope the model complies — you constrain it. Grammar/schema-constrained decoding only allows tokens that keep the output valid against a grammar. In practice this shows up as function-calling / tool-use APIs, JSON mode, or libraries that enforce a schema. This guarantees structure instead of relying on luck.
Senior depth: Treat prompts as code — version them, test them, and run evals, because a model upgrade or prompt tweak can silently cause regressions. Watch token cost (few-shot examples aren’t free). And critically, retrieved/untrusted content in the prompt can contain malicious instructions (prompt injection — topic 10).
Interview tie-in: In RAG, the prompt template — “answer only from the context below, cite the source, respond ‘not found’ if absent” — is your primary grounding mechanism. Guided generation matters when the system must emit structured output (e.g., a config patch or a JSON action).
6. MCP / Skills / Workflows / Agents / Plugins
This layer is about extending an LLM from “writes text” to “does things.” Define each clearly — interviewers test whether you can distinguish them.
• Tools / function calling — you give the model a set of functions with schemas; it decides when to call one; you execute it and feed the result back. This is the foundation under everything else.
• Plugins — an older term (e.g., ChatGPT plugins): packaged tools the model can call, usually via an API spec.
• MCP (Model Context Protocol) — an open standard for connecting models to external tools and data through a uniform interface. Instead of building a bespoke integration per app, an MCP server exposes its tools/resources in a standard way that any MCP-compatible host can use. Think “USB-C for tool integrations” — it decouples tool providers from model providers.
• Skills — packaged, reusable units of capability (instructions + sometimes code/scripts) that an agent loads only when relevant — progressive disclosure to save context (e.g., a “build a PowerPoint” skill loaded only for slide tasks).
• Workflows — you design a fixed sequence of LLM + tool steps. The control flow is predetermined; the model fills in the steps.
• Agents — the model decides the steps: it plans, picks tools, observes results, and loops until the goal is met. More autonomous, less predictable.
The senior distinction (workflows vs agents): Workflows are predictable, testable, debuggable, and cheaper — use them when the task is well-understood and decomposable. Agents are flexible and handle open-ended tasks but are harder to control and evaluate, more expensive, and can loop or fail surprisingly. The standard guidance (and a direct hit on your “over-engineering” pitfall): start simple — single prompt → add retrieval → add tools → workflow → reach for a full agent only when flexibility genuinely requires it. Common patterns to name-drop: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer.
Senior depth: Tool design needs clear schemas, error handling, and human-in-the-loop approval for risky/side-effecting actions. Add tracing/observability, retries, timeouts, and a cap on iterations for cost. Agents that act need permission scoping and sandboxing.
Interview tie-in: For a coding tool, the agent decides which files to read, runs the linter/tests (tools), reads the errors, and iterates — while predictable steps (“always run tests after an edit”) stay as a fixed workflow. This is the “LLM as Actor” and “Validation” themes on your list.
5. LLM evaluation
Beginner: You can’t ship on vibes — you need to measure whether the system is actually good, and catch regressions.
For RAG specifically, evaluate two stages:
• Retrieval — recall@k, precision@k, MRR, NDCG (did the right chunks get retrieved, and ranked well?).
• Generation — faithfulness/groundedness (is the answer supported by the retrieved context?), answer relevance, completeness.
Methods:
• Reference-based — exact match, F1, or older overlap metrics (BLEU/ROUGE) against gold answers.
• LLM-as-judge — a strong model grades outputs against a rubric. Scalable but biased (position bias, verbosity bias, self-preference); mitigate with pairwise comparison, clear rubrics, and calibration against human labels.
• Human eval — gold standard, expensive, used for a sample.
• Cheap objective validators — does the code compile? do tests pass? is the JSON valid? These are perfect for the “Validate” loop.
Senior depth: Build a golden eval set early from real queries and grow it from observed failures. Separate offline eval (pre-deploy) from online eval (A/B tests, thumbs up/down, and implicit signals like suggestion-acceptance rate for a coding tool). Run regression tests so a model/prompt change doesn’t silently break things, and include guardrail evals for safety, PII leakage, and injection resistance.
Interview tie-in: This is the “Validate” + “Learn” stages, and explicitly a “key habit of strong candidates.” For a coding tool: % of suggestions accepted, % that compile, % that pass tests.
10. Jailbreaks / adversarial robustness (optional)
I’ll treat this as the safety and security design topic — I won’t cover techniques for actually bypassing safeguards, but the defensive framing is exactly what an interviewer wants under “Safety & Operations.”
Beginner: Aligned models refuse harmful requests. Jailbreaking is adversarial prompting that tries to get a model to violate its own safety policy. The closely related and arguably bigger system risk is prompt injection — untrusted input (a web page, a retrieved document, a tool’s output) contains instructions that hijack the model’s behavior. In RAG and agents this is especially dangerous because retrieved content and tool results get fed back into the prompt (this is “indirect prompt injection”).
Design-level defenses to cite:
• Input/output moderation classifiers.
• Keep trusted instructions (system prompt) separate from untrusted data, and never let retrieved content be treated as instructions (delimit/“spotlight” it).
• Least privilege for tools, human-in-the-loop for high-risk actions, and sandboxing.
• Validate outputs before acting on them (critical for agents).
• Red-teaming and adversarial evals, with defense in depth — no single layer is enough.
Interview tie-in: Mentioning indirect prompt injection via retrieved docs in a RAG/agent design is a strong maturity signal.