tezvyn:

LLMs & Generative AI

Large language models, chatbots, agents, prompt engineering

105 bites

LLMs & Generative AI43 sec read

Explain prompt injection and how to defend against it

This question tests your understanding of LLM security vulnerabilities and how untrusted user input can manipulate model behavior. A strong answer defines prompt injection as hijacking the model's instructions, then outlines a layered defense including input sanitization, instruction-tuned models, and separating user input from system prompts. A common red flag is confusing it with traditional SQL injection or suggesting simple input filtering is a sufficient solution.

LLMs & Generative AI51 sec read

What is the KV cache and why does it matter for serving LLMs?

This question tests your understanding of performance bottlenecks in autoregressive LLM inference. A great answer first explains that the attention mechanism computes Key (K) and Value (V) tensors for all input tokens. Then, it highlights the redundancy of recomputing these for past tokens at each new generation step. The KV cache solves this by storing these tensors, drastically reducing latency. A red flag is vaguely calling it a 'cache' without connecting it to K/V tensors.

LLMs & Generative AI43 sec read

How does positional encoding work in transformers?

This tests your understanding of why Transformers need explicit position data. A great answer explains that self-attention is permutation-invariant, meaning it sees inputs as an unordered set. Positional encodings—vectors derived from sine and cosine functions—are then added to the input embeddings to inject sequence order. A red flag is simply saying 'it adds position' without explaining why this is necessary or how it's done.

LLMs & Generative AI40 sec read

Encoder-Only vs. Decoder-Only vs. Encoder-Decoder Transformers?

This tests your ability to connect transformer architecture to specific NLP tasks. A great answer explains how each model's attention mechanism dictates its use: encoder-only (bidirectional attention) for understanding content, decoder-only (causal attention) for text generation, and encoder-decoder for sequence-to-sequence tasks like translation. The key red flag is failing to explain the *why* behind the task suitability—the attention mechanism.

LLMs & Generative AI49 sec read

Why are MoE models larger but cheaper to run?

This tests your understanding of sparse activation versus dense models. A great answer defines Mixture-of-Experts (MoE) as a system with a router and multiple expert sub-networks, explaining that only a fraction of the total parameters are activated for any given token, which drastically reduces computational cost (FLOPs) during inference. A red flag is describing MoE as a simple ensemble without mentioning the sparse routing mechanism that enables its efficiency.

LLMs & Generative AI47 sec read

When would you use LoRA vs full fine-tuning?

This tests your grasp of practical trade-offs in ML systems, specifically training cost versus model customization. A great answer explains that LoRA is a parameter-efficient method ideal for resource-constrained scenarios, reducing trainable parameters by 10,000x and GPU memory by 3x. Full fine-tuning is for high-budget projects requiring deep model changes. A red flag is vaguely saying LoRA is 'cheaper' without quantifying the resource savings or explaining the mechanism.

LLMs & Generative AI45 sec read

What is the role of temperature in token sampling?

This tests your understanding of how to control the creativity and randomness of a language model's output. A great answer explains that temperature is a divisor applied to the model's logits before the softmax function. Low temperature makes the output more deterministic by sharpening the probability distribution, while high temperature increases randomness by flattening it. A common red flag is vaguely saying it 'controls randomness' without explaining the underlying softmax mechanism.

LLMs & Generative AI40 sec read

How to reduce hallucination in a production LLM application?

This tests your ability to design a robust, multi-layered system for AI safety, not just your model knowledge. A great answer starts with data-level grounding (RAG), moves to model-level tuning (temperature, fine-tuning), and finishes with application-level safeguards (validation, feedback loops). A red flag is focusing only on prompt engineering or stating it's an unsolvable problem without offering concrete mitigation strategies.

LLMs & Generative AI49 sec read

Explain Supervised Fine-Tuning, RLHF, and DPO

This tests your understanding of modern LLM alignment techniques. A strong answer explains that Supervised Fine-Tuning (SFT) teaches the model a task via imitation, while RLHF and DPO align it with human preferences. RLHF uses a reward model and reinforcement learning, whereas DPO is a simpler, direct optimization method. The key red flag is conflating these distinct stages or failing to explain the 'reward model' step in RLHF.

LLMs & Generative AI40 sec read

What is the vanishing gradient problem and how do transformers avoid it?

This tests your understanding of core deep learning training issues and the transformer's specific architectural solutions. A great answer defines vanishing gradients in sequential models, then explains how the transformer's parallel attention mechanism creates direct, short paths for gradients between any two tokens, regardless of distance. A red flag is vaguely mentioning 'attention' without explaining why its parallel nature is the key to solving the problem for long sequences.

LLMs & Generative AI50 sec read

RAG vs. Fine-Tuning: Key Differences

This tests your understanding of how LLMs incorporate knowledge, specifically the trade-offs between embedding it in model weights versus retrieving it at runtime. A great answer defines RAG as runtime retrieval from an external source and fine-tuning as baking knowledge into model parameters, then contrasts their approaches to knowledge updates, cost, and providing citations. A red flag is stating one is always better, or failing to explain that they solve different problems and can be used tog

LLMs & Generative AI45 sec read

What is the trade-off between top-k and top-p sampling?

This tests your practical knowledge of tuning LLM output for the creativity vs. coherence trade-off. A strong answer defines top-k (static token count) and top-p (dynamic probability mass), then explains that top-p's adaptive window is generally more robust than top-k's fixed window. A red flag is failing to contrast the static nature of top-k with the dynamic nature of top-p, which is the core of the trade-off.

LLMs & Generative AI46 sec read

Explain the concept of self-attention

This tests your ability to explain the core mechanism of Transformers. A strong answer defines self-attention as a process for relating positions of a single sequence, explains the Query-Key-Value (QKV) model where a token's Query is compared to all Keys to generate weights, and describes how these weights create a weighted sum of Values. A red flag is vaguely describing 'importance' without mentioning the QKV mechanism.

LLMs & Generative AI44 sec read

Gemini API Webhooks Eliminate Polling for Long Jobs

The Gemini API now includes event-driven Webhooks, eliminating the need for continuous polling on long-running jobs like batch processing or video generation. Instead of repeatedly calling GET operations, your server will receive a real-time HTTP POST payload the instant a task finishes. This simplifies building efficient, agentic workflows that might take minutes or hours, reducing latency and infrastructure overhead for your applications.

LLMs & Generative AI49 sec read

Google's April AI Push: Gemma 4 and Agent Platform

Google's April AI update introduces the Gemma 4 open model, an eighth-generation chip, and the Gemini Enterprise Agent Platform. This signals a major push into the "agentic era," providing engineers with the foundational models, hardware, and platforms to build more autonomous AI systems. The release also includes a personalized coding tutor in Colab and the Deep Research Max data analysis tool. Evaluate Gemma 4 for your open-source needs and explore the new agent platform for building complex w

LLMs & Generative AI49 sec read

RoPE: Encoding Position with Rotation

Rotary Position Embedding (RoPE) encodes position by rotating token embeddings, where the angle depends on the token's absolute spot in the sequence. This is used in Transformers like Llama to handle long contexts, as the attention score naturally becomes a function of relative distance. The main footgun is assuming standard position embeddings extrapolate; RoPE is designed for sequence length flexibility, unlike many absolute position encodings which fail on longer inputs.

LLMs & Generative AI43 sec read

Instruction Tuning: Teaching Models to Follow Orders

Instruction tuning teaches a language model to generalize by finetuning it on a massive collection of tasks described in plain English. This transforms a raw pretrained model, which just predicts the next word, into one that can follow commands on unseen tasks without any examples (zero-shot). The footgun is mistaking this for simple finetuning on one task; its power comes from the sheer diversity of instructional tasks used during training.

LLMs & Generative AI46 sec read

Speculative Decoding: Faster LLM Inference, Same Results

Speculative decoding accelerates LLM inference by using a small, fast "draft" model to predict a sequence of tokens. The large, accurate model then validates this entire sequence in a single parallel pass, instead of generating one token at a time. This is used to get 2-3x speedups on production models without retraining. The common misconception is that it's a lossy approximation; in reality, it produces bit-for-bit identical output to the original model.

LLMs & Generative AI46 sec read

Constitutional AI: Teaching an AI Right from Wrong

Constitutional AI teaches a model to be harmless by making it follow a set of principles—a constitution—instead of relying on human-labeled examples of bad behavior. This self-correction process, called Reinforcement Learning from AI Feedback (RLAIF), is used to align powerful models, enabling them to refuse harmful requests while explaining their reasoning. The entire system's safety, however, hinges on the quality and completeness of the initial human-written constitution.

LLMs & Generative AI46 sec read

ReAct: Teaching LLMs to Think, Then Act

ReAct teaches LLMs to 'think then do,' interleaving reasoning steps with actions like querying a database. Instead of just generating a final answer, the model forms a thought, acts on it, observes the result, and then thinks again. This is crucial for complex question-answering where the model must gather external information to ground its reasoning. The main footgun it avoids is hallucination, where models invent facts instead of looking them up.