Direct Preference Optimization (DPO): Your LLM is a Reward Model
Direct Preference Optimization (DPO) treats your language model as a secret reward model, simplifying alignment with human preferences. Instead of RLHF's complex multi-stage process, DPO directly fine-tunes the model on preference data (e.g., "response A is better than B") using a simple classification loss. This avoids training a separate reward model and the instability of reinforcement learning. The footgun is assuming DPO works without a strong base model and quality preference data.
LoRA: Fine-Tuning LLMs with a Fraction of the Cost
LoRA fine-tunes a massive model by training tiny "adjustment" matrices instead of retraining all its billions of parameters. This allows you to create many specialized versions of a base model like GPT-3 without the prohibitive cost of storing and training full copies. The key advantage is that these adjustments merge into the original weights, so you get specialized models with no added inference latency, a common footgun with other parameter-efficient techniques.
Mixture of Experts: Scaling LLMs with a Team of Specialists
A Mixture of Experts (MoE) model isn't one giant brain but a team of specialists, routing each task to the most qualified sub-network. This allows large language models to have a massive number of parameters for knowledge, but only activate a small, computationally cheap fraction for any given input. The footgun is mistaking the total parameter count for the active parameters used during inference; MoE models are sparsely activated.
Knowledge Distillation: Shrinking Models, Keeping Smarts
Knowledge distillation trains a small 'student' model to mimic a large 'teacher' model, capturing its expertise in a much smaller package. This is used to deploy powerful but slow models onto resource-constrained hardware like smartphones for real-time inference. The footgun is assuming the student perfectly matches the teacher; you're trading a small amount of accuracy for a massive gain in efficiency and lower computational cost.
RLHF: Teaching an AI 'Good' Without Code
Reinforcement Learning from Human Feedback (RLHF) teaches a model what humans prefer by having it chase the approval of a proxy 'reward model' trained on human rankings. It's the key technique for making large language models more helpful and harmless by aligning them with nuanced instructions that are hard to define in code. The main footgun is 'reward hacking,' where the model finds loopholes to please the reward model in ways that don't actually satisfy users.

VAEs: Generating New Data by Learning Its Essence
A Variational Autoencoder (VAE) learns the *essence* of data, not just how to copy it. Instead of compressing an input to a single point, it maps it to a fuzzy region in a "concept space," allowing you to generate new, similar data by sampling from that region. This is key for creating novel images or music. The footgun is expecting sharp outputs; VAEs often produce blurrier results than models like GANs.
Tool Use: Giving LLMs Access to External Systems
Tool use lets an LLM call external functions, like a brain accessing a calculator or the internet. This is the core mechanism behind AI agents that can search the web, run code, or query a database to answer questions. The biggest footgun is assuming the model will always generate a valid function call; without enforcing a strict schema to match your function's expected input, your agent can fail unpredictably.
RAG: Giving Language Models an Open-Book Exam
Retrieval-Augmented Generation (RAG) gives a language model an open-book exam instead of forcing it to memorize everything. It combines a model's reasoning ability with a searchable external knowledge base. This grounds LLM responses in specific, up-to-date information, like a support bot using a product manual. The footgun is forgetting that the quality of the retrieved information directly limits the quality of the final answer.
Chain-of-Thought: Making LLMs 'Show Their Work'
Chain-of-thought prompting makes an LLM 'show its work' by generating intermediate reasoning steps before the final answer. This simple few-shot technique dramatically improves performance on complex tasks like math word problems or commonsense questions, especially for very large models. The common footgun is applying it to smaller models, where it can actually degrade performance instead of helping, as the reasoning ability hasn't yet emerged.
Generative Adversarial Networks (GANs): An AI Arms Race
Think of a GAN as an AI arms race between two networks: a forger and a detective. The forger network (Generator) creates fake data, like images or audio, while the detective network (Discriminator) tries to spot the fakes. This competition forces the forger to create increasingly realistic outputs. The main footgun is training instability—if one network overpowers the other too early, the whole system fails to learn and produces garbage.
Diffusion Models: Generating Data by Reversing Noise
Think of diffusion models as learning to reverse a "random walk." They take a clean data point, gradually add noise until it's unrecognizable, and then train a model to reverse that process step-by-step. This allows them to start with pure noise and guide it back into a coherent sample that resembles the original dataset. The footgun is that this multi-step reversal makes generation computationally intensive compared to single-pass models.
Perplexity: Measuring a Model's Uncertainty
Perplexity frames a model's uncertainty as the effective number of choices it's considering. For a fair die with six outcomes, the perplexity is 6, reflecting perfect confusion among six options. When evaluating language models, a lower perplexity score indicates a better ability to predict a sequence of text. The footgun is judging the score in a vacuum; a 'good' perplexity is always relative to the task's inherent randomness.
BLEU Score: Judging Translation by Human Overlap
The BLEU score judges a machine translation by how closely its text matches a professional human translation. It's a popular, automated, and inexpensive way to benchmark translation systems, like comparing different versions of a model. The main footgun is that a high score indicates high textual overlap, not necessarily better fluency or meaning, as it's just a proxy for human judgment.
RNNs: Neural Networks with Short-Term Memory
A Recurrent Neural Network (RNN) processes sequences by keeping a running memory of what it's seen. It feeds its own output from one step back into the next, like someone reading a sentence one word at a time. This is ideal for sequential data like text or time series where context is key. The main footgun is its notoriously short memory; information from early in a long sequence often gets lost.
Vector Databases: Searching by Meaning, Not Matches
A vector database organizes data by meaning, not just exact values. Instead of finding a record by its ID, you find it by its similarity to a query. This powers AI features like Retrieval-Augmented Generation (RAG), where an LLM finds relevant documents, and recommendation engines. The main footgun is that it finds *approximate* matches, trading perfect accuracy for speed and the ability to search unstructured data.
Attention: Weighing Input by Relative Importance
The attention mechanism lets a model decide which parts of a sequence are most important relative to others. In natural language processing, it assigns 'soft' weights to words, allowing the model to focus on what's most relevant for a given task. It's used to encode sequences of token embeddings, from short phrases to massive documents. The main pitfall is forgetting that these weights are contextual and relative, not absolute measures of a word's importance.

Transformers: Processing Language in Parallel with Attention
Transformers process all input text at once, weighing which words are most important to each other in parallel. This 'attention' mechanism is the core of models like GPT, allowing them to understand context over long sequences. Text is broken into tokens, turned into vectors, and then contextualized by multiple attention 'heads'. The key footgun is that attention alone is order-agnostic; without explicit positional encodings, the model can't distinguish 'dog bites man' from 'man bites dog'.
Softmax Function: Turning Scores into Probabilities
The softmax function turns a list of raw scores from a model into a clean probability distribution where all values sum to 1. It's most often the final step in a neural network for multi-class classification, like deciding if an image is a 'cat', 'dog', or 'bird'. The main footgun is mistaking a high softmax probability for high model confidence; it only reflects the score's strength relative to the other scores, not its absolute certainty.
Prompt Engineering: How to Talk to AIs
Think of prompt engineering as giving a smart but literal intern a precise set of instructions. It's the skill of structuring your text input to guide a generative AI toward a specific, desired output, moving beyond simple keywords. This is essential for getting reliable results, from formatted JSON to correctly styled text. The biggest mistake is treating the AI like a search engine instead of a collaborator that needs clear direction.
Prompt Engineering: Steering AI with Words
Prompt engineering is steering an AI with carefully chosen words instead of code. You use it to get reliable results from chatbots like ChatGPT or to build applications that use large language models (LLMs). The biggest mistake is treating the AI like a search engine; effective prompts provide context, examples, and constraints to guide the model, rather than just asking a simple question.