
LLM Inference Caching: Pay for Computation Once
LLM inference caching reuses past computations to cut costs and latency. It avoids reprocessing shared system prompts or serves full answers for common queries without hitting the model. The footgun: semantic caches can return a "similar" but incorrect answer.

Git-Based CI Triggers: Automating on Events
Think of Git events like `push` or `pull_request` as the "play" button for your automation. This is how CI systems automatically run tests on new code. The footgun is using broad triggers, like `push` on all branches, which causes costly and redundant runs.

Dev Containers: Your Dev Environment as Code
A dev container packages your entire development environment—tools, libraries, and settings—into a single, portable container. Use it to standardize team environments, simplify onboarding, and ensure consistency between local dev and CI.

Python Virtual Environments: Isolate Project Dependencies
A Python virtual environment is a self-contained directory with its own Python interpreter and packages, preventing dependency conflicts between projects. The biggest mistake is checking the environment folder into source control; it's disposable and meant to…

Great Expectations: Unit Tests for Your Data
Great Expectations brings unit testing to your data, letting you assert what a dataset should look like. It validates data within a pipeline, preventing bad data from corrupting models or reports.

CD4ML: Automating ML from Data to Deployment
CD4ML extends CI/CD to manage ML's three axes of change: code, data, and models. It automates the entire lifecycle, enabling reliable updates for systems like sales forecasting.

Dynamic Batching: Balancing LLM Throughput and Latency
Dynamic batching groups LLM requests like a bus that leaves on a schedule or when full, whichever comes first. This improves throughput in inference servers by avoiding long waits. The footgun: all requests in a batch are still held hostage by the slowest one.
Model Pruning: Making LLMs Smaller, Not Dumber
Model pruning is surgical weight loss for an LLM, removing neurons or layers to reduce its size. It's used to create smaller, faster versions of models like LLaMA for efficient deployment. The footgun: naive pruning can cripple the model's core capabilities.

Modality Gap: When Multimodal LLMs Don't Trust Their Senses
A multimodal LLM has a modality gap when it trusts one input type (like text) over another (like images), even with identical information. This bias causes performance drops, like ignoring visual data if conflicting text is present.

Full Fine-Tuning: Updating Every Model Parameter
Full fine-tuning updates all weights of a pre-trained model on your new data, unlike methods that only change a small fraction. Use it to deeply embed new knowledge, but beware: it's costly and risks making the model forget its original general skills.

Cross-Encoder Re-ranking: Accuracy Over Speed
A cross-encoder re-ranks search results by reading the query and each document together, allowing it to spot subtle connections. It's the second, high-precision step in a search pipeline, re-ordering a small list of candidates.
Mixture of Experts: Scaling Models by Activating Specialists
A Mixture of Experts (MoE) model acts like a team of specialists instead of one generalist. A router sends each token to a few expert sub-networks, enabling faster training and inference for massive models.

The Llama Model Family: Open-Source AI for Production
Think of Llama not as one model, but a family of open-source AIs you can run anywhere. Use it for cost-effective, fine-tuned applications like internal search or when you need full control. The biggest mistake is mis-sizing the model for your task.
Hugging Face Hub: The GitHub for Machine Learning
Think of the Hugging Face Hub as the GitHub for machine learning. It's a central platform to find, share, and collaborate on millions of models, datasets, and demo apps. Use it to download a pre-trained model or share your own.
The EU AI Act: Risk-Based AI Regulation
The EU AI Act isn't a blanket ban but a risk-based framework. It sorts AI into tiers—from unacceptable to minimal risk—and applies rules proportionally, affecting any company with AI users in the EU. The footgun is assuming it only applies to EU companies.

Fairness Metrics: Quantifying AI's Impact on People
Fairness metrics translate "fairness" into a measurable score, checking if a model treats groups equitably. They are crucial for models in hiring or lending.
vLLM: Faster LLM Inference with PagedAttention
vLLM is a serving engine that speeds up LLM inference by treating GPU memory like virtual memory. It's used to serve models with higher throughput by batching requests without wasting memory on padding.

FlashAttention: Faster, Memory-Efficient Exact Attention
FlashAttention is an IO-aware algorithm that computes exact attention faster and with less memory. It avoids slow GPU memory transfers, making it a key optimization for training and serving large models on modern GPUs.

ONNX Runtime: Run Any AI Model, Anywhere
ONNX Runtime is a universal engine for AI models, letting you run them efficiently on any hardware, from cloud GPUs to a user's browser. It's used to deploy models for fast inference on servers or mobile devices.
Post-Training Quantization: Shrink Models Without Retraining
Post-Training Quantization (PTQ) shrinks a pre-trained model by converting its weights to lower precision, like turning a WAV file into an MP3. Use it to run large models on consumer GPUs without costly retraining.