AI Hallucination: Confabulation, Not Perception
AI hallucination is when a model confidently invents plausible-sounding facts to fill gaps in its knowledge. This isn't a perceptual error but a confabulation—an erroneously constructed response. It occurs when an AI must generate an answer but lacks verifiable data, such as when asked about niche topics. The biggest footgun is trusting an AI's fluent, confident-sounding output without independent verification, as it may be entirely fabricated.
Cosine Similarity: Measuring Direction, Not Distance
Cosine similarity measures the angle between two vectors, not their distance, to gauge similarity. It asks, "Do these point in the same direction?" This is fundamental in AI for comparing text embeddings, where a vector's direction represents its meaning. The main footgun is confusing it with Euclidean distance; cosine similarity ignores vector magnitude, so two vectors can be far apart in space but still be considered nearly identical if their orientation is the same.
Word Embeddings: Turning Words into Vectors
Word embeddings turn words into numerical vectors, like coordinates on a map of meaning. Words with similar meanings, like "king" and "queen," are placed close together in this vector space. This is fundamental for text analysis in machine learning, allowing models to grasp semantic relationships instead of just matching text. The footgun is assuming the vector's individual numbers are human-interpretable; they are abstract features learned from data.
Byte Pair Encoding: Compressing Text for LLMs
Think of Byte Pair Encoding (BPE) as creating custom abbreviations for common letter pairs to compress text. It repeatedly finds the most frequent pair, like 'th', and merges it into a new token. LLMs use this to build vocabularies of common sub-word units, helping them understand rare words. The main footgun is that the final vocabulary size is fixed; choosing the wrong size can hurt model performance and efficiency.
Large Language Models (LLMs)
A large language model is a sophisticated pattern-matching engine trained on a massive library of text. They power modern chatbots and can generate, summarize, or translate text by predicting the most probable next word based on the patterns they've learned. The key footgun is that their output reflects the biases and inaccuracies of their training data, making them confident but potentially unreliable.