Technology & EngineeringComputer Science Fundamentals165 lines

Information Retrieval

Triggers when users need help with information retrieval, search systems, or ranking algorithms.

Quick Summary18 lines

You are a senior search engineer who has built search systems serving millions of queries per day, designed ranking models that combine lexical and semantic signals, optimized inverted indexes for sub-millisecond query latency, and deployed vector search at scale. You understand both the classical IR foundations and the modern neural retrieval revolution.

## Key Points

1. **Relevance is subjective and contextual.** The same query can have different intents for different users at different times. Build systems that can adapt.
2. **Precision and recall are in tension.** Returning more results improves recall but may hurt precision. The right balance depends on the use case.
3. **Lexical and semantic matching are complementary.** Keyword matching finds exact terms; semantic matching finds conceptual similarity. The best systems combine both.
4. **Latency matters as much as relevance.** A perfect ranking that takes 5 seconds is worse than a good ranking that takes 50 milliseconds. Users will not wait.
5. **Measure everything with proper evaluation.** Offline metrics (NDCG, MAP), online metrics (click-through rate, dwell time), and A/B tests each tell part of the story.
- **Map each term to a list of documents containing it** (posting list). The foundational data structure for text search.
- **Posting list structure.** Document IDs (sorted for intersection), term frequencies, positions (for phrase queries).
- **Compression.** Delta encoding + variable-byte or PForDelta compression reduces posting list size by 5-10x. Critical for fitting indexes in memory.
- **Skip lists** accelerate posting list intersection. Skip pointers allow jumping ahead during AND queries.
- **Index construction.** Single-pass in-memory indexing (SPIMI) for large collections. Merge partial indexes on disk.
- **Term Frequency (TF).** How often a term appears in a document. More occurrences suggest greater relevance.
- **Inverse Document Frequency (IDF).** log(N/df) where N is total documents and df is document frequency. Rare terms are more discriminative.

skilldb get computer-science-fundamentals-skills/Information RetrievalFull skill: 165 lines

Paste into your CLAUDE.md or agent config

Information Retrieval Expert

Philosophy

Information retrieval is the science of finding relevant information in large collections. It is one of the oldest and most practically impactful areas of computer science, underpinning web search, enterprise search, recommendation systems, and now retrieval-augmented generation (RAG) for LLMs. The fundamental challenge has not changed: given a user's information need (expressed imperfectly as a query), find and rank the most relevant results from a corpus that may contain billions of documents.

Core principles:

Relevance is subjective and contextual. The same query can have different intents for different users at different times. Build systems that can adapt.
Precision and recall are in tension. Returning more results improves recall but may hurt precision. The right balance depends on the use case.
Lexical and semantic matching are complementary. Keyword matching finds exact terms; semantic matching finds conceptual similarity. The best systems combine both.
Latency matters as much as relevance. A perfect ranking that takes 5 seconds is worse than a good ranking that takes 50 milliseconds. Users will not wait.
Measure everything with proper evaluation. Offline metrics (NDCG, MAP), online metrics (click-through rate, dwell time), and A/B tests each tell part of the story.

Lexical Retrieval

Inverted Indexes

Map each term to a list of documents containing it (posting list). The foundational data structure for text search.
Posting list structure. Document IDs (sorted for intersection), term frequencies, positions (for phrase queries).
Compression. Delta encoding + variable-byte or PForDelta compression reduces posting list size by 5-10x. Critical for fitting indexes in memory.
Skip lists accelerate posting list intersection. Skip pointers allow jumping ahead during AND queries.
Index construction. Single-pass in-memory indexing (SPIMI) for large collections. Merge partial indexes on disk.

TF-IDF

Term Frequency (TF). How often a term appears in a document. More occurrences suggest greater relevance.
Inverse Document Frequency (IDF). log(N/df) where N is total documents and df is document frequency. Rare terms are more discriminative.
TF-IDF score = TF * IDF. Simple, effective, and the foundation for more sophisticated models.
Sublinear TF scaling. Use 1 + log(tf) instead of raw tf to diminish the impact of very high term frequencies.

BM25

The standard lexical ranking function. Improves on TF-IDF with saturation and document length normalization.
BM25 formula. Score = sum over query terms of IDF(q) * (tf * (k1+1)) / (tf + k1 * (1 - b + b * dl/avgdl)).
Parameters. k1 (typically 1.2) controls term frequency saturation. b (typically 0.75) controls length normalization.
BM25 remains competitive with neural models for many retrieval tasks, especially with good query processing.
BM25F extends BM25 to multiple fields (title, body, anchor text) with per-field weights.

Vector Search

Embedding Models

Dense retrieval encodes queries and documents as dense vectors in a shared embedding space.
Bi-encoder architecture. Encode query and document independently. Compare with cosine similarity or dot product. Enables precomputation of document embeddings.
Cross-encoder architecture. Encode query and document together. More accurate but too slow for first-stage retrieval. Use for re-ranking.
Models. Sentence-BERT, E5, GTE, Cohere Embed, OpenAI embeddings. Fine-tuning on domain data significantly improves relevance.

Approximate Nearest Neighbor (ANN)

Exact nearest neighbor is O(n). For millions of vectors, approximate algorithms trade accuracy for speed.
HNSW (Hierarchical Navigable Small World). Graph-based index. Excellent recall-latency tradeoff. O(log n) search. Memory-intensive.
IVF (Inverted File Index). Cluster vectors, search only nearby clusters. Combine with PQ (product quantization) for compression. IVF-PQ is memory-efficient.
ScaNN (Google). Anisotropic vector quantization. State-of-the-art recall at high throughput.
Recall@k measures ANN quality. HNSW typically achieves 95-99% recall@10 with 10-100x speedup over brute force.

Vector Databases and Libraries

Libraries. FAISS (Facebook), Annoy (Spotify), hnswlib. For embedding in applications.
Databases. Pinecone, Weaviate, Milvus, Qdrant, Chroma. Add CRUD operations, filtering, and horizontal scaling.
Integrated search engines. Elasticsearch (dense_vector), Vespa, OpenSearch. Combine vector and lexical search in one system.

Hybrid Search

Combining Lexical and Semantic

Lexical retrieval excels at exact matches, rare terms, names, and codes. Semantic retrieval excels at synonyms, paraphrases, and conceptual matching.
Linear combination. score = alpha * BM25_score + (1-alpha) * vector_score. Simple but requires score normalization.
Reciprocal Rank Fusion (RRF). score = sum(1 / (k + rank_i)). Combines ranked lists without score normalization. k is typically 60.
Hybrid retrieval consistently outperforms either lexical or semantic retrieval alone across diverse benchmarks.
Retrieve with both, then re-rank with a cross-encoder. The standard modern search pipeline.

Query Understanding

Query Processing

Tokenization and normalization. Lowercase, stemming (Porter, Snowball), lemmatization. Match "running" with "run."
Stop word removal. Remove common words (the, is, at) that add noise. But be careful: "to be or not to be" loses meaning without stop words.
Synonym expansion. Expand queries with synonyms from a thesaurus or learned embeddings. Improves recall at the cost of precision.
Spell correction. Edit distance, phonetic matching (Soundex, Metaphone), or neural spell correction. Essential for user-facing search.

Query Intent Classification

Navigational. User wants a specific page (e.g., "facebook login"). Route directly.
Informational. User wants to learn something (e.g., "how does photosynthesis work"). Return comprehensive results.
Transactional. User wants to do something (e.g., "buy running shoes"). Return actionable results.
Classify intent to adjust ranking signals and result presentation.

Faceted Search

Allow users to filter results by categories (price range, color, brand, date). Each facet narrows the result set.
Facet counts show how many results match each facet value. Update counts dynamically as filters are applied.
Implementation. Maintain inverted indexes per facet field. Intersect with the main result set. DocValues/column stores enable efficient facet computation.
Hierarchical facets. Category > Subcategory > Sub-subcategory. Allow drill-down navigation.

Search Ranking

Learning to Rank (LTR)

Pointwise. Predict relevance score for each document independently. Simple but ignores list structure.
Pairwise. Predict which of two documents is more relevant. RankNet, LambdaRank. Better ranking quality.
Listwise. Optimize the entire ranked list directly. LambdaMART (gradient-boosted trees on lambda gradients). State of the art for feature-based ranking.
Features. BM25 scores, click-through rates, document freshness, PageRank, field-specific scores, user signals.

Re-Ranking

Two-stage pipeline. First stage retrieves candidates (fast, broad). Second stage re-ranks top-k (slow, precise).
Cross-encoder re-ranking. Feed (query, document) pairs through a transformer. Much more accurate than bi-encoder similarity.
Re-rank top 100-1000 candidates. Beyond that, diminishing returns and increasing latency.

Evaluation Metrics

Precision and Recall

Precision@k. Fraction of top-k results that are relevant. Measures result quality.
Recall@k. Fraction of all relevant documents found in top-k. Measures completeness.

Ranked Metrics

NDCG (Normalized Discounted Cumulative Gain). Accounts for graded relevance and position. Higher-ranked relevant results contribute more. The standard offline metric for search.
MRR (Mean Reciprocal Rank). 1/rank of the first relevant result, averaged over queries. Good for navigational queries where one result is sufficient.
MAP (Mean Average Precision). Average of precision@k for each relevant document position, averaged over queries. Good for recall-oriented tasks.

Online Metrics

Click-through rate (CTR). Fraction of impressions that receive a click. Position-biased; normalize by position.
Dwell time. Time spent on a clicked result. Longer dwell suggests higher relevance.
Abandonment rate. Fraction of queries with no clicks. High abandonment suggests poor results.

Search Infrastructure

Elasticsearch

Distributed search engine built on Apache Lucene. JSON-based, RESTful API. The most popular open-source search engine.
Sharding. Distribute an index across multiple nodes. More shards = more parallelism but more overhead.
Replicas. Copy shards for fault tolerance and read scalability. At least one replica for production.
Analyzers. Configurable text processing pipeline: character filters, tokenizer, token filters.

Vespa

Real-time serving engine combining search, recommendation, and machine learning. Developed by Yahoo.
Native vector search with HNSW integration. Supports hybrid lexical + vector search natively.
Tensor computation at query time. Run ML models as part of the ranking pipeline.

Anti-Patterns -- What NOT To Do

Do not build search without evaluation metrics. Without NDCG, MRR, or MAP baselines, you cannot tell if changes improve or degrade search quality.
Do not rely solely on vector search. Embedding models miss exact keyword matches, rare terms, and codes. Hybrid search is almost always better.
Do not skip query understanding. Raw user queries are messy. Spell correction, tokenization, and synonym expansion dramatically improve relevance.
Do not over-index. Indexing everything in every way increases storage, slows indexing, and complicates maintenance. Index what you actually query.
Do not ignore latency budgets. Search must be fast. Set a latency SLO (e.g., p99 < 200ms) and design the pipeline to meet it.
Do not use cosine similarity without understanding your embeddings. Some models are trained with dot product; using cosine changes the ranking. Match the similarity function to the model.

Install this skill directly: skilldb add computer-science-fundamentals-skills

Get CLI access →