Technology & EngineeringAi Ml90 lines

Nlp Pipeline

Designing end-to-end natural language processing pipelines from text ingestion to

Quick Summary24 lines

An NLP pipeline transforms raw text into structured predictions or representations. Modern NLP is dominated by transformer-based models, but effective pipelines still require careful text preprocessing, task framing, and post-processing. The pipeline design must account for language diversity, domain-specific vocabulary, and the tradeoff between pretrained model capability and task-specific fine-tuning.

## Key Points

- **Classification**: Sentiment, intent, topic, spam detection
- **Token-level**: NER, POS tagging, chunking
- **Span extraction**: QA, keyphrase extraction
- **Generation**: Summarization, translation, dialogue
- **Similarity**: Semantic search, duplicate detection, clustering
1. Define the NLP task precisely: input format, output format, label schema, evaluation metric.
2. Collect and audit the text corpus: language distribution, average length, domain vocabulary, label distribution.
3. Design text cleaning: lowercasing (if case-insensitive), Unicode normalization, HTML/URL removal, language detection.
4. Select tokenization strategy: WordPiece/BPE for transformers, domain-specific tokenizer if needed.
5. Choose the base model: task complexity and data size determine whether to prompt, fine-tune, or train.
6. Implement the model with appropriate head: classification head, token classification head, or seq2seq.
7. Design the training loop: learning rate (2e-5 to 5e-5 for fine-tuning), warmup steps, early stopping on validation metric.

## Quick Example

```
Raw Text -> Cleaning -> Tokenization -> Encoding -> Model -> Post-processing -> Output
```

skilldb get ai-ml-skills/Nlp PipelineFull skill: 90 lines

Paste into your CLAUDE.md or agent config

NLP Pipeline Design

Core Philosophy

Overview

Use this skill when building text classification, named entity recognition, question answering, summarization, or other NLP systems, or when evaluating whether to use a pretrained model, fine-tune, or prompt.

Core Framework

Pipeline Architecture

Raw Text -> Cleaning -> Tokenization -> Encoding -> Model -> Post-processing -> Output

Approach Selection

Approach	When to Use	Data Requirement
Prompting (zero/few-shot)	Quick prototyping, low data	0-20 examples
Fine-tuning pretrained	Production quality needed	1k-100k labeled examples
Training from scratch	Highly specialized domain	1M+ examples
Classical ML + TF-IDF	Simple tasks, low compute	100-10k examples

Task Taxonomy

Classification: Sentiment, intent, topic, spam detection
Token-level: NER, POS tagging, chunking
Span extraction: QA, keyphrase extraction
Generation: Summarization, translation, dialogue
Similarity: Semantic search, duplicate detection, clustering

Process

Define the NLP task precisely: input format, output format, label schema, evaluation metric.
Collect and audit the text corpus: language distribution, average length, domain vocabulary, label distribution.
Design text cleaning: lowercasing (if case-insensitive), Unicode normalization, HTML/URL removal, language detection.
Select tokenization strategy: WordPiece/BPE for transformers, domain-specific tokenizer if needed.
Choose the base model: task complexity and data size determine whether to prompt, fine-tune, or train.
Implement the model with appropriate head: classification head, token classification head, or seq2seq.
Design the training loop: learning rate (2e-5 to 5e-5 for fine-tuning), warmup steps, early stopping on validation metric.
Build post-processing: confidence thresholds, entity merging, output formatting.
Evaluate on held-out test set with task-appropriate metrics (F1 for NER, accuracy for classification, ROUGE for summarization).
Deploy with monitoring for input drift, prediction distribution shifts, and latency.

Key Principles

Pretrained transformers (BERT, RoBERTa, DeBERTa) are the default starting point for most NLP tasks.
Text cleaning should be minimal for transformer models; they handle noise better than classical methods.
Tokenizer and model must match; never use a tokenizer from a different model family.
For multilingual tasks, use multilingual models (XLM-R) rather than translating to English.
Label quality matters more than label quantity; 1000 clean examples beat 10000 noisy ones.
Long documents require chunking strategies with overlap or hierarchical models.
Evaluation must include per-class metrics, not just aggregate scores, to catch systematic failures.

Common Pitfalls

Over-cleaning text and removing signal (e.g., removing punctuation that matters for sentiment).
Fine-tuning with a learning rate too high for the pretrained model, causing catastrophic forgetting.
Ignoring class imbalance in classification tasks and reporting misleading accuracy.
Using BLEU/ROUGE as sole metrics for generation without human evaluation.
Failing to handle out-of-vocabulary tokens and special characters in production inputs.
Not accounting for maximum sequence length limits and silently truncating important text.

Output Format

When designing an NLP pipeline:

Task Specification: Input/output format, label schema, success criteria.
Data Summary: Corpus statistics, language coverage, quality assessment.
Architecture Decision: Approach chosen (prompting/fine-tuning/training) with rationale.
Pipeline Diagram: Each stage with expected input/output shapes.
Model Selection: Specific model checkpoint and why.
Evaluation Plan: Metrics, test set composition, baseline comparisons.
Deployment Requirements: Latency, throughput, model size constraints.

Anti-Patterns

Over-engineering for hypothetical requirements. Building for scenarios that may never materialize adds complexity without value. Solve the problem in front of you first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide wastes time and introduces risk.

Premature abstraction. Creating elaborate frameworks before having enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at system boundaries. Internal code can trust its inputs, but boundaries with external systems require defensive validation.

Skipping documentation. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add ai-ml-skills

Get CLI access →

Nlp Pipeline

NLP Pipeline Design

Core Philosophy

Overview

Core Framework

Pipeline Architecture

Approach Selection

Task Taxonomy

Process

Key Principles

Common Pitfalls

Output Format

Anti-Patterns

Related Skills

Computer Vision Pipeline

Data Preprocessing

ML Deployment

ML Evaluation

ML Model Selection

Neural Network Architecture