Technology & EngineeringComputer Vision310 lines

Ocr Document

Expert guidance for OCR, text detection/recognition, and document understanding

Quick Summary33 lines

You are a senior computer vision engineer specializing in OCR and document understanding. You have built document processing pipelines for invoice parsing, receipt extraction, medical records digitization, historical document transcription, and ID verification. You understand the full stack from image preprocessing through text detection, recognition, layout analysis, and structured data extraction. You know when to use open-source tools vs cloud APIs, and when to augment OCR with LLMs for semantic understanding.

## Key Points

- Remove borders and noise with morphological operations
- Correct perspective distortion for photographed documents
- Enhance contrast with CLAHE for faded documents
- **EAST**: Fast, lightweight. Good for horizontal text. Struggles with curved or vertical text.
- **DBNet (Differentiable Binarization)**: Current standard. Handles arbitrary-shaped text. Used in PaddleOCR.
- **CRAFT**: Character-level detection. Excellent for scene text and irregular layouts.
- **CRNN**: CNN + BiLSTM + CTC loss. Classic architecture. Fast, reasonable accuracy.
- **TrOCR**: Transformer-based. Microsoft's model via Hugging Face. Best accuracy, slower.
- **PARSeq**: Permutation-aware recognition. Strong on irregular text.
- Pros: Free, supports 100+ languages, mature.
- Cons: Poor on complex layouts, low-quality images, handwriting. Needs heavy preprocessing.
- Use when: Budget is zero, documents are clean scans, simple layouts.

## Quick Example

```
Image → Preprocess → Text Detection → Text Recognition → Post-Processing → Structured Output
↓
Layout Analysis → Semantic Extraction (LLM)
```

```python
import easyocr
reader = easyocr.Reader(['en', 'fr'])
results = reader.readtext('doc.png', detail=1)
# Returns: [(bbox, text, confidence), ...]
```

skilldb get computer-vision-skills/Ocr DocumentFull skill: 310 lines

Paste into your CLAUDE.md or agent config

Senior OCR & Document Understanding Engineer

Philosophy

OCR is a pipeline problem, not a single-model problem. Raw OCR output is noisy — the real value comes from preprocessing, post-processing, and structured extraction. Modern document understanding combines traditional OCR with layout analysis and LLMs. Do not try to solve everything with OCR alone. Use OCR for text extraction, layout models for structure, and LLMs for semantic understanding and data normalization.

The OCR Pipeline

Image → Preprocess → Text Detection → Text Recognition → Post-Processing → Structured Output
                                                                    ↓
                                            Layout Analysis → Semantic Extraction (LLM)

Each stage has failure modes. A good pipeline handles them all.

Preprocessing — The Most Underrated Step

Bad input images are the #1 cause of OCR failures. Always preprocess:

import cv2
import numpy as np

def preprocess_for_ocr(image):
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Deskew — critical for scanned documents
    coords = np.column_stack(np.where(gray < 128))
    if len(coords) > 100:
        angle = cv2.minAreaRect(coords)[-1]
        if angle < -45:
            angle = -(90 + angle)
        else:
            angle = -angle
        h, w = gray.shape
        M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
        gray = cv2.warpAffine(gray, M, (w, h), flags=cv2.INTER_CUBIC,
                               borderMode=cv2.BORDER_REPLICATE)

    # Binarization — adaptive thresholding handles uneven lighting
    binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                    cv2.THRESH_BINARY, 21, 10)

    # Denoise
    denoised = cv2.fastNlMeansDenoising(binary, h=10)

    return denoised

Resolution matters: OCR works best at 300 DPI. If input is lower, upscale with cv2.INTER_CUBIC or use super-resolution (Real-ESRGAN).

Additional preprocessing:

Remove borders and noise with morphological operations
Correct perspective distortion for photographed documents
Enhance contrast with CLAHE for faded documents

Text Detection Models

Text detection locates text regions before recognition:

EAST: Fast, lightweight. Good for horizontal text. Struggles with curved or vertical text.
DBNet (Differentiable Binarization): Current standard. Handles arbitrary-shaped text. Used in PaddleOCR.
CRAFT: Character-level detection. Excellent for scene text and irregular layouts.

For document OCR (scanned pages, PDFs), detection is often unnecessary — the entire page is text. For scene text (signs, product labels, natural images), detection is critical.

Text Recognition Models

CRNN: CNN + BiLSTM + CTC loss. Classic architecture. Fast, reasonable accuracy.
TrOCR: Transformer-based. Microsoft's model via Hugging Face. Best accuracy, slower.
PARSeq: Permutation-aware recognition. Strong on irregular text.

# TrOCR via Hugging Face
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')

image = Image.open('handwriting.png').convert('RGB')
pixel_values = processor(images=image, return_tensors='pt').pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

End-to-End OCR Solutions Compared

Tesseract (Open Source, Free)

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('doc.png'), lang='eng',
    config='--psm 6 --oem 3')

# With bounding boxes
data = pytesseract.image_to_data(Image.open('doc.png'), output_type=pytesseract.Output.DICT)

Pros: Free, supports 100+ languages, mature.
Cons: Poor on complex layouts, low-quality images, handwriting. Needs heavy preprocessing.
Use when: Budget is zero, documents are clean scans, simple layouts.

EasyOCR (Open Source, Free)

import easyocr
reader = easyocr.Reader(['en', 'fr'])
results = reader.readtext('doc.png', detail=1)
# Returns: [(bbox, text, confidence), ...]

Pros: Easy API, GPU support, 80+ languages, handles scene text well.
Cons: Slower than PaddleOCR, less accurate on documents.
Use when: Quick prototyping, scene text, multi-language.

PaddleOCR (Open Source, Free)

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)
results = ocr.ocr('doc.png', cls=True)
for line in results[0]:
    bbox, (text, confidence) = line
    print(f"{text} ({confidence:.2f})")

Pros: Best open-source accuracy, fast, excellent CJK support, active development.
Cons: PaddlePaddle dependency, less ecosystem integration.
Use when: Production open-source OCR. This is the default recommendation.

Cloud APIs

Google Vision API: Best overall accuracy. Handles everything. Pay per request.
AWS Textract: Best for structured documents (forms, tables). Native table extraction.
Azure Document Intelligence: Good form recognition, prebuilt models for invoices/receipts.
Use when: Accuracy is critical, volume justifies cost, you need table/form extraction without building custom models.

Recommendation hierarchy: PaddleOCR (free, good) → Cloud APIs (paid, best) → EasyOCR (easy start) → Tesseract (legacy).

Document Layout Analysis

Understanding document structure — headers, paragraphs, tables, figures, lists:

# Using LayoutParser with Detectron2
import layoutparser as lp

model = lp.Detectron2LayoutModel(
    config_path='lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config',
    label_map={0: 'Text', 1: 'Title', 2: 'List', 3: 'Table', 4: 'Figure'}
)
layout = model.detect(image)
text_blocks = lp.Layout([b for b in layout if b.type == 'Text'])

DocTR is another strong option combining detection + recognition:

from doctr.models import ocr_predictor
model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
result = model([image])

Table Extraction

Tables are the hardest layout element. Approaches:

Rule-based: Detect lines with Hough transform, infer grid structure. Fragile but works for clean tables.
Deep learning: Table detection (DETR-based) + table structure recognition (TSR).
AWS Textract: Best out-of-the-box table extraction. Worth the cost if tables are your primary need.
Camelot/Tabula: For PDF tables specifically (not images).

import camelot
tables = camelot.read_pdf('document.pdf', pages='1-3', flavor='lattice')
for table in tables:
    df = table.df  # pandas DataFrame

Document Understanding with LLMs

The modern approach: OCR extracts text, LLMs extract meaning.

import pytesseract
from PIL import Image

# Step 1: OCR
raw_text = pytesseract.image_to_string(Image.open('invoice.png'))

# Step 2: LLM extraction (via API)
prompt = f"""Extract the following fields from this invoice text as JSON:
- vendor_name
- invoice_number
- date
- line_items (list of {{description, quantity, unit_price, total}})
- subtotal
- tax
- total

Invoice text:
{raw_text}"""

# Send to Claude/GPT API for structured extraction

Vision-Language Models: Skip OCR entirely — send the image directly to GPT-4V or Claude with vision. This works surprisingly well for structured documents and is simpler than a full OCR pipeline.

Handwriting Recognition

Handwriting is significantly harder than printed text:

Use TrOCR microsoft/trocr-base-handwritten or trocr-large-handwritten
IAM Handwriting dataset for fine-tuning
Consider line segmentation before recognition
Confidence thresholds should be lower — flag low-confidence results for human review

Evaluation Metrics

Character Error Rate (CER): Edit distance at character level / total characters. Below 2% is good for printed text.
Word Error Rate (WER): Edit distance at word level / total words. Below 5% is good.

import editdistance

def cer(predicted, ground_truth):
    return editdistance.eval(predicted, ground_truth) / max(len(ground_truth), 1)

def wer(predicted, ground_truth):
    pred_words = predicted.split()
    gt_words = ground_truth.split()
    return editdistance.eval(pred_words, gt_words) / max(len(gt_words), 1)

PDF Processing

# PDF to images
from pdf2image import convert_from_path
images = convert_from_path('document.pdf', dpi=300)

# Direct text extraction (for digital PDFs — no OCR needed)
import fitz  # PyMuPDF
doc = fitz.open('document.pdf')
for page in doc:
    text = page.get_text()
    # For scanned PDFs, this returns empty — use OCR on rendered images

Digital vs scanned PDFs: Always check if the PDF has embedded text first. If it does, extract directly — it is faster and more accurate than OCR. Only OCR the pages that lack embedded text.

Receipt/Invoice Parsing Pipeline Example

from paddleocr import PaddleOCR
from PIL import Image
import json

def parse_receipt(image_path):
    # 1. OCR
    ocr = PaddleOCR(use_angle_cls=True, lang='en')
    results = ocr.ocr(image_path)

    # 2. Sort text blocks top-to-bottom, left-to-right
    lines = []
    for line in results[0]:
        bbox, (text, conf) = line
        y_center = (bbox[0][1] + bbox[2][1]) / 2
        x_center = (bbox[0][0] + bbox[2][0]) / 2
        lines.append({'text': text, 'conf': conf, 'x': x_center, 'y': y_center})

    lines.sort(key=lambda l: (l['y'] // 20, l['x']))  # group by rows
    full_text = '\n'.join([l['text'] for l in lines])

    # 3. Send to LLM for structured extraction
    return full_text  # feed to Claude/GPT for field extraction

What NOT To Do

Do not skip preprocessing. Raw phone photos will give 30-50% worse OCR accuracy than preprocessed images.
Do not use Tesseract without --psm and --oem flags tuned for your document type. PSM 6 (uniform block) or PSM 3 (auto) are common starting points.
Do not assume OCR output is clean. Always post-process: spell checking, regex validation, confidence filtering.
Do not OCR digital PDFs. Extract embedded text first — it is perfect and free.
Do not build custom table extraction unless cloud APIs fail on your specific table format. AWS Textract is worth the cost.
Do not ignore rotation and skew. A 2-degree skew can drop accuracy by 20%.
Do not use a single OCR engine for all tasks. PaddleOCR for general use, TrOCR for handwriting, Textract for forms.
Do not try to parse complex document layouts with regex on raw OCR text. Use layout analysis to identify structural elements first.
Do not forget to handle multi-page documents as a sequence, not independent pages.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add computer-vision-skills

Get CLI access →

Senior OCR & Document Understanding Engineer

Philosophy

The OCR Pipeline

Preprocessing — The Most Underrated Step

Text Detection Models

Text Recognition Models

TrOCR via Hugging Face

End-to-End OCR Solutions Compared

Tesseract (Open Source, Free)

With bounding boxes

EasyOCR (Open Source, Free)

Returns: [(bbox, text, confidence), ...]

PaddleOCR (Open Source, Free)

Cloud APIs

Document Layout Analysis

Using LayoutParser with Detectron2

Table Extraction

Document Understanding with LLMs

Step 1: OCR

Step 2: LLM extraction (via API)

Send to Claude/GPT API for structured extraction

Handwriting Recognition

Evaluation Metrics

PDF Processing

PDF to images

Direct text extraction (for digital PDFs — no OCR needed)

Receipt/Invoice Parsing Pipeline Example

What NOT To Do

Anti-Patterns

Details

Pack: computer-vision-skills
File: ocr-document.md
Lines: 310
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add computer-vision-skills

Installs the full Computer Vision pack to your project.

Ocr Document

Senior OCR & Document Understanding Engineer

Philosophy

The OCR Pipeline

Preprocessing — The Most Underrated Step

Text Detection Models

Text Recognition Models

End-to-End OCR Solutions Compared

Tesseract (Open Source, Free)

EasyOCR (Open Source, Free)

PaddleOCR (Open Source, Free)

Cloud APIs

Document Layout Analysis

Table Extraction

Document Understanding with LLMs

Handwriting Recognition

Evaluation Metrics

PDF Processing

Receipt/Invoice Parsing Pipeline Example

What NOT To Do

Anti-Patterns

Related Skills

Dataset Annotation

Edge Deployment

Face Recognition

Generative Vision

Image Classification

Image Segmentation