Natural Language Processing Pipelines

Text is the most abundant data type — support tickets, reviews, emails, contracts, medical records — yet also the hardest to use programmatically. NLP pipelines transform unstructured text into structured, queryable data: extracting entities, classifying intent, measuring sentiment, and summarizing content.

Pipeline Architecture

Raw Text
  │
  ▼
Preprocessing
  │  Clean, normalize, handle encoding
  ▼
Tokenization
  │  Split into tokens (words, subwords, characters)
  ▼
Feature Extraction
  │  Embeddings, TF-IDF, or language model encoding
  ▼
Task-Specific Model
  │  Classification, NER, sentiment, summarization
  ▼
Post-Processing
  │  Entity linking, confidence filtering, formatting
  ▼
Structured Output

Text Preprocessing

import re
import unicodedata

class TextPreprocessor:
    """Clean and normalize text for NLP tasks."""
    
    def preprocess(self, text: str, config: dict = None) -> str:
        config = config or {}
        
        # 1. Normalize unicode
        text = unicodedata.normalize("NFKD", text)
        
        # 2. Handle encoding artifacts
        text = text.encode("ascii", "ignore").decode("ascii")
        
        # 3. Normalize whitespace
        text = re.sub(r"\s+", " ", text).strip()
        
        # 4. Remove URLs (optional)
        if config.get("remove_urls", True):
            text = re.sub(r"https?://\S+", "[URL]", text)
        
        # 5. Remove email addresses (optional)
        if config.get("remove_emails", True):
            text = re.sub(r"\S+@\S+\.\S+", "[EMAIL]", text)
        
        # 6. Lowercase (task-dependent)
        if config.get("lowercase", False):
            text = text.lower()
        
        return text
    
    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50):
        """Split long text into overlapping chunks for processing."""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            chunks.append(chunk)
        
        return chunks

Named Entity Recognition

import spacy

# Production NER pipeline
nlp = spacy.load("en_core_web_trf")  # Transformer-based model

def extract_entities(text: str):
    """Extract named entities from text."""
    doc = nlp(text)
    
    entities = []
    for ent in doc.ents:
        entities.append({
            "text": ent.text,
            "label": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char,
            "confidence": ent._.confidence if hasattr(ent._, "confidence") else None,
        })
    
    return entities

# Example:
text = "Apple Inc. announced that CEO Tim Cook will present at WWDC 2024 in Cupertino, California on June 10."

entities = extract_entities(text)
# [
#   {"text": "Apple Inc.", "label": "ORG"},
#   {"text": "Tim Cook", "label": "PERSON"},
#   {"text": "WWDC 2024", "label": "EVENT"},
#   {"text": "Cupertino, California", "label": "GPE"},
#   {"text": "June 10", "label": "DATE"},
# ]

Text Classification

from transformers import pipeline

class TextClassifier:
    """Zero-shot and fine-tuned text classification."""
    
    def __init__(self):
        # Zero-shot: No training data needed
        self.zero_shot = pipeline(
            "zero-shot-classification",
            model="facebook/bart-large-mnli",
        )
        
    def classify_zero_shot(self, text: str, labels: list):
        """Classify text without training data."""
        result = self.zero_shot(text, candidate_labels=labels)
        
        return {
            label: score 
            for label, score in zip(result["labels"], result["scores"])
        }
    
    def classify_support_ticket(self, ticket_text: str):
        """Classify a support ticket by category and urgency."""
        categories = self.classify_zero_shot(
            ticket_text,
            labels=["billing", "technical", "account", "feature request"],
        )
        
        urgency = self.classify_zero_shot(
            ticket_text,
            labels=["urgent", "normal", "low priority"],
        )
        
        return {
            "category": max(categories, key=categories.get),
            "category_scores": categories,
            "urgency": max(urgency, key=urgency.get),
            "urgency_scores": urgency,
        }

# Example:
# Input: "My payment was charged twice and I need a refund immediately"
# Output: {"category": "billing", "urgency": "urgent"}

Anti-Patterns

Anti-Pattern	Consequence	Fix
No preprocessing	Dirty text = noisy predictions	Clean, normalize, handle edge cases
One model for all languages	Poor non-English performance	Multilingual models or per-language models
Ignore confidence scores	Low-confidence predictions treated as fact	Filter by confidence, human review for uncertain
No domain adaptation	Generic model misses domain terms	Fine-tune on domain-specific data
Process entire documents	Context window exceeded, slow	Chunking with overlap for long texts

NLP pipelines are only as good as their preprocessing and their training data. A clean pipeline with a simple model beats a complex model on dirty data every time.

Pipeline Architecture

Text Preprocessing

Named Entity Recognition

Text Classification

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production