Natural Language Processing Pipelines
Build production NLP systems that extract meaning from text. Covers text preprocessing, tokenization strategies, named entity recognition, sentiment analysis, text classification, and the patterns that turn unstructured text into actionable structured data.
Text is the most abundant data type — support tickets, reviews, emails, contracts, medical records — yet also the hardest to use programmatically. NLP pipelines transform unstructured text into structured, queryable data: extracting entities, classifying intent, measuring sentiment, and summarizing content.
Pipeline Architecture
Raw Text
│
▼
Preprocessing
│ Clean, normalize, handle encoding
▼
Tokenization
│ Split into tokens (words, subwords, characters)
▼
Feature Extraction
│ Embeddings, TF-IDF, or language model encoding
▼
Task-Specific Model
│ Classification, NER, sentiment, summarization
▼
Post-Processing
│ Entity linking, confidence filtering, formatting
▼
Structured Output
Text Preprocessing
import re
import unicodedata
class TextPreprocessor:
"""Clean and normalize text for NLP tasks."""
def preprocess(self, text: str, config: dict = None) -> str:
config = config or {}
# 1. Normalize unicode
text = unicodedata.normalize("NFKD", text)
# 2. Handle encoding artifacts
text = text.encode("ascii", "ignore").decode("ascii")
# 3. Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
# 4. Remove URLs (optional)
if config.get("remove_urls", True):
text = re.sub(r"https?://\S+", "[URL]", text)
# 5. Remove email addresses (optional)
if config.get("remove_emails", True):
text = re.sub(r"\S+@\S+\.\S+", "[EMAIL]", text)
# 6. Lowercase (task-dependent)
if config.get("lowercase", False):
text = text.lower()
return text
def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50):
"""Split long text into overlapping chunks for processing."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
Named Entity Recognition
import spacy
# Production NER pipeline
nlp = spacy.load("en_core_web_trf") # Transformer-based model
def extract_entities(text: str):
"""Extract named entities from text."""
doc = nlp(text)
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"confidence": ent._.confidence if hasattr(ent._, "confidence") else None,
})
return entities
# Example:
text = "Apple Inc. announced that CEO Tim Cook will present at WWDC 2024 in Cupertino, California on June 10."
entities = extract_entities(text)
# [
# {"text": "Apple Inc.", "label": "ORG"},
# {"text": "Tim Cook", "label": "PERSON"},
# {"text": "WWDC 2024", "label": "EVENT"},
# {"text": "Cupertino, California", "label": "GPE"},
# {"text": "June 10", "label": "DATE"},
# ]
Text Classification
from transformers import pipeline
class TextClassifier:
"""Zero-shot and fine-tuned text classification."""
def __init__(self):
# Zero-shot: No training data needed
self.zero_shot = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli",
)
def classify_zero_shot(self, text: str, labels: list):
"""Classify text without training data."""
result = self.zero_shot(text, candidate_labels=labels)
return {
label: score
for label, score in zip(result["labels"], result["scores"])
}
def classify_support_ticket(self, ticket_text: str):
"""Classify a support ticket by category and urgency."""
categories = self.classify_zero_shot(
ticket_text,
labels=["billing", "technical", "account", "feature request"],
)
urgency = self.classify_zero_shot(
ticket_text,
labels=["urgent", "normal", "low priority"],
)
return {
"category": max(categories, key=categories.get),
"category_scores": categories,
"urgency": max(urgency, key=urgency.get),
"urgency_scores": urgency,
}
# Example:
# Input: "My payment was charged twice and I need a refund immediately"
# Output: {"category": "billing", "urgency": "urgent"}
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No preprocessing | Dirty text = noisy predictions | Clean, normalize, handle edge cases |
| One model for all languages | Poor non-English performance | Multilingual models or per-language models |
| Ignore confidence scores | Low-confidence predictions treated as fact | Filter by confidence, human review for uncertain |
| No domain adaptation | Generic model misses domain terms | Fine-tune on domain-specific data |
| Process entire documents | Context window exceeded, slow | Chunking with overlap for long texts |
NLP pipelines are only as good as their preprocessing and their training data. A clean pipeline with a simple model beats a complex model on dirty data every time.