ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Natural Language Processing Pipelines

Build production NLP systems that extract meaning from text. Covers text preprocessing, tokenization strategies, named entity recognition, sentiment analysis, text classification, and the patterns that turn unstructured text into actionable structured data.

Text is the most abundant data type — support tickets, reviews, emails, contracts, medical records — yet also the hardest to use programmatically. NLP pipelines transform unstructured text into structured, queryable data: extracting entities, classifying intent, measuring sentiment, and summarizing content.


Pipeline Architecture

Raw Text


Preprocessing
  │  Clean, normalize, handle encoding

Tokenization
  │  Split into tokens (words, subwords, characters)

Feature Extraction
  │  Embeddings, TF-IDF, or language model encoding

Task-Specific Model
  │  Classification, NER, sentiment, summarization

Post-Processing
  │  Entity linking, confidence filtering, formatting

Structured Output

Text Preprocessing

import re
import unicodedata

class TextPreprocessor:
    """Clean and normalize text for NLP tasks."""
    
    def preprocess(self, text: str, config: dict = None) -> str:
        config = config or {}
        
        # 1. Normalize unicode
        text = unicodedata.normalize("NFKD", text)
        
        # 2. Handle encoding artifacts
        text = text.encode("ascii", "ignore").decode("ascii")
        
        # 3. Normalize whitespace
        text = re.sub(r"\s+", " ", text).strip()
        
        # 4. Remove URLs (optional)
        if config.get("remove_urls", True):
            text = re.sub(r"https?://\S+", "[URL]", text)
        
        # 5. Remove email addresses (optional)
        if config.get("remove_emails", True):
            text = re.sub(r"\S+@\S+\.\S+", "[EMAIL]", text)
        
        # 6. Lowercase (task-dependent)
        if config.get("lowercase", False):
            text = text.lower()
        
        return text
    
    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50):
        """Split long text into overlapping chunks for processing."""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            chunks.append(chunk)
        
        return chunks

Named Entity Recognition

import spacy

# Production NER pipeline
nlp = spacy.load("en_core_web_trf")  # Transformer-based model

def extract_entities(text: str):
    """Extract named entities from text."""
    doc = nlp(text)
    
    entities = []
    for ent in doc.ents:
        entities.append({
            "text": ent.text,
            "label": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char,
            "confidence": ent._.confidence if hasattr(ent._, "confidence") else None,
        })
    
    return entities

# Example:
text = "Apple Inc. announced that CEO Tim Cook will present at WWDC 2024 in Cupertino, California on June 10."

entities = extract_entities(text)
# [
#   {"text": "Apple Inc.", "label": "ORG"},
#   {"text": "Tim Cook", "label": "PERSON"},
#   {"text": "WWDC 2024", "label": "EVENT"},
#   {"text": "Cupertino, California", "label": "GPE"},
#   {"text": "June 10", "label": "DATE"},
# ]

Text Classification

from transformers import pipeline

class TextClassifier:
    """Zero-shot and fine-tuned text classification."""
    
    def __init__(self):
        # Zero-shot: No training data needed
        self.zero_shot = pipeline(
            "zero-shot-classification",
            model="facebook/bart-large-mnli",
        )
        
    def classify_zero_shot(self, text: str, labels: list):
        """Classify text without training data."""
        result = self.zero_shot(text, candidate_labels=labels)
        
        return {
            label: score 
            for label, score in zip(result["labels"], result["scores"])
        }
    
    def classify_support_ticket(self, ticket_text: str):
        """Classify a support ticket by category and urgency."""
        categories = self.classify_zero_shot(
            ticket_text,
            labels=["billing", "technical", "account", "feature request"],
        )
        
        urgency = self.classify_zero_shot(
            ticket_text,
            labels=["urgent", "normal", "low priority"],
        )
        
        return {
            "category": max(categories, key=categories.get),
            "category_scores": categories,
            "urgency": max(urgency, key=urgency.get),
            "urgency_scores": urgency,
        }

# Example:
# Input: "My payment was charged twice and I need a refund immediately"
# Output: {"category": "billing", "urgency": "urgent"}

Anti-Patterns

Anti-PatternConsequenceFix
No preprocessingDirty text = noisy predictionsClean, normalize, handle edge cases
One model for all languagesPoor non-English performanceMultilingual models or per-language models
Ignore confidence scoresLow-confidence predictions treated as factFilter by confidence, human review for uncertain
No domain adaptationGeneric model misses domain termsFine-tune on domain-specific data
Process entire documentsContext window exceeded, slowChunking with overlap for long texts

NLP pipelines are only as good as their preprocessing and their training data. A clean pipeline with a simple model beats a complex model on dirty data every time.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →