Natural Language Processing

What is natural language processing?

Natural language processing (NLP) is the set of techniques that lets software read, classify, extract from, summarize, and generate human language. In a business context, NLP is what turns the 80% of your data that lives in free-form text — contracts, tickets, emails, transcripts, reviews, notes — into structured fields a system can act on.

Modern NLP is no longer a single technique. It's a stack: classical methods (regex, spaCy, scikit-learn) for the high-volume deterministic layer, transformer models (BERT, RoBERTa, DistilBERT) for fine-tuned classification and extraction, and large language models (GPT-4, Claude, Llama) for flexible reasoning and zero-shot tasks. A good NLP system uses all three, picking the right tool for each step in the pipeline.

Key terms used on this page:

NER (named entity recognition): Identifying spans of text as people, organizations, dates, monetary amounts, drug names, contract clauses, etc.
Classification: Assigning a label to a piece of text — sentiment, topic, intent, priority, risk tier.
Information extraction: Pulling structured fields (party name, effective date, payment terms) out of unstructured documents.
Embedding: A dense numerical vector that captures the meaning of a word, sentence, or document for semantic comparison.
Fine-tuning: Adapting a pre-trained model to your domain by continuing training on your labeled examples.
Zero-shot / few-shot: Using an LLM to perform a task with only instructions (zero-shot) or a handful of examples in the prompt (few-shot), no training required.
Active learning: A loop where the model flags its lowest-confidence predictions for human labeling, feeding the corrections back into training.

How does text classification work in production?

Text classification — sentiment, topic, intent, priority — is the most common NLP problem and the one most often built badly. The naive path is to throw GPT-4 at every ticket and live with a $40,000 monthly bill. The production path looks different:

1. Start with a labeled evaluation set of 200 to 500 examples that represent your real data, including the messy edge cases. Without this, you can't measure whether anything you build actually works.

2. Establish a baseline with a simple model — TF-IDF plus logistic regression, or sentence embeddings plus a linear classifier. This takes a day and tells you the floor.

3. Try an LLM with a good prompt and structured output (OpenAI structured outputs, Anthropic tool use, or Instructor on top). This is usually your ceiling.

4. Decide based on cost, latency, and accuracy. If the LLM is 3 points more accurate but 100x more expensive and 50x slower, fine-tune a smaller model on the LLM's outputs (knowledge distillation). If the LLM is needed only for the 5% of hard cases, use the small model first and route low-confidence predictions to the LLM.

The architectures we ship most often: a Hugging Face DistilBERT fine-tuned on 5,000 examples for the bulk of traffic, with GPT-4 or Claude as a fallback for low-confidence or novel inputs, and a feedback loop that captures human corrections.

How does named entity recognition work for domain-specific text?

Out-of-the-box NER (spaCy's default model, AWS Comprehend) handles generic entities — people, places, dates, money — well enough. The problem is that almost no business actually wants generic entities. They want contract parties and effective dates, drug names and dosages, ICD-10 codes, ticker symbols, vehicle VINs, or part numbers in a maintenance log.

We approach domain NER in three layers:

Rules and gazetteers first. A list of known drug names, product SKUs, or legal clause headings catches the deterministic 60–80% with zero machine learning. Don't skip this — rules are cheap, debuggable, and don't drift.
A fine-tuned transformer for the fuzzy middle. SpaCy's transformer pipeline or a Hugging Face token-classification model trained on 1,000 to 5,000 annotated documents picks up the variations rules can't.
An LLM for the long tail and edge cases. GPT-4 or Claude with a structured-output schema handles novel entity types or low-confidence spans, with the results captured for the next round of fine-tuning.

For relation extraction — connecting entities (e.g., "Patient X was prescribed Drug Y at Dose Z") — we lean heavily on LLM-based extraction now. The accuracy beats traditional dependency-parsing pipelines and the maintenance burden is lower.

How does semantic search and embedding-based retrieval work?

Keyword search (Elasticsearch, OpenSearch, Postgres full-text) breaks the moment your users phrase their query differently from how the document was written. Semantic search uses embeddings — dense vector representations of meaning — to retrieve documents based on what they mean, not what words they contain.

The standard architecture:

1. Chunk documents into passages of 200 to 800 tokens, with overlap.

2. Embed each chunk with a model like OpenAI text-embedding-3-large, Voyage AI voyage-3, or Cohere Embed v3 (or open-source BGE / E5 if you need self-hosted).

3. Store the vectors in Pinecone, Weaviate, Qdrant, Chroma, Milvus, or Postgres + pgvector.

4. Query by embedding the user's question and retrieving the top-K nearest neighbors.

5. Rerank the top 50 candidates with a cross-encoder (Cohere Rerank, BGE Reranker) to pull the truly relevant 5 to the top.

For most enterprise search problems, hybrid search — combining BM25 keyword scoring with vector similarity — outperforms either alone. Numerical IDs, model numbers, and exact phrases need keyword matching; conceptual questions need vectors. Weaviate and Qdrant ship hybrid out of the box; Elasticsearch with the dense_vector type and a reranker is a strong self-hosted option.

How does document summarization work for legal, financial, and clinical text?

Summarization is where LLMs decisively beat older techniques. Extractive methods (TextRank, BERT-based sentence selection) are still useful for very long documents where you need verifiable, in-document quotes. For everything else, a well-prompted Claude 4.7, GPT-4, or open-source Llama-3 70B is the right tool.

The patterns we use most:

Map-reduce for documents longer than the context window: summarize each chunk, then summarize the summaries. Slower but bounded by token limits.
Hierarchical summarization for contracts and clinical notes: first pass extracts structured fields (party, term, obligations / patient, complaint, plan), second pass writes the prose.
Constrained outputs for compliance: force the model to cite specific document sections by ID for every claim, so reviewers can verify. We use this on legal contract review and clinical-note generation.
Evaluation with a rubric, not vibes. Build a 20- to 50-document gold set and score every candidate prompt or model on faithfulness (no hallucinated facts), coverage (no missed obligations), and brevity.

Should you build, buy, or partner for NLP?

The NLP market is mature enough that most generic tasks have a buyable answer. The build-vs-partner decision shows up on domain-specific extraction, regulated data, or anything where your text is your moat.

Option	Best for	Strengths	Weaknesses	Typical cost
Cloud NLP APIs (AWS Comprehend, Google Cloud Natural Language API, Azure Cognitive Services)	Generic sentiment, language detection, generic NER, batch translation	Zero ops, pay-per-call, decent accuracy on common entities	Weak on domain data, limited customization, data-residency questions	$0.0001–$0.001 per call; hidden cost is rebuilding when accuracy is too low
General-purpose LLMs (OpenAI, Anthropic, Cohere) with structured outputs	Flexible extraction and classification, prototypes, low-volume production	Excellent zero-shot accuracy, fast iteration, no training data required	Expensive at scale, latency variability, prompt drift on model upgrades	$0.01–$0.10 per call depending on document length
Open-source transformers (Hugging Face: spaCy, BERT, RoBERTa, DistilBERT, XLM-R)	High-volume classification and NER, latency-critical, cost-sensitive	Cheap inference, full control, deployable in your VPC	Needs labeled data and ML engineering to fine-tune and serve	$5K–$50K to fine-tune and deploy a single model end-to-end
Specialty NLP vendors (Rosette for multilingual entity extraction, Lexalytics for sentiment in regulated industries, MonkeyLearn for no-code teams)	When you need pre-built domain coverage you don't have to label	Faster time-to-value than fine-tuning from scratch	Lock-in on data formats and pricing; quality varies by domain	$30K–$150K annual licenses
Custom NLP build with us	Domain-specific extraction, regulated data, where text is differentiated	Tuned to your data, owned IP, mixed architecture (rules + transformers + LLMs)	Requires labeled data and a 6-to-12-week build	$40K–$200K upfront, sub-cent inference at scale

Our default recommendation: prototype with an LLM and structured outputs (OpenAI or Anthropic), measure accuracy and cost honestly on your data, then graduate to a fine-tuned smaller model only when volume or latency justifies it. Skipping the LLM prototype to "do it right with fine-tuning from day one" is the most common way NLP projects miss their deadline.

How do you handle multilingual text and code-switching?

Most LATAM and global enterprise data is multilingual. The patterns we use:

Multilingual transformer backbones (XLM-RoBERTa, mBERT, BGE-M3) for classification and NER, fine-tuned on a mix of languages. One model, multiple languages, less infrastructure.
Cohere Embed v3 multilingual or OpenAI text-embedding-3-large for cross-lingual semantic search — a Spanish query retrieves relevant English documents.
Language detection as a first step (fasttext, lingua-py) to route to the right downstream model when language-specific tuning is required.
Targeted evaluation per language, especially for Spanish and Portuguese variants. A model that scores 0.94 on Iberian Spanish may drop to 0.86 on Mexican Spanish; the only way to know is to evaluate on real data.

Code-switching (Spanglish in customer support, mixed-language legal contracts) breaks language detection and degrades extraction. We handle it with multilingual models and explicit examples in the evaluation set rather than pretending it doesn't exist.

What does an NLP engagement look like with us?

A typical engagement runs 6 to 16 weeks across four phases:

1. Data and goal scoping (1–2 weeks). We inspect a representative sample of the text, define the target outputs precisely (every field, every label, every edge case), and build a 200-to-500-example evaluation set with you. This is where most NLP projects succeed or fail.

2. Prototype (2–4 weeks). LLM-first, structured outputs, run end-to-end on real data. We score against the evaluation set and present accuracy, cost, and latency tradeoffs.

3. Production build (3–8 weeks). Pipeline engineering: ingestion, preprocessing, model serving (or LLM calls with caching), confidence-based routing to human review, audit logging, monitoring. If the LLM economics don't work at scale, we fine-tune a smaller model on the LLM's outputs.

4. Hand-off and continuous improvement (2 weeks). Documentation, runbooks, retraining playbooks, and an active-learning loop so the system improves as your team uses it.

Outcomes we deliver: a measurable accuracy number on your data, a per-document cost we can defend, a monitoring dashboard, and code your team owns.

What does NLP cost?

Realistic ranges for the work we do:

Single-task LLM-based pipeline (one classification or extraction problem, low-to-mid volume): USD 25,000–60,000 to build, USD 500–5,000/month to run depending on volume.
Fine-tuned transformer in production (high-volume classification or NER): USD 50,000–120,000 to build, USD 200–2,000/month to host on a single GPU or serverless inference.
Multi-pipeline NLP system (ingestion + extraction + search + summarization, with human review): USD 100,000–250,000 to build, USD 2,000–15,000/month to run.

LLM API costs scale linearly with volume; the number we plan around is per-document cost at projected volume, not headline per-token pricing. We benchmark every project against this on real data before we quote.

For pricing detail, see our Pricing page.

Frequently asked questions about natural language processing

Should we use a fine-tuned classifier or just call GPT-4?

Both have their place. For high-volume, latency-sensitive, or budget-constrained classification (millions of items, sub-100ms responses), a fine-tuned smaller model — DistilBERT, a Hugging Face transformer, or even logistic regression on embeddings — usually wins on cost and speed. For low-volume, complex extraction with shifting requirements, GPT-4 or Claude with a well-written prompt and structured output schema is faster to build and maintain. We routinely deploy both in the same pipeline.

How accurate can we expect NLP to be on our domain data?

On a clean classification problem with good training data, F1 scores of 0.90 to 0.97 are realistic. On noisy clinical or legal text with rare entities, expect 0.75 to 0.90 with active learning to close the gap. Anyone who promises 99% on your first model without seeing your data is selling you something.

Do we need labeled training data, and how much?

For traditional fine-tuning, 500 to 5,000 labeled examples per class gets you most of the way. For LLM-based extraction, you can start with zero-shot prompting and 20 to 50 hand-labeled examples for evaluation — that's enough to know whether the system works. We almost always start LLM-first now and only fine-tune if cost or latency demands it.

Can NLP work on languages other than English?

Yes. Multilingual transformers (XLM-RoBERTa, mBERT) and Cohere's multilingual embeddings handle 100+ languages competently. Spanish and Portuguese — relevant for our Mexico and LATAM clients — perform nearly on par with English. Low-resource languages and code-switching (Spanglish, Portuñol) need extra evaluation and sometimes targeted fine-tuning.

What happens when the model is wrong?

Every production NLP system needs three things: a confidence threshold that routes uncertain predictions to human review, an audit log so you can replay any decision, and a feedback loop that captures human corrections as new training data. Systems without these eventually drift and lose trust.

How do you handle PII and sensitive text?

We deploy NLP inside your VPC or on-premise when data sensitivity requires it, use Microsoft Presidio or AWS Comprehend's PII detection to redact before any model call, and disable training-data retention on every API. For regulated industries we map the data flow to GDPR, LFPDPPP, or sector regulations before code ships.

How long until we see results?

A focused NLP build — one extraction or classification problem on existing data — ships in 4 to 8 weeks. A multi-pipeline system with ingestion, evaluation, monitoring, and human review usually runs 10 to 16 weeks.