Not Every NLP Problem Needs a Frontier Model

Frontier LLMs are capable. They're also expensive, slow, and prone to hallucination. For many NLP tasks, using a fine-tuned BERT model will be more accurate, easier to audit, better suited to your domain, and cheaper by orders of magnitude — and it'll keep your data off someone else's servers.

There's a pattern in enterprise AI adoption. An organization decides it needs to do something with its documents — classify them, extract information from them, summarize them. Someone suggests using ChatGPT or Gemini or Claude. Nobody pushes back, because those are the tools everyone has heard of. The project proceeds, the costs come in higher than expected, the outputs require more human review than anticipated, and the team spends significant time writing and refining prompts that will need to be rewritten again the next time the model provider changes the model's behavior.

This isn't always the wrong choice. Sometimes it's the right one. But organizations often either (a) don't make these decisions carefully enough or (b) create incentive structures (like rewarding token usage to encourage AI adoption) that lead employees to adopt cost-inefficient solutions. And that's a problem — because the choice of model architecture is one of the most consequential decisions in an NLP project, with implications for cost, accuracy, latency, and maintainability.

Two fundamentally different approaches

GPT-style LLMs — like ChatGPT, Gemini, and Claude — are generative models. You give them a prompt, they produce text. Frontier GPT-style models are capable, and for tasks that require synthesis, flexible reasoning, or open-ended generation — drafting documents, question answering over heterogeneous sources, analyzing documents where the structure and formatting varies unpredictably — generative models are often the right tool.

BERT-style models (BERT, RoBERTa, DeBERTa, and their domain-adapted variants) are discriminative models. They don't generate text. They can classify tokens, sentences, and paragraphs; they can extract spans of text; they can generate text embeddings that you can use for search and retrieval tasks. For tasks that can be precisely specified — classify this paragraph, extract these entities, determine whether these two passages are semantically similar, find similar paragraphs — discriminative models are often more accurate, dramatically cheaper, and easier to audit than generative alternatives.

Most practical NLP tasks fall into the second category. Classification, named entity recognition, information extraction, span detection, semantic similarity — these are well-defined tasks with well-defined outputs. Solving them with a generative model is like using a search engine to answer a question that a lookup table would handle in microseconds. It works. But it's not the right tool for the job.

The accuracy argument

Generative models hallucinate. This isn't a bug that'll be patched in the next release — it's a structural property of how these models work. They produce fluent, plausible text. That text isn't always accurate, and models are bad at self-assessing their confidence in the accuracy of their output. AI companies use human-in-the-loop reinforcement learning to fine-tune LLMs, and that process rewards confident-sounding output.

For extraction tasks, this is a serious problem. Asking a frontier LLM to extract all cited cases in a court decision will return a plausible-looking list. The majority will probably be correct. But some of the "extracted" citations may not actually appear in the document. If you cite one of these cases without checking it — which takes a non-trivial amount of time and effort — you could face serious professional consequences.

(You don't want a judge to reprimand you for filing a brief with made-up citations.)

When the information you need to extract is contained in specific spans of text, you can train a BERT model to extract those spans. If the information you need is dispersed or requires synthesis, then an LLM will be the better choice.

You might think that frontier reasoning models would be better at extracting information without hallucinating. Reasoning models are, after all, capable of handling complex multi-step tasks. But in practice, (cheaper) non-reasoning models are often better at structured extraction and tagging tasks. Reasoning models are trained to think through problems step by step — a useful property when the task is open-ended, but a liability when the task requires strictly adhering to a specific output format. Reasoning models have a tendency to ignore your output schema — they might provide the information you wanted, but not always in the format you wanted.

Another key difference is how you assess performance and measure accuracy. With BERT models, you benchmark accuracy against a validation set before deployment, giving you a statistical baseline. You then monitor model drift during deployment. With an LLM, you don't know how it will perform until it encounters enough production data, requiring post-hoc evaluation frameworks to catch hallucinations.

More capability doesn't necessarily translate to better performance on specific tasks — particularly extractive tasks.

The observability argument

Frontier LLMs are black boxes — you can't see exactly what's going on inside them. But when you use LLMs to extract data from documents, there's another source of uncertainty — you don't know whether the answer is even based on the document. If you ask an LLM to extract citations from a court decision, you don't know, for any extracted citation, if the text it's generated is from the document or something it made up using its general knowledge.

A fine-tuned BERT model can extract spans from text. These models are still black boxes (you don't know why the model did what it did), but at least you know that the text you're getting is real text from the document. The outputs are traceable: every extracted entity can be traced back to a specific location in the document. For any application where provenance matters — legal research, regulatory compliance — this kind of observability (without any hallucination risk) is essential.

Another issue is that many frontier generative models are proprietary — their behavior can change at any time. And sometimes unexpectedly, surprising even their owners. A proprietary LLM could perform well at a task one day, and then poorly the next. These changes are not observable until it's too late — until it's adversely affected your product's performance. That's a business risk. And you don't have control of the model, so you can't independently fix it.

Your BERT model won't change unless you retrain it.

(Which you can easily do — on a single GPU. In hours or even minutes.)

The specialization argument

General-purpose LLMs are trained on general-purpose text. Their knowledge of specialized domains — EU competition law, ECHR jurisprudence, WTO dispute settlement procedures — is whatever happened to appear in their training corpus — unverified, unsystematically sampled, and frozen at the training cutoff. The problem isn't that the general training corpus doesn't include material that's relevant to your specific domain — it probably includes lots of it; it's that it also includes a lot of other things, and those other things can bias the model on your domain-specific tasks. In other words, the problem isn't a lack of information so much as too much. The model could incorrectly apply knowledge from one context to another — like getting a CJEU citation format wrong because it's confused it with another format from another court that it has more information about.

If you want to extract information from CJEU judgments, a domain-adapted BERT model trained on a curated corpus of CJEU judgments is a better choice. It's been trained on the specific terminology, citation conventions, and document structure of your corpus. Its representations of domain-specific concepts reflect their actual usage on your corpus, rather than their general use. It won't invent a citation to a directive that doesn't exist, because it's not generating anything. It identifies spans in the provided text or it doesn't. It might miss some citations, but you'll have a benchmark for how accurate it is (the validation set), and you'll be able to identify the most common failure modes and add training data to help the model improve.

This is the main argument for domain adaptation over prompting for high-stakes, domain-specific NLP tasks — like legal document parsing. The question isn't which model has read more text. It's which model's architecture is appropriate for the task and which model's training is appropriate for the domain.

The cost argument

The economics aren't subtle. A BERT-style classifier running on a single GPU can process tens of thousands of documents in minutes at no marginal cost. API calls to frontier generative models for the same volume run to hundreds or thousands of dollars, with latency that makes their use in real-time applications impractical for many use cases.

For example, for a legal tech product that analyzes court filings at scale — tagging argumentation techniques in paragraphs, extracting citations from decisions, identifying procedural actions across thousands of cases — the cost difference between a fine-tuned BERT model and a generative LLM pipeline isn't marginal. It's the difference between a system that's economically viable to run continuously and one that requires careful rationing of compute resources.

The economics of LLMs are changing. The per-token cost of compute for commoditized models has declined, but these prices are artificially low because they're subsidized by venture capital. As model providers face increasing pressure to generate returns, these subsidies may start to run out. Newer reasoning models process far more tokens than non-reasoning models. Agents are now responsible for more LLM usage, and relatively simple tasks can unexpectedly make many model calls, making billing less predictable. Organizations' overall spend on AI compute has gone up, raising the opportunity cost of using LLMs for tasks that they're not necessary for.

The financial cost of frontier LLMs isn't the whole story. LLMs run on infrastructure that has substantial energy and water requirements. Routing classification tasks that a 110-million-parameter BERT model could handle in milliseconds through a frontier model with hundreds of billions of parameters carries a real environmental cost — one that scales directly with volume. Using the most powerful LLM available is rarely necessary. For organizations with sustainability commitments, those commitments should extend to infrastructure choices in ML pipelines.

This tradeoff matters more as organizations' products mature. Proof-of-concepts can absorb high per-query costs — financial and otherwise. Production systems, running continuously on growing document corpora, often can't.

The privacy argument

Processing documents using a frontier model API means sending them to someone else's servers. For legal documents — client communications, draft contracts, confidential filings, internal legal strategy — this isn't a hypothetical concern. The same goes for medical documents and financial documents. LLMs raise data governance questions with implications for legal liability and professional ethics. LLM providers offer contractual data protections, but that doesn't change the fact that your data is on their servers.

A fine-tuned BERT model runs locally. A domain-adapted classifier processing client documents can run on your company-controlled computer or cloud server — hosted by an organization that meets your specific data protection requirements — and never exposes a single document to the outside world. The data stays where it belongs — on your machine. There's no terms-of-service question about whether the text can be used in future training; no risk of third-party data leaks. For law firms, legal tech companies, or any organization handling privileged or proprietary material, this is critical to get right.

(Some major law firms have learned this the hard way.)

LLMs and BERT models aren't always competitors

There are cases where frontier models and BERT-style models work well together. Here's an example: synthetic data generation.

Annotating training data for training a domain-adapted classification model is expensive. Domain experts are scarce and their time is valuable. One approach that has become increasingly practical is using a frontier LLM to generate synthetic labeled examples via zero-shot or few-shot prompting — producing a first-pass training corpus that you can fine-tune a BERT model on. This synthetic data can be audited by experts — that's cheaper than coding it from scratch. The LLM handles the generation; the smaller discriminative model handles production inference, where cost, latency, and auditability matter.

This isn't a universal solution — synthetic data has its own validity challenges. But it's a legitimate way to reduce annotation costs on tasks where ground-truth examples are sparse, and it illustrates that the choice between LLMs and BERT models isn't always binary. The tools serve different purposes but can be combined effectively.

When to use which

LLMs are the right choice when the task requires generation: drafting, synthesis, question-answering where the relevant information isn't localized in a specific span, or fast-changing tasks where a discriminative model would require constant retraining. They can also be appropriate for low-volume applications where the flexibility of a prompted model reduces development time enough to justify the cost.

BERT-style models are the right choice when the task is well-defined, the output is structured, the volume is high, the domain is specialized, and accuracy is important enough that hallucination isn't acceptable. Classification, named-entity recognition, information extraction, semantic search, and document similarity all qualify.

The online AI discourse is dominated by announcements of newer, more powerful, more expensive models. That discourse isn't a good guide to engineering decisions. There's a systematic bias in the discourse towards impressive-sounding tools over cheaper, more appropriate ones. More often than you might think, the right answer is a fine-tuned BERT model — architecture from 2018. The AI influencers moved on. The use cases didn't.

Not Every NLP Problem Needs a Frontier Model

Two fundamentally different approaches

The accuracy argument

The observability argument

The specialization argument

The cost argument

The privacy argument

LLMs and BERT models aren't always competitors

When to use which

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Something Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Review. Learn. Practice.