Why General-Purpose Language Models Struggle with Legal Text

Legal language has structural and semantic properties that general-purpose models weren't trained to handle. For high-stakes legal NLP applications, the choice between fine-tuning, domain adaptation, and prompting is a consequential engineering decision.

Legal tech companies building NLP systems face a complex architectural choice that's easy to defer and expensive to get wrong. Frontier generative models — like ChatGPT, Gemini, and Claude — are very capable, immediately available, and require zero upfront training data. Smaller, fine-tuned encoder models are cheaper, faster, and fully auditable. Domain adaptation via continued pretraining yields the best results on specialized tasks but requires a curated corpus and a dedicated compute budget.

Organizations may be tempted to start with whatever is most convenient and assume they can easily iterate later.

There's a predictable failure mode: strong demo performance, but inconsistent production results. The culprit is the domain gap — a mismatch between the general text the model was trained on and the specific properties of legal text. You have to understand the gap to choose the right solution.

What Legal Text Actually Is

Legal language isn't just legal jargon layered on top of ordinary English; it's a parallel system of syntax and semantics.

Legal language is a distinct register with its own vocabulary, rhetorical strategies, citation conventions, and cross-lingual semantics. Every one of these dimensions creates unique friction for models trained on general text or even domain-adapted models trained only on English text. Here are just a few of the challenges.

Semantic Divergence: Legal terms carry precise meanings that diverge from their everyday meanings — a problem that's amplified by the multilingual character of EU law. In this register, "proportionality" doesn't refer to fairness; it's a multi-part judicial test that governs legislative competence. "Decision" doesn't refer to a choice; it's a legal act under Article 288 TFEU that targets specific addressees and that can have horizontal and vertical direct effect. "Undertaking" doesn't refer to a commitment to do something; it's a legal entity that's engaged in economic activity, regardless of its legal status. Foundational EU legal concepts like "subsidiarity" have specific doctrinal meanings that ordinary-language definitions completely fail to convey. A language model that has a representation of the term "subsidiarity" that's based on general internet language is not going to adequately capture its meaning in a paragraph of a CJEU judgment.
Semantically-Rich Citations: Citation formats can vary widely, even within the decisions of a single court. For example, Court of Justice of the European Union (CJEU) judgments contain hundreds of unique citation formats. They also contain short-form citations. A CJEU judgment that cites Francovich is invoking a specific legal doctrine about state liability for failing to implement EU directives. The name Francovich has a specific meaning here — it encodes important semantic information about the argument. A general language model — or even a general legal language model — won't have an internal representation of the name Francovich that captures the semantic signal its sending. A model that doesn't have an EU-specific representation of Francovich is missing substantive content.
Hierarchical Documents: In legal documents, formatting often encodes information about the relationships between blocks of text. For example, EU directives and regulations have a hierarchical structure — parts, titles, chapters, articles, paragraphs, subparagraphs, points, sub-points. A model that ignores formatting discards contextual signals about how paragraphs relate to each other that a human lawyer would rely on automatically. The text of Article 5(2)(b) of a regulation should be interpreted in the context of Article 5(2) and Article 5(2)(a) — otherwise you might misinterpret its scope or implications. In retrieval-augmented generation (RAG) pipelines, documents are typically split using fixed token windows (e.g., sliding windows of 512 tokens). This doesn't work with legal documents. If you split a treaty article mid-sentence, you destroy the context that you need for accurate retrieval and reasoning. You need to split documents based on their actual internal structure.
Document Cross-Referencing: Legal text rarely contains self-contained thoughts. A single paragraph of a CJEU judgment might refer to two treaty articles, three directives, and five court cases. The meaning of that paragraph depends on the text of all those cross-referenced documents. The paragraph is a graph as much as it is a paragraph. A general-purpose language model or a naive RAG system can't fully encode the meaning of that paragraph without knowing what the cross-referenced documents say.
Co-Reference Resolution: Legal documents generally prefer formal, structured pointers over pronouns: "that Member State," "the said Directive," or "the contested decision." General language models frequently struggle with co-reference resolution in documents with high syntactic density. If the model's attention layers misalign a pointer like "the contested decision" to a peripheral regulation mentioned a few sentences earlier, rather than the regulation under review in the case, the downstream extraction or classification task could become less accurate.
Temporal Dynamics: Language models don't understand that legal text has a time dimension. They don't understand that an article of a base regulation may have been amended by a later amending regulation. They don't understand that a regulation may no longer be in force. They don't understand that a court case could have annulled a regulation. They don't understand that a court case might have changed the interpretation of an article. In short, they don't understanding that meaning changes over time, and they don't know what the implications are for the task they're performing — whether its extracting an answer to a user's question from a document using a BERT-style model or generating an answer to a user's question using a GPT-style model.
Multilingual Documents: Many EU legal documents have 24 official, equally authentic language versions. But the challenge is more complex that just having to deal with the same document in multiple languages — which is hard enough. Documents are often themselves multi-lingual in that they contain names and legal terms from other languages. A language model that's processing an English legal document will frequently encounter German names or French legal terms that carry important semantic signals. And it won't know what they mean. Even an English language model that's been fine-tuned on EU legal text won't have good internal representations of non-English names or non-English legal terms.

In this domain, where accuracy is paramount, using a model that has inadequate internal representations of a paragraph can be a problem.

The Limits of General-Purpose Pretraining

A model trained on the entire internet learns how to work with every-day language; it doesn't know how to work with legal text.

General-purpose language models produce different failure modes depending on whether you're using an encoder model or a decoder model.

For small encoder architectures (like BERT, RoBERTa, or DeBERTa), the problem is contextual blind spots. Because their pretraining data (general-web crawls and standard literature corpora) contains a statistically small fraction of specialized legal text, the model's internal weights reflect average internet usage. When an encoder processes an EU legal document, its embedding layer maps terms like "undertaking" or "decision" to vectors dominated by general-domain contexts. The legal meaning is present, but the model's representations of these terms are dominated by their everyday meanings, crowding out their precise doctrinal meanings.

For frontier decoder architectures (like ChatGPT, Gemini, or Claude), the scale of pretraining data means they have seen a substantial amount of legal text. But because they're autoregressive next-token predictors, their representations are optimized to capture the statistical average of the open web. This creates tokenization fragmentation.

If a base tokenizer — whether it belongs to a 110-million-parameter BERT-style model or a trillion-parameter GPT-style model — doesn't contain native tokens for specialized legal abbreviations (like GBER or SGEI) or case names (like Francovich), it'll be forced to fracture them into sub-tokens, which won't capture the term's semantic signal as well as a dedicated token would. Frontier decoder models use much larger BPE vocabularies and are more likely to have dedicated tokens for common legal abbreviations, but case names may still have weak contextual representations.

How do Deal with Domain-Specific Text

Choosing an NLP architecture isn't about picking the "best" model; it's about matching the engineering solution to the specific task type.

Legal NLP tasks generally fall into two categories — structured discrimination (classification, extraction) and text generation (summarization, drafting, question-answering). Resolving the domain gap requires choosing the right architecture for the right objective.

Task-Specific Fine-Tuning

You can use fine-tuning to address the domain gap for structured classification and extraction tasks — such as span classification for extracting citations or identifying specific clauses. With this approach, you take a general pretrained encoder model (like a standard BERT or DeBERTa variant), attach a task-specific classification head, and continue training across the entire network on labeled domain examples. This updates the model's underlying internal weights so that it learns to map its representations to your specific classification categories or entity types. This is effective when the domain gap is moderate and the task is a well-defined discriminative problem. It's cheap, fast, and predictable.

But it still has a performance ceiling. Because standard task fine-tuning updates the weights using only a pool of task-specific examples, it forces the network to learn your domain through the lens of your specific classification objective. The model is then doing two jobs at once: learning domain-specific representations of tokens and learning how to classify labels.

Domain Adaptation via Continued Pretraining

You can use domain adaptation to address the domain gap for high-volume, continuous extraction and entity recognition pipelines. With this approach, you tackle the representation problem directly by continuing the self-supervised pretraining process on a large, unlabeled legal corpus before you add any task-specific classification heads.

For encoder models, this means additional Masked Language Modeling (MLM) to better learn the linguistic patterns of legal documents.
For decoder models, this means additional causal language modeling using a corpus of domain text.

With parameter-efficient techniques like LoRA and QLoRA, you can freeze the base model's weights and run this adaptation loop on targeted adapter layers. This uses a fraction of the compute resources. For high-volume extraction tasks or entity recognition needs to run continuously, domain-adapting a local model yields substantial long-term dividends in accuracy and reduces out-of-domain drift. For decoder models, parameter-efficient adaptation improves generation quality on domain text, but the model remains generative — hallucination risk and structured output limitations still apply.

Retrieval-Augmented Generation (RAG)

You can use Retrieval-Augmented Generation (RAG) to address the domain gap for open-ended question-answering and complex legal research tasks. Instead of altering a model’s weights, RAG injects the relevant legal sources directly into the prompt context window. The challenge is that you have to design a retrieval system that can find the relevant sources. This approach c fail if your data ingestion is naive. Standard RAG pipelines chunk text using arbitrary, fixed token windows (e.g., 512 tokens), which blindly splits apart documents. This may leave the downstream model without the context it needs to determine if a retrieved source actually addresses to the user's question.

Building a production-ready RAG pipeline requires a structural, hierarchy-aware approach rather than raw compute. You need to parse documents into coherent semantic chunks based on the structure of the documents, rather than token counts. Because documents in legal corpora exist are interconnected, through citations and references, you need to enrich each chunk with metadata and context. This shifts the engineering challenge away from model training and onto upstream document parsing and data processing.

Prompting

You can use prompting to address the domain gap for flexible, low-volume tasks that require open-ended language synthesis, high-level summarization, or initial drafting. With prompting, you skip local training entirely. Instead, you use zero-shot or few-shot prompting to instruct an external, frontier decoder model to complete specific textual synthesis tasks based on the instructions provided in the context.

For structured classification and extraction tasks, using a frontier decoder model comes with major liabilities. Generative models naturally hallucinate — if you ask an LLM to extract citations from a legal document, it can easily invent a plausible-sounding but nonexistent citation. You have to continuously check model updates haven't degraded the model's performance on your task. And generative models don't always return structured output — like tagged spans or JSON — reliably. You have to constrain decoding and/or post-processes the output to validate that it satisfies your data schema, which adds complexity. Fine-tuned encoders with a classification head produce structured outputs by construction so there's no generation step that can go off-script.

The Danger of the Quick Demo

Don't ignore the properties of domain-specific text.

A common engineering trap is sequencing your project backward. It's easy to build a proof-of-concept using a frontier LLM in three weeks, demonstrate a demo to stakeholders in, and then spend the next nine months trying to prompt-engineer away the model's fundamental structural limitations.

If you're validating a product concept over a short timeline, start with fine-tuning or a small, controlled LLM pipeline. But don't confuse a prototype with a production system.

When your application starts running at scale across a specialized corpus, you can't just solve the "domain-specific text" problem with raw compute. Build your architecture to solve the specific challenges presented by the linguistic properties of your domain. This requires substantial domain expertise.

If your language model doesn't understand the properties of your domain-specific text, prompt engineering won't save it in production.

Why General-Purpose Language Models Struggle with Legal Text

What Legal Text Actually Is

The Limits of General-Purpose Pretraining

How do Deal with Domain-Specific Text

The Danger of the Quick Demo

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Can You Tell If Something Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Can You Tell If Something Was Written by an LLM?

Want a Good Model? Start with a Good Measurement Strategy

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Can You Tell If Something Was Written by an LLM?

Review. Learn. Practice.