Making a Monolingual Model Bilingual with Domain Adaptation

You have an English BERT model that works well on legal text. But your corpus is bilingual. Here's how domain adaptation on a bilingual corpus can produce a model with strong masked language modeling performance in both languages — and why legal text makes this work better than you might expect.

Suppose you have an English TinyBERT model that you've fine-tuned on European Union (EU) legal documents. It's learned the vocabulary, citation conventions, and argumentative structure of English-language EU legal documents reasonably well. But your corpus is half French. You need the model to work in both languages.

What do you do?

The obvious solution is to start over with a multilingual model — mBERT or XLM-RoBERTa — and accept the performance tradeoff that comes with spreading capacity across 100 languages when you only need two. A less obvious solution, and often a better one for this specific problem, is to adapt the English model to French through continued pretraining on a bilingual domain corpus. You get a compact, domain-specialized model that handles both languages — and the reason it works has everything to do with the specific nature of legal language.

The standard approach and its costs

Multilingual models are the default answer to multilingual NLP problems. They're pretrained on text from dozens or hundreds of languages simultaneously, which gives them cross-lingual representations that transfer reasonably well across tasks. For general-purpose applications, this is a sensible choice.

For specialized domains, the tradeoffs are less favorable. A model pretrained on 100 languages has allocated its representational capacity across all of them, which means any single language — and certainly any specialized register within a single language — gets a smaller share of that capacity than a monolingual model of the same size would provide. mBERT knows a great deal about French. It knows less about French competition law, and the gap between its representations of "abus de position dominante" and "abuse of dominant position" reflects general multilingual pretraining rather than the doctrinal equivalence between them.

TinyBERT adds a second constraint. It's a distilled model — smaller and faster than the full BERT model — that's designed for deployment contexts where latency really matters. Multilingual TinyBERT models exist but are generally less capable than their English-only counterparts. If you want a small, fast, domain-specialized model that works in two languages, the multilingual model gets you less than you might hope.

Continued pretraining on a bilingual corpus

The alternative is to take the English TinyBERT model and continue pretraining it — using masked language modeling (randomly removing words and teaching the model to fill them back in) — on a bilingual corpus of English and French legal documents. The model was trained to predict masked tokens in English; it will now be trained to predict masked tokens in both English and French, on text that reflects the specific vocabulary and structure of the legal domain.

This is a form of domain adaptation. The model's existing English legal representations provide a starting point; the continued pretraining on French text teaches the model to extend those representations into the new language. Because the fine-tuning data is domain-specific, the French representations it learns are legal French — not the general French of the internet, but the French of EU directives, Commission decisions, and ECJ judgments.

The practical requirements are modest by the standards of pretraining from scratch. A bilingual corpus of EU legal documents — the kind that is publicly available from EUR-Lex — provides sufficient coverage of both languages. The compute cost is a fraction of full pretraining. You can do it on a single cloud GPU instance in a few hours. The result is a model that retains its English legal representations while acquiring French legal representations trained on the same domain.

Why legal text makes this work

The interesting question is why this approach works as well as it does — why a model adapted from English to French on legal text achieves masked language modeling performance in French that is comparable to its English performance. The answer lies in the specific linguistic relationship between English and French in the legal domain — which speaks to why it's important to understand the linguistic patterns of your domain.

Legal English is substantially French-derived. The Norman Conquest deposited an enormous French vocabulary into English legal usage — "contract," "plaintiff," "defendant," "jury," "evidence," "property" — and that vocabulary has remained largely intact ever since. The overlap is not merely etymological. Many of these terms retain similar or identical meanings in both languages in legal contexts, even where their general-language meanings have diverged.

EU legal text reinforces this overlap structurally. The EU produces all its legislation simultaneously in 24 languages, which means that English and French EU legal documents are translations of the same source texts, drafted by highly-trained lawyer-linguists to be legally equivalent. The linguistic mapping between the languages is not approximate — it's designed by experts to be as precise as possible. "Proportionnalité" and "proportionality" mean exactly the same thing in this corpus. The same is true for the procedural vocabulary, institutional terminology, and doctrinal concepts that appear throughout EU law.

This creates favorable conditions for bilingual domain adaptation. The model's English legal representations are not starting from scratch when it encounters French legal text. A significant share of the vocabulary is cognate or identical. The structural patterns — citation formats, section organization, argumentative sequence — are parallel by design. The model is not learning a new language so much as learning a parallel encoding of concepts it already represents in English.

The result, in practice, is masked language modeling performance in French that approaches English performance on domain-specific evaluation sets — an outcome that would not generalize to, say, adapting an English model to Finnish legal text, where the linguistic overlap is minimal and the structural parallels, while real, are not reinforced by a shared vocabulary. Nor is it an outcome that would generalize to adapting a general English model to general French documents (although the linguistic overlap between French and English is relatively high, even in non-technical contexts).

When to use this

This approach to language-adaptation is well-suited to organizations working linguistically related languages in domains that have specific linguistic conventions — which describes a substantial share of legal tech applications covering EU law and international law. It also describes some financial and medical applications (other domains with specific linguistic conventions) that target international markets.

This approach produces a compact, fast, domain-specialized model that can be deployed without the infrastructure overhead of a large multilingual system.

It's not a general solution to multilingual NLP. The linguistic properties that make English-French legal adaptation work — the historical vocabulary sharing, the parallel translation corpus, the structural similarities — are specific to this language pair in this domain. But there are other language pairs in other domains that exhibit these same properties. Adapting the same model to Arabic or Chinese legal text would require a different approach entirely, and probably a multilingual foundation model rather than a domain-adapted monolingual one.

It also requires a reasonably large bilingual domain corpus to work well. The EU's public document database is sufficient for this purpose; an organization without access to parallel legal text at scale would face a data constraint.

Within those limits, this is a practical and underused technique for a general problem facing companies that work in fields with domain-specific linguistic conventions and serve international markets: you have a good English model, your corpus is bilingual, and you need to extend the model's capabilities without starting over.

Making a Monolingual Model Bilingual with Domain Adaptation

The standard approach and its costs

Continued pretraining on a bilingual corpus

Why legal text makes this work

When to use this

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Something Was Written by an LLM?

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Something Was Written by an LLM?

Want a Good Model? Start with a Good Measurement Strategy

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Something Was Written by an LLM?

Review. Learn. Practice.