Want a Good Model? Start with a Good Measurement Strategy

Measurement strategy is the most consequential modeling decision in a supervised learning project. Treat labeling as a low-skill data-cleaning task and you'll build a model that learns the wrong thing. You can't gloss over measurement and validity and expect to build a good model.

Data Annotation is a Research Design Problem

You can't build a successful model without precise domain-driven definitions of your target concepts.

Annotating training data for a supervised learning model is a research design problem, not just a mechanical labeling task. Before you even start annotating data, you need to precisely define the concepts you're trying to measure. What are the concept you care about? What are the observable linguistic indicators in your text corpus that capture variation in those concepts? This isn't an ML question. It's a theoretical, domain-specific question. Some people treat labeling as a low-skill data-cleaning task. It's not. It's a high-skill task that requires domain expertise.

Before you start annotating, you need to define these constructs with enough precision that your coders can resolve the edge cases consistently. For instance, if you're labeling the rhetorical roles of paragraphs in court decisions, and your labels include "legal interpretation" and "legal reasoning", you need to know what the precise differences are between those two concepts and what examples of each type of paragraph look like in your specific institutional context. This is domain knowledge — it's theory. It doesn't have anything to do with ML. But you can't train a valid ML model without doing this first. You can't skip it, and you can't just outsource it to an LLM.

Assessing Validity Isn't Optional

Before you code a single observation, you need to know what you want to measure.

Developing a valid measurement strategy for your labels is more important than what modeling architecture you use or how much data you have. If your measurement strategy is poor, your can't trust anything that's downstream of that.

To develop good labels, you need a measurement strategy, and you need to systematically evaluate validity. A good measure needs construct validity — observed variation in your measure (your choice of labels) needs to capture variation in the latent concept that you're interested in (the rhetorical role of a paragraph in a court decision). There are four pillars you should consider:

Face validity: Does the label make intuitive sense to a domain expert? If you want to identify which paragraphs in a court decision contain "legal interpretation", your labels need to align with what a legal scholar or experienced lawyer would say. If your rules feel arbitrary to an expert, you measurement strategy doesn't have face validity.
Content validity: Does your measurement strategy capture the entire concept, or just one aspect or dimension of the concept? If your coding rules say to classify paragraphs that abstractly discuss the content of statutes as examples of "legal interpretation", but they don't say what to do about paragraphs that abstractly discuss the content of case law, then you've missed a crucial dimension of the concept, and your measurement strategy doesn't have content validity.
Convergent validity: Do your labels correlate with proxies of the same concept? If your "legal interpretation" labels don't correlate with other valid proxies of your concept — like the presence of legal citations in a paragraph — then your labels aren't capturing the underlying signal, and your measurement strategy doesn't have convergent validity.
Discriminant validity: Can your measurement strategy successfully differentiate between two distinct concepts? If your coding rules don't guide a coder to consistently tell the difference between a paragraph that does "legal interpretation" and one that does "legal reasoning", as you've defined them, then your measurement strategy doesn't have discriminant validity.

Doing this correctly takes a lot of thought. And it requires domain knowledge — a strong theoretical understanding of the domain you're working in. Labels that seems conceptually clear at first look sometimes turn out to be a lot more complicated once you dig into the text with a domain expert.

Conceptual Clarity is a Prerequisite for Coding

If your labels have conceptual overlap, your annotators (and models) will fail in practice.

To build a conceptually valid training set, your labels need to be precisely defined and conceptually distinct. This is where many data science projects go wrong: ML teams ask annotators to choose between labels that are not sufficiently specified.

If you can't draw a conceptually clear line between "legal reasoning" (applying a law to case facts) and "legal interpretation" (interpreting a law in the abstract without applying it to case facts), human annotators — or LLMs if you're doing few-shot coding — won't be able to draw it either. You need to figure this out before writing your coding rules. The goal of a measurement strategy is to eliminate ambiguity at the conceptual level so your coding rules can provide clear guidelines for handling complex, subjective human behaviors.

This is an interative process — you don't get to just think about it once and move on. You need to (a) do multiple rounds of coding, (b) conduct inter-coder reliablity tests to see whether different coders consistently reach the same conclusions, and (c) revise your coding rules after each one. You want to make any adjustments to your coding rules before you invest the time and money to code your full training set.

Inter-Coder Disagreements are High-Signal

Persistent coding disagreements aren't noise that you can ignore; they're structural flaws in your measurement strategy.

When your coders are consistently reaching different conclusions about how a subset of observations should be coded, don't just resolve the individual discrepancies and move on. If they're just one-off disagreements, that's one thing. But if there's a pattern to the disagreements, then you need to probe what's going on — you need to check if there's an underlying problem with how you're measuring and operationalizing the concepts that your labels are supposed to be capturing.

Don't just paper over inter-coder reliability problems. Inter-coder reliability tests aren't just about reconciling individual codings — they're about identifying structural flaws in your measurement strategy. Don't focus on surpasing a specific theshold on a specific inter-coder reliability metric (e.g., Cohen's Kappa or Krippendorff's Alpha) — that's arbitrary and it misses the point. What matters is that you leverage inter-coder reliability tests to surface theoretical problems with your measurement strategy and correct them.

Coding disagreements usually point to one of two things: either your labels lack conceptual clarity (there's too much conceptual overlap), or your coding rules are underspecified (coders are implicitly using different, competing assumptions to resolve edge cases). You need to know which of these things is happening so you can make the right adjustments to your measurement strategy.

Labeling Errors Preview Production Failures

Catching measurement errors during the annotation phase is cheaper than catching them after deployment; spend the time to do it right.

Inter-coder disagreements are high-signal: they give you a direct, early preview of the situations in which your model will fail in production. If two human experts look at a paragraph of a legal document and can't agree on whether it's "legal interpretation" or "legal reasoning" because the decision boundary is fuzzy, you can't expect your model to cleanly apply those labels — if you're using an LLM, it could hallucinate or flip-flop between labels arbitrarily; if you're using a classifier, it could deliver low-confidence or poorly calibrated predictions.

It's better (cheaper) to discover the blind spots of your measurement strategy before you start training than after you deploy a model to production.

LLMs Won't Save a Bad Measurement Strategy

Automated labeling scales your measurement errors; it doesn't fix them.

You need to run diagnostics on your data-coding pipeline early on in the process of building your training set. That's especially true if you're using an LLM for zero-shot or few-shot labeling. ML teams often turn to LLMs to avoid the friction and cost of human annotation, but LLMs aren't a workaround for poor research design.

If human coders are struggling to agree on training labels because your operationalization of your key concepts is muddy, an LLM is probably going to face the exact same problem. If two labels aren't conceptually distinct, the LLM will confuse them just as easily as your human coders, introducing a systematic source of measurement error into your training data. Using an LLM to do few-shot coding will just replicate human confusion — but silently and at scale.

Clear coding rules are a non-negociable prerequisite for high-quality training labels, regardless of whether a human or a machine is doing the coding.

Better Labels, Better Models

A sophisticated model trained on labels coded using a bad measurement strategy is a bad model.

If you want to build robust, trustworthy models, don't treat your training labels as objective "ground truth". They're the product of a measurement strategy that's making implicit assumptions that you need to validate. They're a root source of model error.

The solution isn't more data or a better model. It's better-specified labels. It's a better measurement strategy. When you get this right, you don't need as many training examples because you're reducing the noise-to-signal ratio in the data.

No matter what your substantive application is, if you don't have good theory, you won't be able to design a good mesurement strategy; if you don't have a good measurement strategy, you won't be able to train your coders (human or otherwise) to accurately code labels; and if you don't have good training labels, your model won't perform well in production.

You can't skip measurement. And that means you can't skip theory.

Want a Good Model? Start with a Good Measurement Strategy

Data Annotation is a Research Design Problem

Assessing Validity Isn't Optional

Conceptual Clarity is a Prerequisite for Coding

Inter-Coder Disagreements are High-Signal

Labeling Errors Preview Production Failures

LLMs Won't Save a Bad Measurement Strategy

Better Labels, Better Models

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Something Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Review. Learn. Practice.