Can You Tell If Something Was Written by an LLM?

The "em-dash debate" misses the point entirely. Detecting LLM-generated text is a challenging classification problem, and the "folk methods" people use to do it are somewhere between unreliable and useless.

Influencers on LinkedIn are always rediscovering that ChatGPT frequently uses em-dashes and proclaim that this is how you can tell if text was written by an LLM. It always goes the same way. The post gets shared widely. People start scrutinizing each other's writing for em-dashes. Someone points out that good writers have always used em-dashes. (They're correct.) Someone else points out that the original post was probably LLM-generated. (They're also probably correct.) The discourse moves on, and then repeats, having produced no insights.

It is — in a word — insufferable.

The "em-dash debate" (it's not always em-dashes — but it often is) is a symptom of people trying to address a difficult classification problem with "folk methods" that don't actually work. Whether or not we can tell that text was generated by an LLM is an important methodological question — with real stakes when it comes to academic and professional integrity. The answer is, of course, considerably more complicated than monitoring punctuation use.

(And the influencers often know that. But they don't care — they got your impression.)

Why surface features don't work

The intuition behind feature-based detection is straightforward: LLMs have stylistic tendencies, those tendencies produce detectable patterns, and detecting the patterns tells you whether an LLM generated the text. Em-dashes, hedging phrases, certain sentence structures, an high density of transitional language — all of these have been proposed as signals.

(It turns out that good writing tends to exhibit certain features, and LLMs are quite good at learning them.)

The problem is that this is a complex classification task and the signal is a moving target.

LLM outputs aren't drawn from a fixed distribution. They vary substantially by model, by prompt, by temperature setting, by the prompt they're responding to, and by any post-generation editing done by the user. (Always edit your LLM-generated content. That way, at minimum, you know what you said.) A stylistic pattern that is a reasonable diagnostic for ChatGPT responses is not necessarily a reasonable diagnostic for ChatGPT responses to domain-specific prompts, or for Claude, or for Gemini, or for any of the many other models that will be released after the cut-off for whatever training data the detector was built on. Detectors trained on one model's outputs generalize poorly to others and degrade quickly as models are updated.

Human writing doesn't have a fixed distribution. (That's generally true of anything produced by humans.) Stylistic tastes drift over time and across contexts. Em-dashes are commonly used by some writers and rarely used by others. Hedging language is characteristic of academic writing, regardless of who wrote it. Certain sentence structures appear frequently in legal writing, in financial analysis, or in technical documentation — not because those texts were LLM-generated but because those registers have their own conventions. A detector trained on general text will systematically produce false positives on domain-specific professional writing, which may happen to look more like LLM output on surface features than casual prose does.

LLM-detectors are unreliable in exactly the situations where reliability matters most.

What the research shows

Automated LLM detection is the subject of a lot of research, and the findings are not encouraging for anyone who wants a reliable detector. The best-performing detectors — watermarking approaches aside — achieve accuracy that can look reasonable on held-out samples from the training distribution and degrades substantially under distribution shift, paraphrasing, and cross-model evaluation.

Watermarking is a promising technical solution: some model providers can embed statistical signals in generated text that are detectable by someone who knows the key. The limitation is that watermarking requires the cooperation of the model provider, is absent from responses that have been substantially edited, and provides no signal for text generated by models that don't implement it — which includes every open-weight model and every API that doesn't participate. (So, nearly all of them.)

Zero-shot detection methods — using a language model to evaluate the likelihood of a text under its own distribution, on the theory that LLM-generated text will have higher likelihood than human-written text — can work under controlled conditions but usually fail under realistic ones. Perplexity-based methods are sensitive to domain, register, and writing quality; they produce unacceptable false positive rates on professional writing.

The upshot is that reliable, general-purpose LLM detection — accurate across models, robust to editing, applicable without ground truth — doesn't exist. Researchers may develop new methods that make this problem more tractable. But don't count on it.

The base rate problem

There's a further issue that the em-dash discourse doesn't usually address: base rates.

Suppose a detector achieves 90% accuracy — 90% true positive rate on LLM-generated text and 90% true negative rate on human-written text. That sounds useful. Now suppose the actual prevalence of LLM-generated text in the population you are examining is 10%. Running the detector on 1,000 texts, you have 100 that are LLM-generated and 900 that are not. The detector correctly identifies 90 of the 100 LLM texts and incorrectly flags 90 of the 900 human texts. You get 180 positives, half of which are false. The precision of the detector — the probability that a flagged text is actually LLM-generated — is 50%. (That's not good.)

A detector with 90% accuracy, applied in a context where LLM use is not the majority behavior, is a coin flip on positive predictions. For a use case like detecting plagiarism — where the consequence of a false positive is accusing someone of cheating who didn't — this isn't a useful tool. The base rate problem is not a fixable calibration issue. It's a structural property of using classifiers in low-prevalence contexts.

(Side note: If you're an educator, never rely on tools that claim to detect LLM use. You can't trust them, and using them can do real damage. Also, your students are using LLMs. You're going to have to come to terms with that — and change your approach to assignments and exams.)

What you can actually do

None of this means the question is unanswerable in every context. It just means that reliable detection requires more than pattern-matching on surface features.

Behavioral evidence — comparing a person's submitted work against earlier samples from the same person, examining consistency of voice and knowledge across a person's body of work, looking for discontinuities in a person's writing quality — is far more informative than any single-document classifier. It's also more labor-intensive and requires baseline data that often isn't available. And it requires you to have good judgment. (Which you may have. Or, which you may think you have.)

For high-stakes applications, like assessments, a better approach is to change the incentive structure rather than try to improve detection. Assessments that require real-time demonstration of understanding, that are personalized enough that LLMs are not useful, or that involve process documentation alongside final output are more robust to LLM use than detection-after-the-fact methods. You'll have to adapt your approach to assessments to the new LLM reality — and that can be a lot of work. You also need to think hard about fairness and bias. (For example, many people are bad at oral exams. That doesn't mean they're not competent.)

The broader point

The em-dash debate is frustrating not because it's wrong about em-dashes — although it is wrong about em-dashes — but because it reflects a broader tendency to treat a hard classification problem as though it were a pattern-recognition exercise that anyone can do by eye. It's the same kind of error that leads people to make over-confident claims about forged documents, fabricated images, and manipulated audio based on the visual identification of artifacts that turn out to have innocent explanations.

LLM detection is a serious technical problem without a clear answer: it's partially solvable in controlled conditions, but not reliably solvable in general. The "folk methods" that circulate on social media don't solve it. Ignore them. In situations where real consequences follow, turning to these methods is far worse than simply acknowledging that this is a hard problem.

Writing is hard, and LLMs make it easier. People are going to use them (even good writers) — the efficiency gains are too great to ignore. Standards of originality will have to adapt to a reality in which LLMs, which are trained on other peoples' work (often without permission or compensation — another problem), are doing much of the actual writing. What "originality" means is going to change. Good ideas and good taste will be the differentiators. But the aphorism that "writing is thinking" is often true, and the danger with cutting too many corners is that the quality of your ideas will suffer. But that's your problem, not the LLMs.

How something was written — and how much credit you get to claim for "writing" it — will matter less than what it says. That, of course, was always the true standard of good writing, anyway: Does it say something worth reading?

Can You Tell If Something Was Written by an LLM?

Why surface features don't work

What the research shows

The base rate problem

What you can actually do

The broader point

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Want a Good Model? Start with a Good Measurement Strategy

Making a Monolingual Model Bilingual with Domain Adaptation

Domain Adaptation or Fine Tuning?

Why General-Purpose Language Models Struggle with Legal Text

Review. Learn. Practice.