Practical Methods for Uncertainty-Aware Automatic Coding with ML and xAI
Institut national de la statistique et des études économiques (INSEE)
2026-05-12
The setting
A deep learning model automatically codes job descriptions into ISCO-08 occupations if confidence is high enough, otherwise it is sent to manual annotation.
The reality — at INSEE’s scale
| Volume | Errors |
|---|---|
| 100 000 records / year | 5 200 wrong codes |
| Sent automatically, unreviewed | all 5 200 |
| Statistician detects the pattern | months later |
The model does not know when it is wrong. A confident wrong prediction is the worst failure mode in official statistics.
Model A — confused but honest
Predicts Nurse with probability 0.52 when the true code is Midwife.
→ A threshold of 0.7 catches this. → The record goes to a human reviewer. → No damage done.
Model B — confident and wrong
Predicts Software developer with probability 0.97 when the true code is Data entry clerk.
→ No threshold catches this. → Published automatically. → Systematic bias in employment statistics.
Accuracy measures how often we are right. It says nothing about when we know we are right. That gap is what we need to close.
A classifier outputs P(class | text). We want this number to mean something:
Among all predictions made with confidence 80 %, exactly 80 % should be correct.
This property is called calibration. A well-calibrated model turns confidence into a reliable operational signal.
Calibration ≠ Accuracy. A model can be highly accurate yet badly miscalibrated — and vice-versa. Modern deep learning tends to be overconfident.
ECE gives a single diagnostic number. A well-calibrated model has ECE close to 0. Post-hoc techniques (temperature scaling, isotonic regression) can correct miscalibration without retraining.
For models based on PyTorch, the torch-uncertainty package offer several ready-to-use modules.
Set a confidence threshold τ: above τ → automate · below τ → human review
Coverage = share of records automated
Risk = error rate among automated records
These two quantities move in opposite directions as τ varies.
The risk–coverage curve answers: “If we accept a 1 % error rate, what share of our volume can we automate?” — a question accuracy cannot answer.
A model that dominates another sits lower and to the right on the curve — lower risk for the same coverage.
Key operational question: where is your institutional risk tolerance?
Practical workflow: train → calibrate → plot risk-coverage → choose τ consistent with your quality constraints → monitor stability after deployment.
ISCO, NACE, CPA — all share the same structure: a tree.
Standard approach: one softmax head at the finest level.
The problem: when the model is uncertain at level 4, it often already knows the right level-2 code. Forcing a fine-grained wrong answer wastes that signal.
Insight: uncertainty should trigger a graceful degradation to a coarser but correct code — not a confident wrong fine code.
Architecture
One shared encoder → One prediction head per level of the hierarchy
At inference time:
Benefits
The fallback mechanism transforms uncertainty from an obstacle into a feature: we always give the most precise answer we can trust.
Label attention: each ISCO code gets its own embedding vector (learned matrix of size \(n_{labels} \times d_{embed}\)). A cross-attention layer uses label embeddings as queries and token embeddings as keys/values, producing one label-conditioned sequence embedding per ISCO code, each scored by a linear probe.
Why it helps
Pillar I — Right metrics
Pillar II — Right architecture
The combination: an architecture that knows what it knows, metrics that reveal when it doesn’t, and curves that let managers set the acceptable risk — before going live, not after the damage is done.
Do not deploy on accuracy alone.
| Question | Tool |
|---|---|
| Are my probabilities meaningful? | Reliability diagram + ECE |
| Can I fix miscalibration cheaply? | Temperature scaling / isotonic regression |
| What automation rate can I afford? | Risk-coverage curve at your risk tolerance |
| How to handle hierarchical codes? | Multi-level heads + confidence fallback |
| How to explain a prediction to a reviewer? | Label attention + Layer Integrated Gradients |
| How to monitor after deployment? | Track ECE drift + risk-coverage stability |
The goal is to automate confidently what the model knows, and flag honestly what it does not.
