NLP-inspired Machine Learning for Longitudinal Categorical Data
Insee
2026-05-12
Many administrative registers track millions of individual trajectories, in the form of categorical panel data:
High cardinality Thousands of event types → transition matrices become intractable, feature spaces explode.
Long-range dependence An event at step 1 shapes outcomes at step 10. Markov models cannot capture this.
Irregular timing Events happen at individual-driven intervals. Fixed-period panel models are not designed for this.
Each challenge has been addressed in isolation (D’Haultfoeuille and Iaria 2016; Rust 1988). No existing framework handles all three jointly. Language model architectures can.
A sentence is a sequence of tokens drawn from a large vocabulary.
Deep-learning language models learn directly from data:
| NLP concept | Panel data equivalent |
|---|---|
| Word / Token | Event type (course, medical code…) |
| Sentence | Individual trajectory |
| Vocabulary | Set of all possible event types |
| Next-word prediction | Next-event prediction |
| Sentence embedding | Trajectory summary vector |
Every event type is mapped to a dense, learnable vector.
We want to get \(h_c\), a vector summarizing the past up to position \(c\).
RNNs (GRU / LSTM) — rolling memory (Schmidt 2019)
✓ Lightweight, easy to train
✗ but sequential processing implies slowness and vanishing information for long sequences
Transformers — attention over full history (Vaswani et al. 2023)
✓ Perfect long-range memory, Full parallelisation
✗ Needs more data · Slower on very short sequences
Key concept
\(h_c\) is the key component of those models: a dense vector with a controlled dimensionality that summarizes all the information from the past.
Language models are pre-trained on the Next-Token Prediction task — predicting the next event given all previous events.
Why this works so well:
To have time-aware models, we add a Time-to-Next-Event task (see (Shmatko et al. 2025)).
Data
At this scale and vocabulary size (940 classes), standard regression, ensemble, and boosting methods were not trainable. Sequence models are the only viable option.
| Model | Top-1 Accuracy | Top-3 Accuracy | Top-5 Accuracy | Top-1 Acc. excl. 1st enrol. pred. | Top-3 Acc. excl. 1st enrol. pred. | Top-5 Acc. excl. 1st enrol. pred. |
|---|---|---|---|---|---|---|
| Bigram model | N.A. | N.A. | N.A. | 62.34 % | 79.65 % | 83.86 % |
| RNN (GRU) | 54.01 % | 71.24 % | 77.10 % | 66.65 % | 82.78 % | 86.56 % |
Inertia dominates. Most predictive signal comes from the previous enrollment alone. The GRU’s marginal accuracy gain is modest.
Why the model still matters:
Data — SNDS (National Health Database): 70 million patients covered, 8 billion medical events, 7,204 token vocabulary:
| Event family | Tokens |
|---|---|
| Specialty consultations | 83 |
| Clinical procedures (CCAM) | 936 |
| Medications (ATC) | 704 |
| Hospital diagnoses (ICD-10) | 3 275 |
| Admissions, emergencies, deaths | — |
Model
This structured geometry emerges without any supervision: it is entirely a product of learning to predict the next event in 8 billion medical records.
Task (unseen during training): given a random point in a patient’s trajectory, predict whether a specific event occurs within 30 / 180 / 365 days.
Three approaches compared
Evaluated on 1 million held-out patients.
Average Precision Score — 365-day horizon
| Event | Baseline | Zero-shot | LogReg emb. |
|---|---|---|---|
| Dialysis | 0.004 | 0.82 | 0.85 |
| Chemotherapy | 0.01 | 0.56 | 0.64 |
| Death | 0.20 | 0.46 | 0.63 |
| GP visit | 0.70 | 0.83 | 0.90 |
| Nurse visit | 0.46 | 0.53 | 0.68 |
| Hospitalization | 0.24 | 0.35 | 0.47 |
| Emergency | 0.21 | 0.25 | 0.35 |
| Heart failure | 0.04 | 0.09 | 0.12 |
Zero-shot rivals supervised (100k labels) for most events. History-driven events — dialysis (0.82), chemotherapy (0.56) — show the largest gains from an near-zero baseline. Unpredictable events (emergency, heart failure) remain hard but still improve.
What we have shown
✓ Sequence models handle high cardinality, long-range dependence, and irregular timing jointly — no classical method does
✓ Trained on raw administrative data, they learn structured, semantically meaningful representations of individual histories
They are foundational:
✓ These representations transfer to downstream tasks unseen during training, without retraining
✓ The models are generative by construction !
The institutional opportunity
One model, pre-trained on a national register, serves as a shared backbone:
One model. Many research questions. Lower barrier to entry for research teams.
Governance is not optional.
