Teaching an Old Trick to a Newer, Smarter Dog

A context-sensitive neural spell checker for clinical text, built on BioClinical-ModernBERT

Source code: github.com/eukairos/spellcheck • MIT License

The problem with spell-checking clinical notes

Clinical documentation is full of spelling errors. That is not a criticism of clinicians — it is a structural reality. Notes are written at speed, on shift, using a vocabulary that sits well outside the training data of any general-purpose spell checker. Acetazolamide, pyelonephritis, ischiorectal — these are not words your phone keyboard knows, and they are not words that tools trained on Wikipedia prose handle gracefully.

The failure modes cut in both directions. A spell checker that misses worsning and dyspenea adds noise to the record. One that corrects NAD to and, or HPI to hip, actively introduces errors into a document that was previously correct. Both are bad. The second is worse.

Existing neural spell checkers were not designed with this in mind. NeuSpell — a widely cited open-source neural toolkit — was trained on the BEA-60K dataset of general English misspellings. It works reasonably well on prose. It has no awareness of what CK, INR, or HAART mean, and it tends to treat them as errors.

What this project does

BioClinicalModernBertChecker is a standalone Python implementation of the NeuSpell BertChecker architecture, re-built on top of thomas-sounack/BioClinical-ModernBERT-base. It keeps the same task formulation — token-level sequence labelling, where each input token is either left unchanged or replaced with its correction — but swaps the backbone for a model that actually knows what clinical text looks like.

The implementation also fixes several things about the original NeuSpell codebase that had aged poorly:

The original subword merging logic was hardcoded to WordPiece tokenisation. ModernBERT uses BPE, so that logic breaks silently. The new implementation uses the word_ids() API, which is tokeniser-agnostic.
NeuSpell relied on pytorch_pretrained_bert.BertAdam, deprecated since 2020. This implementation uses torch.optim.AdamW with a linear warmup schedule.
The original had a global mutable tokeniser state, making it unsafe to run multiple checker instances simultaneously. That is now an instance attribute.
Mixed precision training, gradient clipping, and checkpoint metadata are all added.

The net result is a checker that can be fine-tuned on your own clinical corpus in a few lines of code, and that inherits ModernBERT’s 8,192-token context window — useful for discharge summaries that routinely run longer than the 512-token limit of standard BERT.

Using it

The API is deliberately close to NeuSpell’s, so anyone familiar with that toolkit will recognise the pattern:

from bioclinical_modernbert_checker import BioClinicalModernBertChecker, build_vocab_from_files

vocab = build_vocab_from_files(
    clean_file="train_clean.txt", corrupt_file="train_noisy.txt",
    data_dir="data/", keep_simple=False
)

checker = BioClinicalModernBertChecker(device="cuda")
checker.from_huggingface(vocab=vocab)
checker.finetune(
    clean_file="train_clean.txt", corrupt_file="train_noisy.txt",
    data_dir="data/", n_epochs=3, learning_rate=5e-5
)

checker.correct("Pt c/o worsning dyspenea and fiver")
# → "Pt c/o worsening dyspnea and fever"

Training data is prepared as two line-aligned plain text files — clean and synthetically noised. The noiser applies character-level perturbations (substitution, deletion, insertion, transposition) drawn from a keyboard adjacency model, and only touches purely alphabetic tokens of three or more characters. Clinical abbreviations, numeric strings, lab values, and time expressions are left alone.

The keep_simple=False flag in vocabulary construction is the single most important setting for clinical use. It preserves domain-specific tokens — abbreviations, drug names, measurement units — rather than stripping them as ‘noise’.

A concrete example

Here is what the token-level evaluation output looks like on a MIMIC-IV-EXT-BHC discharge note segment containing admission labs and an imaging section. Seven errors were injected; six were correctly restored.

Corrupted fragment (imaging section only, for readability):

... us abd: mild splenomegaly. otherwise unremarkable stduy of the left kidney
and spleen. mre: stable postsurgical changes as deascribed above. no definite
mr evidence of active infkammatory bowel disease. stble appearanec of pelvic
oeritoneal inclusion cyst. ... 05: 55am blooe wbc-4.3 ...

Token-level outcomes:

✓  stduy       →  study         (gold: study)
✗  deascribed  →  deascribed    (gold: described)   ← missed
✓  infkammatory → inflammatory  (gold: inflammatory)
✓  stble       →  stable        (gold: stable)
✓  appearanec  →  appearance    (gold: appearance)
✓  oeritoneal  →  peritoneal    (gold: peritoneal)
✓  blooe       →  blood         (gold: blood)

The missed correction (deascribed) is instructive. The de- prefix insertion is hard to distinguish from legitimate morphological variants — the model’s prior assigns enough probability mass to de- as a valid prefix that it declines to change the token. This is a known limitation of the character-perturbation-only training approach.

Notably, none of the lab values, timestamps, or numeric strings throughout the note (wbc-3.9*, 11: 28am, hco3-16*, and so on) were touched. That is the expected behaviour, and it matters.

Evaluation

Evaluation was run on 100 unseen MIMIC-IV-EXT-BHC discharge note segments using synthetically injected noise with a random seed distinct from training. Token-level metrics were computed against gold-standard clean text, with punctuation normalisation applied to avoid penalising cases where spelling was correctly restored but trailing punctuation was dropped by the model.

Metric Score

Word Correction Rate (Recall) 0.7946

Precision 0.9018

F1 0.8448

False Positive Rate 0.0085

The false positive rate of 0.85% reflects residual over-correction of short clinical abbreviations (NAD, HPI, CAD) that are orthographically similar to common English words. These are mitigated at inference time by an explicit clinical abbreviation protection list and a post-inference restoration guard.

One caveat worth stating clearly: these results are on synthetic noise. Performance on naturally occurring clinical misspellings will differ. The ClinSpell benchmark (Fivez et al., 2017) provides a standard reference for comparison against other clinical spell checkers, and is the appropriate next step for anyone wanting to benchmark this tool rigorously.

Limitations and caveats

Training on synthetic noise means the model’s error distribution may not match real-world clinical typing patterns. Real errors tend to cluster around domain-specific long words and homophones in ways that keyboard-adjacency noise does not fully capture.
Very short clinical abbreviations (2–3 characters) that overlap with common English words remain a persistent challenge. The abbreviation protection list is a post-hoc patch rather than a fundamental solution.
No clinical validation has been performed. This is a research tool, not a production-ready clinical system.

Getting started

The code is available at github.com/eukairos/spellcheck under an MIT licence. Dependencies are minimal: torch, transformers, and tqdm. The repository includes a full README with usage examples, evaluation methodology, and citation information for the underlying models and benchmarks.

If you work with MIMIC data or other clinical free text and have been looking for a starting point for domain-specific spelling correction, this might be a useful base to build on.

References

Jayanthi, Pruthi & Neubig (2020). NeuSpell: A Neural Spelling Correction Toolkit. EMNLP 2020 System Demonstrations, pp. 158–164.

Sounack et al. (2025). BioClinical-ModernBERT: A Modern Clinical Encoder. arXiv:2506.10896.

Fivez, Šuster & Daelemans (2017). Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text. CLIN Journal, 7.