Category: Healthcare
-
A STS Annotation Tool for EHR Text
Semantic Textual Similarity (STS) evaluation is a standard way to measure how well an embedding model captures meaning. You present the model with pairs of sentences, compute the cosine similarity of their embeddings, and correlate those scores against human judgements. The correlation — Spearman’s ρ — tells you whether the model’s sense of “similar” matches…
-
Teaching an Old Trick to a Newer, Smarter Dog
A context-sensitive neural spell checker for clinical text, built on BioClinical-ModernBERT Source code: github.com/eukairos/spellcheck • MIT License The problem with spell-checking clinical notes Clinical documentation is full of spelling errors. That is not a criticism of clinicians — it is a structural reality. Notes are written at speed, on shift, using a vocabulary that sits…
-

Building a Spell Screener for Clinical Text — And How You Can Adapt It for Any Domain
Clinical notes are peculiarly messy. Written under time pressure by busy clinicians, they’re full of abbreviations, shorthand, and — inevitably — typos. When you’re building natural language processing (NLP) pipelines that depend on these notes, these ‘features’ become a real problem. This post describes a tool Anthropic’s Claude helped me build to tackle that problem,…
-
Adding Allergy Nodes to our MIMIC-IV Patient Graph
In our previous graph database exercise, we built a graph of MIMIC-IV patients, their admissions, and diagnoses associated with each admission. In this exercise, we’ll load some of their allergies. The allergies documented in MIMIC-IV are not in some structured fields, but exist as free text inside clinical notes, which makes it a challenge to…
-
Topic Modelling 2: Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a probability-based topic modelling approach that treats documents as bags-of-words. Conceptually it is similar to Latent Semantic Analysis (LSA, discussed in the previous post) in that it tries to discover a latent space from observed variables, but instead of a deterministic matrix factorization, it uses probability distributions on random variables.…
-
Building a MIMIC-IV Patient Graph
A Simple Patient Graph Database. In a previous post, I shared how to build a SNOMED concept graph database. In this post, we will use the MIMIC-IV dataset (details below) to construct a graph of patients, their admissions and the diagnoses associated with each admission. We then link this graph to our SNOMED concept graph…
-
Building a SNOMED Concept Graph
SNOMED Is A Knowledge Graph The Systematized Nomenclature of Medicine, Clinical Terms (SNOMED CT) is a de facto standard for standardizing clinical vocabulary and ontology in many parts of the world, including in Singapore. You can access SNOMED CT in a number of ways. If you work in a large healthcare organization, it probably has…
-
Adjacent Possibles
Welcome to Eukairos, a collection of musings at the confluence of artificial intelligence (AI), data management and healthcare. The term eukairos is derived from the Greek ευκαιρός, loosely meaning ‘timeliness’ or ‘opportunity’. The short explanation for the site’s name is that English-language domain names are pretty much saturated in the .sg domain. The more involved…