The fine art of algorithms

Software architect Christian Ohr gives us a look at the inner workings of the Master Patient Index.

The core task of a master patient index is unique identification of patients across systems and facilities. By linking patient data, it is possible to consolidate medical information stored in different facilities into a virtual electronic patient record.

Although Germany has numerous national personal identifiers, like taxpayer identification numbers, pension insurance numbers, or the new 10-digit health insurance number, there is still no lifetime identifier that can be used for healthcare purposes in accordance with data protection laws, while at the same time enabling unique identification of a particular patient. This makes it necessary to assess patient datasets based on the similarity of their demographic features.

There are various methods for calculating the degree of similarity between two patient datasets. ICW’s Master Patient Index module relies on the Fellegi-Sunter probabilistic method [1], which takes flexibly defined (and, if at all possible, mutually independent) identifiers in two datasets and determines in advance how likely it is that those values will match if the datasets belong to the same person (m probability) or to different people (u probability). Ultimately,

  • m reflects faulty entries or potential changes (e.g. due to marriage or a move),
  • u reflects the number of different values for a particular identifier (e.g. sex: 2-3, surname: approx. 50,000 in Germany, far fewer in China) and their relative distribution (such as location: large city vs. village)

Using logarithmic functions, a value can be calculated from these probabilities that indicates how much weight to assign to this type of identifier when calculating similarities. If an identifier matches, its weight is proportional to log(m/u); if not, its weight is proportional to log((1-m)/(1-u)).

The sum of the actual weighting values, standardized to a numerical range of 0-1, indicates the probability that two patient datasets refer to the same person [2]. Obviously, individual characteristics are “weightier” if they rarely change, never change, are scanned in from an insurance card with a bar code reader (as opposed to being entered with a keyboard), and can assume many different values. Thus Germany’s new lifetime health insurance number has a much greater effect on similarity level than gender, for example.

The term “identifier agreement” does not necessarily mean exact equality; it can also be fuzzily defined using phonetic and string distance algorithms. The degree to which missing values or pseudovalues (like “unknown”) play a role in calculating similarity—if at all—must also be determined. Using configurable threshold values, the Master Patient Index then decides whether the calculated probability level is high enough to link the patient datasets automatically, or, in ambiguous cases, whether to generate a mapping task, in which case the decision is left to the data clearing center staff (typically patient administration personnel). The Master Patient Index provides an intuitive and straightforward user interface for this purpose.

The procedure is similar when updating information for patients already registered in the system, with the ability to break links and reassign patient datasets, if desired. The data clearing staff can also have existing links checked at any time. This probabilistic approach is accepted and widely used. However, client specifications for matching configurations are often deterministic in design —that is, in the form of “if–then” rules. Typical examples of this approach are:

  • „If the insurance number is the same, always link records, otherwise compare first name, last name, date of birth and address.“
  • „If the last name contains ’emergency’ or ‘baby,’ never link records.“
  • „If two patients could be twins, generate a manual mapping task.“

Specifications like these are usually fully justified in specific cases. However, alternatives (“else”) are not defined in many instances. Also, it is often unclear what should happen if several of these rules apply but their outcomes are contradictory. There have been cases, for example, where health insurance numbers were the same, but all the other demographic data was different. It has also proven impractical to use a purely probabilistic strategy to execute deterministic rules, and to “invent” the necessary m and u probabilities without regard to their actual significance.

The weighting function is derived from the sum of all probability ratios


What to do, then? ICW’s Master Patient Index runs through a series of decision rules internally, which indicate whether or not automatic links and mapping tasks should be generated. Assessment of probabilistically computed similarity based on the configurable threshold values is a fundamental rule, but there are numerous additional rules that can be activated or deactivated. They also meet the “if–then” requirements mentioned above. Examples of these rules and the decisions they make include:

  • Automatic link if one or more identifiers match
  • No link and mapping task if one or more identifiers do not match
  • No link and mapping task if the person could be a twin
  • Mapping task if there are too many potential links with comparable similarity ratings

When there are contradictory outcomes, the Master Patient Index prefers to generate a mapping task or indicate a potential duplicate record in order to prevent faulty linkages—and the possibility that clinical information from different patients will be combined in the same record. To achieve the highest possible match quality, ICW’s Master Patient Index also offers a combination of these two models. The probabilistically determined similarity score comes into play particularly when no deterministic rule prevents or forces an automatic link. At the same time, however, it prevents a deterministically based linkage if the calculated similarity score indicates that different patient identities are involved, for example if duplicate health insurance numbers are issued in error. In this way, the client specifications mentioned in the examples above can be implemented without losing the advantages of the probability-based similarity calculations.

[1]. I. Fellegi, A.Sunter, Alan: A Theory for Record Linkage (PDF). Journal of the American Statistical Association. 64 (328): pp. 1183–1210.
[2]. M. A. Jaro.Probab ilistic linkage of large public health data files. Statistics Med 1995;14:491- 498
[3]. D. E. Clark, D. R. Hahn. Comparison of probabilistic and deterministic recordlinkage in  the development of a statewide trauma registry. Proc Annu Symp Comput Appl Med Care. 1995 : 397–401.