Clinical patient record systems architecture: an overview.PM Nadkarni
Centre for Medical Informatics, Yale University School of Medicine, New Haven, Connecticut 06520-8009, USA. , USA
Correspondence Address: Source of Support: None, Conflict of Interest: None PMID: 11298472
Source of Support: None, Conflict of Interest: None
Keywords: Connecticut, Decision Making, Computer-Assisted, Human, Medical Records Systems, Computerized, organization &administration,Sensitivity and Specificity, Statistics, Support, U.S. Gov′t, P.H.S.,
Creation of a general-purpose medical record is one of the more difficult problems in database design. In the USA, most medical institutions have much more electronic information on a patient’s financial and insurance history than on the patient’s medical record. Financial information, like orthodox accounting information, is far easier to computerize and maintain, because the information is fairly standardized. Clinical information, by contrast, is extremely diverse. Signal and image data—X-Rays, ECGs, —requires much storage space, and is more challenging to manage. Mainstream relational database engines developed the ability to handle image data less than a decade ago, and the mainframe-style engines that run many medical database systems have lagged technologically. One well-known system has been written in assembly language for an obsolescent class of mainframes that IBM sells only to hospitals that have elected to purchase this system.
CPRSs are designed to review clinical information that has been gathered through a variety of mechanisms, and to capture new information. From the perspective of review, which implies retrieval of captured data, CPRSs can retrieve data in two ways. They can show data on a single patient (specified through a patient ID) or they can be used to identify a set of patients (not known in advance) who happen to match particular demographic, diagnostic or clinical parameters. That is, retrieval can either be patient-centric or parameter-centric. Patient-centric retrieval is important for real time clinical decision support. “Real time” means that the response should be obtained within seconds (or a few minutes at the most), because the availability of current information may mean the difference between life and death. Parameter-centric retrieval, by contrast, involves processing large volumes of data: response time is not particularly critical, however, because the results are used for purposes like long-term planning or for research, as in retrospective studies.
In general, on a single machine, it is possible to create a database design that performs either patient-centric retrieval or parameter-centric retrieval, but not both. The challenges are partly logistic and partly architectural. From the logistic viewpoint, in a system meant for real-time patient query, a giant parameter-centric query that processed half the records in the database would not be desirable because it would steal machine cycles from critical patient-centric queries. Many database operations, both business and medical, therefore periodically copy data from a “transaction” (patient-centric) database, which captures primary data, into a parameter-centric “query” database on a separate machine in order to get the best of both worlds. Some commercial patient record systems, such as the 3M Clinical Data Repository (CDR) are composed of two subsystems, one that is transaction-oriented and one that is query-oriented. Patient-centric query is considered more critical for day-to-day operation, especially in smaller or non-research-oriented institutions. Many vendors therefore offer parameter-centric query facilities as an additional package separate from their base CPRS offering. We now discuss the architectural challenges, and consider why creating an institution-wide patient database poses significantly greater hurdles than creating one for a single department.
During a routine check-up, a clinician goes through a standard checklist in terms of history, physical examination and laboratory investigations. When a patient has one or more symptoms suggesting illness, however, a whole series of questions are asked, and investigations performed (by a specialist if necessary), which would not be asked/performed if the patient did not have these symptoms. These are based on the suspected (or apparent) diagnosis/-es. Proformas (protocols) have been devised that simplify the patient’s workup for a general examination as well as many disease categories. The clinical parameters recorded in a given protocol have been worked out by experience over years or decades, though the types of questions asked, and the order in which they are asked, varies with the institution (or vendor package, if data capture is electronically assisted). The level of detail is often left to individual discretion: clinicians with a research interest in a particular condition will record more detail for that condition than clinicians who do not. A certain minimum set of facts must be gathered for a given condition, however, irrespective of personal or institutional preferences.
The objective of a protocol is to maximize the likelihood of detection and recording of all significant findings in the limited time available. One records both positive findings as well as significant negatives (e.g., no history of alcoholism in a patient with cirrhosis). New protocols are continually evolving for emergent disease complexes such as AIDS. While protocols are typically printed out (both for the benefit of possibly inexperienced residents, and to form part of the permanent paper record), experienced clinicians often have them committed to memory. However, the difference between an average clinician and a superb one is that the latter knows when to depart from the protocol: if departure never occurred, new syndromes or disease complexes would never be discovered. In any case, the protocol is the starting point when we consider how to store information in a CPRS.
CPRSs continue to be an area of active research. This paper, however, focuses on the mechanism by which data is stored and retrieved, rather than the ancillary functions provided by the system, such as implementation of clinical guidelines, problem-oriented data capture, or therapy support. The obvious approach for storing clinical data is to record each type of finding in a separate column in a table. In the simplest example of this, the so-called “flat-file” design, there is only a single value per parameter for a given patient encounter. Systems that capture standardised data related to a particular specialty (e.g., an obstetric examination, or a colonoscopy) often do this. This approach is simple for non-computer-experts to understand, and also easiest to analyse by statistics programs (which typically require flat files as input). A system that incorporates problem-specific clinical guidelines is easiest to implement with flat files, as the software engineering for data management is relatively minimal.
In certain cases, an entire class of related parameters is placed in a group of columns in a separate table, with multiple sets of values. For example, laboratory information systems, which support labs that perform hundreds of kinds of tests, do not use one column for every test that is offered. Instead, for a given patient at a given instant in time, they store pairs of values consisting of a lab test ID and the value of the result for that test. Similarly for pharmacy orders, the values consist of a drug/medication ID, the preparation strength, the route, the frequency of administration, and so on. When one is likely to encounter repeated sets of values, one must generally use a more sophisticated approach to managing data, such as a relational database management system (RDBMS). Simple spreadsheet programs, by contrast, can manage flat files, though RDBMSs are also more than adequate for that purpose.
The one-column-per-parameter approach, unfortunately, does not scale up when considering an institutional database that must manage data across dozens of departments, each with numerous protocols. (By contrast, the groups-of-columns approach scales well, as we shall discuss later.) The reasons for this are discussed below.
One obvious problem is the sheer number of tables that must be managed. A given patient may, over time, have any combination of ailments that span specialities: cross-departmental referrals are common even for inpatient admission episodes. In most Western European countries where national-level medical records on patients go back over several decades, using such a database to answer the question, “tell me everything that has happened to this patient in forward/reverse chronological order” involves searching hundreds of protocol-specific tables, even though most patients may not have had more than a few ailments.
Some clinical parameters (e.g., serum enzymes and electrolytes) are relevant to multiple specialities, and, with the one-protocol-per-table approach, they tend to be recorded redundantly in multiple tables. This violates a cardinal rule of database design: a single type of fact should be stored in a single place. If the same fact is stored in multiple places, cross-protocol analysis becomes needlessly difficult because all tables where that fact is recorded must be first tracked down.
The number of tables keeps growing as new protocols are devised for emergent conditions, and the table structures must be altered if a protocol is modified in the light of medical advances. In a practical application, it is not enough merely to modify or add a table: one must alter the user interface to the tables– that is, the data-entry/browsing screens that present the protocol data. While some system maintenance is always necessary, endless redesign to keep pace with medical advances is tedious and undesirable.
A simple alternative to creating hundreds of tables suggests itself. One might attempt to combine all facts applicable to a patient into a single row. Unfortunately, across all medical specialities, the number of possible types of facts runs into the hundreds of thousands. Today’s database engines permit a maximum of 256 to 1024 columns per table, and one would require hundreds of tables to allow for every possible type of fact. Further, medical data is time-stamped, i.e., the start time (and, in some cases, the end time) of patient events is important to record for the purposes of both diagnosis and management. Several facts about a patient may have a common time-stamp, e.g., serum chemistry or haematology panels, where several tests are done at a time by automated equipment, all results being stamped with the time when the patient’s blood was drawn. Even if databases did allow a potentially infinite number of columns, there would be considerable wastage of disk space, because the vast majority of columns would be inapplicable (null) for a single patient event. (Even null values use up a modest amount of space per null fact.) Some columns would be inapplicable to particular types of patients–e.g., gyn/obs facts would not apply to males.
The challenges to representing institutional patient data arise from the fact that clinical data is both highly heterogeneous as well as sparse. The design solution that deals with these problems is called the entity-attribute-value (EAV) model. In this design, the parameters (attribute is a synonym of parameter) are treated as data recorded in an attribute definitions table, so that addition of new types of facts does not require database restructuring by addition of columns. Instead, more rows are added to this table. The patient data table (the EAV table) records an entity (a combination of the patient ID, clinical event, and one or more date/time stamps recording when the events recorded actually occurred), the attribute/parameter, and the associated value of that attribute. Each row of such a table stores a single fact about a patient at a particular instant in time. For example, a patient’s laboratory value may be stored as: (
Attribute-value pairs themselves are used in non-medical areas to manage extremely heterogeneous data, e.g., in Web “cookies” (text files written by a Web server to a user’s local machine when the site is being browsed), and the Microsoft Windows registries. The first major use of EAV for clinical data was in the pioneering HELP system built at LDS Hospital in Utah starting from the late 70s.,, HELP originally stored all data – characters, numbers and dates– as ASCII text in a pre-relational database (ASCII, for American Standard Code for Information Interchange, is the code used by computer hardware almost universally to represent characters. The range of 256 characters is adequate to represent the character set of most European languages, but not ideographic languages such as Mandarin Chinese.) The modern version of HELP, as well as the 3M CDR, which is a commercialisation of HELP, uses a relational engine.
A team at Columbia University was the first to enhance EAV design to use relational database technology. The Columbia-Presbyterian CDR,, also separated numbers from text in separate columns. The advantage of storing numeric data as numbers instead of ASCII is that one can create useful indexes on these numbers. (Indexes are a feature of database technology that allow fast search for particular values in a table, e.g., laboratory parameters within or beyond a particular range.). When numbers are stored as ASCII text, an index on such data is useless: the text “12.5” is greater than “11000”, because it comes later in alphabetical order.) Some EAV databases therefore segregate data by data type. That is, there are separate EAV tables for short text, long text (e.g., discharge summaries), numbers, dates, and binary data (signal and image data). For every parameter, the system records its data type so that one knows where it is stored. ACT/DB,, a system for management of clinical trials data (which shares many features with CDRs) created at Yale University by a team led by this author, uses this approach.
From the conceptual viewpoint (i.e., ignoring data type issues), one may therefore think of a single giant EAV table for patient data, containing one row per fact for a patient at a particular date and time. To answer the question “tell me everything that has happened to patient X”, one simply gathers all rows for this patient ID (this is a fast operation because the patient ID column is indexed), sorts them by the date/time column, and then presents this information after “joining” to the Attribute definitions table. The last operation ensures that attributes are presented to the user in ordinary language – e.g., “haemoglobin,” instead of as cryptic numerical IDs.
One should mention that EAV database design has been employed primarily in medical databases because of the sheer heterogeneity of patient data. One hardly ever encounters it in “business” databases, though these will often use a restricted form of EAV termed “row modelling.” Examples of row modelling are the tables of laboratory test result and pharmacy orders, discussed earlier.
Note also that most production “EAV” databases will always contain components that are designed conventionally. EAV representation is suitable only for data that is sparse and highly variable. Certain kinds of data, such as patient demographics (name, sex, birth date, address, etc.) is standardized and recorded on all patients, and therefore there is no advantage in storing it in EAV form.
EAV is primarily a means of simplifying the physical schema of a database, to be used when simplification is beneficial. However, the users conceptualise the data as being segregated into protocol-specific tables and columns. Further, external programs used for graphical presentation or data analysis always expect to receive data as one column per attribute. The conceptual schema of a database reflects the users’ perception of the data. Because it implicitly captures a significant part of the semantics of the domain being modelled, the conceptual schema is domain-specific. A user-friendly EAV system completely conceals its EAV nature from its end-users: its interface confirms to the conceptual schema and creates the illusion of conventional data organisation. From the software perspective, this implies on-the-fly transformation of EAV data into conventional structure for presentation in forms, reports or data extracts that are passed to an analytic program. Conversely, changes to data by end-users through forms must be translated back into EAV form before they are saved.
To achieve this sleight-of-hand, an EAV system records the conceptual schema through metadata – “dictionary” tables whose contents describe the rest of the system. While metadata is important for any database, it is critical for an EAV system, which can seldom function without it. ACT/DB, for example, uses metadata such as the grouping of parameters into forms, their presentation to the user in a particular order, and validation checks on each parameter during data entry to automatically generate web-based data entry. The metadata architecture and the various data entry features that are supported through automatic generation are described elsewhere.
EAV is not a panacea. The simplicity and compactness of EAV representation is offset by a potential performance penalty compared to the equivalent conventional design. For example, the simple AND, OR and NOT operations on conventional data must be translated into the significantly less efficient set operations of Intersection, Union and Difference respectively. For queries that process potentially large amounts of data across thousands of patients, the impact may be felt in terms of increased time taken to process queries. A quantitative benchmarking study performed by the Yale group with microbiology data modelled both conventionally and in EAV form indicated that parameter-centric queries on EAV data ran anywhere from 2-12 times as slow as queries on equivalent conventional data. Patient-centric queries, on the other hand, run at the same speed or even faster with EAV schemas, if the data is highly heterogeneous. We have discussed the reason for the latter.
A more practical problem with parameter-centric query is that the standard user-friendly tools (such as Microsoft Access’s Visual Query-by-Example) that are used to query conventional data do not help very much for EAV data, because the physical and conceptual schemas are completely different. Complicating the issue further is that some tables in a production database are conventionally designed. Special query interfaces need to be built for such purposes. The general approach is to use metadata that knows whether a particular attribute has been stored conventionally or in EAV form: a program consults this metadata, and generates the appropriate query code in response to a user’s query. A query interface built with this approach for the ACT/DB system; this is currently being ported to the Web.
So far, we have discussed how EAV systems can create the illusion of conventional data organization through the use of protocol-specific forms. However, the problem of how to record information that is not in a protocol–e.g., a clinician’s impressions–has not been addressed. One way to tackle this is to create a “general-purpose” form that allows the data entry person to pick attributes (by keyword search, etc.) from the thousands of attributes within the system, and then supply the values for each. (Because the user must directly add attribute-value pairs, this form reveals the EAV nature of the system.) In practice, however, this process, which would take several seconds to half a minute to locate an individual attribute, would be far too tedious for use by a clinician.
Therefore, clinical patient record systems also allow the storage of “free text” – narrative in the doctor’s own words. Such text, which is of arbitrary size, may be entered in various ways. In the past, the clinician had to compose a note comprising such text in its entirety. Today, however, “template” programs can often provide structured data entry for particular domains (such as chest X-ray interpretations). These programs will generate narrative text, including boilerplate for findings that were normal, and can greatly reduce the clinician’s workload. Many of these programs use speech recognition software, thereby improving throughput even further.
Once the narrative has been recorded, it is desirable to encode the facts captured in the narrative in terms of the attributes defined within the system. (Among these attributes may be concepts derived from controlled vocabularies such as SNOMED, used by Pathologists, or ICD-9, used for disease classification by epidemiologists as well as for billing records.) The advantage of encoding is that subsequent analysis of the data becomes much simpler, because one can use a single code to record the multiple synonymous forms of a concept as encountered in narrative, e.g., hepatic/liver, kidney/renal, vomiting/emesis and so on. In many medical institutions, there are non-medical personnel who are trained to scan narrative dictated by a clinician, and identify concepts from one or more controlled vocabularies by looking up keywords. This process is extremely human intensive, and there is ongoing informatics research focused on automating part of the process. Currently, it appears that a computer program cannot replace the human component entirely. This is because certain terms can match more than one concept. For example, “anaesthesia” refers to a procedure ancillary to surgery, or to a clinical finding of loss of sensation. Disambiguation requires some degree of domain knowledge as well as knowledge of the context where the phrase was encountered. The processing of narrative text is a computer-science speciality in its own right, and a preceding article has discussed it in depth.
Medical knowledge-based consultation programs (“expert systems”) have always been an active area of medical informatics research, and a few of these, e.g., QMR, have attained production-level status. A drawback of many of these programs is that they are designed to be stand-alone. While useful for assisting diagnosis or management, they have the drawback that information that may already be in the patient’s electronic record must be re-entered through a dialog between the program and the clinician. In the context of a hospital, it is desirable to implement embedded knowledge-based systems that can act on patient data as it is being recorded or generated, rather than after the fact (when it is often too late). Such a program might, for example, detect potentially dangerous drug interactions based on a particular patient’s prescription that had just been recorded in the pharmacy component of the CPRS. Alternatively, a program might send an alert (by pager) to a clinician if a particular patient’s monitored clinical parameters deteriorated severely.
The units of program code that operate on incoming patient data in real-time are called medical logic modules (MLMs), because they are used to express medical decision logic. While one could theoretically use any programming language (combined with a database access language) to express this logic, portability is an important issue: if you have spent much effort creating an MLM, you would like to share it with others. Ideally, others would not have to rewrite your MLM to run on their system, but could install and use it directly. Standardization is therefore desirable. In 1994, several CPRS researchers proposed a standard MLM language called the Arden syntax.,, Arden resembles BASIC (it is designed to be easy to learn), but has several functions that are useful to express medical logic, such as the concepts of the earliest and the latest patient events. One must first implement an Arden interpreter or compiler for a particular CPRS, and then write Arden modules that will be triggered after certain events. The Arden code is translated into specific database operations on the CPRS that retrieve the appropriate patient data items, and operations implementing the logic and decision based on that data. As with any programming language, interpreter implementation is not a simple task, but it has been done for the Columbia-Presbyterian and HELP CDRs: two of the informaticians responsible for defining Arden, Profs. George Hripcsak and T. Allan Pryor, are also lead developers for these respective systems. To assist Arden implementers, the specification of version 2 of Arden, which is now a standard supported by HL7, is available on-line.
Arden-style MLMs, which are essentially “if-then-else” rules, are not the only way to implement embedded decision logic. In certain situations, there are sometimes more efficient ways of achieving the desired result. For example, to detect drug interactions in a pharmacy order, a program can generate all possible pairs of drugs from the list of prescribed drugs in a particular pharmacy order, and perform database lookups in a table of known interactions, where information is typically stored against a pair of drugs. (The table of interactions is typically obtained from sources such as First Data Bank.) This is a much more efficient (and more maintainable) solution than sequentially evaluating a large list of rules embodied in multiple MLMs.
Nonetheless, appropriately designed MLMs can be an important part of the CPRS, and Arden deserves to become more widespread in commercial CPRSs. Its currently limited support in such systems is more due to the significant implementation effort than to any flaw in the concept of MLMs.
Patient management software in a hospital is typically acquired from more than one vendor: many vendors specialize in niche markets such as picture archiving systems or laboratory information systems. The patient record is therefore often distributed across several components, and it is essential that these components be able to inter-operate with each other. Also, for various reasons, an institution may choose to switch vendors, and it is desirable that migration of existing data to another system be as painless as possible. Data exchange/migration is facilitated by standardization of data interchange between systems created by different vendors, as well as the metadata that supports system operation. Significant progress has been made on the former front. The standard formats used for the exchange of image data and non-image medical data are DICOM (Digital Imaging and Communications in Medicine) and HL-7 (Health Level 7) respectively. For example, all vendors who market digital radiography, CT or MRI devices are supposed to be able to support DICOM, irrespective of what data format their programs use internally. HL-7 is a hierarchical format that is based on a language specification syntax called ASN.1 (ASN=Abstract Syntax Notation), a standard originally created for exchange of data between libraries. HL-7’s specification is quite complex, and HL-7 is intended for computers rather than humans, to whom it can be quite cryptic. There is a move to wrap HL-7 within (or replace it with) an equivalent dialect of the more human-understandable XML (eXtended Markup Language), which has rapidly gained prominence as a data interchange standard in E-commerce and other areas. XML also has the advantage that there are a very large number of third-party XML tools available: for a vendor just entering the medical field, an interchange standard based on XML would be considerably easier to implement.
CPRSs pose formidable informatics challenges, all of which have not been fully solved: many solutions devised by researchers are not always successful when implemented in production systems. An issue for further discussion is security and confidentiality of patient records. In countries such as the US where health insurers and employers can arbitrarily reject individuals with particular illnesses as posing too high a risk to be profitably insured or employed, it is important that patient information should not fall in the wrong hands. Much also depends on the code of honour of the individual clinician who is authorised to look at patient data. In their book, “Freedom at Midnight,” authors Larry Collins and Dominic Lapierre cite the example of Mohammed Ali Jinnah’s anonymous physician (supposedly Rustom Jal Vakil) who had discovered that his patient was dying of lung cancer. Had Nehru and others come to know this, they might have prolonged the partition discussions indefinitely. Because Dr. Vakil respected his patient’s confidentiality, however, world history was changed.