Information retrieval in medicine: overview and applications.PM Nadkarni
Centre for Medical Informatics, Yale University School of Medicine, New Haven, Connecticut 06520-8009, USA. , USA
Correspondence Address: Source of Support: None, Conflict of Interest: None PMID: 0011013482
Source of Support: None, Conflict of Interest: None
Keywords: Databases, Bibliographic, Information Storage and Retrieval, Support, U.S. Gov′t, P.H.S.,
Information Retrieval (IR) is a branch of computer science that is concerned with the processing of collections of documents containing “free text.” Examples of such collections are a set of hospital discharge summaries, radiology reports, or surgery notes, or (in a non-medical context) the full text of the complete works of Shakespeare or the Bible. In contrast to a spreadsheet or a database table, which is divided into rows and columns, such documents have no obvious structure: any structure that is imposed is purely artificial, highly variable, and is of little use in retrieving information. Thus, while the Bible is divided into books, chapters and, occasionally, verses, a reader is more interested in locating a section by the phrases it contains (e.g., the name of particular Biblical characters or a quote). Similarly, the “structure” of a Chest X-Ray report, in terms of the headings under which the radiograph is described, is different from the structure of a Barium swallow or Intravenous Pyelogram report.
IR began as an offshoot of “information theory,” a field defined in a classic paper by Claude Shannon of Bell Laboratories in 1949. (Shannon and Weaver’s text describes this work for a general audience.) However, “information” was defined in a very broad sense. Some of the work in this field considered practical problems such as how to compress data without loss, (e.g., Ziv and Lempel) and how to add redundant (extra) information so that data transmission or storage would be reliable despite the presence of physical damage to the medium or noisy transmission, (e.g., Reed and Solomon). The focus on textual information can be traced to several researchers, most notably the late Gerard Salton of Cornell University, who has written the definitive textbook for IR. Salton’s group has performed research (using a system called SMART) that has defined the methodology in the field. Because of its applicability to the textual component of the clinical patient record, IR techniques have found ready applicability in medicine: a good reference for medical IR is Hersh.
For several decades, IR was a relatively "orphan" technology researched by a relative handful of scientists, and commercial offerings were relatively sparse and modestly featured. However, due to the spread of the World Wide Web, IR is now mainstreaming because most of the information on the Web is textual. Web search engines such as Yahoo, Excite, AltaVista and AskJeeves are used by millions of users to locate information on Web pages across the world on any topic. The use of search engines has spread to the point where, for people with access to the Internet, the World Wide Web has replaced the library as the reference tool of first choice. Search engines can also be purchased commercially for those who wish to provide search facilities for their own Web sites. For example, Microsoft bundles a component called Microsoft Index Server with its Web server software. IR technology is also the basis of MEDLINE access as implemented through PubMed. (A subsequent article in this series will discuss MEDLINE article retrieval from the user’s perspective. This article focuses on the principles underlying IR technology.)
Search engines rely on pre-processing a document collection for subsequent fast retrieval based on keywords. This pre-processing step is termed indexing. The steps of indexing documents by the words contained in them are described below. Some of the operations involved are also employed in concept-identification algorithms.
* Every document in a collection is scanned word-by-word, skipping words belonging to a stop-word list. Stop-words are very common words in the language or domain that are not useful for searching (e.g., "the", "an", "of", etc). It is important to note that, in specific contexts, words uncommon in the language as a whole may be stop-words. Thus, for a collection of surgical notes, "surgery" would be a stop word. Conversely, when processing medical text, certain common words should not be treated as stop words because they qualify medically relevant concepts. Thus, “greater” and “lesser” specify a femoral trochanter.
* One or more disk-based structures (indexes) are created. The global word-frequency index records, for each distinct word, how many times it occurs in the entire collection. The document word-frequency index records how often a particular word occurs in each document. An optional proximity index records the position of individual words within the document as word, sentence, or paragraph offsets. (A proximity index takes up lots of space, but is useful for searches where one can specify two or more words and require that they be within the same sentence, or next to each other, or within so many words of each other).
* The index stores words either as they occur in the text, or after transformation. One type of transformation is normalization, which involves lower-case conversion and removal of variations in person or tense. Thus, “children” is normalized to “child,” and “brought” is normalized to “bring.” Another transformation is stemming, where a word undergoes progressive suffix removal or letter substitutions to yield a “root” form. Stemming is more drastic than normalization; often yielding a root that is not a word in the language. Thus, “hypothesis,” “hypothesize” and ”hypothetical” yield the same root, “hypothes.” Stemming uses simple linguistic rules (e.g., removing suffixes like “-ing”) rather than domain knowledge: information on synonyms (e.g., liver, hepatic) is not used in stemming. The most widely used technique for stemming is the Porter algorithm. Both normalization and stemming increase sensitivity, but at the cost of specificity. (In the context of IR, sensitivity, also called recall, is defined as the ratio of the number of relevant documents in a collection that are retrieved in response to a query to the number of relevant documents actually present in that collection. Specificity, also called precision in the context of IR, is defined as the ratio of the number of relevant documents retrieved by a query to the total number of documents retrieved: some retrieved documents may turn out to be irrelevant to the query. By analogy with evaluation of screening tests in diagnostic medicine, the higher the sensitivity, the less the false negative rate: similarly, specificity is negatively correlated with the false positive rate.) The original Porter algorithm is known to have problems in stemming words with Greco-Latin suffixes (-um, -ae, -ii, -us, etc.), which often occur in medicine. The C language implementation of the Porter algorithm as described in is driven by rules that are stored in tables, however, and such suffixes can be managed by adding more rules.
* For a large document collection, it is wasteful of disk space to store the words themselves in the index. Instead, a table is created to store every unique word in the collection, and every word is assigned a numeric ID. This ID is used in the index.
* Every document in the collection is also given an ID. In the medical field, each document applies to a particular patient, and so one maintains a table of document IDs versus patient ID. This information is used as a bridge between the free text and other non-free-text information (e.g., laboratory test values) on that patient, which is stored elsewhere in the clinical patient record system.
There are two methods of searching documents using such indexes. The conceptually simpler (and older) Boolean method (named after the 19th century British mathematician George Boole, who devised a mathematical approach to logic) allows the user to specify keywords combined with the operators AND, OR and NOT. (These will be described in greater depth in the subsequent article on MEDLINE.) Boolean methods have been (falsely) criticized because Boolean logic is hard for end-users to understand and specify correctly. A more legitimate criticism of a purely Boolean approach is that the returned documents are not ranked in any way: if numerous documents are retrieved, the user must scan each one visually to determine how relevant it is to the query.
Modern IR search engines (e.g., Web-based engines), which may retrieve hundreds of matches for a user’s query, therefore perform automatic relevance ranking, which is robust enough to handle situations where no document matches every single keyword specified by the user. (In the case of Boolean search, zero documents would be returned, while relevance ranking will still return partially matching documents.)
Relevance ranking is typically done through the Document Vector approach, originally devised by Salton’s group. Several variants of this approach exist,. The measure of similarity between the terms in a query, and a document that must be screened to see how closely it matches the query, is called the “normalized cosine metric.” While the intricacies of how the metric is computed will not be described, the principle is that keywords in the query that are uncommon in the document collection are given more weight than keywords that are relatively common, and that documents that contain one or more of the keywords in the query multiple times are given more weight than documents that contain the keywords infrequently or not at all. Thus, if we take a query with the keywords “carotid endarterectomy complications” the first two words would be less common, and therefore more discriminating, than the third. Similarly, documents that contained the words “carotid” and “endarterectomy” many times would be likely to have these words as a theme of the document, and would therefore be more relevant to the query than documents that mentioned the phrase in passing only once.
The word “normalized” in the metric refers to the fact that the frequency with which a particular term occurs in a document is adjusted for the length of the document. Thus, a 200-word abstract that mentioned “carotid” three times would be more relevant than a 10,000-word article that mentioned it four times. The normalized cosine metric can also be used to compare two documents for similarity: PubMed does this, as described later. The word “cosine” in the metric implies the metric is a number varying between o and 1 (as does the cosine of an angle). If two documents are identical with respect to the terms they contain and the frequencies of those terms, the “angle” between them is zero (and cos(0)=1). If the two have absolutely no terms in common (i.e., they are completely dissimilar), the “angle” between them is 90 degrees (and cos(90)=0). While relevance ranking sounds very complex and highly mathematical, the programming turns out to be quite simple, and relevance ranking means sorting the matching documents by descending order of cosine.
Modern search engines use a combination of Boolean and vector methods, with the former being used as a filter to restrict the search. Most search engines accept the character + before a word (with no intervening space) to mandate that all retrieved documents must contain that word. In other words, + implies the AND operator. Similarly the character - before a word indicates that the retrieved documents should NOT contain that word. (PubMed uses a slightly different syntax, as a subsequent article will illustrate.) All documents that match the Boolean criteria are then ranked by relevance.
Thus far, we have discussed electronic indexing of documents based on the words within them. For documents belonging to a single domain such as medicine, word indexing does not capitalize on domain knowledge. For example, synonymous phrases are not automatically recognized. Searches of medical free text that is indexed only by word would require a user to manually specify synonymous forms, or else risk missing relevant documents.
Using concepts in a domain-specific vocabulary (thesaurus) can enhance retrieval: one can index the concepts identified in a document. Indexing by concept is similar in principle to indexing by word. Text is processed to recognize potential terms, which are then looked up in a vocabulary. The ID of the concept in the vocabulary is used instead of a word ID. As will be discussed in a subsequent article, the US National Library of Medicine has historically used manual concept recognition using trained human indexers who read a publication carefully before assigning concepts to it. Here, we discuss automated approaches that have attempted to use computer programs for the same task.
In medicine, concept-indexing research has used a variety of vocabularies, depending on the material that must be indexed. The biggest thesaurus in biomedicine is the US National Library of Medicine’s Unified Medical Language System (UMLS) Metathesaurus. This incorporates numerous other vocabularies, such as SNOMED (Systematic Nomenclature in Medicine, used for pathological classification of disease), ICD-10 (Internal Classification of diseases, typically used for billing code), DSM-IV (Diagnostic and Statistical Manual for Psychiatry Disorders, 4th Edition), as well as MeSH (Medical Subject Headings), used by human MEDLINE indexers.
The UMLS’s contents centre on concepts, terms, strings, and words. A concept, such as hypertension, may be expressed in different ways: e.g., “hypertension”, “high blood pressure disease” and “hypertensive vascular disease” are different synonymous forms, or terms, that refer to the same concept. (Every concept is given a unique ID.) The same term may be expressed in multiple string forms through variations such as transposition of words, punctuation, or differences in case, person and tense, e.g., “disease, hypertensive”. Finally, each string is composed of one or more words. The UMLS contains numerous cross-reference tables that greatly ease the programmer’s task. For example, one can directly locate all concepts containing a particular word. A concept can belong to one or more semantic categories (e.g., pharmacological substance, therapeutic procedure) and every term for a concept is tagged with the ID of the source vocabulary (e.g., ICD-10) from which it was taken. This information allows researchers to create UMLS data subsets for special purposes.
We now discuss the methods used for automated concept indexing.
Phrase-Based Methods use natural language processing (NLP) methods to scan the text and identify words/phrases of interest, which are then used to search the thesaurus. Most research has utilized phrases representing nouns (e.g., “obstructive lung disease”), as in the work of Elkin et al. Aronson and Rindflesch at the National Library of Medicine have augmented this approach, as described in several published papers,,. If the entire phrase does not have a match in the vocabulary, one generates subsets of the phrase and searches the vocabulary for matches against these subsets. Thus, “digitalis-induced atrial fibrillation” does not match any concept in the UMLS, but the sub-phrases “digitalis” and “atrial fibrillation” do.
A problem with using noun phrases alone is that, in medical text, many concepts can only be correctly identified through other parts of speech that are close to the noun phrase, e.g., “blood pressure was greatly elevated”, which implies hypertension, as opposed to blood pressure alone. Verb phrases such as “surgically resected” are intrinsically meaningful, and cannot be excluded. Further, the same concept may be divided across two noun phrases, e.g., “hypertension is secondary to renal disease,” which indicates renal hypertension.
In an attempt to overcome the limitations of single noun phrases, alternative approaches have attempted to use larger units of text, such as entire sentences, as in the SAPHIRE family of algorithms devised by Hersh,,. Using larger text units yields more sensitivity, but at the cost of specificity; that is, there is an increased risk of false positives. For example, the text segment “spleen rupture and normal stomach” in an emergency surgery note will match the concept of “stomach rupture” which, while it is a subset of this phrase, is not implied in the note. In addition, using an entire sentence is no guarantee of success. In the admittedly artificial example, “Blood pressure was recorded in the supine position. It was found to be greatly elevated,” the concept of hypertension is split across two sentences. Finally, the time required to generate subsets of the input phrase (and search the vocabulary) varies exponentially with the number of words in the phrase, and can be prohibitive for large sentences. Production systems therefore tend to favour phrase-based approaches.
For medical text, detection of a concept in a document does not, per se, make that document relevant for that concept. The concept may refer to a finding that was looked for but absent or ruled out, or that occurred in the remote past. The recording of “significant negatives” is important in medicine, and robust recognition of negation in narrative is currently a major challenge.
Concept recognition addresses the problem of synonyms (different phrases for the same concept) but fails in the presence of homonyms (words/phrases with multiple meanings). For example, the term “anaesthesia” by itself can indicate either a procedure for pain relief, or a clinical finding of loss of sensation. Disambiguation (the process of identifying a unique match from several possible matches) can be performed based on neighbouring words. Thus, the phrase “endotracheal anaesthesia” can be uniquely matched. However, if the phrase that assists disambiguation is present further away, e.g. in the previous or succeeding sentence, the problem is much more difficult. Elaborate rules have been manually devised for limited sets of words that would favour one concept over the other. For example, mention of an anatomic location (such as the dorsum of the foot) would favour the interpretation of “anaesthesia” as a clinical finding. However, the approach of manual rule creation does not scale up: the 1999 edition of the UMLS lists 13,688 ambiguous term entries, and devising rules for each entry manually would be a Herculean task. There may be other methods, however, that are less labour intensive than manually devising rules (e.g., machine learning) that have yet to be explored.
The presence of elisions, neologisms, or abbreviations in medical text complicates automated concept recognition. An elision is a phrase with missing words: a domain expert can still disambiguate the phrase based on the phrase’s context. For example, in the phrase “the white count was 1800”, “white count” is an elided form of “white blood cell count”. A neologism is a newly coined word that would not be expected to be in a dictionary. An example is “coumadinize” for the act of administering coumadin (dicoumarol) until satisfactory anticoagulation is attained. Abbreviations that are also words in their own right can confuse a computer program, e.g., “RAT” can mean “recurrent acute tonsillitis” to an ENT surgeon. The UMLS contains numerous entries for elisions and abbreviations, but is by no means complete: it is therefore updated continually, with new data being released annually for use by informaticians.
Some of the practical problems faced in automated concept indexing are discussed in.
While medical text is part of the patient record, it must coexist with more structured parts of the record that are stored in tables of a traditional relational database, for example, demographic information or laboratory results. Historically, however, such integration has been lacking, and hospital-based software developers have had to do such integration themselves. Over the last two years, the situation has improved considerably as relational database system vendors have integrated IR technology into their products. Thus, Microsoft’s SQL Server version 7 incorporates a full-text indexing engine, as does the “ConText option” of Oracle versions 7 and greater. Such integration enables the vendor to offer a "total solution" for data management.
Feature-wise, however, these offerings currently lag greatly behind those of companies that specialize in offering IR technology alone. For example, dtSearch, a popular package that provides the search engine for the well-known Physician’s Desk Reference (PDR) database, goes beyond Microsoft and Oracle in several ways, e.g., letting the developer provide a custom thesaurus, as well as allowing proximity searching (which has been discussed earlier in the section on document indexing). While it is possible to build one’s own indexing and search engine if programming expertise is available (several engines, such as INQUERY, are freely available for researchers), it is best to go with a commercial engine for production purposes. The commercial engines have various extra features that would be highly tedious to program by oneself. For example, they understand documents in their native formats (e.g., HTML, MS-Word, and PDF). Other packages, such as IBM’s Intelligent Miner for Text, constitute a full-fledged development environment that consists of multiple text analysis tools, a “Web crawler” that will search the Web on a periodic basis for keywords of interest, and a relational database engine where the results that have been gathered can be stored.
For many queries of medical data, it makes sense to combine a query of data present in traditional relational tables with that present in free text. For example, one might want to search for patients of a particular age, sex, and diagnosis who presented with particular clinical findings. Age, sex, and diagnosis are stored in traditional tables, while clinical presentation information may be stored in narrative. Because querying relational tables for laboratory test values or diagnosis codes is much more efficient than querying free text, the relational-table query step can be used to reduce the number of documents that must be searched to identify patients matching a particular set of criteria. Vasanthakumar et al have described a system allowing integrated query of traditional and textual data. MS SQL Server and Oracle allow integrated query through proprietary extensions to the database system’s query language for querying text.
When a medical institution is computerizing its electronic records, one must have a means of accessing old paper-based data that may go back to several decades. Re-entering this information into a computer by hand is extremely labour-intensive, and document management systems (DMSs) allow one to scan the paper records. The resultant images can be cross-referenced against essential demographic data on the patient such as medical record number, date of admission/discharge/death, primary diagnosis and so on. (There is often no choice but to hand-enter this demographic data.) If the information on paper has been typed or printed rather than hand-written, one can often use optical character recognition (OCR) on these images. OCR generates text files from the images, and one can then use IR technology, which is bundled into several DMSs, to index these files so that they can be searched usefully.
We have already discussed the document vector approach for relevance ranking. It is similarly possible, using the technique discussed earlier, to compare two documents for similarity just as one compares a document vector with a query vector. When one gathers a series of documents (passages) written by a particular author, the document vectors for these passages act to some extent collectively acts as a literary fingerprint for the author, because particular authors tend to use some words more often than others. (Other statistics, such as sentence length and the proportion of verbs with a passive voice, can also be used for fingerprinting.) This knowledge has been used in various ways.
If the field of literature, if one collects several passages from different authors, document-vector methods can be used to cluster the documents into groups based on similarities between them. With enough passages per author, one typically observes documents from a particular author falling into their own special cluster. This method is reliable enough that if one had a suitably long passage by an author whose identity was uncertain (but about whom it was known that s/he was one of the profiled authors) document-vector methods could provide a reasonable guess as to the author’s identity. This technique has been used in detecting literary forgeries (as in the “Hitler diaries”, where the text did not quite match Hitler’s writing style in his autobiography, “Mein Kampf”). Similarly, a comparison of passages of William Shakespeare with those of Francis Bacon indicates that it is unlikely that Bacon wrote the plays of Shakespeare (as has been claimed by some). During the Cold War, "Kremlin Watchers" in the CIA were able to classify a particular communiqué published in Pravda as having come from one of a small handful of anonymous writers based upon which cluster the communiqué matched most closely. (These writers’ identities were never known, so they were given fictitious names.)
The Pubmed database, maintained by US National Centre for Biotechnology Information (part of the National Library of Medicine), provides public access to all of MEDLINE over the Web. It uses an algorithm devised by NCBI's John Wilbur to compute similarity. For every article in medicine, supercomputer-class machines periodically determine which other articles in MEDLINE are similar to it (in terms of passing a statistical significance threshold). This information is stored in the PubMed database so that, if one is looking at a particular reference, one can instantly find out all other references in the database that have been determined to be similar, by clicking a button on the Web form. Computation of the similarity metric uses the full text of the article where available electronically: otherwise, only the abstract is used. The similarity algorithm is quite powerful, and it has been successfully used for a purpose not originally intended by Wilbur. The journal Science reported that a researcher using PubMed discovered a case of serial plagiarism. Several older documents flagged as highly similar to papers published by a particular Polish scientist had in fact been translated into Polish without Acknowledgment, and passed off as the scientist’s own work. Wilbur’s algorithm was able to identify similarities despite the fact that it was working with translations in the latter case.
Information retrieval technology is the hidden workhorse that underlies the powerful Web-searching facilities that all of us take for granted today. It constitutes a valuable addition to the armamentarium of the developer of computer software. Just as relational database packages have become progressively easier to use and administer, it is possible that future versions of microcomputer database packages will include IR components.