What is the coverage of SNOMED CT®on scientific medical corpora?

Author Affiliation: Centre for Language Technology, Department of Swedish Language, the Swedish Language Bank, University of Gothenburg, Gothenburg, Sweden. dimitrios.kokkinakis@svenska.gu.se

Keywords: Humans; Information Storage and Retrieval; Language; Medical Informatics - methods; Medical Records Systems, Computerized; Programming Languages; Reproducibility of Results; Sweden; Systematized Nomenclature of Medicine; Terminology as Topic; Vocabulary, Controlled

Abstract: This paper reports on the results of a large scale mapping of SNOMED CT on scientific medical corpora. The aim is to automatically access the validity, reliability and coverage of the Swedish SNOMED-CT translation, the largest, most extensive available resource of medical terminology. The method described here is based on the generation of predominantly safe harbor term variants which together with simple linguistic processing and the already available SNOMED term content are mapped to large corpora. The results show that term variations are very frequent and this may have implication on technological applications (such as indexing and information retrieval, decision support systems, text mining) using SNOMED CT. Naïve approaches to terminology mapping and indexing would critically affect the performance, success and results of such applications. SNOMED CT appears not well-suited for automatically capturing the enormous variety of concepts in scientific corpora (only 6,3% of all SNOMED terms could be directly matched to the corpus) unless extensive variant forms are generated and fuzzy and partial matching techniques are applied with the risk of allowing the recognition of a large number of false positives and spurious results.