Medical Document Categorization - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Medical Document Categorization

Description:

1 Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, OH, USA ... Department of Informatics, Nicolaus Copernicus ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 15
Provided by: lukasz2
Category:

less

Transcript and Presenter's Notes

Title: Medical Document Categorization


1
  • Medical Document Categorization
  • Using a Priori Knowledge

L. Itert1,2, W. Duch2,3, J. Pestian1
1 Department of Biomedical Informatics,
Childrens Hospital Research Foundation,
Cincinnati, OH, USA 2 Department of Informatics,
Nicolaus Copernicus University, Torun, Poland 3
School of Computer Engineering, Nanyang
Technological University, Singapore ICANN 2005,
Warsaw, 10-14 Sept. 2005
2
Outline
  • Goals questions
  • Medical data
  • Data preparation
  • Model of similarity
  • Computational experiments and results

3
Goals Questions
  • What are the key clinical descriptors for a given
    disease?
  • In what sense are the records describing patients
    with the same diseases similar?
  • Can we capture experts intuition evaluating
    documents similarity and diversity?
  • Include a priori knowledge in document
    categorization important especially for rare
    disease.
  • Use UMLS ontology and NLM lexical tools.

4
Example of clinical summary discharges
  • Jane is a 13yo WF who presented with CF
    bronchopneumonia. She has noticed increasing
    cough, greenish sputum production, and fatique
    since prior to 12/8/03. She had 2 febrile
    epsiodes, but denied any nausea, vomiting,
    diarrhea, or change in appetite. Upon admission
    she had no history of diabetic or liver
    complications. Her FEV1 was 73 12/8 and she was
    treated with 2 z-paks, and on 12/29 FEV1 was 72
    at which time she was started on Cipro. She noted
    no clinical improvement and was admitted for a 2
    week IV treatment of Tobramycin and Meropenem.

5
Unified Medical Language System (UMLS)
  • semantic types
  • Virus" causes "Disease or Syndrome"
  • semantic relation
  • Other relations interacts with, contains,
    consists of , result of, related to,
  • Other types Body location or region, Injury
    or Poisoning, Diagnostic procedure,

6
UMLS Example (keyword virus)
  • Metathesaurus
  • Concept Virus, CUI C0042776, Semantic
    Type Virus
  • Definition (1 of 3)
  • Group of minute infectious agents
    characterized by a lack of independent metabolism
    and by the ability to replicate only within
    living host cells have capsid, may have DNA or
    RNA (not both). (CRISP Thesaurus)
  • Synonyms Virus, Vira Viridae
  • Semantic Network
  • "Virus" causes "Disease or Syndrome"

7
Data
Disease name Clinical Data Clinical Data Reference Data size bytes
Disease name No. of records Average size bytes Reference Data size bytes
Pneumonia 609 1451 23583
Asthma 865 1282 36720
Epilepsy 638 1598 19418
Anemia 544 2849 14282
UTI 298 1587 13430
JRA 41 1816 27024
Cystic fibrosis 283 1790 7958
Cerebral palsy 177 1597 35348
Otitis media 493 1420 32416
Gastroenteritis 586 1375 9906
JRA - Juvenile Rheumatoid Arthritis UTI -
Urinary tract infection
8
Data processing/preparation
MMTx discovers UMLS concepts in text
Reference Texts
MMTx
ULMS concepts /feature prototypes/
Filtering /focus on 26 semantic types/
Features /UMLS concept IDs/
Clinical Documents
MMTx
UMLS concepts
Filtering using existing space
Final data
9
Semantic types used
Values indicate the actual numbers of concepts
found inI clinical textsII reference texts
10
Data - statistics
  • 10 classes
  • 4534 vectors
  • 807 features (out of 1097 found in reference
    texts)
  • Baseline
  • Majority 19.1 (asthma class)
  • Content based 34.6 (frequency of class name in
    text)
  • Remarks
  • Very sparse vectors
  • Feature values represent term frequency (tf) i.e.
    the number of occurrences of a particular concept
    in text

11
Model of similarity I
  • Intuitions
  • Initial distance between document D and the
    reference vectors Rk should be proportional to
    d0k D Rk ? 1/p(Ck) - 1
  • If a term i appears in Rk with frequency Rik gt 0
    but does not appear in D the distance d(D,Rk)
    should increase by ?ik a1Rik
  • If a term i does not appear in Rk but it has
    non-zero frequency Di the distance d(D,Rk)
    should increase by ?ik a2Di
  • If a term i appears with frequency Rik gt Di gt 0
    in both vectors the distance d(D,Rk) should
    decrease by ?ik -a3Di
  • If a term i appears with frequency 0 lt Rik Di
    in both vectors the distance d(D,Rk) should
    decrease by ?ik -a4Rik

12
Model of Similarity II
Given the document D, a reference vector Rk and
probability p(iCk) probability that the class of
D is Ci should be proportional to
where ?ik depends on adaptive parameters a1,,a4
which may be specific for each class. Linear
programming technique can be used to estimate ai
by maximizing similarity between documents and
reference vectors
  • with the constrains

where k indicates the correct class.
13
Results
M0 M1 M2 M3 M4 M5
kNN 48.9 50.2 51.0 51.4 49.5 49.5
SSV 39.5 40.6 31.0 39.5 39.5 42.3
MLP (300 neur.) 66.0 56.5 60.7 63.2 72.3 71.0
SVM (C opt.) 59.3 (1.0) 60.4 (0.1) 60.9 (0.1) 60.5 (0.1) 59.8 (0.01) 60.0 (0.01)
10 Ref. vectors 71.6 - 71.4 71.3 70.7 70.1
10-fold crossvalidation accuracies in for
different feature weightings. M0 tf
frequencies M1 binary data
14
Conclusions
Medical text contain a large number of rare,
specific concepts. Vector representation using
standard td x idf weighting leads to poor
results A priori knowledge was introduced using
single reference vector (this certainly needs
improvement). Expert intuitions were formalized
in a model to measure similarity of text, with
only 4 parameters per class. Linear programming
has been used to optimize parameters. Results are
quite encouraging. Finding best set of reference
vectors and similarity measures for medical
documents is an interesting challenge.
Write a Comment
User Comments (0)
About PowerShow.com