Title: Modern Neoplasm Classification
 1 Modern Neoplasm Classification Dept of 
Pathology University of Michigan October 27, 
2005 Jules J. Berman, Ph.D., M.D. jjberman_at_alum.
mit.edu 
 2What is a tumor classification? A grouped 
taxonomy listing of all tumors with the 
following properties Inheritance Hierarchical 
structure, with each class of tumors inheriting 
properties of its ancestors Uniqueness Each 
tumor occurs in only one place in the 
classification Comprehensive All tumors are 
included Class-intransitive A tumor from one 
class does not change into a tumor from another 
class (e.g. an adenocarcinoma does not become a 
lymphoma) Ernst Mayr The growth of biological 
thought diversity, evolution and inheritance. 
Cambridge Belknap Press 1982. 
 3Problems with current tumor classifications Mixed 
bag of tumor classes based on Anatomic site 
(roughly distance from the tumor to the floor as 
in head and neck tumors) Clinical specialty 
(dermatologic tumors) Functional similarity of 
cell types (e.g. endocrine tumors) Not based on 
any describable biologic premise. 
 4Molecular classification of cancer The so-called 
molecular classifications (based largely on gene 
expression arrays of tumors) are simply a way of 
finding variants within a population. Mostly, 
you see experiments designed to cluster out 
variants of a tumor type (slow-growing, 
responsive to a specific treatment, prone to 
metastasize, etc.) This is simply not 
classification (ignores the intransitive law), 
and in fact, no classification has emerged from 
any of the work that's been done with molecular 
diagnostics. My opinion Gene expression array 
studies do not create classifications  but are 
very useful taxon finders 
 5Developmental Lineage Classification and Taxonomy 
of Neoplasms Similar to (but different from) the 
classification efforts of the 1950s (particularly 
Willis) Old hypothesis (more or less 
discredited) is that tumor development 
recapitulates embryologic development. New (my) 
hypothesis is that tumors will tend to inherit 
the molecular pathways from their developmental 
ancestors. May be helpful in selecting classes of 
tumors responsive to molecular targets. Despite 
the difference in hypotheses, either way you end 
up with a classification that follows embryologic 
lines and that fits in will stem cell hypothesis. 
 6(No Transcript) 
 7Developmental Lineage Classification and Taxonomy 
of Neoplasms Now 145,000 terms (10 
Megabytes) Publicly available and free The 
latest version at www.pathologyinformatics.org  
 853 ways of writing prostate cancer Prostate 
cancer is the concept, the 53 synonyms are the 
terms for the concept, and C486300 is the 
code ltname nci-code  "C4863000"gtprostate with 
adenocalt/namegt ltname nci-code  
"C4863000"gtadenoca arising in prostatelt/namegt ltnam
e nci-code  "C4863000"gtadenoca involving 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenoca arising from 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenoca of prostatelt/namegt ltname 
nci-code  "C4863000"gtadenoca of the 
prostatelt/namegt ltname nci-code  
"C4863000"gtprostate with adenocarcinomalt/namegt ltna
me nci-code  "C4863000"gtadenocarcinoma arising 
in prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma involving 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma arising from 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma of prostatelt/namegt ltname
 nci-code  "C4863000"gtadenocarcinoma of the 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma arising in the 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma involving the 
prostatelt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma arising from the 
prostatelt/namegt ltname nci-code  
"C4863000"gtprostate with calt/namegt ltname nci-code 
 "C4863000"gtca arising in prostatelt/namegt ltname 
nci-code  "C4863000"gtca involving 
prostatelt/namegt ltname nci-code  "C4863000"gtca 
arising from prostatelt/namegt ltname nci-code  
"C4863000"gtca of prostatelt/namegt ltname nci-code  
"C4863000"gtca of the prostatelt/namegt ltname 
nci-code  "C4863000"gtprostate with 
cancerlt/namegt ltname nci-code  "C4863000"gtcancer 
arising in prostatelt/namegt ltname nci-code  
"C4863000"gtcancer involving prostatelt/namegt ltname 
nci-code  "C4863000"gtcancer arising from 
prostatelt/namegt ltname nci-code  
"C4863000"gtcancer of prostatelt/namegt 
 9More ltname nci-code  "C4863000"gtcancer of the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcancer arising in the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcancer involving the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcancer arising from the 
prostatelt/namegt ltname nci-code  
"C4863000"gtprostate with carcinomalt/namegt ltname 
nci-code  "C4863000"gtcarcinoma arising in 
prostatelt/namegt ltname nci-code  
"C4863000"gtcarcinoma involving prostatelt/namegt ltna
me nci-code  "C4863000"gtcarcinoma arising from 
prostatelt/namegt ltname nci-code  
"C4863000"gtcarcinoma of prostatelt/namegt ltname 
nci-code  "C4863000"gtcarcinoma of the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcarcinoma arising in the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcarcinoma involving the 
prostatelt/namegt ltname nci-code  
"C4863000"gtcarcinoma arising from the 
prostatelt/namegt ltname nci-code  
"C4863000"gtprostate adenocalt/namegt ltname nci-code 
 "C4863000"gtprostate adenocarcinomalt/namegt ltname 
nci-code  "C4863000"gtprostate calt/namegt ltname 
nci-code  "C4863000"gtprostate cancerlt/namegt ltname
 nci-code  "C4863000"gtprostate 
carcinomalt/namegt ltname nci-code  
"C4863000"gtprostatic cancerlt/namegt ltname nci-code 
 "C4863000"gtprostatic carcinomalt/namegt ltname 
nci-code  "C4863000"gtprostatic 
adenocarcinomalt/namegt ltname nci-code  
"C4863000"gtprostate gland adenocarcinomalt/namegt ltn
ame nci-code  "C4863000"gtadenocarcinoma of the 
prostate glandlt/namegt ltname nci-code  
"C4863000"gtadenocarcinoma of prostate 
glandlt/namegt ltname nci-code  "C4863000"gtprostate 
gland carcinomalt/namegt ltname nci-code  
"C4863000"gtcarcinoma of the prostate 
glandlt/namegt ltname nci-code  "C4863000"gtcarcinoma
 of prostate glandlt/namegt 
 10 Is the taxonomy comprehensive? Let's compare it 
with SNOMED. 
 11Comparing the Developmental Lineage 
Classification with SNOMED. 1. Used the 2005 
version of UMLS (free from ww.nlm.gov) 2. 
MRCON05 650,948,750 1-18-05 and MRCXT 
 1,610,612,736 1-18-05 MRCXT2 
1,610,612,736 1-18-05 MRCXT3 
1,610,612,736 1-18-05 MRCXT4 
1,610,612,736 1-18-05 MRCXT5 
1,610,612,736 1-18-05 MRCXT6 
1,610,612,736 1-18-05 MRCXT7 
1,196,031,492 1-18-05 4. Extracted the snomed 
ct terms from mrcon05 using the script MRCON05 
.PL 2,098 5-30-05 
 12MRCON05.PL line  " " start  time() open 
(TEXT,"mrcon05") open (OUT,"gtsnom05") while 
(line ne "")  line  ltTEXTgt 
_at_linearray  split(/\/,line) cuinumber  
linearray0 language  linearray1 
vocabulary  linearray11 next if ("ENG" 
ne language) next if ("SNOMEDCT" ne 
vocabulary) print OUT "cuinumber 
linearray14\n" print "cuinumber 
linearray14\n"  end  time() total  
end - start print "\ntotal time was total 
seconds\n" exit Execution time of 132 seconds 
on a 2.89 Ghz PC 
 13 5. This produced a 35 MByte file SNOM05 
 35,127,210 5-30-05 6. Created a perl script, 
neopull2.pl that uses the mrcxt 
"Neoplasm" relationship to identify all the 
neoplasm CUIs in UMLS and to pull out any of the 
SNOMED terms that corresponded to a Neoplasm CUI 
(neopull2.pl) 7. The output file is SNOM 
.OUT 567,372 5-30-05 8. This output file 
contains a lot of redundant terms and plurals, so 
I wrote snoclean.pl to get rid of the extraneous 
terms SNOCLEAN .PL 1,092 5-30-05 9. 
The final output file is SNOCLEAN .OUT 
300,834 5-30-05 SNOMED contains 2,673 different 
neoplasm concepts and 7,696 neoplasm terms 
 14 SNOMED The total number of neoplasm concepts 
is 2,673 The total number of neoplasm terms is 
7,696 Developmental Lineage The total number of 
neoplasm concepts is 6,193 The total number of 
neoplasm terms is 146,666 The Developmental 
Lineage has 2.3 times the neoplasm concepts as 
SNOMED 19 times the neoplasm terms as SNOMED Can 
one pathologist create a better nomenclature than 
the CAP? maybe 
 15The large curated nomenclatures can't be used for 
concept matching and are fast becoming obsolete 
for their intended mode of human-based 
implementation due to the explosive growth of the 
data domain terabytes and terabytes every day  
think about all types of digital data in medical 
information systems PRAKASH NADKARNI, MD, ROLAND 
CHEN, MD, CYNTHIA BRANDT, MD, MPH, UMLS Concept 
Indexing for Production DatabasesA Feasibility 
StudyJ Am Med Inform Assoc. 2001880-91. Conclusi
ons Considerable curation needs to be performed 
to define a UMLS subset that is suitable for 
concept matching. 
 16 What is the value of a comprehensive neoplasm 
classification? 1. A modern classification is 
the key to retrieving, organizing, and 
integrating the data held in biomedical databases 
(including the data held in hospital information 
systems) Can we use the taxonomy to code our 
surgical pathology reports and other textual 
documents? 2. A classification is a hypothesis 
about the nature of reality. Can we use the 
classification to select classes of tumors 
(rather than single tumors) to molecular targeted 
cancer therapy? We've done this with 
antibiotics with astounding success. Can we 
learn something about the biology of tumors by 
using the classification to stratify the data 
found in large biological databases and 
inspecting the results? 
 17 Autocoding Surgical Pathology Reports What is 
the size of the data domain when we're talking 
about surgical pathology reports. There are 
about 25 million surgical pathology reports 
generated in the U.S. each year (about 50 million 
cytology reports) 
 18 Autocoding Surgical Pathology Reports Allowing 
1000 bytes per report, these reports occupy 25 
Gigabytes of text (25 thousand million 
bytes) Here is what 1000 bytes looks like To 
be, or not to be,--that is the question-- Whether
 'tis nobler in the mind to suffer The slings and 
arrows of outrageous fortune Or to take arms 
against a sea of troubles, And by opposing end 
them?--To die,--to sleep,-- No more and by a 
sleep to say we end The heartache, and the 
thousand natural shocks That flesh is heir 
to,--'tis a consummation Devoutly to be wish'd. 
To die,--to sleep-- To sleep! perchance to 
dream--ay, there's the rub For in that sleep of 
death what dreams may come, When we have shuffled 
off this mortal coil, Must give us pause there's 
the respect That makes calamity of so long 
life For who would bear the whips and scorns of 
time, The oppressor's wrong, the proud man's 
contumely, The pangs of despis'd love, the law's 
delay, The insolence of office, and the 
spurns That patient merit of the unworthy 
takes, When he himself might his quietus 
make With a bare bodkin? who would these fardels 
bear, To grunt and sweat under a weary life, 
 But Compressed, all of the surgical pathology 
reports produced in the U.S. In one year will fit 
easily on one DVD (like 10 episodes of I Love 
Lucy). 
 19lt?xml version"1.0"?gt ltrdfRDF 
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdc"http//www.purl.org/dc/elemen
ts/1.0/" xmlnsv"http//www.pathologyinforma
tics.org/informatics_r.htm"gt ltrdfDescription 
about"urnPMID-16160487"gt ltdctitlegt 
 interobserver and intraobserver variability 
in the diagnosis of hydatidiform mole 
lt/dctitlegt ltvautocode term"mole" 
code"C0000000" /gt ltvautocode 
term"hydatidiform mole" code"" /gt 
ltde_idgt   and   in the  of hydatidiform mole 
 lt/de_idgt lt/rdfDescriptiongt 
ltrdfDescription about"urnPMID-16160486"gt 
 ltdctitlegt primary glial tumor of the 
retina with features of myxopapillary ependymoma 
 lt/dctitlegt ltvautocode 
term"tumor" code"C0000000" /gt 
ltvautocode term"myxopapillary ependymoma" 
code"C0000000" /gt ltvautocode 
term"tumor of the retina" code"C0000000" /gt 
 ltvautocode term"glial tumor" 
code"C3059000" /gt ltvautocode 
term"ependymoma" code"C0000000" /gt 
ltde_idgt   glial tumor of the retina with  of 
myxopapillary ependymoma  lt/de_idgt 
lt/rdfDescriptiongt ltrdfDescription 
about"urnPMID-16160485"gt ltdctitlegt 
 cd20-negative t-cell-rich b-cell lymphoma as 
a progression of a nodular lymphocyte-predominant 
hodgkin lymphoma treated with rituximab 
a molecular analysis using laser capture 
microdissection lt/dctitlegt 
ltvautocode term"lymphoma" code"C0000000" /gt 
 ltvautocode term"hodgkin" code"C0000000" 
/gt ltvautocode term"b-cell lymphoma" 
code"C6858100" /gt ltvautocode 
term"t-cell-rich b-cell lymphoma" 
code"C9496100" /gt ltvautocode 
term"hodgkin lymphoma" code"" /gt 
ltde_idgt   t-cell-rich b-cell lymphoma as a  of 
a   hodgkin lymphoma  with  a   using    
 lt/de_idgt lt/rdfDescriptiongt 
 20 The autocoder prepares an XML file in RDF format 
(self-describing document) that autocodes and 
scrubs text concurrently, at a speed of about 
8,000 reports per second.... and does an 
incomparably better job than human coders! This 
means that it will code and scrub the 25 million 
surgical pathology reports in the U.S. In about 
an hour using a desktop PC If we had access to 
a supercomputer (operating more than 3,000 times 
faster than my desktop PC), we could autocode and 
scrub every pathology report produced in the 
country in about a second. 
 21 Why is it so important to autocode 
fast? Because we're not really talking about 
coding (coded datasets cannot be justified on the 
basis of their scientific value). We're really 
talking about re-coding very large datasets as 
necessary. You almost always need to re-code!!! 
 1. Whenever you want to change from one 
nomenclature to another (eliminates problem of 
brand-name loyalty) 2. Whenever you introduce a 
new version of a nomenclature 3. Whenever you 
want to use a new coding algorithm (e.g. 
Parsimonious versus comprehensive, linking code 
to a particular extracted portion of report) 4. 
Whenever you add legacy data to your LIS 5. 
Whenever you merge different pathology datasets  
forget mapping!!! 
 22 How can we integrate the neoplasm classification 
with OMIM to discover a new biological 
observation about tumors? What is OMIM? Omim is 
a free, comprehensive listing of all the 
so-called Mendelian inherited diseases. Omim is 
103,610,906 bytes (over 100 million 
bytes) Shakespeare's Hamlet is 180,711 
bytes OMIM is about 573 times larger than 
Hamlet Each record of OMIM lists the name of the 
inherited disease, and all the medical conditions 
(including neoplasms) that may be associated with 
the condition. 
 23Let's autocode all of OMIM and examine the 
results 1. The time to autocode was 92 
seconds 2. The number of records in omim is 
16785 3. The number of records listing primitive 
tumors is 348 4. The number of records listing 
endoderm_or_ectoderm tumors is 1220 5. The number 
of records lising mesoderm tumors is 1766 
(completely unlike what you might expect with 
non-inherited tumors) 6. The number of records 
listing neuroectoderm tumors is 747 So, because 
we have a class system, we can look at 
instance-coded datasets and make observations 
about CLASS 
 24Easy to count the three combinations of 
two-lineage (discordant) records The number of 
OMIM records with neoplasm concepts in the record 
text is 1,015. ectoderm/mesoderm 72 omim 
records ectoderm/neuroectoderm 24 omim 
records mesoderm/neuroectoderm 39 omim 
records total 135 
class-discordant OMIM records So, 135/1,015 
(13) have a lineage discordance. 
 25Causes for 135 cases of class discordance 1. 
Inherited conditions with an (external) 
environmental factor 2. Physiologic (internal) 
effects that cross lineages (breast and ovarian 
cancers caused by an endocrine sensitivity that 
extends across lineages) 3. Conditions that 
included a tumor that occurs too infrequently to 
be correctly associated with the inherited 
condition 4. Mistakes in parsing omim (finding 
the name of a tumor in a record that was never 
intended to indicate that the condition is 
associated with the tumor) 5. Bad 
classification How do you decide? In this case, 
you go back and read the 135 records and try to 
understand what went wrong in each case. 
 26Classification papers Autocoding papers (Doublet 
Method 20,000 times faster than other published 
methods) Confidentiality/privacy papers - 
De-identification and data scrubbing (Concept 
Match method) - Zero-knowledge reconciliation of 
identities - Threshold method for exchanging 
pieces of data Data integration 
papers www.pubmed.org search on berman jj 
 27 end 
 28