Medical Sentence Parsing - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Medical Sentence Parsing

Description:

The opinions expressed are solely those of the author who does not ... give me your contact information (snail-mail, telephone, fax, and preferred email) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 20
Provided by: ISCS47
Category:

less

Transcript and Presenter's Notes

Title: Medical Sentence Parsing


1
Medical Sentence Parsing
  • Jules J. Berman, Ph.D., M.D.Program Director,
    Pathology InformaticsCancer Diagnosis
    ProgramNational Cancer InstituteThe opinions
    expressed are solely those of the author who does
    not represent the NIH.

2
If you don't have a way of parsing your text
into distinct sentences, you can't sensibly
derive information from the text. This is true
for both humans and computers. Unfortunately,
pathology reports are full of sentences that
do not follow any grammatic logic. This makes
natural language parsing of pathology free-text
based on grammar rules virtually impossible.
3
Pathology sentences are not grammatic sentences.
For example BASAL CELL CARCINOMA MARGINS NOT
INVOLVED Basal cell carcinoma is not a grammatic
sentence and does not even end with a period. It
is simply a diagnostic term. It can be
thought of as an elliptic form of a more complete
but hidden sentence such as BASAL CELL
CARCINOMA IS PRESENT IN THE SPECIMEN. THE MARGINS
OF RESECTION OF THE SPECIMEN ARE NOT INVOLVED.
4
So if I were given a million pathology reports,
and I wanted to create software that would map
the diagnoses to a standard nomenclature, such as
UMLS or SNOMED or ICD-O, I would want the
software to think that "basal cell carcinoma" is
a complete sentence when it appears in all caps
in its own line in the microscopic diagnosis
section.
5
Microscopic Diagnosis INVASIVE SQUAMOUS CELL
CARCINOMA IN SITU COMPONENT IS NOT SEEN Dr.
Green notified Jan. 15, at 300 p.m. that margins
are positive. Sentence-Blind Parser invasive
squamous cell / carcinoma in situ / component is
not seen dr. / green notified jan. / 15, at 3 /
00 p./m./ that margins are positive.
6
United States Patent 5,001,633 Fukumochi, et.
al. Mar. 19, 1991 Computer assisted language
translating machine with sentence extracting
function ..... A discriminating device
determines whether the character data read out of
the first storage unit is a period, and whether a
space follows the period. In the event a period
and space are detected, the sentence associated
with the period is translated. Appl. No.
397,188 Filed Aug. 23, 1989
7
TextSentence Publicly available CPAN Perl
Package Philosophic approach to algorithm (as per
Tony Rose and Ave Wrigley) The main problem
with trying to split text into sentences is that
there are several uses for periods, such as
abbreviationsThe performance of TextSentence
could be improved by taking into account special
cases like honorifics (Mr., Mrs., Dr.) common
abbreviations (e.g., etc., i.e.), and so
on. However, as with many language problems, this
obeys the law of diminishing returns a little
bit of effort will do a decent 90 job, but that
last 10 is pretty difficult. For our purposes
the 90 is good enough.
8
A sentence parser for pathology really needs to
take into consideration the specific ways that
pathologists separate their sentences. It should
also try to deal with the way that pathologists
use sentence separators unintentionally.
9
Design. A medical sentence parser was created
in PERL (Sentence.PM). PERL is
platform-independent, running on any system
(including Unix, DOS, Windows, Mac). PERL
interpreters are widely available at no
cost. The algorithmic strategy of the parser is
to remove all punctuation unrelated to sentence
breaks and to introduce sentence breaks at
positions likely to indicate the beginning or end
of a "complete thought, independent of grammar
restrictions.
10
Results. The parser was tested on
fully-anonymized pathology reports from two
institutions as well as a publicly available
medical text (the 67 Mbyte OMIM On-Line
Mendelian Inheritance in Man) any many other
texts. The latter two resemble the structure of
many pathology reports, containing numerous
heading "sentences" along with short lines
lacking standard sentence separators. A 2
error rate (for run-on sentences) was found for
the OMIM text.
11
Conclusions. The sentence parser is fast and
accurate The parser handles ascii files of any
size, regardless of system RAM memory
constraints. Parsed text outputs at the rate
of 265 Kbytes/second (on a 550 Mhz CPU).
Source code as object-oriented class modules,
is available from the author, at no cost.
12
How to Use the Parser 1. Send me an email
asking for it, and give me your contact
information (snail-mail, telephone, fax, and
preferred email). You dont need to give me any
more information than that. Id like to keep
tabs on all the users so that I can send them
version updates. This is the kind of software
that is never finished. Youre always
improving upon it.
13
How to use parser 2. Make sure you have a Perl
interpreter. To the best of my knowledge, this
comes with every install of Linux. If youre
using Windows, my favorite is the ActiveState
Perl distribution that you can download for
free www.ActiveState.com
14
How to use parser 3. Create a Sentence class
object and use a filename as an argument when you
call the method that parses files into
sentences. This sounds obscure, but its easy
15
Lets call this script sentence.pl
/usr/local/bin
/perl use Parse use Sentence _at_ISA (Parse,
Sentence) object1 Sentence-gtnew() print
"\nWhat file would you like to parse. FULL
PATHNAME.\n" filename ltSTDINgt chomp(filename
) filename object1-gtsentence_parse(filename)
print "Your output file is filename\n" datehe
ader object1-gtget_day() print
dateheader exit
16
C\newinfo\parsegtperl sentence.pl Sentence Wha
t file would you like to parse. Include full
pathname. readpars.txt ..... readpars.txt Elapsed
time 0 seconds Your output file is
TKWILTKG.LZM This text was produced by Jules
Berman's Sentence Class on Wednesday - September
19, 2001
17
Caveat This will only work with plain ascii
files. But the program doesnt care how big the
file is. It can be one line or it can be a
gigabyte.
18
How to improve the parser Send all code
improvements and all suggestions and constructive
criticisms to me. I hope to be issuing new
versions of the class methods and creating new
class methods.
19
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com