Title: BioLINK Talks
1BioLINK Talks
Linking Literature, Information and Knowledge for
Biology
BioLINK,Detroit, June 24 (Edinburgh July 11)
2- Corpora and Corpus design (2)
- NER and Term Normalisation (3)
- Annotation and Zoning (2)
- Relation Extraction (2)
- Other
3Corpus Design for Biomedical Natural Language
Processing
Corpora and corpus design
K. Bretonnel Cohen et al (U of Colorado)
Main Question why are some (bio-)corpora more
used than others? What makes them attractive?
Crucial points
- format XML
- code several layers of information
- publicity write specific papers about corpus,
publicise its availability
Take home message if you want people to use your
corpus, use XML, publish annotation guidelines,
publicise corpus with dedicated papers, use it
for competitions
4MedTag a collection of biomedical annotations
Corpora and corpus design
L. Smith et al. (National Center for
Biotechnology Information, Bethesda, Maryland)
Main Point MedTag is a database that combines
three corpora
- MedPost (modified to include 1000 extra
sentences) - ABGene
- GENETAG (modified to reflect new defs of genes
and prots)
The data is available in flat files software to
facilitate loading data into SQL database
Take home message integrated data, more
accessible, you should try it.
5Corpora and corpus design
- 6700 sentences
- annotated for POS and gerund arguments
- POS tagger trained on it (97.4 accuracy)
- 15000 sentences currently released
- tagged for gene/protein identification
- used in Biocreative
- over 4000 sentences
- annotated for gene/protein names
- NER tagger trained on it (lower 70s)
6Corpora and corpus design
GOOD
BAD
- Recommended Uses
- training and evaluating
- POS taggers
- training and evaluating
- NER taggers
- developing and evaluating
- a chunker (for PubMed
- phrase indexing)
- analysis of grammatical
- usage in medical text
- feature extraction for ML
- entity annotation guidelines
- tokenisation!
- (white spaces were deleted)
7NER and TN
Weakly Supervised Learning Methods for Improving
the Quality of Gene Name Normalization
Ben Wellner (MITRE)
Main points
1. presenting method of
improving quality of training data
from BioCreative task1b. Systems performance
on improved data is better than
on original data 2. weakly
supervised methods can be successfully applied
for re-labeling noisy training data
(next week)
8NER and TN
Unsupervised gene/protein normalization using
automatically extracted dictionaries
A. Cohen (Oregon Health Science U., Portland,
Oregon)
Main point dictionary-based gene and protein NER
and normalisation system no supervised
training no human intervention.
- what curated databases are the best collections
of names? - are simple rules sufficient for generating
ortographic variants? - can common English words be used to decrease
false positives? - what is the normalization performance of a
dictionary-based - approach?
Results near state-of-the-art saving on
annotation
9NER and TN
METHOD
1. Building the dictionary
Automatically extracted from 5 databases
official symbol, Unique identifiers, name,
symbol, synonym, alias fields
2. Generating orthographic variants
Set of 7 simple rules applied iteratively
3. Separating common English words
Dictionary split in two parts confusion and main
dictionary
4. Screening out most common English words
5. Searching the text
6. Disambiguation
Note 5 ambiguous intra-species 85 across
species. Exploit non-ambiguous synonyms exploit
context
10NER and TN
A machine learning approach to acronym generation
Tsuruoka et al (Tokyo (Tsujii group), Japan and
Salford, UK)
Task system generates possible acronyms from a
given expanded form
Main point acronym generation as sequence
tagging problem
Method ML approach (MaxEnt Markov Model)
Experiments - 1901 definition/acronym
pairs - several ranked options as output
- 75.4 coverage when including top 5
candidates - baseline take first letters
and capitalise them
11NER and TN
Classes (tags) 1. SKIP (generator skips the
letter) 2. UPPER (generator upper-cases
letter) 3. LOWER (generator lower-cases
letter) 4. SPACE (generator converts letter into
space) 5. HYPHEN (generator converts letter into
hyphen)
Features - letter unigram - letter bigram -
letter trigam - action history (preceding
action) - orthographic (uppercase or not) -
length (words in definition) - letter sequence -
distance (between target letter and
beginning/tail of word)
12Annotation/Zoning
Searching for High-Utility Text in the Biomedical
Lit.
Shatkay et al. (Queens,Ontario and NYU and
NCBI,Maryland)
Task identify text regions that are rich in
scientific content, and retrieve docs that have
many such regions
(Main idea annotation guidelines)
High Utility Regions regions in the text that
we identify as focusing on scientific findings,
stated with a high confidence, and
preferably supported by experimental evidence.
13Annotation/Zoning
assertion sentence or fragment Focus type
of information conveyed by assertion -
scientific - generic - methodology Polarity of
assertion (positive/negative) Certainty -
complete uncertainty (0) - complete certainty
(3) Evidence whether assertion is supported by
exp evidence - E0 lack of evidence - E1
evidence exists but not reported (it was
shown..) - E2 evidence not given directly but
reference provided - E3 evidence
provided Direction/Trend whether assertion
reports increase/decrease in specific
phenomenon
K.83
K.81
K.70
K.73
K.81
14Annotation/Zoning
Automatic Highlighting of Bioscience Literature
H. Wang et al (CS Department, University of Iowa
- M. Light group)
Task automatic highlighting of relevant passages
Approach IR task - sentence is passage
unit - each sentence treated as document -
user provides a query - query box for
keywords - example passage highlighting -
system ranks sentences as to relevance to
query ( query expansion system is web-based)
15Annotation/Zoning
- Corpus 13 journal articles each highlighted by
a bio graduate student before the
request for annotation
- Queries constructed in retrospect. The
annotators created the queries
for the articles they had selected.
The first highlighted region also used as query
- Processing tokenisation (LingPipe), indexing
(Zettair), ranking of
retrieved sentences (Zettair)
- Query Expansion definitions were used. Google
define for each
word (excluding stopwords). Over 80
of query words had Google defs.
- poor results
- first highlighted passage works better than
keywords - Google expansion helps
16Rel Extr
Using biomedical literature mining to consolidate
the set of known human PPIs
A. Ramani et al (U of Texas at Austin -
Bunescu/Mooney group)
Task construct a database of known human PPIs
by - combining and linking interactions from
existing DBs - mine additional interactions from
750000 Medline abs
Results - quality of automatically extracted
interactions comparable to that of
those extracted manually - overall
network of 31609 interactions between 7748 prots
17Rel Extr
1. Identify proteins in text CRF tagger 2.
Filter out less confident entities 3. Try to
detect which pairs of remaining ones are
interactions
- use co-citation analysis - train model on
existing set
Trained model a sentence containing 2 protein
names is classified as correct/wrong. If a
sentence has n prots (n 2), the sentence is
replicated n times - ELCS Extraction w Longest
Common Subsequences (learned rules) - ERK
Extraction using a Relation Kernel
18Rel Extr
IntEx A syntactic role driven PPI extractor for
biomedical text
S. Ahmed et al (Arizona State University)
Task detect PPIs by reducing complex sentences
to simple clauses and then exploiting syntactic
relations
- pronoun resolution (third person and
reflexives simple heuristics) - entity tagging
(dictionary lookup heuristics) - parsing (Link
Grammar, dependency based, CMU?) - complex
sentence splitting (verb-based approach to
extract simple clauses) - interaction
extraction (from simple clauses exploiting
syntactic roles)