BioLINK Talks

About This Presentation

Title:

BioLINK Talks

Description:

A machine learning approach to acronym generation ... point: acronym generation as sequence tagging problem. Experiments: - 1901 definition/acronym pairs ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 19

Provided by: malvina3

Category:

more less

Transcript and Presenter's Notes

Title: BioLINK Talks

1
BioLINK Talks
Linking Literature, Information and Knowledge for
Biology
BioLINK,Detroit, June 24 (Edinburgh July 11)
2

Corpora and Corpus design (2)
NER and Term Normalisation (3)
Annotation and Zoning (2)
Relation Extraction (2)
Other

3
Corpus Design for Biomedical Natural Language
Processing
Corpora and corpus design
K. Bretonnel Cohen et al (U of Colorado)
Main Question why are some (bio-)corpora more
used than others? What makes them attractive?
Crucial points

format XML
code several layers of information
publicity write specific papers about corpus,
publicise its availability

Take home message if you want people to use your
corpus, use XML, publish annotation guidelines,
publicise corpus with dedicated papers, use it
for competitions
4
MedTag a collection of biomedical annotations
Corpora and corpus design
L. Smith et al. (National Center for
Biotechnology Information, Bethesda, Maryland)
Main Point MedTag is a database that combines
three corpora

MedPost (modified to include 1000 extra
sentences)
ABGene
GENETAG (modified to reflect new defs of genes
and prots)

The data is available in flat files software to
facilitate loading data into SQL database
Take home message integrated data, more
accessible, you should try it.
5
Corpora and corpus design

MedPost

6700 sentences
annotated for POS and gerund arguments
POS tagger trained on it (97.4 accuracy)

GENETAG

15000 sentences currently released
tagged for gene/protein identification
used in Biocreative

ABGene

over 4000 sentences
annotated for gene/protein names
NER tagger trained on it (lower 70s)

6
Corpora and corpus design
GOOD
BAD

Recommended Uses
training and evaluating
POS taggers
training and evaluating
NER taggers
developing and evaluating
a chunker (for PubMed
phrase indexing)
analysis of grammatical
usage in medical text
feature extraction for ML
entity annotation guidelines

tokenisation!
(white spaces were deleted)

7
NER and TN
Weakly Supervised Learning Methods for Improving
the Quality of Gene Name Normalization
Ben Wellner (MITRE)
Main points
1. presenting method of
improving quality of training data
from BioCreative task1b. Systems performance
on improved data is better than
on original data 2. weakly
supervised methods can be successfully applied
for re-labeling noisy training data
(next week)
8
NER and TN
Unsupervised gene/protein normalization using
automatically extracted dictionaries
A. Cohen (Oregon Health Science U., Portland,
Oregon)
Main point dictionary-based gene and protein NER
and normalisation system no supervised
training no human intervention.

what curated databases are the best collections
of names?
are simple rules sufficient for generating
ortographic variants?
can common English words be used to decrease
false positives?
what is the normalization performance of a
dictionary-based
approach?

Results near state-of-the-art saving on
annotation
9
NER and TN
METHOD
1. Building the dictionary
Automatically extracted from 5 databases
official symbol, Unique identifiers, name,
symbol, synonym, alias fields
2. Generating orthographic variants
Set of 7 simple rules applied iteratively
3. Separating common English words
Dictionary split in two parts confusion and main
dictionary
4. Screening out most common English words
5. Searching the text
6. Disambiguation
Note 5 ambiguous intra-species 85 across
species. Exploit non-ambiguous synonyms exploit
context
10
NER and TN
A machine learning approach to acronym generation
Tsuruoka et al (Tokyo (Tsujii group), Japan and
Salford, UK)
Task system generates possible acronyms from a
given expanded form
Main point acronym generation as sequence
tagging problem
Method ML approach (MaxEnt Markov Model)
Experiments - 1901 definition/acronym
pairs - several ranked options as output
- 75.4 coverage when including top 5
candidates - baseline take first letters
and capitalise them
11
NER and TN
Classes (tags) 1. SKIP (generator skips the
letter) 2. UPPER (generator upper-cases
letter) 3. LOWER (generator lower-cases
letter) 4. SPACE (generator converts letter into
space) 5. HYPHEN (generator converts letter into
hyphen)
Features - letter unigram - letter bigram -
letter trigam - action history (preceding
action) - orthographic (uppercase or not) -
length (words in definition) - letter sequence -
distance (between target letter and
beginning/tail of word)
12
Annotation/Zoning
Searching for High-Utility Text in the Biomedical
Lit.
Shatkay et al. (Queens,Ontario and NYU and
NCBI,Maryland)
Task identify text regions that are rich in
scientific content, and retrieve docs that have
many such regions
(Main idea annotation guidelines)
High Utility Regions regions in the text that
we identify as focusing on scientific findings,
stated with a high confidence, and
preferably supported by experimental evidence.
13
Annotation/Zoning
assertion sentence or fragment Focus type
of information conveyed by assertion -
scientific - generic - methodology Polarity of
assertion (positive/negative) Certainty -
complete uncertainty (0) - complete certainty
(3) Evidence whether assertion is supported by
exp evidence - E0 lack of evidence - E1
evidence exists but not reported (it was
shown..) - E2 evidence not given directly but
reference provided - E3 evidence
provided Direction/Trend whether assertion
reports increase/decrease in specific
phenomenon
K.83
K.81
K.70
K.73
K.81
14
Annotation/Zoning
Automatic Highlighting of Bioscience Literature
H. Wang et al (CS Department, University of Iowa
- M. Light group)
Task automatic highlighting of relevant passages
Approach IR task - sentence is passage
unit - each sentence treated as document -
user provides a query - query box for
keywords - example passage highlighting -
system ranks sentences as to relevance to
query ( query expansion system is web-based)
15
Annotation/Zoning
- Corpus 13 journal articles each highlighted by
a bio graduate student before the
request for annotation
- Queries constructed in retrospect. The
annotators created the queries
for the articles they had selected.
The first highlighted region also used as query
- Processing tokenisation (LingPipe), indexing
(Zettair), ranking of
retrieved sentences (Zettair)
- Query Expansion definitions were used. Google
define for each
word (excluding stopwords). Over 80
of query words had Google defs.

poor results
first highlighted passage works better than
keywords
Google expansion helps

16
Rel Extr
Using biomedical literature mining to consolidate
the set of known human PPIs
A. Ramani et al (U of Texas at Austin -
Bunescu/Mooney group)
Task construct a database of known human PPIs
by - combining and linking interactions from
existing DBs - mine additional interactions from
750000 Medline abs
Results - quality of automatically extracted
interactions comparable to that of
those extracted manually - overall
network of 31609 interactions between 7748 prots
17
Rel Extr
1. Identify proteins in text CRF tagger 2.
Filter out less confident entities 3. Try to
detect which pairs of remaining ones are
interactions
- use co-citation analysis - train model on
existing set
Trained model a sentence containing 2 protein
names is classified as correct/wrong. If a
sentence has n prots (n 2), the sentence is
replicated n times - ELCS Extraction w Longest
Common Subsequences (learned rules) - ERK
Extraction using a Relation Kernel
18
Rel Extr
IntEx A syntactic role driven PPI extractor for
biomedical text
S. Ahmed et al (Arizona State University)
Task detect PPIs by reducing complex sentences
to simple clauses and then exploiting syntactic
relations
- pronoun resolution (third person and
reflexives simple heuristics) - entity tagging
(dictionary lookup heuristics) - parsing (Link
Grammar, dependency based, CMU?) - complex
sentence splitting (verb-based approach to
extract simple clauses) - interaction
extraction (from simple clauses exploiting
syntactic roles)

Write a Comment

User Comments (0)

About PowerShow.com

BioLINK Talks - PowerPoint PPT Presentation

BioLINK Talks

A machine learning approach to acronym generation ... point: acronym generation as sequence tagging problem. Experiments: - 1901 definition/acronym pairs ... – PowerPoint PPT presentation