Title: Vortragstitel
1ProMiner Organism-specific protein name detection
using approximate string matching
2Results group 16
3The ProMiner-System PSB2003Playing Biology's
Name Game Identifying Protein Names in
Scientific Text
- Dictionary generation and curation
- Approximate search
- Accept permutations (synonym length gt 3
words), small deletions and insertions - Tokens are assigned to different token classes
with different weights e.g. - numbers greek letters class,
- modifier class (e.g. receptor),
- description class (e.g. -, subunit)
-
- Two associated scoring measures
- The boundary score sß controls the end of the
extension - The acceptance score sa is a linear combination
of token class specific match and mismatch terms - Filtering of ambiguous matches
-
-
4Generation and curation of dictionaries
- Yeast
- the only modification was the addition of the
letter p to each gene name. - Mouse (described in the presentation of group 24)
- all spelling variants in mouse,
- manual adaptation to development and trainings
set - remove of all unspecific synonyms
- Fly
- obtained directly from the FlyBase database and
- entries were limited to D. melanogaster
standard curation was used -
5Rule based classification of synonyms
- Case-sensitive synonyms
- A count of two or more occurrences of a synonym
in different entries and different case of the
non-normalized form must be considered
case-sensitive. - Questionable synonyms
- highly unspecific, leads to substantial number
of false positives - occur frequently in a reference corpus,
- occur reasonably often in a reference corpus and
are contained in a dictionary of English words,
or - match to rules identifying potential sequence
parts, (roman or arabic) numbers or subunit tags
(e.g. alpha 1) - Standard synonyms
6Questionable entry
More specific synonyms are generated based on a
supplied pattern file. For instance, the
clipped fly gene (Flybase-Identifier
FBgn0000354) is expanded as clipped locus,
clipped protein, gene clipped, insertion
of clipped, transposon clipped,
7Disambiguating object occurrences
- Positional match disambiguation
- acceptance score, fraction and length of match
- Ambiguous object occurrence
- an ambiguous synonym match is only accepted, if
a unique match of another synonym - of the same object is also found.
- A disambiguation threshold for the size of the
final set of objects D1, D3, D5
- Integration of
- controlled vocabulary
- GO cellular component,
- body parts fly
- acronym dictionaries
- Biomedical Abbreviation Server1
- putative abbreviations from all test and
training abstracts provided)
1J.T. Chang, H. Schutze, and R.B. Altman.
Creating an online dictionary of abbreviations
from medline. The Journal of the American Medical
Informatics Association, 9(6)612 620, 2002.
8Ontology-Filter
Input the taxonomy of the NCBI.9 ( formalized
as a directed, acyclic graph in conjunction with
a controlled vocabulary) .
The filtering is based on co-occurrence of terms
in a frame of reference, i.e. an abstract or a
sentence
9Settings for the final runs
- Disambiguation threshold D1, D3, D5
- Use of a ontology filter based on cooccurrence
with organism names - Significance of a dash at end of synonym S-, S
- (e.g. IL1-induced proliferation - accept,
not accept)
Best search D1 accept only unique matches! S
do not accept a match if there is a dash at the
end of synonym O use Ontology filter in fly,
O- not in mouse
10Impact of the ProMiner components fly
11Impact of different parts of the ProMiner system
mouse
Optimal search with the original dictionary
without curation reach a F-measure of 0.783
12Impact of curation mouse
13Short analysis of false positives in run 3, mouse
- Ambiguity60
- TP, but not gold standard 6 cases
- Organism specificity13
ontology filter does not work for mouse?
but there are also organism inconsistencies in
the goldstandard
Cytokine-stimulated human osteosarcoma cells
mouse_00084_testing
MGI99512 Y mouse_00084_testing
MGI101878 Y mouse_00084_testing
MGI98259 Y human cancer
cells mouse_00152_testing MGI88139
Y we have isolated genomic clones spanning
the mouse_00099_testing MGI1313269 Y
human PLA2L locus To better characterize
the regulation of human CRBP II
mouse_00096_testing MGI97877 Y
identified a novel human protein termed Celtix-1
mouse_00171_testing MGI96591 Y
which binds to IRF-2 mouse_00171_testing
MGI1349766 Y
a yeast two-hybrid cDNA library
from rat kidney glomeruli mouse_00098_testing
MGI102784 Y mouse_00098_testing
MGI1916503 Y
- to identify modular components ('blocks') in the
growth hormone (GH) gene - promoter sequences of some 22 vertebrate
species, from salmon to human - Solar UVA, but not UVC, reaches the earth's
surface and therefore is an important
etiological factor for the induction of human
skin cancer - Moreover, FABD-mutated c-Abl stimulated the
formation of F-actin branches in - neurites of rat embryonic cortical neurons.
14Conclusions
- ProMiner System
- Splitting of the dictionary increase specificity
and sensitivity and - reduce the high manual effort of adapted
curations for new dictionaries - Disambiguation lead to important increase of
specificity - Incorporation of controlled vocabulary and
acronym dictionary augments specificity further - Ontology filter raise specificity in fly, but
does not work for the mouse data set
15Conclusions
- ProMiner System
- Splitting of the dictionary increase specificity
and sensitivity and - reduce the high manual effort of adapted
curations for new dictionaries - Disambiguation lead to important increase of
specificity - Incorporation of controlled vocabulary and
acronym dictionary augments specificity further - Ontology filter raise specificity in fly, but
does not work for the mouse data set - Benchmark-Sets
- training set only useful for rough adaptations
not for fine tuning - fine tuning was only possible with the dev-test
set - organism impact not always obvious in the
abstract (e.g. mouse)
16Team
- Daniel Hanisch
- Theo Mevissen
- Katrin Fundel
- Ralf Zimmer