Vortragstitel - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Vortragstitel

Description:

Creating an online dictionary of abbreviations from. medline. The Journal of the American Medical Informatics Association, 9(6):612 620, 2002. ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 16
Provided by: ingrid3
Category:

less

Transcript and Presenter's Notes

Title: Vortragstitel


1
ProMiner Organism-specific protein name detection
using approximate string matching
2
Results group 16
3
The ProMiner-System PSB2003Playing Biology's
Name Game Identifying Protein Names in
Scientific Text
  • Dictionary generation and curation
  • Approximate search
  • Accept permutations (synonym length gt 3
    words), small deletions and insertions
  • Tokens are assigned to different token classes
    with different weights e.g.
  • numbers greek letters class,
  • modifier class (e.g. receptor),
  • description class (e.g. -, subunit)
  • Two associated scoring measures
  • The boundary score sß controls the end of the
    extension
  • The acceptance score sa is a linear combination
    of token class specific match and mismatch terms
  • Filtering of ambiguous matches

4
Generation and curation of dictionaries
  • Yeast
  • the only modification was the addition of the
    letter p to each gene name.
  • Mouse (described in the presentation of group 24)
  • all spelling variants in mouse,
  • manual adaptation to development and trainings
    set
  • remove of all unspecific synonyms
  • Fly
  • obtained directly from the FlyBase database and
  • entries were limited to D. melanogaster
    standard curation was used

5
Rule based classification of synonyms
  • Case-sensitive synonyms
  • A count of two or more occurrences of a synonym
    in different entries and different case of the
    non-normalized form must be considered
    case-sensitive.
  • Questionable synonyms
  • highly unspecific, leads to substantial number
    of false positives
  • occur frequently in a reference corpus,
  • occur reasonably often in a reference corpus and
    are contained in a dictionary of English words,
    or
  • match to rules identifying potential sequence
    parts, (roman or arabic) numbers or subunit tags
    (e.g. alpha 1)
  • Standard synonyms

6
Questionable entry
More specific synonyms are generated based on a
supplied pattern file. For instance, the
clipped fly gene (Flybase-Identifier
FBgn0000354) is expanded as clipped locus,
clipped protein, gene clipped, insertion
of clipped, transposon clipped,
7
Disambiguating object occurrences
  • Positional match disambiguation
  • acceptance score, fraction and length of match
  • Ambiguous object occurrence
  • an ambiguous synonym match is only accepted, if
    a unique match of another synonym
  • of the same object is also found.
  • A disambiguation threshold for the size of the
    final set of objects D1, D3, D5
  • Integration of
  • controlled vocabulary
  • GO cellular component,
  • body parts fly
  • acronym dictionaries
  • Biomedical Abbreviation Server1
  • putative abbreviations from all test and
    training abstracts provided)

1J.T. Chang, H. Schutze, and R.B. Altman.
Creating an online dictionary of abbreviations
from medline. The Journal of the American Medical
Informatics Association, 9(6)612 620, 2002.
8
Ontology-Filter
Input the taxonomy of the NCBI.9 ( formalized
as a directed, acyclic graph in conjunction with
a controlled vocabulary) .
The filtering is based on co-occurrence of terms
in a frame of reference, i.e. an abstract or a
sentence
9
Settings for the final runs
  • Disambiguation threshold D1, D3, D5
  • Use of a ontology filter based on cooccurrence
    with organism names
  • Significance of a dash at end of synonym S-, S
  • (e.g. IL1-induced proliferation - accept,
    not accept)

Best search D1 accept only unique matches! S
do not accept a match if there is a dash at the
end of synonym O use Ontology filter in fly,
O- not in mouse
10
Impact of the ProMiner components fly
11
Impact of different parts of the ProMiner system
mouse
Optimal search with the original dictionary
without curation reach a F-measure of 0.783
12
Impact of curation mouse
13
Short analysis of false positives in run 3, mouse
  • Ambiguity60
  • TP, but not gold standard 6 cases
  • Organism specificity13

ontology filter does not work for mouse?
but there are also organism inconsistencies in
the goldstandard
Cytokine-stimulated human osteosarcoma cells
mouse_00084_testing    
MGI99512       Y mouse_00084_testing   
  MGI101878      Y mouse_00084_testing  
   MGI98259       Y human cancer
cells mouse_00152_testing MGI88139
Y we have isolated genomic clones spanning
the mouse_00099_testing     MGI1313269     Y
human PLA2L locus       To better characterize
the regulation of human CRBP II
mouse_00096_testing     MGI97877       Y
identified a novel human protein termed Celtix-1
mouse_00171_testing     MGI96591       Y
which binds to IRF-2 mouse_00171_testing    
MGI1349766     Y                               
                                                  
               a yeast two-hybrid cDNA library
from rat kidney glomeruli mouse_00098_testing    
MGI102784      Y mouse_00098_testing   
  MGI1916503     Y                              
                                                  
                
  • to identify modular components ('blocks') in the
    growth hormone (GH) gene
  • promoter sequences of some 22 vertebrate
    species, from salmon to human
  • Solar UVA, but not UVC, reaches the earth's
    surface and therefore is an important
    etiological factor for the induction of human
    skin cancer
  • Moreover, FABD-mutated c-Abl stimulated the
    formation of F-actin branches in
  • neurites of rat embryonic cortical neurons.

14
Conclusions
  • ProMiner System
  • Splitting of the dictionary increase specificity
    and sensitivity and
  • reduce the high manual effort of adapted
    curations for new dictionaries
  • Disambiguation lead to important increase of
    specificity
  • Incorporation of controlled vocabulary and
    acronym dictionary augments specificity further
  • Ontology filter raise specificity in fly, but
    does not work for the mouse data set

15
Conclusions
  • ProMiner System
  • Splitting of the dictionary increase specificity
    and sensitivity and
  • reduce the high manual effort of adapted
    curations for new dictionaries
  • Disambiguation lead to important increase of
    specificity
  • Incorporation of controlled vocabulary and
    acronym dictionary augments specificity further
  • Ontology filter raise specificity in fly, but
    does not work for the mouse data set
  • Benchmark-Sets
  • training set only useful for rough adaptations
    not for fine tuning
  • fine tuning was only possible with the dev-test
    set
  • organism impact not always obvious in the
    abstract (e.g. mouse)

16
Team
  • Daniel Hanisch
  • Theo Mevissen
  • Katrin Fundel
  • Ralf Zimmer
Write a Comment
User Comments (0)
About PowerShow.com