Literature Data Mining and Protein Ontology Development - PowerPoint PPT Presentation

About This Presentation
Title:

Literature Data Mining and Protein Ontology Development

Description:

Georgetown University Medical Center. Washington, DC 20007 ... Dictionary pre-tagging. F-measure: 0.412 (0.372 Precision, 0.462 Recall) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 19
Provided by: wuc
Category:

less

Transcript and Presenter's Notes

Title: Literature Data Mining and Protein Ontology Development


1
Literature Data Mining and Protein Ontology
Development
At the Protein Information Resource (PIR)
Hu ZZ, Mani I, Liu H, Hermoso V, Vijay-Shanker
K, Nikolskaya A, Natale DA, and Wu CH ISMB 2005,
Detroit, Michigan
  • June 29, 2005
  • Zhang-Zhi Hu, M.D.
  • Senior Bioinformatics Scientist, PIR
  • Georgetown University Medical Center
  • Washington, DC 20007

2
PIR Integrated Protein Informatics Resource for
Genomic/Proteomic Research (http//pir.georgetown.
edu)
New version of PIR homepage
UniProt Central international database of
protein sequence and function (http//www.uniprot.
org)
3
Objective Accurate, Consistent, and Rich
Annotation of Protein Sequence and Function
  • Literature-Based Curation Extract Reliable
    Information from Literature
  • Function, domains/sites, developmental stages,
    catalytic activity, binding and modified
    residues, regulation, pathways, tissue
    specificity, subcellular location ...
  • Ensure high quality, accurate and up-to-date
    experimental data for each protein.
  • A major bottleneck!
  • Ontologies/Controlled Vocabularies For
    Information Integration and Knowledge Management
  • UniProtKB entries will be annotated using widely
    accepted biological ontologies and other
    controlled vocabularies, e.g. Gene Ontology (GO)
    and EC nomenclature.

4
iProLINK An integrated protein resource for
literature mining and literature-based curation
1. Bibliography mapping - UniProt mapped
citations 2. Annotation extraction -
annotation tagged literature 3. Protein named
entity recognition - dictionary, name tagged
literature 4. Protein ontology development -
PIRSF-based ontology
5
iProLINK
http//pir.georgetown.edu/iprolink/
  • RLIMS-P text mining tool
  • Protein dictionaries
  • Name tagging guideline
  • Protein ontology

6
Protein Phosphorylation Annotation Extraction
  • Manual tagging assisted with computational
    extraction
  • Training sets of positive and negative samples

Evidence attribution
RLIMS-P
3 objects
7
RLIMS-P Rule-based LIterature Mining System for
Protein Phosphorylation
download
http//pir.georgetown.edu/iprolink/
8
Benchmarking of RLIMS-P
High recall for paper retrieval and high
precision for information extraction
  • UniProtKB site feature annotation
  • Proteomics Mass Spec. data analysis protein
    identification

9
Online RLIMS-P
(version 1.0)
http//pir.georgetown.edu/iprolink/rlimsp/
  • Search interface
  • Summary table with top hit of all sites
  • All sites and tagged text evidence

10
BioThesaurus http//pir.georgetown.edu/iprolink/bi
othesaurus/
BioThesaurus v1.0
m million
UniProtKB entry 1.86m
Source DB record 6.6m
Gene/protein names/terms 3.6m
(May, 2005)
11
BioThesaurus Report
Synonyms for Metalloproteinase inhibitor 3
  • Gene/Protein Name Mapping
  • Search Synonyms
  • Resolve Name Ambiguity
  • Underlying ID Mapping

1
3
ID Mapping
TMP3
Name ambiguity
2
12
Protein Name Tagging
  • Tagging guideline versions 1.0 and 2.0
  • Generation of domain expert-tagged corpora
  • Inter-coder reliability upper bound of machine
    tagging
  • Dictionary pre-tagging
  • F-measure 0.412 (0.372 Precision, 0.462 Recall)
  • Advantages helpful with standardization and
    extent of tagging, reducing the fatigue problem,
    and improve inter-coder reliability.
  • BioThesaurus for pre-tagging

13
PIRSF in DAG View
PIRSF-Based Protein Ontology
  • PIRSF family hierarchy based on evolutionary
    relationships
  • Standardized PIRSF family names as hierarchical
    protein ontology
  • DAG Network structure for PIRSF family
    classification system

14
PIRSF to GO Mapping
  • Mapped 5363 curated PIRSF homeomorphic families
    and subfamilies to the GO hierarchy
  • 68 of the PIRSF families and subfamilies map to
    GO leaf nodes
  • 2329 PIRSFs have shared GO leaf nodes
  • Complements GO PIRSF-based ontology can be used
    to analyze GO branches and concepts and to
    provide links between the GO sub-ontologies

15
Protein Ontology Can Complement GO
GO-centric view
  • Expanding a Node Identification of GO subtrees
    that can be expanded when GO concepts are too
    broad
  • IGFBP subfamilies and
  • High- vs. low-affinity binding for IGF between
    IGFBP and IGFBPrP

16
Exploration of Gene and Protein Ontology
PIRSF-centric view
Molecular function
Biological process
  • Systematic links between three GO sub-ontologies,
    e.g., linking molecular function and biological
    process
  • Estrogen receptor binding
  • Estrogen receptor signaling pathway

17
Summary
  • PIR iProLINK literature mining resource provides
    annotated data sets for NLP research on
    annotation extraction and protein ontology
    development
  • RLIMS-P text-mining tool for protein
    phosphorylation from PubMed literature.
  • BioThesaurus can be used for name mapping to
    solve name synonym and ambiguity issues.
  • PIRSF-based protein ontology can complement other
    biological ontologies such as GO.

18
Acknowledgements
  • Research Projects
  • NIH NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt)
  • NSF SEIII (Entity Tagging)
  • NSF ITR (Ontology)
  • Collaborators
  • I. Mani from Georgetown University Department of
    Linguistics on protein name recognition and
    protein name ontology.
  • H. Liu from University of Maryland Department of
    Information System on protein name recognition
    and text mining.
  • Vijay K. Shanker from University of Delaware
    Department of Computer and Information Science on
    text mining of protein phosphorylation features.
Write a Comment
User Comments (0)
About PowerShow.com