Caderige Eventbased Information Extraction for the biomedical domain PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: Caderige Eventbased Information Extraction for the biomedical domain


1
CaderigeEvent-based Information Extraction for
the biomedical domain
  • E. Alphonse, S. Aubin, P. Bessières, G. Bisson,
    T. Hamon, S.Lagarrigue, A. Nazarenko, A.-P.
    Manine, C. Nédellec, M. Ould Abdel Vetah, T.
    Poibeau, D. Weissenbacher
  • http//www-leibniz.imag.fr/SICLAD/Caderige/
  • http//www-lipn.univ-paris13.fr/poibeau/Extra/

2
Caderige partnership
  • NLP, knowledge acquisition and machine learning
  • LEIBNIZ laboratory (IMAG, CNRS, Grenoble)
  • LIPN laboratory (University Paris 13, CNRS,
    Villetaneuse)
  • LRI laboratory (University Paris 11, CNRS, Orsay)
  • Biology and Bioinformatics
  • AÏDA project (IRISA, INRIA-Rennes)
  • MIG laboratory (INRA, Jouy-en-Josas.
    Microbiology)
  • ENSAR laboratory (INRA, Rennes. Animal genetics)

3
Caderige aim of the project
  • Extract structured information from textual
    databases in genomics (i.e. Medline)

.. the GerE protein inhibits transcription in
vitro of the sigK gene encoding sigmaK ..
4
Extraction rules
  • Analyze and normalize the text
  • Apply extraction rules to fill the extraction
    template
  • GerE stimulates cotD transcription and cotA
    transcription , and,
  • unexpectedly, inhibits transcription of the
    gene (sigK)
  • interaction(X,Y,Z)-
  • is-a(X,protein), is-a(Z,gene), is-a(Y,
    interaction), is-a(U, transcription),
    subject(X,Y), DObj(Y,U), NprepN(U,Z).
  • How can these rules be semi-automatically
    acquired from the text?

5
Overview
  • Overall approach
  • NLP for text normalization
  • Ontology learning
  • Conclusion Toward extraction rule learning

6
Linguistic analysis example
  • Multi-level annotations
  • Normalization of the original text

7
IR and IE for textual database access
  • Naïve bayes
  • P 74
  • R 85

8
System architecture (IE engine)
9
Analysis principles
  • Provide robust and versatile NLP tools
  • Provide domain-specific resources
  • Existing certified resources
  • Acquisition from the corpus
  • Provide automatic method for resource acquisition

10
NLP for text normalization
  • Named entity analysis
  • Terminology acquisition and filtering
  • Focused syntactic analysis

11
Named entity analysis
  • Group and normalize names corresponding to
    entities of the domain
  • Graphical variations sigma K / sigma(K) / sigma-K
  • Morphological variations Down syndrom / Down's
    syndrom
  • Syntactic variations human cancer / cancers in
    human
  • Semantic variation rat somatotropin, rat growth
    hormon
  • Synonymy due to renaming SpoIIIG / sigma G.
  • Ellipsis EPO mimetic peptide / EPO
  • Abbreviations Bacillus subtilis / B. subtilis
  • Acronyms chloramphenicol acetyltransferase / CAT

12
Terminology processing
  • Terminology is necessary for
  • Sentence filtering
  • Syntactic analysis
  • Ontology learning
  • 2 main sources for terminology
  • External validated resources (i.e. MeSH Gene
    ontology)
  • Acquisition from a representative corpus (Acabit,
    Daille1995)
  • How to automatically filter terms?
  • Morpho-syntactic variation Jacquemin 2001
  • External certified terminology
  • Statistical filtering Daille1995

13
Terminology filtering
Terms in corpus
14
Syntactic analysis
  • Syntactic analysis is necessary for
  • Ontology acquisition
  • Structured information extraction
  • Dependency vs constituent based grammars
  • Constituent grammars efficient to segment the
    text in syntactic phrases
  • But fail to extract relevant functional
    relationships betweens phrases
  • Our choice the Link Parser (www.link.cs.cmu.edu/l
    ink/)
  • Partial dependency-based analysis
  • Efficiency proved on useful syntactic relations
  • Linguistic resources accessibility

15
Integration of the Link Parser
  • Normalized text input to limit linguistic
    ambiguity
  • Revised sentence segmentation
  • Integration of named entities (e.g. bacillus
    subtilis)
  • Integration of terminology (e.g. in vitro)
  • Link parser resource tuning for biology
  • Addition of unknown words from the biology domain
  • Addition of new rules (e.g. omission of the
    determiner before some nouns that require one)
  • Separate evaluation of each kind of dependency

16
Semantic annotation
  • Semantic annotation is necessary for
  • Ontology acquisition
  • Structured information extraction
  • Resources for biology text annotation has been
    produced
  • An annotation tool called Cadixe
  • DTD for the biological domain
  • Annotated corpora, mainly on Bacillus Subtilis
    (MIG)
  • Easy adaptation to new DTD and new domains
  • Cadixe currently used by Swiss Prot to annotate
    other biological corpora (with another DTD)

17
The Cadixe annotation tool
18
Ontology learning
  • Concept clustering
  • Hierarchical learning
  • Evaluation

19
Hierarchy of conceptual cluster learning
  • Learning goal conceptual hierarchies where nodes
    are classes of semantically-related terms
  • Method conceptual clustering based on
    distributional analysis
  • Semantic distance between terms is based on the
    number of common contexts shared, in the training
    corpus.
  • Common contexts of two terms
  • Co-occurrences in a window or in a document
  • Co-occurrences in syntactic contexts (Grishman
    et al., 78, Dagan, 98)

20
Syntactic-based distance
  • Corpus parsing
  • Syntactic dependencies between heads and their
    arguments
  • Learning examples for Asium Faure Nédellec,
    1999

21
Learning conceptual hierarchies by clustering
  • Concepts correspond to classes of terms
    co-occurring in different syntactic contexts

22
Clustering results
  • Tentative evaluation
  • Ontology acquisition tested on a scientific
    corpus (Agrovoc)
  • Comparison with other distances (Greedy, Dagan)
  • Asium Semantic distance produces less induced
    examples (lower recall) but has a higher
    precision than comparable methods

recall
precision
23
Conclusion toward extraction rules learning
  • Summary
  • Perspectives

24
Summary on Caderige
  • The approach text normalization for information
    extraction
  • A two-tier architecture
  • Production tool a chain of configurable NLP
    tools
  • Acquisition tools corpus-based knowledge
    acquisition, based on ML techniques and text
    normalization
  • Current activities
  • Complete the NLP integration
  • Machine learning to acquire extraction rules

25
Example of learned IE rule
  • Annotated sentence
  • The ltagent typeproteingt GerE lt/agentgt
    protein ltinteraction typenegativegtinhibits
    lt/interactiongt lttarget typeexpressiongttranscripti
    on of ltsource typegenegtthe sigK genelt/sourcegt
    encoding ltproductgt sigmaKlt/productgt lt/targetgt
  • Example of learned rule (Propal Alphonse et al.,
    01)
  • interaction (X, Y)-
  • isa(X,protein), isa(Y,gene), isa(Z,
    neg_interaction), subject(X,Z), DObj(Z, U).
Write a Comment
User Comments (0)
About PowerShow.com