Caderige Eventbased Information Extraction for the biomedical domain presentation

About This Presentation

Transcript and Presenter's Notes

Title: Caderige Eventbased Information Extraction for the biomedical domain

1
CaderigeEvent-based Information Extraction for
the biomedical domain

E. Alphonse, S. Aubin, P. Bessières, G. Bisson,
T. Hamon, S.Lagarrigue, A. Nazarenko, A.-P.
Manine, C. Nédellec, M. Ould Abdel Vetah, T.
Poibeau, D. Weissenbacher
http//www-leibniz.imag.fr/SICLAD/Caderige/
http//www-lipn.univ-paris13.fr/poibeau/Extra/

2
Caderige partnership

NLP, knowledge acquisition and machine learning
LEIBNIZ laboratory (IMAG, CNRS, Grenoble)
LIPN laboratory (University Paris 13, CNRS,
Villetaneuse)
LRI laboratory (University Paris 11, CNRS, Orsay)
Biology and Bioinformatics
AÏDA project (IRISA, INRIA-Rennes)
MIG laboratory (INRA, Jouy-en-Josas.
Microbiology)
ENSAR laboratory (INRA, Rennes. Animal genetics)

3
Caderige aim of the project

Extract structured information from textual
databases in genomics (i.e. Medline)

.. the GerE protein inhibits transcription in
vitro of the sigK gene encoding sigmaK ..
4
Extraction rules

Analyze and normalize the text
Apply extraction rules to fill the extraction
template
GerE stimulates cotD transcription and cotA
transcription , and,
unexpectedly, inhibits transcription of the
gene (sigK)
interaction(X,Y,Z)-
is-a(X,protein), is-a(Z,gene), is-a(Y,
interaction), is-a(U, transcription),
subject(X,Y), DObj(Y,U), NprepN(U,Z).
How can these rules be semi-automatically
acquired from the text?

5
Overview

Overall approach
NLP for text normalization
Ontology learning
Conclusion Toward extraction rule learning

6
Linguistic analysis example

Multi-level annotations
Normalization of the original text

7
IR and IE for textual database access

Naïve bayes
P 74
R 85

8
System architecture (IE engine)
9
Analysis principles

Provide robust and versatile NLP tools
Provide domain-specific resources
Existing certified resources
Acquisition from the corpus
Provide automatic method for resource acquisition

10
NLP for text normalization

Named entity analysis
Terminology acquisition and filtering
Focused syntactic analysis

11
Named entity analysis

Group and normalize names corresponding to
entities of the domain
Graphical variations sigma K / sigma(K) / sigma-K
Morphological variations Down syndrom / Down's
syndrom
Syntactic variations human cancer / cancers in
human
Semantic variation rat somatotropin, rat growth
hormon
Synonymy due to renaming SpoIIIG / sigma G.
Ellipsis EPO mimetic peptide / EPO
Abbreviations Bacillus subtilis / B. subtilis
Acronyms chloramphenicol acetyltransferase / CAT

12
Terminology processing

Terminology is necessary for
Sentence filtering
Syntactic analysis
Ontology learning
2 main sources for terminology
External validated resources (i.e. MeSH Gene
ontology)
Acquisition from a representative corpus (Acabit,
Daille1995)
How to automatically filter terms?
Morpho-syntactic variation Jacquemin 2001
External certified terminology
Statistical filtering Daille1995

13
Terminology filtering
Terms in corpus
14
Syntactic analysis

Syntactic analysis is necessary for
Ontology acquisition
Structured information extraction
Dependency vs constituent based grammars
Constituent grammars efficient to segment the
text in syntactic phrases
But fail to extract relevant functional
relationships betweens phrases
Our choice the Link Parser (www.link.cs.cmu.edu/l
ink/)
Partial dependency-based analysis
Efficiency proved on useful syntactic relations
Linguistic resources accessibility

15
Integration of the Link Parser

Normalized text input to limit linguistic
ambiguity
Revised sentence segmentation
Integration of named entities (e.g. bacillus
subtilis)
Integration of terminology (e.g. in vitro)
Link parser resource tuning for biology
Addition of unknown words from the biology domain
Addition of new rules (e.g. omission of the
determiner before some nouns that require one)
Separate evaluation of each kind of dependency

16
Semantic annotation

Semantic annotation is necessary for
Ontology acquisition
Structured information extraction
Resources for biology text annotation has been
produced
An annotation tool called Cadixe
DTD for the biological domain
Annotated corpora, mainly on Bacillus Subtilis
(MIG)
Easy adaptation to new DTD and new domains
Cadixe currently used by Swiss Prot to annotate
other biological corpora (with another DTD)

17
The Cadixe annotation tool
18
Ontology learning

Concept clustering
Hierarchical learning
Evaluation

19
Hierarchy of conceptual cluster learning

Learning goal conceptual hierarchies where nodes
are classes of semantically-related terms
Method conceptual clustering based on
distributional analysis
Semantic distance between terms is based on the
number of common contexts shared, in the training
corpus.
Common contexts of two terms
Co-occurrences in a window or in a document
Co-occurrences in syntactic contexts (Grishman
et al., 78, Dagan, 98)

20
Syntactic-based distance

Corpus parsing
Syntactic dependencies between heads and their
arguments
Learning examples for Asium Faure Nédellec,
1999

21
Learning conceptual hierarchies by clustering

Concepts correspond to classes of terms
co-occurring in different syntactic contexts

22
Clustering results

Tentative evaluation
Ontology acquisition tested on a scientific
corpus (Agrovoc)
Comparison with other distances (Greedy, Dagan)
Asium Semantic distance produces less induced
examples (lower recall) but has a higher
precision than comparable methods

recall
precision
23
Conclusion toward extraction rules learning

Summary
Perspectives

24
Summary on Caderige

The approach text normalization for information
extraction
A two-tier architecture
Production tool a chain of configurable NLP
tools
Acquisition tools corpus-based knowledge
acquisition, based on ML techniques and text
normalization
Current activities
Complete the NLP integration
Machine learning to acquire extraction rules

25
Example of learned IE rule

Annotated sentence
The ltagent typeproteingt GerE lt/agentgt
protein ltinteraction typenegativegtinhibits
lt/interactiongt lttarget typeexpressiongttranscripti
on of ltsource typegenegtthe sigK genelt/sourcegt
encoding ltproductgt sigmaKlt/productgt lt/targetgt
Example of learned rule (Propal Alphonse et al.,
01)
interaction (X, Y)-
isa(X,protein), isa(Y,gene), isa(Z,
neg_interaction), subject(X,Z), DObj(Z, U).

Write a Comment

User Comments (0)

About PowerShow.com

Caderige Eventbased Information Extraction for the biomedical domain PowerPoint PPT Presentation