Title: Caderige Eventbased Information Extraction for the biomedical domain
1CaderigeEvent-based Information Extraction for
the biomedical domain
- E. Alphonse, S. Aubin, P. Bessières, G. Bisson,
T. Hamon, S.Lagarrigue, A. Nazarenko, A.-P.
Manine, C. Nédellec, M. Ould Abdel Vetah, T.
Poibeau, D. Weissenbacher - http//www-leibniz.imag.fr/SICLAD/Caderige/
- http//www-lipn.univ-paris13.fr/poibeau/Extra/
2Caderige partnership
- NLP, knowledge acquisition and machine learning
- LEIBNIZ laboratory (IMAG, CNRS, Grenoble)
- LIPN laboratory (University Paris 13, CNRS,
Villetaneuse) - LRI laboratory (University Paris 11, CNRS, Orsay)
- Biology and Bioinformatics
- AÏDA project (IRISA, INRIA-Rennes)
- MIG laboratory (INRA, Jouy-en-Josas.
Microbiology) - ENSAR laboratory (INRA, Rennes. Animal genetics)
3Caderige aim of the project
- Extract structured information from textual
databases in genomics (i.e. Medline)
.. the GerE protein inhibits transcription in
vitro of the sigK gene encoding sigmaK ..
4Extraction rules
- Analyze and normalize the text
- Apply extraction rules to fill the extraction
template - GerE stimulates cotD transcription and cotA
transcription , and, - unexpectedly, inhibits transcription of the
gene (sigK) - interaction(X,Y,Z)-
- is-a(X,protein), is-a(Z,gene), is-a(Y,
interaction), is-a(U, transcription),
subject(X,Y), DObj(Y,U), NprepN(U,Z). - How can these rules be semi-automatically
acquired from the text?
5Overview
- Overall approach
- NLP for text normalization
- Ontology learning
- Conclusion Toward extraction rule learning
6Linguistic analysis example
- Multi-level annotations
- Normalization of the original text
7IR and IE for textual database access
8System architecture (IE engine)
9Analysis principles
- Provide robust and versatile NLP tools
- Provide domain-specific resources
- Existing certified resources
- Acquisition from the corpus
- Provide automatic method for resource acquisition
10NLP for text normalization
- Named entity analysis
- Terminology acquisition and filtering
- Focused syntactic analysis
11Named entity analysis
- Group and normalize names corresponding to
entities of the domain - Graphical variations sigma K / sigma(K) / sigma-K
- Morphological variations Down syndrom / Down's
syndrom - Syntactic variations human cancer / cancers in
human - Semantic variation rat somatotropin, rat growth
hormon - Synonymy due to renaming SpoIIIG / sigma G.
- Ellipsis EPO mimetic peptide / EPO
- Abbreviations Bacillus subtilis / B. subtilis
- Acronyms chloramphenicol acetyltransferase / CAT
12Terminology processing
- Terminology is necessary for
- Sentence filtering
- Syntactic analysis
- Ontology learning
- 2 main sources for terminology
- External validated resources (i.e. MeSH Gene
ontology) - Acquisition from a representative corpus (Acabit,
Daille1995) - How to automatically filter terms?
- Morpho-syntactic variation Jacquemin 2001
- External certified terminology
- Statistical filtering Daille1995
13Terminology filtering
Terms in corpus
14Syntactic analysis
- Syntactic analysis is necessary for
- Ontology acquisition
- Structured information extraction
- Dependency vs constituent based grammars
- Constituent grammars efficient to segment the
text in syntactic phrases - But fail to extract relevant functional
relationships betweens phrases - Our choice the Link Parser (www.link.cs.cmu.edu/l
ink/) - Partial dependency-based analysis
- Efficiency proved on useful syntactic relations
- Linguistic resources accessibility
15Integration of the Link Parser
- Normalized text input to limit linguistic
ambiguity - Revised sentence segmentation
- Integration of named entities (e.g. bacillus
subtilis) - Integration of terminology (e.g. in vitro)
- Link parser resource tuning for biology
- Addition of unknown words from the biology domain
- Addition of new rules (e.g. omission of the
determiner before some nouns that require one) - Separate evaluation of each kind of dependency
16Semantic annotation
- Semantic annotation is necessary for
- Ontology acquisition
- Structured information extraction
- Resources for biology text annotation has been
produced - An annotation tool called Cadixe
- DTD for the biological domain
- Annotated corpora, mainly on Bacillus Subtilis
(MIG) - Easy adaptation to new DTD and new domains
- Cadixe currently used by Swiss Prot to annotate
other biological corpora (with another DTD)
17The Cadixe annotation tool
18Ontology learning
- Concept clustering
- Hierarchical learning
- Evaluation
19Hierarchy of conceptual cluster learning
- Learning goal conceptual hierarchies where nodes
are classes of semantically-related terms - Method conceptual clustering based on
distributional analysis - Semantic distance between terms is based on the
number of common contexts shared, in the training
corpus. - Common contexts of two terms
- Co-occurrences in a window or in a document
- Co-occurrences in syntactic contexts (Grishman
et al., 78, Dagan, 98)
20Syntactic-based distance
- Corpus parsing
- Syntactic dependencies between heads and their
arguments - Learning examples for Asium Faure Nédellec,
1999
21Learning conceptual hierarchies by clustering
- Concepts correspond to classes of terms
co-occurring in different syntactic contexts
22Clustering results
- Tentative evaluation
- Ontology acquisition tested on a scientific
corpus (Agrovoc) - Comparison with other distances (Greedy, Dagan)
- Asium Semantic distance produces less induced
examples (lower recall) but has a higher
precision than comparable methods
recall
precision
23Conclusion toward extraction rules learning
24Summary on Caderige
- The approach text normalization for information
extraction - A two-tier architecture
- Production tool a chain of configurable NLP
tools - Acquisition tools corpus-based knowledge
acquisition, based on ML techniques and text
normalization - Current activities
- Complete the NLP integration
- Machine learning to acquire extraction rules
25Example of learned IE rule
- Annotated sentence
-
- The ltagent typeproteingt GerE lt/agentgt
protein ltinteraction typenegativegtinhibits
lt/interactiongt lttarget typeexpressiongttranscripti
on of ltsource typegenegtthe sigK genelt/sourcegt
encoding ltproductgt sigmaKlt/productgt lt/targetgt - Example of learned rule (Propal Alphonse et al.,
01) - interaction (X, Y)-
- isa(X,protein), isa(Y,gene), isa(Z,
neg_interaction), subject(X,Z), DObj(Z, U).