Title: Unsupervised Information Extraction in the OntoBasis Project
1Unsupervised Information Extraction in the
OntoBasis Project
- Marie-Laure Reinberger, Walter Daelemans
- CNTS
- University of Antwerp
2Context
- OntoBasis Foundation, Construction, Services
and Applications of Ontologies - Elaboration and adaptation of semantic knowledge
extraction tools for the building of specific
domain ontology
3Method 1
NP1Subject The/DT patients/NNS NP1Subject
VP1 followed/VBD VP1 use wear remove
mask mask, protective eyewear, face mask,
glove mask, protective eyewear during
operation surgery
- raw text
- ? shallow parser
- parsed text
- ? selection
- classes
- ? similarity
- clusters
- ? pattern matching
- linked clusters
4Corpus
- SwissProt corpus
- 13M words
- Extracted from Medline abstracts
- Protein and genes
5Shallow Parsing
- Processing of a large amount of text
- Syntactic analysis -gt semantic information
- Detection of the syntactic structure
subject-verb-object(demo http//ilk.kub.nl)
6Example
The patients have followed
a healthy diet .
The/DT patients/NNS have/VBP followed/VBD a/DT
healthy/JJ diet/NN ./.
NP The/DT patients/NNS NP
VP have/VBP followed/VBD VP
NP a/DT healthy/JJ diet/NN NP ./.
NP1Subject The/DT patients/NNS NP1Subject
VP1 have/VBP followed/VBD VP1
NP1Object a/DT healthy/JJ diet/NN NP1Object ./.
7Selection (1)Syntactic structure verb-direct
object
- Pairs main verb-nominal string, the NS being a
string of adjectives and nouns - Building of classes amino_acide_sequence
deduce predict derive compare
determine arginine change convert
replace mitosis exit enter leave initiate
complete
8Selection (2)Filtering
- Use of a contrastive corpus Wall Street Journal
- Elimination of the NS containing nouns appearing
in the WSJ corpus - Exampleadministrationboard
9Selection(3)Statistical measures
- Baseline Frequency measure
- Probability measure
- Resnik measure the selectional preference value
of a verb is strong when this verb combines with
unfrequent nominal strings
10Clustering of nominal strings
- Similarity between classes of verbs nominal
strings combining with a set of common verbs
share semantic features - Clusters alanine glycine arginine cysteine
codon antibody monoclonal_antibody pcr primer
oligonucleotide clue insight information
new_insight
11Pattern Matching
- Nominal string Preposition Nominal
stringblood_vessel_growth on ribonucleolytic_acti
vityamino_acid_residue in polymeraseprimer from
amino_acid_sequence - Trios organized in classesamino_acid_residue in
polymerase exon region N-terminal_region
catalysis
12Combining clustering with pattern matching
- Enlarging clusters and building links between
clusters - combination sequence use of
oligonucleotide_probe polymerase chain_reaction
probe set prc primer - combination sequence use gene of
byoligonucleotide_probe polymerase
chain_reaction probe set prc primer of on c
DNA exon basis sequence
from amino_acid_sequence sequence region
13Evaluation
- Use of UMLS (Unified Medical Language System)
thesaurus and semantic network - Extraction of pairs of terms related in UMLS
- Recall and precision computed according to the
quantity of UMLS pairs found in the clusters - Only a part of the relations is evaluated!
14Results
- Without filtering, high occurrencesRecall 30
Precision 16 - Terminology filtered Recall PrecisionFrequen
cy 59 4Resnik 45 9Probability
47 7
15Method 2
NP1Subject The/DT patients/NNS NP1Subject
VP1 followed/VBD VP1 protein of
amino_acid RNA synthetize cDNA
- raw text
- ? shallow parser
- parsed text
- ? pattern matching
- general relations
- ? selection
- highly rated relations
- ? pattern matching
- functional relations
16Shallow parser
- Trained on new terminology including protein
names, to get rid of SVO constructions such
as gene murine chromosome expression
acetyltransferase gene - same corpus
17Filtering
- Selection of patternsnominal string1 prep
nominal string2 - Statistical measures, probability
ofNS1-PrepPrep-NS2NS1-Prep-NS2 - Selection of the n NS with the best measures
18Examples
- Nominal string Preposition Nominal
stringshort_arm of human_chromosomeoligonucleot
ide_probe from N-terminal_amino_acid_sequencepr
imer from amino_acid_sequenceapoptosis during
neuronal_differentiation
19Next step
- Selection of patterns subject-vb-object-
containing the NS selected in the previous step
- rated according to the previous measure and
the probability of appearance of the verb- use
of a (basic) stoplist
20Examples
- Subject Verb Objectprotein encoding genelysine
replaced glutamic_acidmeiosis expressed geneRNA
synthetizes cDNA
21- valine - signal - cleavage - replace -
alanine - for - isoleucine - for - alanine -
to - methionine - for - glycine - for -
leucine - Edman_degradation - gave - glycineaspartic_acide
replaces - glycineglycine - to -
glutamic_acid - to - aspartic-acid
22- protein show immunoreactivity - show -
sequence_similarity - bind copper - represses
transcription - encode gene - of
amino_acid - onto membrane - DNA inhibits - protein_synthesisinduction -
requires RNA requires -
23Problems
- Possible negations
- Preposition link not always precise enough
(part_of relations)
24Evaluation by experts of the domain
- 261 relations 165 functional/ 96 prepositional
- False/irrelevant 118 95/23
- General info/weak relevance 21 12/9
- Specific info/strong relevance 122 58/64
- Precision 55
25What next
- Sort and precise part_of relations hydrogenase
of alcaligenescentral_region of chromosome - and othersphenylalanine to tyrosine ??
- Pattern Verb-prep-objectguanine changing
arginine ?
26Possible applications
- Ontology adaptation to the medical/protein domain
- Document selection
- Information extraction