Unsupervised Information Extraction in the OntoBasis Project - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Unsupervised Information Extraction in the OntoBasis Project

Description:

Elaboration and adaptation of semantic knowledge extraction tools for the ... mask, protective eyewear, face mask, glove [mask, protective eyewear... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 27
Provided by: Vero171
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Information Extraction in the OntoBasis Project


1
Unsupervised Information Extraction in the
OntoBasis Project
  • Marie-Laure Reinberger, Walter Daelemans
  • CNTS
  • University of Antwerp

2
Context
  • OntoBasis Foundation, Construction, Services
    and Applications of Ontologies
  • Elaboration and adaptation of semantic knowledge
    extraction tools for the building of specific
    domain ontology

3
Method 1
NP1Subject The/DT patients/NNS NP1Subject
VP1 followed/VBD VP1 use wear remove
mask mask, protective eyewear, face mask,
glove mask, protective eyewear during
operation surgery
  • raw text
  • ? shallow parser
  • parsed text
  • ? selection
  • classes
  • ? similarity
  • clusters
  • ? pattern matching
  • linked clusters

4
Corpus
  • SwissProt corpus
  • 13M words
  • Extracted from Medline abstracts
  • Protein and genes

5
Shallow Parsing
  • Processing of a large amount of text
  • Syntactic analysis -gt semantic information
  • Detection of the syntactic structure
    subject-verb-object(demo http//ilk.kub.nl)

6
Example
The patients have followed
a healthy diet .
The/DT patients/NNS have/VBP followed/VBD a/DT
healthy/JJ diet/NN ./.
NP The/DT patients/NNS NP
VP have/VBP followed/VBD VP
NP a/DT healthy/JJ diet/NN NP ./.
NP1Subject The/DT patients/NNS NP1Subject
VP1 have/VBP followed/VBD VP1
NP1Object a/DT healthy/JJ diet/NN NP1Object ./.
7
Selection (1)Syntactic structure verb-direct
object
  • Pairs main verb-nominal string, the NS being a
    string of adjectives and nouns
  • Building of classes amino_acide_sequence
    deduce predict derive compare
    determine arginine change convert
    replace mitosis exit enter leave initiate
    complete

8
Selection (2)Filtering
  • Use of a contrastive corpus Wall Street Journal
  • Elimination of the NS containing nouns appearing
    in the WSJ corpus
  • Exampleadministrationboard

9
Selection(3)Statistical measures
  • Baseline Frequency measure
  • Probability measure
  • Resnik measure the selectional preference value
    of a verb is strong when this verb combines with
    unfrequent nominal strings

10
Clustering of nominal strings
  • Similarity between classes of verbs nominal
    strings combining with a set of common verbs
    share semantic features
  • Clusters alanine glycine arginine cysteine
    codon antibody monoclonal_antibody pcr primer
    oligonucleotide clue insight information
    new_insight

11
Pattern Matching
  • Nominal string Preposition Nominal
    stringblood_vessel_growth on ribonucleolytic_acti
    vityamino_acid_residue in polymeraseprimer from
    amino_acid_sequence
  • Trios organized in classesamino_acid_residue in
    polymerase exon region N-terminal_region
    catalysis

12
Combining clustering with pattern matching
  • Enlarging clusters and building links between
    clusters
  • combination sequence use of
    oligonucleotide_probe polymerase chain_reaction
    probe set prc primer
  • combination sequence use gene of
    byoligonucleotide_probe polymerase
    chain_reaction probe set prc primer of on c
    DNA exon basis sequence
    from amino_acid_sequence sequence region

13
Evaluation
  • Use of UMLS (Unified Medical Language System)
    thesaurus and semantic network
  • Extraction of pairs of terms related in UMLS
  • Recall and precision computed according to the
    quantity of UMLS pairs found in the clusters
  • Only a part of the relations is evaluated!

14
Results
  • Without filtering, high occurrencesRecall 30
    Precision 16
  • Terminology filtered Recall PrecisionFrequen
    cy 59 4Resnik 45 9Probability
    47 7

15
Method 2
NP1Subject The/DT patients/NNS NP1Subject
VP1 followed/VBD VP1 protein of
amino_acid RNA synthetize cDNA
  • raw text
  • ? shallow parser
  • parsed text
  • ? pattern matching
  • general relations
  • ? selection
  • highly rated relations
  • ? pattern matching
  • functional relations

16
Shallow parser
  • Trained on new terminology including protein
    names, to get rid of SVO constructions such
    as gene murine chromosome expression
    acetyltransferase gene
  • same corpus

17
Filtering
  • Selection of patternsnominal string1 prep
    nominal string2
  • Statistical measures, probability
    ofNS1-PrepPrep-NS2NS1-Prep-NS2
  • Selection of the n NS with the best measures

18
Examples
  • Nominal string Preposition Nominal
    stringshort_arm of human_chromosomeoligonucleot
    ide_probe from N-terminal_amino_acid_sequencepr
    imer from amino_acid_sequenceapoptosis during
    neuronal_differentiation

19
Next step
  • Selection of patterns subject-vb-object-
    containing the NS selected in the previous step
    - rated according to the previous measure and
    the probability of appearance of the verb- use
    of a (basic) stoplist

20
Examples
  • Subject Verb Objectprotein encoding genelysine
    replaced glutamic_acidmeiosis expressed geneRNA
    synthetizes cDNA

21
  • valine - signal - cleavage - replace -
    alanine - for - isoleucine - for - alanine -
    to - methionine - for - glycine - for -
    leucine
  • Edman_degradation - gave - glycineaspartic_acide
    replaces - glycineglycine - to -
    glutamic_acid - to - aspartic-acid

22
  • protein show immunoreactivity - show -
    sequence_similarity - bind copper - represses
    transcription - encode gene - of
    amino_acid - onto membrane
  • DNA inhibits - protein_synthesisinduction -
    requires RNA requires -

23
Problems
  • Possible negations
  • Preposition link not always precise enough
    (part_of relations)

24
Evaluation by experts of the domain
  • 261 relations 165 functional/ 96 prepositional
  • False/irrelevant 118 95/23
  • General info/weak relevance 21 12/9
  • Specific info/strong relevance 122 58/64
  • Precision 55

25
What next
  • Sort and precise part_of relations hydrogenase
    of alcaligenescentral_region of chromosome
  • and othersphenylalanine to tyrosine ??
  • Pattern Verb-prep-objectguanine changing
    arginine ?

26
Possible applications
  • Ontology adaptation to the medical/protein domain
  • Document selection
  • Information extraction
Write a Comment
User Comments (0)
About PowerShow.com