Terminological knowledge extraction - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Terminological knowledge extraction

Description:

Research into the subject field of linguistic signals (Jennifer Person, 1998) ... relation (ISA), the causal relation (CAUSE) or the partitive relation (PART) ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 31
Provided by: LBS9
Category:

less

Transcript and Presenter's Notes

Title: Terminological knowledge extraction


1
Terminological knowledge extraction
  • a machine learning approach
  • TKE 2005
  • Lone Bo Sisseck
  • PhD Student
  • Dept. of Computational Linguistics

2
Outline
  • Overall PhD Project
  • Applied method for the machine learning approach
  • Description of data
  • Results of the experiment
  • Conclusion and further research

3
Overall PhD project
  • Research into the subject field of linguistic
    signals (Jennifer Person, 1998) that can indicate
    semantic relations between concepts in domain
    specific text
  • knowledge probes (Khurshid Ahmad, 1993)
  • knowledge patterns (Ingrid Meyer, 2001)
  • lexico-syntactic patterns (Marti Hearst, 1992)

4
Overall PhD project
  • Example of linguistic signals
  • Chrom er et sporstof/generic-specific rel. (ISA)
  • chromium is a trace mineral
  • Mangel pÃ¥ kobber medfører anæmi/causal rel.
    (CAUSE)
  • lack of copper causes anaemia

5
Overall PhD project
  • Areas of interest
  • Identification of Danish linguistic signals that
    can indicate semantic relations between concepts
    in specialized texts (Lotte Weilgaard, 2002)
  • I.e. the generic-specific relation (ISA), the
    causal relation (CAUSE) or the partitive relation
    (PART)
  • Manifestation of concepts in writing and in the
    mind
  • The complexity of concepts
  • Automatic extraction of linguistic signals

6
Applied method for the machine learning approach
  • Brill tagging (Eric Brill, 1995). Easy
    accessibility
  • Learning from already POS tagged corpus
  • All occurrences of a specific semantic relation
    are marked up with the semantic relation tag in
    the training corpus

7
Applied method for the extraction experiment
  • For instance
  • chrom/N er/V_PRES et/PRON_UBST sporstof/N
  • Chromium/N is/V_PRES a/PRON_UBST trace mineral/N
  • chrom/N er/V_PRES_ISA et/PRON_UBST sporstof/N
  • Chromium/N is/V_PRES_ISA a/PRON_UBST trace
    mineral/N

8
Applied method for the extraction experiment
  • An example of a sequence of rules learned for the
    verb
  • medføre(cause(s)) indicating the causal
    relation
  • V_PRES V_CAUSE PREV1OR2OR3TAG ADJ
  • V_INF V_CAUSE PREVBIGRAM N V_PRES
  • V_PRES V_CAUSE PREV1OR2TAG PRÆP
  • In the extraction process, the contextual rules
    are applied in succession
  • The rules are applied to a test corpus in order
    to extract
  • linguistic signals that can indicate the causal
    relation

9
Data
  • The data consists of
  • 3 training corpora/variations of the same corpus
  • Nutrition corpus
  • 1 test corpus
  • Medical corpus

10
Data
  • Training corpus 1 92 sentences all containing
    the Danish verb er (is)
  • Part of the nutrition corpus of 20.000 words
  • POS tagged
  • Manually term tagged with unigrams, bigrams and
    trigrams

11
Data
  • Manual term mark up in training corpus 1
  • N (Noun) where the N is domain-specific
  • ADJ (adjective) N (noun) where the two words
    belong together and form a term, e.g. frie
    radikaler (free radicals).
  • N (noun) P (preposition) N (noun) when it is
    possible to convert the phrase into a one word
    term, e.g. mangel på A-vitamin (lack of vitamin
    A) A-vitaminmangel (Vitamin A lack).

12
Data
  • Training corpus 2 the whole nutrition corpus of
    20.000 words from the OQ project
  • POS tagged
  • Term tagged with unigrams

13
Data
  • Manual term mark up in training corpus 2
  • N (Noun) where the N is domain-specific
  • A word list from the nutrition corpus was
    compared with a list of Danish general language
    words from the Danish module of INTEX (Tine
    Lassen, 2004).
  • Reduced manually

14
Data
  • Training corpus 3 nutrition corpus of 20.000
    words - not term tagged
  • Test corpus medical corpus of 240.000 words
  • Extracted from the Internet, www.netdoktor.dk
  • not term tagged

15
Results of the experiments
  • Experiment 1
  • Based on training corpus 1
  • Results from the paper
  • Experiment 2
  • Based on training corpus 2 and 3
  • Comparison of results with and without term mark
    up
  • Experiment 3
  • Testing contextual rules on test corpus

16
Experiment 1
  • Experiment no 1
  • Training corpus 1 CORPUS of 92 sentences all
    containing
  • the Danish verb er. TERM tagged with unigrams,
    bigrams
  • and trigrams

17
Experiment 1
  • RULES leaned from mark up of er, is, as
    V_PRES_ISA
  • V_PRES V_PRES_ISA NEXT1OR2OR3TAG T
  • V_PRES_ISA V_PRES NEXT1OR2TAG PRÆP
  • V_PRES_ISA V_PRES NEXT1OR2TAG V_PARTC_PAST
  • V_PRES_ISA V_PRES NEXTBIGRAM T ADJ
  • V_PRES_ISA V_PRES NEXT1OR2WD ikke

18
Experiment 1
  • RULES leaned from mark up of er, is, as
    V_PRES_ISA
  • V_PRES V_PRES_ISA NEXT1OR2OR3TAG T
  • Chrom er et sporstof
  • (Chromium is a trace mineral)
  • V_PRES_ISA V_PRES NEXT1OR2TAG PRÆP
  • Vitaminer er kendt for at forebygge
    mangelsygdomme.
  • (Vitamins are known for preventing illnesses.)
  • V_PRES_ISA V_PRES NEXT1OR2TAG V_PARTC_PAST
  • I biologien er antioxidanter defineret som enhver
    substans . (Within biology antioxidants have
    been defined as any substance .)

19
Experiment 1
  • V_PRES_ISA V_PRES NEXTBIGRAM T ADJ
  • Uden behandling med protein, fx mælkepulver, er
    sygdommen dødelig.
  • (Without treatment with protein, e.g. milk
    powder, the illness is fatal.)
  • V_PRES_ISA V_PRES NEXT1OR2WD ikke
  • Hos mennesker er mangelsymptomer ikke
    karakteristiske og . (Symptoms of lack (of
    vitamins) are not characteristic for humans.)
  • Extraction of non-ISA senses manually from test
    corpus
  • (whole nutrition corpus, 20.000 words). Precision
    87

20
Experiment 2
  • Experiment no 2
  • Training corpus 2 nutrition corpus of 20.000
    words
  • Term tagged with unigrams only

21
Experiment 2
  • RULES leaned from mark up of er, is, as
    V_PRES_ISA
  • V_PRES V_PRES_ISA SURROUNDTAG N_T PRON_UBST
  • Kræft/N i/PRÆP nyrene/N_T er/V_PRES en/PRON_UBEST
    alvorlig/ADJ sygdom/N_T
  • Cancer in the kidnies is a serious illness
  • V_PRES V_PRES_ISA NEXT1OR2TAG XX
  • de/PRON_DEMO mest/ADJ brugte/ADJ kosttilskud/N_T
    er/V_PRES kombinerede/V_PAST vitamin-/XX og/SKONJ
    mineraltilskud/N_T
  • the most used dietary supplements are combined
    vitamin and mineral supplements

22
Experiment 2
  • V_PRES V_PRES_ISA NEXTWD betacaroten
  • det/PRON_DEMO mest/ADJ kendte/ADJ caroten/N_T
    er/V_PRES betacaroten/N_T
  • The most known caroten is betacaroten

23
Experiment 2
  • Training corpus 3 nutrition corpus of 20.000
    words
  • Not term tagged
  • RULES leaned from mark up of er, is, as
    V_PRES_ISA
  • V_PRES V_PRES_ISA SURROUNDTAG N PRON_UBST
  • V_PRES V_PRES_ISA NEXT1OR2TAG XX
  • V_PRES V_PRES_ISA NEXTWD betacaroten

24
Experiment 2
  • Results of experiment 2
  • Extraction of ISA relations from Training Corpus
    2
  • Recall 29 Precision 95
  • Extraction of ISA relations from Training Corpus
    3
  • Recall 29 Precision 85

25
Experiment 3
  • Experiment no 3
  • Rules for the causal relation ( CAUSE) learned
    from the verb
  • medføre/medfører to cause/cause(s)
  • V_PRES V_CAUSE PREV1OR2OR3TAG ADJ
  • Store/ADJ mængder/N kortisol/N medfører/V_PRES
    alvorlige/ADJ bivirkninger/N
  • large amounts of cortisone cause serious adverse
    affects
  • V_INF V_CAUSE PREVBIGRAM N V_PRES
  • Store/ADJ doser/N kortisol/N vil/V_PRES
    medføre/V_INF bivirkninger/N
  • large amounts of cortisone will cause serious
    adverse affects

26
Experiment 3
  • V_PRES V_CAUSE PREV1OR2TAG PRÆP
  • Mangel/N pÃ¥/PRÆP ADH/N medfører/V_PRES
    udskillelse/N af/PRÆP urin/N
  • lack of ADH causes secretion of urine

27
Experiment 3
  • Test corpus medical corpus of 240.000 words
  • Extracting ISA relation from Test Corpus
  • Relative recall 35 Precision 40
  • Extracting CAUSE relation from Test Corpus
  • Recall 47 Precision 70

28
Experiment 3
  • 5 of the errors are due to errors in the POS
    tagging (ISA)
  • Undertiden/N er/V_PRES_ISA man/PRON_UBST nødt/ADJ
    til/PRÆP at/UNIK prøve/V_INF forskellige
    behandlingsformer/N
  • Sometimes/N you have to try different treatments
  • 3 of the errors are due to errors in the POS
    tagging (CAUSE)
  • Højresidigt/ADJ hjertesvigt/ADJ kan/V_PRES
    medføre/V_PRES hævelse/N af/PRÆP benene/N
  • Heart disorder/ADJ in the right side can cause
    swelling of the legs

29
Conclusion and future work
  • The results with term tagging were better than
    without, but requires adequate term extraction
    tools
  • It is useful to continue the extraction process
    without term recognition
  • The precision is best when extracting the causal
    relation (CAUSE) from the test corpus
  • Future work To make the system perform better by
    adding more data, and perform iterative learning
    processes on the results of the extraction process

30
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com