Title: ASWC-08, Bangkok
1ASWC-08, Bangkok
Extracting Semantic Frames from Thai
Medical-Symptom Phrases with Unknown Boundaries
Peerasak Intarapaiboon Ekawit Nantajeewarawat Than
aruk Theeramunkong School of ICT Sirindhorn
International Institute of Technology Thammasat
University, Thailand
2Background Thai Medical KB Construction Project
Funded by NECTEC
Web Page Collection
Internet
Data Collection
Information Extraction
Selected Keywords
Keyword Extraction
KB Construction
Dictionary
Link Construction
KB
3Background Thai Medical KB Construction Project
Thai Medical Textual information On the Web
Data Collection
Information Extraction
Keyword Extraction
KB Construction
- Disease characteristics
- Causes
- Treatment
- Drug information
- etc.
- Currently,
- 3,594 keywords
- 22,122 information entries
Link Construction
Structured Data
4Search in Context
In-Database Search
Disease List
Keyword Link
Relation Graph
5- The graph shows
-
- Occurrences of keywords in disease
- descriptions
-
- Description types, e.g., treatment, cause.
6Fine-grained semantic relation
Objective of this paper
7Objective
Generate
Semantic Representation
Semantic Frame
Text
A framework for
-
- Extracting semantic information from symptom
descriptions - in the Thai language
- Representing the extracted information in an
ontology-based - machine-processable form
8Objective
Generate
Semantic Representation
Semantic Frame
Text
Example
?????????????????????????
9Objective
Pattern-Based Information Extraction Rules
Semantic Representation
Semantic Frame
Text
Example
?????????????????????????
10Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
experience a sym pain in org chest which
lasts ptime 6-12 days
11Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
12Pattern-Based Information Extraction An Example
Rule
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ??????????????????org
?????????????????????ptime 6-12 ???
Symptom
Type
OBS sym ????????? LOC org ??????PER
ptime 6-12 ???
Organ
Type
Type
?????? Chest
6-12 ??? 6-12 days
PER
LOC
13Thai IEDifficulty
Test corpus
Training corpus
Preprocessing
Free Text
Free Text
Word Segmentation
POS-tagging, Word Sense
Shallow-Parsed Text
Shallow-Parsed Text
Shallow Parsing
Rule Extraction
Extraction
IE Rules
Output
A rule-based IE framework
14Thai IEDifficulty
Test corpus
Training corpus
Preprocessing
Free Text
Free Text
Word Segmentation
POS-tagging, Word Sense
Supplement techniques are necessary
Shallow-Parsed Text
Shallow-Parsed Text
Shallow Parsing
Rule Extraction
Extraction
IE Rules
Output
A rule-based IE framework
15Proposed Framework
16Proposed Framework
17Proposed Framework
Use The WHISK Algo.
18Proposed Framework
2
1
3
19Rule Application using Sliding Window (RAW)
Rule
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
20Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
1, 10-portion
21Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
2, 11-portion
22Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
3, 12-portion
23Rule Application using Sliding Window (RAW)
Rule predefined window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
34, 43-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
34, 43 OBS sym ?????????
LOC org ???????PER ptime 3-4???
Correct
24Proposed Framework
25Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Wildcard instantiated across phrase boundary ?
Incorrect extracted
26Classifier Learning
(sym)(org)???(ptime)
Training corpus
27Classifier Learning
(sym)(org)???(ptime)
The 1st internal wildcard instantiation
Training corpus
classes spaces words
???? ??? ??????? sym org ??????
28Classifier Learning
(sym)(org)???(ptime)
The 1st internal wildcard instantiation
Training corpus
classes spaces words
???? ??? ??????? sym org ??????
Feature Selection
Information gain
spaces words sym ???
29Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words sym ???
Feature ??????sym ???????????? 0
3 1 1
0, 3, 1, 1
Wildcard-instantiation feature vector 0, 3, 1,
1
30Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words ??? Feature
???? 0
1 0
0, 1, 0
Wildcard-instantiation feature vector 0, 3, 1,
1, 0, 1, 0
31Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Instantiation spaces
words space Feature
0
0 1
0, 0, 1
Wildcard-instantiation feature vector 0, 3, 1,
1, 0, 1, 0, 0, 0, 1
32Proposed Framework
33Overlapping Frame Filtering (OFF)
Annotated text
SYM S1LOC L1 SYM S2LOC L2 SYM S3LOC
L3 SYM SnLOC Ln
SYM S1LOC L1 SYM S2LOC L2 SYM S3LOC
L3 SYM SnLOC Ln
RAW
WIF
OFF
SYM S2LOC L2 SYM S3LOC L2
Overlapping frames
Use classifier score
Some slot fillers are from the same text position.
34Experiments Output Templates
Type-MD1 Template Abnormal characteristics of
some observable entities
Symptom
Type
Type
Type
ATTR
OBS
Observed Entity
Attribute
PER
Period of time
35Experiments Output Templates
Type-MD1 Template Abnormal characteristics of
some observable entities
Symptom
Secretion
Color
Type
Type
Type
?????? NasalMucus
??????? Green
ATTR
OBS
Observed Entity
Attribute
PER
6-10 ??? 6-10 days
Period of time
36Experiments Output Templates
Type-MD2 Template Human-body locations at which
primitive symptoms appear
Symptom
Type
Primitive symptom
Organ
Type
Type
PER
LOC
Period of time
Human body
37Experiments Output Templates
Type-MD2 Template Human-body locations at which
primitive symptoms appear
Symptom
Type
Primitive symptom
Organ
Type
Type
??????? Rip
6-10 ??? 6-10 days
PER
LOC
Period of time
Human body
38Data Characteristics
Data sets grouped by disease groups
- D3 (Test set)
- The respiratory system,
- The gastrointestinal tract system,
- Infectious diseases, and
- Accidental diseases
- D1 (Training set)
- The circulatory system,
- The urology system,
- The reproductive system,
- The eye system, and
- The ear system
- D2 (Test set)
- The skin/dermal system,
- The skeletal system,
- The endocrine system,
- The nervous system,
- Parasitic system, and
- Venereal system
39Data Characteristics
Data sets grouped by disease groups
- D3 (Test set)
- The respiratory system,
- The gastrointestinal tract system,
- Infectious diseases, and
- Accidental diseases
- D1 (Training set)
- The circulatory system,
- The urology system,
- The reproductive system,
- The eye system, and
- The ear system
- D2 (Test set)
- The skin/dermal system,
- The skeletal system,
- The endocrine system,
- The nervous system,
- Parasitic system, and
- Venereal system
40Data Characteristics
Data sets grouped by disease groups
- D3 (Test set)
- The respiratory system,
- The gastrointestinal tract system,
- Infectious diseases, and
- Accidental diseases
- D1 (Training set)
- The circulatory system,
- The urology system,
- The reproductive system,
- The eye system, and
- The ear system
- D2 (Test set)
- The skin/dermal system,
- The skeletal system,
- The endocrine system,
- The nervous system,
- Parasitic system, and
- Venereal system
41Example of Rules
42Experimental Results
Use SVM as a classifier
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
43Classifier Comparisons SVM, kNN, NB, DT
44Recall Improvement by Rule Generalization
- To improve the performances of rules by rule
generalization (RG)
Output template
Pattern
OBS 1ATTR 2
(org)(gq)
Generalize to
Output template
Pattern
OBS 1ATTR 2
(org)(ch)
OBS 1ATTR 2
(org)(col)
Overfitting Rules
45Recall Improvement by Doubling Window Size and
Rule Generalization
2W window size doubling RG rule
generalization
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
46Classifier Comparisons SVM, kNN, NB, DT
47Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Test Corpus
48Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Test Corpus
IE Rules
49Compared with Known-Boundary Test
Manually locate target phrases Apply rules to
located target phrases
Known boundary extraction
Our framework
Insignificant differences
50Experimental ResultsOther Domains
- The proposed framework is applied to the other
domains, i.e., - Soccer match reports (SR),
- Soccer player transferring (ST),
- Stock market (SM), and
- Dividend yield (DY)
51Domain Characteristics
Long target-phrases
A few target-phrases
52Experimental ResultsOther Domains
OFF Preserve recall Improve precision
RAW High recall Low precision
WIF Preserve recall Improve precision
53Classifier Comparisons SVM, kNN, NB, DT
54Compared with Known-Boundary Test
Known boundary extraction
Our framework
Insignificant differences
55Conclusions
- In this work
- We apply IE rules by the siding window
technique. - We use WIF and OFF to classify extracted frames.
- We apply the framework to medical-symptom
descriptions, - Soccer match reports, Soccer player
transferring, - Stock market, and Dividend yield.
- Further work
- How the semantic representations of symptom
descriptions - facilitate automated reasoning, e.g., medical
diagnosis reasoning.
56Thank You
57Conclusion
58RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
59RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
Incorrect extractions probably be produced
60Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
Internal wildcards
61RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
62RAW An Example
Rule predefined window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
63RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
21, 30-portion
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
Extraction frame
Correctness
Portion
21, 30 OBS sym ????????? LOC
org ??????PER ptime 6-12???
Correct
64RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ????????? LOC
org ??????PER ptime 6-12???
Correct
65RAW An Example
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
col ?????????sym ??????????????????o
rg ?????????????????????ptime 6-12???
18 19 20
21 22 23
24 25 26 27 28
29 30
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Extraction frame
Correctness
Portion
21, 30 OBS sym ?????????
LOC org ??????PER ptime 6-12???
Correct 33, 42
OBS sym ??????? LOC org ???????PER
ptime 3-4??? Incorrect
66Components of IE Systems
Word segmentation
POS-tagging, Word Sense
Full parsing, Shallow parsing
67Wildcard Instantiation
Rule window size is 10
Pattern
Output template
???
(sym)
(org)
(ptime)
OBS 1 LOC 2PER 3
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
An internal wildcard is instantiated across a
boundary ? unrelated slots are extracted
An external wildcard is instantiated across a
boundary ? unrelated slots are extracted
68Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
33, 42-portion
Portion Instantiation
spaces words sym ???
Feature-1 33, 42
69Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Instantiation
spaces words sym ???
Feature-1 33, 42 ??????sym
???????????? 0 3
1 1 0, 3, 1,
1 34, 43 ???
0 1
0 1 0, 1, 0, 1
70Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 33, 42 0, 3, 1,
1 34, 43 0, 1, 0, 1
71Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 33, 42 0, 3, 1, 1 1,
1, 0, 0, 1 1, 2, 1 34, 43 0,
1, 0, 1 2, 1, 0, 1, 0 0, 1, 1
72Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 Feature vector
Label 33, 42 0, 3, 1, 1
1, 1, 0, 0, 1 1, 2, 1 0, 3, 1,
1, 1, 1, 0, 0, 1, 1, 2, 1 -1 34, 43
0, 1, 0, 1 2, 1, 0, 1, 0
0, 1, 1 0, 1, 0, 1, 2, 1, 0, 1, 0, 0,
1, 1 1
73Experimental ResultsRecall Improvement
- To improve the performances of rules by rule
generalization (RG) - To improve the performances of rules by
doubling window size (2W)
74Classifier Learning
(sym)(org)???(ptime)
??sym ?????????????sym
????????????org ??????????????ptime
3-4??????????
32 33 34 35
36 37
38 39 40 41
42 43 44
34, 43-portion
Portion Feature-1 Feature-2
Feature-3 Feature vector
Label 33, 42 0, 3, 1, 1
1, 1, 0, 0, 1 1, 2, 1 0, 3, 1,
1, 1, 1, 0, 0, 1, 1, 2, 1 -1 34, 43
0, 1, 0, 1 2, 1, 0, 1, 0
0, 1, 1 0, 1, 0, 1, 2, 1, 0, 1, 0, 0,
1, 1 1
Classifier Learning
75Components of IE Systems
Word segmentation
POS-tagging, Word Sense
Supplement techniques are necessary
Full parsing, Shallow parsing