Mining Domain Specific Words from Hierarchical Web Documents

About This Presentation

Title:

Mining Domain Specific Words from Hierarchical Web Documents

Description:

Department of Computer Science & Information Engineering. National Chi-Nan (??) University ... Machine translation. Help select translation lexicon candidates ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 41

Provided by: nlpCsie

Category:

more less

Transcript and Presenter's Notes

Title: Mining Domain Specific Words from Hierarchical Web Documents

1
Mining Domain Specific Words fromHierarchical
Web Documents

Jing-Shin Chang (???)
Department of Computer Science Information
Engineering
National Chi-Nan (??) University
1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.
jshin_at_csie.ncnu.edu.tw
CJNLP-04, 2004/11/1015, City U., H.K.

2
Personal Information, MTNLP Labs in NCNU

Jing-Shin Chang
Assistant Professor (2000), CSIE, NCNU
was with NTHU NLP Lab BDC (Behavior Design
Corp.) (19862000)
PhD, EE, NTHU (Natl. Tsing-Hua Univ., Hsinchu,
TW), 1997
Research Fields Interests
Machine Translation 1986
Statistical Models of all MT subtasks
Alignment (parallel/non-parallel)
Parsing LPCFG, Generalized Probabilistic
Semantic Model
New Statistical MT Models
Chinese Language Processing
Word Segmentation, Lexicon Acquisition, Chinese
Abbreviation,

3
TOC

Motivation
What are DSWs?
Why DSW Mining? (Applications)
WSD with DSWs without sense tagged corpus
Constructing Hierarchical Lexicon Tree w/o
Clustering
Other applications
How to Mine DSWs from Hierarchical Web Documents
Preliminary Results
Error Sources
Remarks

4
Motivation

Is there a quick and easy (engineering) way to
construct a large scale WordNet or things like
that now that everyone is talking about
ontological knowledge sources and X-WordNet
(whatever you call it)?
trigger a new view for constructing a lexicon
tree with hierarchical semantic links
DSW identification turns out to be a key to such
construction
and can be used in various applications,
including DSW-based WSD without using sense
tagged corpora

5
What Are Domain Specific Words (DSWs)

Words that appear frequently in some particular
domains
(a) Multiple Sense Words that are used frequently
with special meanings or usage in particular
domains
E.g., piston ?? (mechanics) or ??? (sports)
(b) Single Sense Words that are used frequently
in particular domains
Suggesting that some words in the current
document might be related to this particular
sense
As anchor words/tags in the context for
disambiguating other multiple sense words

6
What to Do in DSW Mining

DSW Mining Task
Find lists of words that occurs frequently in the
same domain and associate each list (and words
within it) a domain (implicit sense) tag
E.g., entertainment singer, pop songs, rock
roll, Chang Hui-Mei (Ah-Mei), album,
As a side effect, find the hierarchical or
network-like relationships between adjacent sets
of DSWs
When applied to mining DSWs associated with each
node of a hierarchical directory/document tree
Each node being annotated with a domain tag

7
Why DSW Mining from Document Trees

Semantics hierarchy among words and membership of
words in semantics hierarchy are important
resources for WSD
a hierarchical lexicon tree or a
wordnet-lookalike is invaluable for such a
purpose
Multiple sense words tend to be used differently
in different domains, but is highly related to
the document domain they appear
Identifying the particular domain in use suggests
the particular sense for the words
Words of the same domain impose sense constraints
over possible senses

8
Why DSW Mining from Document Trees

Important for word sense disambiguation (WSD)
Multi-sense words appearing in the same document
tend to be tagged with the same word sense if
they belong to the same common domain in the
semantic hierarchy Yarowsky 95.
Pretty much true for different words in the same
document

9
Why DSW Mining from Document Trees

WSD by using DSWs (anchor words)
Domain identification is easier to identify
More available sources that are directly tagged
with a domain label, not linguistically motivated
sense tag sets
Disambiguation by referring to or based on the
sense of semantically unambiguous words
Manual construction of a lexicon hierarchy and
finding members of each node is costly and labor
intensive
New words and new usages are produced every few
days, (semi-)automatic update is preferred
E.g. Piston is better known as a basketball
team rather than a mechanical part

10
DSW Applications (1)

Technical term extraction
W(d) w w ? DSW(d)
d ? computer, traveling, food,

11
DSW Applications (2)

Generic WSD based on DSWs
ArgmaxS Sd P(sd,W)P(dW) agrmaxS Sd
P(sd,W)P(Wd)P(d)
If a large-scale sense-tagged corpus is not
available, which is often the case
Machine translation
Help select translation lexicon candidates
E.g., money bank (when used with payment,
loan, etc.), river bank, memory bank (in PC,
Intel, MS Windows domains)

12
DSW Applications
Need sense-tagged corpora for training (not
widely available)

Generic WSD based on DSWs

Implicitly domain-tagged corpora are widely
available in the web
Sum over domains where w0 is a DSW
13
DSW Applications (3)

Document classification
N-class classification based on DSWs
Anti-spamming (Two-class classification)
Words in spamming (uninteresting) mails vs.
normal (interesting) mails help block spamming
mails
Interesting domains vs. uninteresting domains
P(WS)P(S) vs. P(WS)P(S)

14
DSW Applications (3.a)

Document classification based on DSWs
d document class label
w1..n bag of words in document
D n gt 2 number of document classes
Anti-spamming based on DSWs
Dn2 (two-class classification)

15
DSW Applications (4)

Building large lexicon tree or wordnet-lookalike
(semi-) automatically from hierarchical web
documents
Membership Semantic links among words of the
same domain are close (context), similar
(synonym, thesaurus), or negated concept
(antonym)
Hierarchy Hierarchy of the lexicon suggests some
ontological relationships

16
Conventional Methods for Constructing Lexicon
Trees

Construction by Clustering
Collect words in a large corpus
Evaluate word association as distance (or
closeness) measure for all word pairs
Use clustering criteria to build lexicon
hierarchy
Adjust the hierarchy and Assign semantic/sense
tags to nodes of the lexicon tree
Thus assigning sense tags to members of each node

17
Clustering Methods for Constructing Lexicon Trees
18
Clustering Methods for Constructing Lexicon Trees

Disadvantages
Do not take advantages of hierarchical
information of document tree (flattened when
collecting words)
Word association Clustering criteria are not
related directly to human perception
Most clustering algorithms conduct binary merging
(or division) in each step for simplicity
Automatically generated semantics hierarchy may
not reflect human perception
Hierarchy boundaries are not clearly
automatically detected
Adjustment of hierarchy may not be easy (since
human perception is not used to guide clustering)
Pairwise association evaluation is costly

19
Hierarchical Information Loss when Collecting
Words
20
Clustering Methods for Constructing Lexicon Trees
Reflect human perception?
Why binary?
Hierarchy?
21
Alternative View for Constructing Lexicon Trees

Construction by Retaining DSWs
Preserve hierarchical structure of web documents
as baseline of semantic hierarchy, which is
already mildly confirmed by webmasters
Associate each node with DSWs as members and tag
each DSW with the directory/domain name
Optionally adjust the tree hierarchy and members
of each nodes

22
Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
23
Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
24
Constructing Lexicon Trees by Preserving DSWs

Advantages
Hierarchy reflect human perception
Adjustment could be easier if necessary
Directory names are highly correlated to sense
tags
Domain-based model can be used if sense-tagged
corpora is not available
Pairwise word association evaluation is replaced
by computation of domain specificity against
domains
O(WxW) vs. O(WxD)
Requirements
A well-organized web site
Mining DSWs from such a site

25
Constructing Lexicon Trees by Preserving DSWs
Synonym Antonym
Is_a, hypernym,
X
Membership (closeness, similarity)
relationship
Y
Y is_a X ??? B is_a X (or A1)
26
Alternative View for Constructing Lexicon Trees

Benefits
No similarity computation Closeness (incl.
similarity) is already implicitly encoded by
human judges
No binary clustering Clustering is already done
(implicitly) with human judgment
Hierarchical links available Some well developed
relationships are already done
Although not perfect

27
Why Mining From Web Pages

Rich tagging information, including
Explicit tags HTML/XML tags, added by webmasters
Implicit tags directory names
Domain specific words associated with the same
node of the document hierarchy have the same
(unknown) tags most of time
The unknown tag is highly correlated with the
directory name assigned by the webmasters
Explicit links hierarchy among implicit tags
Free and huge amount of training material
Explicitly/Implicitly tagged (semantically) by
webmasters all over the world
Cleaner than BBS/News group articles
Dialogue-like noise is rare

28
Proposed Method for Mining

Web Hierarchy as a Large Document Tree
Each document was generated by applying DSWs to
some generic document templates
Remove non-specific words from documents, leaving
a lexicon tree with DSWs associated with each
node
Leaving only domain-specific words
Forming a lexicon tree from a document tree
Label domain specific words
Characteristics
Get associated words by measuring
domain-specificity to a known and common domain
instead of measuring pairwise association plus
clustering

29
Mining CriteriaCross-Domain Entropy

Domain-independent terms tend to distributed
evenly in all domains.
Distributional evenness can be measured with
the Cross-Domain Entropy (CDE) defined as
follows
Pij probability of word-i in domain-j
fij normalized frequency

30
Mining CriteriaCross-Domain Entropy

Example
Wi piston, with frequencies (normalized to
0,1) at various domains
fij (0.001, 0.62, 0.0003, 0.57, 0.0004)
?Domain-specific (unevenly distributed) at the
2nd and the 4th domains

31
Mining Algorithm Step1

Step1 (Data Collection) Acquire a large
collection of web documents using a web spider
while preserving the directory hierarchy of the
documents. Strip unused markup tags from the web
pages.

32
Mining Algorithm Step2

Step2 (Word Segmentation or Chunking) Identify
word (or compound word) boundaries in the
documents by applying a word segmentation
process, such as (Chiang 92 Lin 93), to
Chinese-like documents (where word boundaries are
not explicit) or applying a compound word
chunking algorithms to English-like documents in
order to identify interested word entities.

33
Mining Algorithm Step3

Step3 (Acquiring Normalized Term Frequencies for
all Words in Various Domains) For each
subdirectory dj, find the number of occurrences
nij of each term wi in all the documents, and
derive the normalized term frequency fij nij/Nj
by normalizing nij with the total document size,
Nj Si nij, in that directory. The directory is
then associated with a set of ltwi, dj, fijgt
tuples, where wi is the i-th words of the
complete word list for all documents, dj is the
j-th directory name (refer to as the domain
hereafter), and fij is the normalized relative
frequency of occurrence of in domain dj.

34
Mining Algorithm Step3

Input
where

35
Mining Algorithm Step4

Step4 (Removing Domain-Independent Terms)
Domain-independent terms are identified as those
terms which distributed evenly in all domains.
That is, terms with large Cross-Domain Entropy
(CDE) defined as follows
Terms whose CDE is above a threshold can be
removed from the lexicon tree since such terms
are unlikely to be associated with any domain
closely. Terms with a low CDE will be retained in
a few domains with the highest normalized
frequencies (e.g., top-1 and top-2).

36
Mining Algorithm (optional)

Adjusting the hierarchy and members of each node
(optional, or manual)

37
Mining Algorithm Step5

Step5 (Clustering Pairs of Terms with High
Association within each Domain) For each domain,
we can further partition the terms into clusters
of high association to get finer grained
domain-specific association lists. The clustering
method can use association metrics like mutual
information as distance measure between words,
where the mutual information metric is defined
as
This definition is identical to that of Church
89, but the probabilities are taken over a
particular domain (dj).

38
Mining Algorithm Step6

Step6 (Splitting Domain into Subdomains) After
clustering, a few subdirectories are created, one
for each cluster, as the children of the current
domain. The names of the subdirectories are the
terms with the highest relative frequency of
occurrence.

39
Mining Algorithm Step7

Step7 (Moving Common Terms with Higher CDE
Upward) Terms that are common to siblings in the
directory tree with higher cross-domain entropy
(measured over the siblings) are moved upward one
level in the directory tree (starting from the
leaf nodes toward the root). These terms are
moved up to a more general domain since they are
sufficiently general (but not to the degree of
domain independent though) with respect to the
sibling domains.

40
Mining Algorithm Step8

Step8 (Repeat Steps 5-7) The clustering (i.e.,
splitting) and moving (i.e., merging) operations
can be repeated until a stopping criterion is
satisfied or a number of iterations is reached.

41
Mining Algorithm Step9

Step9 (Output) The directory tree now represents
a hierarchical classification of the terms for
different domains, and thus carries the lists of
associated words for ambiguous words in different
context.

42
Experiments

Domains
News articles from a local news site
138 distinct domains
including leaf nodes of the directory tree and
their parents
leaves with the same name are considered in the
same domain
Examples baseball, basketball, broadcasting,
car, communication, culture, digital,
edu(cation), entertainment (????), finance, food
(????,??,??,???,)
Size 200M bytes (HTML files)
16K unique words after word segmentation

43
Domains(Hierarchy not shown)
44
Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
45
Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
46
Preliminary Results

Domain specific words and the assigned domain
tags are well associated (e.g., ?? is
specifically used in the baseball domain.)
Extraction with the cross-domain entropy (CDE)
metric is well founded.
Domain-independent (or irrelevant) words (such as
those for webmasters advertisements) are well
rejected as DSW candidates for their high
cross-domain entropy
DSWs are mostly nouns and verbs (open-class
words)

47
Preliminary Results

Low cross-domain entropy words (DSWs) in the
respective domain are generally highly correlated
(e.g., ????, ??)
New usages of words, such as ?? (Pistons) with
the basketball sense, could also be identified
Both are good for WSD tasks to use the DSWs as
contextual evidences

48
Error Sources

Single CDE metric may not be sufficient to
capture all characteristics of domain-specificity
Type II error Some general (non-specific) words
may have low entropy simply because they appear
only in one domain (CDE0)
Probably due to low occurrence counts (a kind of
estimation error)
Type I error Some multiple sense words may have
too many senses and thus be mis-recognized as
non-specific in each domain (although the senses
are unique in respect domains)

49
Error Sources

Well-organized website assumption may not be
available all the time
The hierarchical directory tags may not be
appropriate representatives for the document
words within a website
The hierarchies may not be consistent from
website to website

50
Future works

Use other knowledge sources, other than the
single CDE measure, to co-train the model in a
manner similar to Chang 97b, c
E.g., with other term weighting metrics
E.g., stop list acquisition metric for
identifying common words (for type II errors)
Explore methods and criteria to adjust hierarchy
of a single directory tree
Explore methods to merge directory trees from
different sites

51
Concluding Remarks

A simple metric for automatic/semi-automatic
identification of DSWs
At low sense tagging cost
Rich web resource almost free
Implicit semantic tagging implied by the
directory hierarchy (imperfect hierarchy but
free)
A simple method to build semantic links and
degree of closeness among DSWs
may be helpful for building large semantically
tagged lexicon trees or network linked x-wordnets
Good knowledge source for WSD-related
applications
WSD, Machine translation, document
classification, anti-spamming,