Title: Mining Domain Specific Words from Hierarchical Web Documents
1Mining Domain Specific Words fromHierarchical
Web Documents
- Jing-Shin Chang (???)
- Department of Computer Science Information
Engineering - National Chi-Nan (??) University
- 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.
- jshin_at_csie.ncnu.edu.tw
- CJNLP-04, 2004/11/1015, City U., H.K.
2Personal Information, MTNLP Labs in NCNU
- Jing-Shin Chang
- Assistant Professor (2000), CSIE, NCNU
- was with NTHU NLP Lab BDC (Behavior Design
Corp.) (19862000) - PhD, EE, NTHU (Natl. Tsing-Hua Univ., Hsinchu,
TW), 1997 - Research Fields Interests
- Machine Translation 1986
- Statistical Models of all MT subtasks
- Alignment (parallel/non-parallel)
- Parsing LPCFG, Generalized Probabilistic
Semantic Model - New Statistical MT Models
- Chinese Language Processing
- Word Segmentation, Lexicon Acquisition, Chinese
Abbreviation,
3TOC
- Motivation
- What are DSWs?
- Why DSW Mining? (Applications)
- WSD with DSWs without sense tagged corpus
- Constructing Hierarchical Lexicon Tree w/o
Clustering - Other applications
- How to Mine DSWs from Hierarchical Web Documents
- Preliminary Results
- Error Sources
- Remarks
4Motivation
- Is there a quick and easy (engineering) way to
construct a large scale WordNet or things like
that now that everyone is talking about
ontological knowledge sources and X-WordNet
(whatever you call it)? - trigger a new view for constructing a lexicon
tree with hierarchical semantic links - DSW identification turns out to be a key to such
construction - and can be used in various applications,
including DSW-based WSD without using sense
tagged corpora
5What Are Domain Specific Words (DSWs)
- Words that appear frequently in some particular
domains - (a) Multiple Sense Words that are used frequently
with special meanings or usage in particular
domains - E.g., piston ?? (mechanics) or ??? (sports)
- (b) Single Sense Words that are used frequently
in particular domains - Suggesting that some words in the current
document might be related to this particular
sense - As anchor words/tags in the context for
disambiguating other multiple sense words
6What to Do in DSW Mining
- DSW Mining Task
- Find lists of words that occurs frequently in the
same domain and associate each list (and words
within it) a domain (implicit sense) tag - E.g., entertainment singer, pop songs, rock
roll, Chang Hui-Mei (Ah-Mei), album, - As a side effect, find the hierarchical or
network-like relationships between adjacent sets
of DSWs - When applied to mining DSWs associated with each
node of a hierarchical directory/document tree - Each node being annotated with a domain tag
7Why DSW Mining from Document Trees
- Semantics hierarchy among words and membership of
words in semantics hierarchy are important
resources for WSD - a hierarchical lexicon tree or a
wordnet-lookalike is invaluable for such a
purpose - Multiple sense words tend to be used differently
in different domains, but is highly related to
the document domain they appear - Identifying the particular domain in use suggests
the particular sense for the words - Words of the same domain impose sense constraints
over possible senses
8Why DSW Mining from Document Trees
- Important for word sense disambiguation (WSD)
- Multi-sense words appearing in the same document
tend to be tagged with the same word sense if
they belong to the same common domain in the
semantic hierarchy Yarowsky 95. - Pretty much true for different words in the same
document
9Why DSW Mining from Document Trees
- WSD by using DSWs (anchor words)
- Domain identification is easier to identify
- More available sources that are directly tagged
with a domain label, not linguistically motivated
sense tag sets - Disambiguation by referring to or based on the
sense of semantically unambiguous words - Manual construction of a lexicon hierarchy and
finding members of each node is costly and labor
intensive - New words and new usages are produced every few
days, (semi-)automatic update is preferred - E.g. Piston is better known as a basketball
team rather than a mechanical part
10DSW Applications (1)
- Technical term extraction
- W(d) w w ? DSW(d)
- d ? computer, traveling, food,
11DSW Applications (2)
- Generic WSD based on DSWs
- ArgmaxS Sd P(sd,W)P(dW) agrmaxS Sd
P(sd,W)P(Wd)P(d) - If a large-scale sense-tagged corpus is not
available, which is often the case - Machine translation
- Help select translation lexicon candidates
- E.g., money bank (when used with payment,
loan, etc.), river bank, memory bank (in PC,
Intel, MS Windows domains)
12DSW Applications
Need sense-tagged corpora for training (not
widely available)
- Generic WSD based on DSWs
Implicitly domain-tagged corpora are widely
available in the web
Sum over domains where w0 is a DSW
13DSW Applications (3)
- Document classification
- N-class classification based on DSWs
- Anti-spamming (Two-class classification)
- Words in spamming (uninteresting) mails vs.
normal (interesting) mails help block spamming
mails - Interesting domains vs. uninteresting domains
- P(WS)P(S) vs. P(WS)P(S)
14DSW Applications (3.a)
- Document classification based on DSWs
- d document class label
- w1..n bag of words in document
- D n gt 2 number of document classes
- Anti-spamming based on DSWs
- Dn2 (two-class classification)
15DSW Applications (4)
- Building large lexicon tree or wordnet-lookalike
(semi-) automatically from hierarchical web
documents - Membership Semantic links among words of the
same domain are close (context), similar
(synonym, thesaurus), or negated concept
(antonym) - Hierarchy Hierarchy of the lexicon suggests some
ontological relationships
16Conventional Methods for Constructing Lexicon
Trees
- Construction by Clustering
- Collect words in a large corpus
- Evaluate word association as distance (or
closeness) measure for all word pairs - Use clustering criteria to build lexicon
hierarchy - Adjust the hierarchy and Assign semantic/sense
tags to nodes of the lexicon tree - Thus assigning sense tags to members of each node
17Clustering Methods for Constructing Lexicon Trees
18Clustering Methods for Constructing Lexicon Trees
- Disadvantages
- Do not take advantages of hierarchical
information of document tree (flattened when
collecting words) - Word association Clustering criteria are not
related directly to human perception - Most clustering algorithms conduct binary merging
(or division) in each step for simplicity - Automatically generated semantics hierarchy may
not reflect human perception - Hierarchy boundaries are not clearly
automatically detected - Adjustment of hierarchy may not be easy (since
human perception is not used to guide clustering) - Pairwise association evaluation is costly
19Hierarchical Information Loss when Collecting
Words
20Clustering Methods for Constructing Lexicon Trees
Reflect human perception?
Why binary?
Hierarchy?
21Alternative View for Constructing Lexicon Trees
- Construction by Retaining DSWs
- Preserve hierarchical structure of web documents
as baseline of semantic hierarchy, which is
already mildly confirmed by webmasters - Associate each node with DSWs as members and tag
each DSW with the directory/domain name - Optionally adjust the tree hierarchy and members
of each nodes
22Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
23Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
24Constructing Lexicon Trees by Preserving DSWs
- Advantages
- Hierarchy reflect human perception
- Adjustment could be easier if necessary
- Directory names are highly correlated to sense
tags - Domain-based model can be used if sense-tagged
corpora is not available - Pairwise word association evaluation is replaced
by computation of domain specificity against
domains - O(WxW) vs. O(WxD)
- Requirements
- A well-organized web site
- Mining DSWs from such a site
25Constructing Lexicon Trees by Preserving DSWs
Synonym Antonym
Is_a, hypernym,
X
Membership (closeness, similarity)
relationship
Y
Y is_a X ??? B is_a X (or A1)
26Alternative View for Constructing Lexicon Trees
- Benefits
- No similarity computation Closeness (incl.
similarity) is already implicitly encoded by
human judges - No binary clustering Clustering is already done
(implicitly) with human judgment - Hierarchical links available Some well developed
relationships are already done - Although not perfect
27Why Mining From Web Pages
- Rich tagging information, including
- Explicit tags HTML/XML tags, added by webmasters
- Implicit tags directory names
- Domain specific words associated with the same
node of the document hierarchy have the same
(unknown) tags most of time - The unknown tag is highly correlated with the
directory name assigned by the webmasters - Explicit links hierarchy among implicit tags
- Free and huge amount of training material
- Explicitly/Implicitly tagged (semantically) by
webmasters all over the world - Cleaner than BBS/News group articles
- Dialogue-like noise is rare
28Proposed Method for Mining
- Web Hierarchy as a Large Document Tree
- Each document was generated by applying DSWs to
some generic document templates - Remove non-specific words from documents, leaving
a lexicon tree with DSWs associated with each
node - Leaving only domain-specific words
- Forming a lexicon tree from a document tree
- Label domain specific words
- Characteristics
- Get associated words by measuring
domain-specificity to a known and common domain
instead of measuring pairwise association plus
clustering
29Mining CriteriaCross-Domain Entropy
- Domain-independent terms tend to distributed
evenly in all domains. - Distributional evenness can be measured with
the Cross-Domain Entropy (CDE) defined as
follows - Pij probability of word-i in domain-j
- fij normalized frequency
30Mining CriteriaCross-Domain Entropy
- Example
- Wi piston, with frequencies (normalized to
0,1) at various domains - fij (0.001, 0.62, 0.0003, 0.57, 0.0004)
- ?Domain-specific (unevenly distributed) at the
2nd and the 4th domains
31Mining Algorithm Step1
- Step1 (Data Collection) Acquire a large
collection of web documents using a web spider
while preserving the directory hierarchy of the
documents. Strip unused markup tags from the web
pages.
32Mining Algorithm Step2
- Step2 (Word Segmentation or Chunking) Identify
word (or compound word) boundaries in the
documents by applying a word segmentation
process, such as (Chiang 92 Lin 93), to
Chinese-like documents (where word boundaries are
not explicit) or applying a compound word
chunking algorithms to English-like documents in
order to identify interested word entities.
33Mining Algorithm Step3
- Step3 (Acquiring Normalized Term Frequencies for
all Words in Various Domains) For each
subdirectory dj, find the number of occurrences
nij of each term wi in all the documents, and
derive the normalized term frequency fij nij/Nj
by normalizing nij with the total document size,
Nj Si nij, in that directory. The directory is
then associated with a set of ltwi, dj, fijgt
tuples, where wi is the i-th words of the
complete word list for all documents, dj is the
j-th directory name (refer to as the domain
hereafter), and fij is the normalized relative
frequency of occurrence of in domain dj.
34Mining Algorithm Step3
35Mining Algorithm Step4
- Step4 (Removing Domain-Independent Terms)
Domain-independent terms are identified as those
terms which distributed evenly in all domains.
That is, terms with large Cross-Domain Entropy
(CDE) defined as follows - Terms whose CDE is above a threshold can be
removed from the lexicon tree since such terms
are unlikely to be associated with any domain
closely. Terms with a low CDE will be retained in
a few domains with the highest normalized
frequencies (e.g., top-1 and top-2).
36Mining Algorithm (optional)
- Adjusting the hierarchy and members of each node
(optional, or manual)
37Mining Algorithm Step5
- Step5 (Clustering Pairs of Terms with High
Association within each Domain) For each domain,
we can further partition the terms into clusters
of high association to get finer grained
domain-specific association lists. The clustering
method can use association metrics like mutual
information as distance measure between words,
where the mutual information metric is defined
as - This definition is identical to that of Church
89, but the probabilities are taken over a
particular domain (dj).
38Mining Algorithm Step6
- Step6 (Splitting Domain into Subdomains) After
clustering, a few subdirectories are created, one
for each cluster, as the children of the current
domain. The names of the subdirectories are the
terms with the highest relative frequency of
occurrence.
39Mining Algorithm Step7
- Step7 (Moving Common Terms with Higher CDE
Upward) Terms that are common to siblings in the
directory tree with higher cross-domain entropy
(measured over the siblings) are moved upward one
level in the directory tree (starting from the
leaf nodes toward the root). These terms are
moved up to a more general domain since they are
sufficiently general (but not to the degree of
domain independent though) with respect to the
sibling domains.
40Mining Algorithm Step8
- Step8 (Repeat Steps 5-7) The clustering (i.e.,
splitting) and moving (i.e., merging) operations
can be repeated until a stopping criterion is
satisfied or a number of iterations is reached.
41Mining Algorithm Step9
- Step9 (Output) The directory tree now represents
a hierarchical classification of the terms for
different domains, and thus carries the lists of
associated words for ambiguous words in different
context.
42Experiments
- Domains
- News articles from a local news site
- 138 distinct domains
- including leaf nodes of the directory tree and
their parents - leaves with the same name are considered in the
same domain - Examples baseball, basketball, broadcasting,
car, communication, culture, digital,
edu(cation), entertainment (????), finance, food
(????,??,??,???,) - Size 200M bytes (HTML files)
- 16K unique words after word segmentation
43Domains(Hierarchy not shown)
44Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
45Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
46Preliminary Results
- Domain specific words and the assigned domain
tags are well associated (e.g., ?? is
specifically used in the baseball domain.) - Extraction with the cross-domain entropy (CDE)
metric is well founded. - Domain-independent (or irrelevant) words (such as
those for webmasters advertisements) are well
rejected as DSW candidates for their high
cross-domain entropy - DSWs are mostly nouns and verbs (open-class
words)
47Preliminary Results
- Low cross-domain entropy words (DSWs) in the
respective domain are generally highly correlated
(e.g., ????, ??) - New usages of words, such as ?? (Pistons) with
the basketball sense, could also be identified - Both are good for WSD tasks to use the DSWs as
contextual evidences
48Error Sources
- Single CDE metric may not be sufficient to
capture all characteristics of domain-specificity
- Type II error Some general (non-specific) words
may have low entropy simply because they appear
only in one domain (CDE0) - Probably due to low occurrence counts (a kind of
estimation error) - Type I error Some multiple sense words may have
too many senses and thus be mis-recognized as
non-specific in each domain (although the senses
are unique in respect domains)
49Error Sources
- Well-organized website assumption may not be
available all the time - The hierarchical directory tags may not be
appropriate representatives for the document
words within a website - The hierarchies may not be consistent from
website to website
50Future works
- Use other knowledge sources, other than the
single CDE measure, to co-train the model in a
manner similar to Chang 97b, c - E.g., with other term weighting metrics
- E.g., stop list acquisition metric for
identifying common words (for type II errors) - Explore methods and criteria to adjust hierarchy
of a single directory tree - Explore methods to merge directory trees from
different sites
51Concluding Remarks
- A simple metric for automatic/semi-automatic
identification of DSWs - At low sense tagging cost
- Rich web resource almost free
- Implicit semantic tagging implied by the
directory hierarchy (imperfect hierarchy but
free) - A simple method to build semantic links and
degree of closeness among DSWs - may be helpful for building large semantically
tagged lexicon trees or network linked x-wordnets - Good knowledge source for WSD-related
applications - WSD, Machine translation, document
classification, anti-spamming,
52Thanks for your attention!!