Mining Domain Specific Words from Hierarchical Web Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Domain Specific Words from Hierarchical Web Documents

Description:

Department of Computer Science & Information Engineering. National Chi-Nan (??) University ... Machine translation. Help select translation lexicon candidates ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 41
Provided by: nlpCsie
Category:

less

Transcript and Presenter's Notes

Title: Mining Domain Specific Words from Hierarchical Web Documents


1
Mining Domain Specific Words fromHierarchical
Web Documents
  • Jing-Shin Chang (???)
  • Department of Computer Science Information
    Engineering
  • National Chi-Nan (??) University
  • 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.
  • jshin_at_csie.ncnu.edu.tw
  • CJNLP-04, 2004/11/1015, City U., H.K.

2
Personal Information, MTNLP Labs in NCNU
  • Jing-Shin Chang
  • Assistant Professor (2000), CSIE, NCNU
  • was with NTHU NLP Lab BDC (Behavior Design
    Corp.) (19862000)
  • PhD, EE, NTHU (Natl. Tsing-Hua Univ., Hsinchu,
    TW), 1997
  • Research Fields Interests
  • Machine Translation 1986
  • Statistical Models of all MT subtasks
  • Alignment (parallel/non-parallel)
  • Parsing LPCFG, Generalized Probabilistic
    Semantic Model
  • New Statistical MT Models
  • Chinese Language Processing
  • Word Segmentation, Lexicon Acquisition, Chinese
    Abbreviation,

3
TOC
  • Motivation
  • What are DSWs?
  • Why DSW Mining? (Applications)
  • WSD with DSWs without sense tagged corpus
  • Constructing Hierarchical Lexicon Tree w/o
    Clustering
  • Other applications
  • How to Mine DSWs from Hierarchical Web Documents
  • Preliminary Results
  • Error Sources
  • Remarks

4
Motivation
  • Is there a quick and easy (engineering) way to
    construct a large scale WordNet or things like
    that now that everyone is talking about
    ontological knowledge sources and X-WordNet
    (whatever you call it)?
  • trigger a new view for constructing a lexicon
    tree with hierarchical semantic links
  • DSW identification turns out to be a key to such
    construction
  • and can be used in various applications,
    including DSW-based WSD without using sense
    tagged corpora

5
What Are Domain Specific Words (DSWs)
  • Words that appear frequently in some particular
    domains
  • (a) Multiple Sense Words that are used frequently
    with special meanings or usage in particular
    domains
  • E.g., piston ?? (mechanics) or ??? (sports)
  • (b) Single Sense Words that are used frequently
    in particular domains
  • Suggesting that some words in the current
    document might be related to this particular
    sense
  • As anchor words/tags in the context for
    disambiguating other multiple sense words

6
What to Do in DSW Mining
  • DSW Mining Task
  • Find lists of words that occurs frequently in the
    same domain and associate each list (and words
    within it) a domain (implicit sense) tag
  • E.g., entertainment singer, pop songs, rock
    roll, Chang Hui-Mei (Ah-Mei), album,
  • As a side effect, find the hierarchical or
    network-like relationships between adjacent sets
    of DSWs
  • When applied to mining DSWs associated with each
    node of a hierarchical directory/document tree
  • Each node being annotated with a domain tag

7
Why DSW Mining from Document Trees
  • Semantics hierarchy among words and membership of
    words in semantics hierarchy are important
    resources for WSD
  • a hierarchical lexicon tree or a
    wordnet-lookalike is invaluable for such a
    purpose
  • Multiple sense words tend to be used differently
    in different domains, but is highly related to
    the document domain they appear
  • Identifying the particular domain in use suggests
    the particular sense for the words
  • Words of the same domain impose sense constraints
    over possible senses

8
Why DSW Mining from Document Trees
  • Important for word sense disambiguation (WSD)
  • Multi-sense words appearing in the same document
    tend to be tagged with the same word sense if
    they belong to the same common domain in the
    semantic hierarchy Yarowsky 95.
  • Pretty much true for different words in the same
    document

9
Why DSW Mining from Document Trees
  • WSD by using DSWs (anchor words)
  • Domain identification is easier to identify
  • More available sources that are directly tagged
    with a domain label, not linguistically motivated
    sense tag sets
  • Disambiguation by referring to or based on the
    sense of semantically unambiguous words
  • Manual construction of a lexicon hierarchy and
    finding members of each node is costly and labor
    intensive
  • New words and new usages are produced every few
    days, (semi-)automatic update is preferred
  • E.g. Piston is better known as a basketball
    team rather than a mechanical part

10
DSW Applications (1)
  • Technical term extraction
  • W(d) w w ? DSW(d)
  • d ? computer, traveling, food,

11
DSW Applications (2)
  • Generic WSD based on DSWs
  • ArgmaxS Sd P(sd,W)P(dW) agrmaxS Sd
    P(sd,W)P(Wd)P(d)
  • If a large-scale sense-tagged corpus is not
    available, which is often the case
  • Machine translation
  • Help select translation lexicon candidates
  • E.g., money bank (when used with payment,
    loan, etc.), river bank, memory bank (in PC,
    Intel, MS Windows domains)

12
DSW Applications
Need sense-tagged corpora for training (not
widely available)
  • Generic WSD based on DSWs

Implicitly domain-tagged corpora are widely
available in the web
Sum over domains where w0 is a DSW
13
DSW Applications (3)
  • Document classification
  • N-class classification based on DSWs
  • Anti-spamming (Two-class classification)
  • Words in spamming (uninteresting) mails vs.
    normal (interesting) mails help block spamming
    mails
  • Interesting domains vs. uninteresting domains
  • P(WS)P(S) vs. P(WS)P(S)

14
DSW Applications (3.a)
  • Document classification based on DSWs
  • d document class label
  • w1..n bag of words in document
  • D n gt 2 number of document classes
  • Anti-spamming based on DSWs
  • Dn2 (two-class classification)

15
DSW Applications (4)
  • Building large lexicon tree or wordnet-lookalike
    (semi-) automatically from hierarchical web
    documents
  • Membership Semantic links among words of the
    same domain are close (context), similar
    (synonym, thesaurus), or negated concept
    (antonym)
  • Hierarchy Hierarchy of the lexicon suggests some
    ontological relationships

16
Conventional Methods for Constructing Lexicon
Trees
  • Construction by Clustering
  • Collect words in a large corpus
  • Evaluate word association as distance (or
    closeness) measure for all word pairs
  • Use clustering criteria to build lexicon
    hierarchy
  • Adjust the hierarchy and Assign semantic/sense
    tags to nodes of the lexicon tree
  • Thus assigning sense tags to members of each node

17
Clustering Methods for Constructing Lexicon Trees
18
Clustering Methods for Constructing Lexicon Trees
  • Disadvantages
  • Do not take advantages of hierarchical
    information of document tree (flattened when
    collecting words)
  • Word association Clustering criteria are not
    related directly to human perception
  • Most clustering algorithms conduct binary merging
    (or division) in each step for simplicity
  • Automatically generated semantics hierarchy may
    not reflect human perception
  • Hierarchy boundaries are not clearly
    automatically detected
  • Adjustment of hierarchy may not be easy (since
    human perception is not used to guide clustering)
  • Pairwise association evaluation is costly

19
Hierarchical Information Loss when Collecting
Words
20
Clustering Methods for Constructing Lexicon Trees
Reflect human perception?
Why binary?
Hierarchy?
21
Alternative View for Constructing Lexicon Trees
  • Construction by Retaining DSWs
  • Preserve hierarchical structure of web documents
    as baseline of semantic hierarchy, which is
    already mildly confirmed by webmasters
  • Associate each node with DSWs as members and tag
    each DSW with the directory/domain name
  • Optionally adjust the tree hierarchy and members
    of each nodes

22
Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
23
Constructing Lexicon Trees by Preserving DSWs
O DSW X -DSW
24
Constructing Lexicon Trees by Preserving DSWs
  • Advantages
  • Hierarchy reflect human perception
  • Adjustment could be easier if necessary
  • Directory names are highly correlated to sense
    tags
  • Domain-based model can be used if sense-tagged
    corpora is not available
  • Pairwise word association evaluation is replaced
    by computation of domain specificity against
    domains
  • O(WxW) vs. O(WxD)
  • Requirements
  • A well-organized web site
  • Mining DSWs from such a site

25
Constructing Lexicon Trees by Preserving DSWs
Synonym Antonym
Is_a, hypernym,
X
Membership (closeness, similarity)
relationship
Y
Y is_a X ??? B is_a X (or A1)
26
Alternative View for Constructing Lexicon Trees
  • Benefits
  • No similarity computation Closeness (incl.
    similarity) is already implicitly encoded by
    human judges
  • No binary clustering Clustering is already done
    (implicitly) with human judgment
  • Hierarchical links available Some well developed
    relationships are already done
  • Although not perfect

27
Why Mining From Web Pages
  • Rich tagging information, including
  • Explicit tags HTML/XML tags, added by webmasters
  • Implicit tags directory names
  • Domain specific words associated with the same
    node of the document hierarchy have the same
    (unknown) tags most of time
  • The unknown tag is highly correlated with the
    directory name assigned by the webmasters
  • Explicit links hierarchy among implicit tags
  • Free and huge amount of training material
  • Explicitly/Implicitly tagged (semantically) by
    webmasters all over the world
  • Cleaner than BBS/News group articles
  • Dialogue-like noise is rare

28
Proposed Method for Mining
  • Web Hierarchy as a Large Document Tree
  • Each document was generated by applying DSWs to
    some generic document templates
  • Remove non-specific words from documents, leaving
    a lexicon tree with DSWs associated with each
    node
  • Leaving only domain-specific words
  • Forming a lexicon tree from a document tree
  • Label domain specific words
  • Characteristics
  • Get associated words by measuring
    domain-specificity to a known and common domain
    instead of measuring pairwise association plus
    clustering

29
Mining CriteriaCross-Domain Entropy
  • Domain-independent terms tend to distributed
    evenly in all domains.
  • Distributional evenness can be measured with
    the Cross-Domain Entropy (CDE) defined as
    follows
  • Pij probability of word-i in domain-j
  • fij normalized frequency

30
Mining CriteriaCross-Domain Entropy
  • Example
  • Wi piston, with frequencies (normalized to
    0,1) at various domains
  • fij (0.001, 0.62, 0.0003, 0.57, 0.0004)
  • ?Domain-specific (unevenly distributed) at the
    2nd and the 4th domains

31
Mining Algorithm Step1
  • Step1 (Data Collection) Acquire a large
    collection of web documents using a web spider
    while preserving the directory hierarchy of the
    documents. Strip unused markup tags from the web
    pages.

32
Mining Algorithm Step2
  • Step2 (Word Segmentation or Chunking) Identify
    word (or compound word) boundaries in the
    documents by applying a word segmentation
    process, such as (Chiang 92 Lin 93), to
    Chinese-like documents (where word boundaries are
    not explicit) or applying a compound word
    chunking algorithms to English-like documents in
    order to identify interested word entities.

33
Mining Algorithm Step3
  • Step3 (Acquiring Normalized Term Frequencies for
    all Words in Various Domains) For each
    subdirectory dj, find the number of occurrences
    nij of each term wi in all the documents, and
    derive the normalized term frequency fij nij/Nj
    by normalizing nij with the total document size,
    Nj Si nij, in that directory. The directory is
    then associated with a set of ltwi, dj, fijgt
    tuples, where wi is the i-th words of the
    complete word list for all documents, dj is the
    j-th directory name (refer to as the domain
    hereafter), and fij is the normalized relative
    frequency of occurrence of in domain dj.

34
Mining Algorithm Step3
  • Input
  • where

35
Mining Algorithm Step4
  • Step4 (Removing Domain-Independent Terms)
    Domain-independent terms are identified as those
    terms which distributed evenly in all domains.
    That is, terms with large Cross-Domain Entropy
    (CDE) defined as follows
  • Terms whose CDE is above a threshold can be
    removed from the lexicon tree since such terms
    are unlikely to be associated with any domain
    closely. Terms with a low CDE will be retained in
    a few domains with the highest normalized
    frequencies (e.g., top-1 and top-2).

36
Mining Algorithm (optional)
  • Adjusting the hierarchy and members of each node
    (optional, or manual)

37
Mining Algorithm Step5
  • Step5 (Clustering Pairs of Terms with High
    Association within each Domain) For each domain,
    we can further partition the terms into clusters
    of high association to get finer grained
    domain-specific association lists. The clustering
    method can use association metrics like mutual
    information as distance measure between words,
    where the mutual information metric is defined
    as
  • This definition is identical to that of Church
    89, but the probabilities are taken over a
    particular domain (dj).

38
Mining Algorithm Step6
  • Step6 (Splitting Domain into Subdomains) After
    clustering, a few subdirectories are created, one
    for each cluster, as the children of the current
    domain. The names of the subdirectories are the
    terms with the highest relative frequency of
    occurrence.

39
Mining Algorithm Step7
  • Step7 (Moving Common Terms with Higher CDE
    Upward) Terms that are common to siblings in the
    directory tree with higher cross-domain entropy
    (measured over the siblings) are moved upward one
    level in the directory tree (starting from the
    leaf nodes toward the root). These terms are
    moved up to a more general domain since they are
    sufficiently general (but not to the degree of
    domain independent though) with respect to the
    sibling domains.

40
Mining Algorithm Step8
  • Step8 (Repeat Steps 5-7) The clustering (i.e.,
    splitting) and moving (i.e., merging) operations
    can be repeated until a stopping criterion is
    satisfied or a number of iterations is reached.

41
Mining Algorithm Step9
  • Step9 (Output) The directory tree now represents
    a hierarchical classification of the terms for
    different domains, and thus carries the lists of
    associated words for ambiguous words in different
    context.

42
Experiments
  • Domains
  • News articles from a local news site
  • 138 distinct domains
  • including leaf nodes of the directory tree and
    their parents
  • leaves with the same name are considered in the
    same domain
  • Examples baseball, basketball, broadcasting,
    car, communication, culture, digital,
    edu(cation), entertainment (????), finance, food
    (????,??,??,???,)
  • Size 200M bytes (HTML files)
  • 16K unique words after word segmentation

43
Domains(Hierarchy not shown)
44
Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
45
Sample Output (4 Selected Domains)
Table 1. Sampled domain specific words with low
entropies.
46
Preliminary Results
  • Domain specific words and the assigned domain
    tags are well associated (e.g., ?? is
    specifically used in the baseball domain.)
  • Extraction with the cross-domain entropy (CDE)
    metric is well founded.
  • Domain-independent (or irrelevant) words (such as
    those for webmasters advertisements) are well
    rejected as DSW candidates for their high
    cross-domain entropy
  • DSWs are mostly nouns and verbs (open-class
    words)

47
Preliminary Results
  • Low cross-domain entropy words (DSWs) in the
    respective domain are generally highly correlated
    (e.g., ????, ??)
  • New usages of words, such as ?? (Pistons) with
    the basketball sense, could also be identified
  • Both are good for WSD tasks to use the DSWs as
    contextual evidences

48
Error Sources
  • Single CDE metric may not be sufficient to
    capture all characteristics of domain-specificity
  • Type II error Some general (non-specific) words
    may have low entropy simply because they appear
    only in one domain (CDE0)
  • Probably due to low occurrence counts (a kind of
    estimation error)
  • Type I error Some multiple sense words may have
    too many senses and thus be mis-recognized as
    non-specific in each domain (although the senses
    are unique in respect domains)

49
Error Sources
  • Well-organized website assumption may not be
    available all the time
  • The hierarchical directory tags may not be
    appropriate representatives for the document
    words within a website
  • The hierarchies may not be consistent from
    website to website

50
Future works
  • Use other knowledge sources, other than the
    single CDE measure, to co-train the model in a
    manner similar to Chang 97b, c
  • E.g., with other term weighting metrics
  • E.g., stop list acquisition metric for
    identifying common words (for type II errors)
  • Explore methods and criteria to adjust hierarchy
    of a single directory tree
  • Explore methods to merge directory trees from
    different sites

51
Concluding Remarks
  • A simple metric for automatic/semi-automatic
    identification of DSWs
  • At low sense tagging cost
  • Rich web resource almost free
  • Implicit semantic tagging implied by the
    directory hierarchy (imperfect hierarchy but
    free)
  • A simple method to build semantic links and
    degree of closeness among DSWs
  • may be helpful for building large semantically
    tagged lexicon trees or network linked x-wordnets
  • Good knowledge source for WSD-related
    applications
  • WSD, Machine translation, document
    classification, anti-spamming,

52
Thanks for your attention!!
  • Thanks!!
Write a Comment
User Comments (0)
About PowerShow.com