Word Weighting based on Users Browsing History - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Word Weighting based on Users Browsing History

Description:

National Institute of Advanced Industrial Science and Technology (JPN) ... has no interest in MLB would note the words 'game' or 'Seattle Mariners' as the ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 36
Provided by: mat50
Category:

less

Transcript and Presenter's Notes

Title: Word Weighting based on Users Browsing History


1
Word Weighting based on Users Browsing History
  • Yutaka Matsuo
  • National Institute of Advanced Industrial Science
    and Technology (JPN)
  • Presenter Junichiro Mori
  • University of Tokyo (JPN)

2
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • System architecture
  • Evaluation
  • Conclusion

3
Introduction
Introduction
  • Many information support systems with NLP use
    tfidf to measure the weight of words.
  • Tfidf is based on statistics of word occurrence
    on a target document and a corpus.
  • It is effective in many practical systems
    including summarization systems and retrieval
    systems.
  • However, a word that is important to one user is
    sometimes not important to others.

4
Example
Introduction
  • Suzuki hitting streak ends at 23 games
  • Ichiro Suzuki is a Japanese MLB player, MVP in
    2001.
  • Those who are greatly interested in MLB would
    thinks hitting streak ends as important,
  • While a user who has no interest in MLB would
    note the words game or Seattle Mariners as
    the informative, because those words would
    indicate that the subject of the article was
    baseball.
  • If a user is not familiar with the topic, he/she
    may think general words related to the topic are
    important.
  • On the other hand, if a user is familiar with the
    topic, he/she may think more detailed words are
    important.

Our main hypothesis
5
Goal of this research
Introduction
  • This research addresses context-based word
    weighting, focusing on the statistical feature of
    word co-occurrence.
  • In order to measure the weight of words more
    correctly, contextual information about a user
    (we call familiar words) is used.

6
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • Previous work
  • IRM (Interest Relevance Measure)
  • System architecture
  • Evaluation
  • Conclusion

7
IRM
  • A new measure, IRM, is based on a word-weighting
    algorithm applied to a single document.
  • Matsuo 03 Keyword Extraction from a Single
    Document using Word Co-occurrence Statistical
    Information, FLAIRS 2003

8
We take a paper for example.
Previous work Matsuo03
COMPUTING MACHINERY AND INTELLIGENCE
A.M.TURING 1 The Imitation Game I PROPOSE to
consider the question, 'Can machines think?' This
should begin with definitions of the meaning of
the terms 'machine 'and 'think'. The definitions
might be framed so as to reflect so far as
possible the normal use of the words, but this
attitude is dangerous. If the meaning of the
words 'machine' and 'think 'are to be found by
examining how they are commonly used it is
difficult to escape the conclusion that the
meaning and the answer to the question, 'Can
machines think?' is to be sought in a statistical
survey such as a Gallup poll. But this is absurd.
Instead of attempting such a definition I shall
replace the question by another, which is closely
related to it and is expressed in relatively
unambiguous words. The new form of the problem
can be described' in terms of a game which we
call the 'imitation game'. It is played with
three people, a man (A), a woman (B), and an
interrogator (C) who may be of either
9
Distribution of frequent terms
Previous work Matsuo03
10
Next, count co-occurrences
Previous work Matsuo03
  • The new form of the problem can be described' in
    terms of a game which we call the imitation
    game.
  • stem, stop word
    elimination, phrase extraction
  • new and form co-occur once.
  • new and problem co-occur once.
  • .
  • call and imitation game co-occur once.

11
Co-occurrence matrix
Previous work Matsuo03
12
Co-occurrences ofkind frequent terms,
andmakefrequent terms
Previous work Matsuo03
  • A general term such as kind or make is used
    relatively impartially with each frequent term,
    but

13
Co-occurrence matrix
Previous work Matsuo03
Frequent terms
Frequent terms
14
Co-occurrences ofimitation frequent terms,
anddigital computerfrequent terms
Previous work Matsuo03
  • while a term such as imitation or digital
    computer shows co-occurrence especially with
    particular terms.

15
Biases of co-occurrence
Previous work Matsuo03
  • A general term such as kind or make is used
    relatively impartially with each frequent
    tem,while a term such as imitation or
    digital computer shows co-occurrence especially
    with particular terms.
  • Therefore, the degree of biases of co-occurrence
    can be used as a surrogate of term importance.

16
?2-measure
Previous work Matsuo03
  • We use the ?2-test, which is very common for
    evaluating biases between expected and observed
    frequencies.
  • G Frequent terms
  • freq(w,g) Frequency of co-occurrence term w and
    term g.
  • pg unconditional probability (the expected
    probability) of g.
  • f(w) The total number of co-oocurrence of term w
    and frequent terms G
  • Large bias of co-occurrence means importance of a
    word.

17
Sort by ?2-value
Previous work Matsuo03
We can get important words based on co-occurrence
information in a document.
18
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • Previous work
  • IRM (Interest Relevance Measure)
  • System architecture
  • Evaluation
  • Conclusion

19
Personalize the calculation of word importance
IRM, proposed measure
  • The previous method is useful for extracting
    reader-independent important words from a
    document.
  • However, importance of words depends not only on
    the document itself but also on a reader.

20
If we change the columns to pick up
IRM, proposed measure
a machine, b computer, c question, d
digital, e answer, f game, g argument, h
make, i state, j number u imitation, v
digital computer, wkind, xmake
21
If we change the columns to pick up
IRM, proposed measure
Frequent words
Frequent termslogic
Frequent termsGod
The relevant words to selected words have high ?2
value, because they co-occurs often.
22
Familiarity instead of frequency
IRM, proposed measure
  • We focus on familiar words to the user, instead
    of frequent words in the document.
  • Definition Familiar words are the words which a
    user has frequently seen in the past.

23
Interest Relevancy Measure (IRM)
IRM, proposed measure
  • where Hk is a set of familiar words for user k

24
IRM
IRM, proposed measure
  • If the value of IRM is large, word wij is
    relevant to users familiar words.
  • The word is relevant to the users interests, so
    it is a keyword for the user.
  • Conversely, if the value of IRM is small, word
    wij is not specifically relevant to any of the
    familiar words.

25
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • Previous work
  • IRM (Interest Relevance Measure)
  • System architecture
  • Evaluation
  • Conclusion

26
Browsing support system
  • It is difficult to evaluate IRM objectively
    because the weight of words depends on a users
    familiar words, and therefore varies among users.
  • Therefore, we evaluate IRM by constructing a Web
    browsing support system.
  • Web pages accessed by a user are monitored by a
    proxy server.
  • The count of each word is stored in a database.

27
System architecture ofbrowsing support system
Browser
28
Sample Screen shot
29
(No Transcript)
30
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • Previous work
  • IRM (Interest Relevance Measure)
  • System architecture
  • Evaluation
  • Conclusion

31
Evaluation
  • For evaluation, ten people tried this system for
    more than one hour.
  • Three methods are implemented for comparison.
  • (I) word frequency
  • (II) tfidf
  • (III) IRM

32
Evaluation Result(1)
  • After using each system (blind), we ask the
    following questions on a 5-point Likert-scale
    from 1(not at all) to 5 (very much).
  • Q1 Do this system help you browse the Web?
  • (I) 2.8 (II) 3.2 (III) 3.2
  • Q2 Are the red color words (high IRM words)
    interesting to you?
  • (I) 3.2 (II) 4.0 (III) 4.1
  • Q3 Are the interesting words colored red?
  • (I) 2.9 (II) 3.3 (III) 3.8
  • Q4 Are the blue color words (familiar words)
    interesting to you?
  • (I) 2.7 (II) 2.5 (III) 2.0
  • Q5 Are the interesting words colored blue?
  • (I) 2.7 (II) 2.5 (III) 2.4

(I) word frequency (II) tfidf (III) IRM
33
Evaluation Result(2)
  • After evaluating all three system, we ask the
    following two questions.
  • Q6 Which one helps your browsing the most?
  • (I) 1 people (II) 3 (III) 6
  • Q7 Which one detects your interests the most?
  • (I) 0 people (II) 2 (III) 8
  • Overall, IRM can detect words of the users
    interests the most.

(I) word frequency (II) tfidf (III) IRM
34
Outline of the talk
  • Introduction
  • Context-based word weighting
  • Proposed measure
  • Previous work
  • IRM (Interest Relevance Measure)
  • System architecture
  • Evaluation
  • Conclusion

35
Conclusion
  • We develop an context-based word weighting
    measure (IRM) based on the relevance (i.e., the
    co-occurrence) to a users familiar words.
  • If a user is not familiar with the topic, he/she
    may think general words related to the topic are
    important.
  • On the other hand, if a user is familiar with the
    topic, he/she may think more detailed words are
    important.
  • We implemented IRM to browsing support system,
    and showed the effect.
Write a Comment
User Comments (0)
About PowerShow.com