Event Detection and Summarization in Weblogs with Temporal Collocations - PowerPoint PPT Presentation

About This Presentation
Title:

Event Detection and Summarization in Weblogs with Temporal Collocations

Description:

Retrieve the collocations from the sentences in blog posts ... the English weblog of 2,734,518 articles for analysis. Gold standard ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 26
Provided by: nam5124
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Event Detection and Summarization in Weblogs with Temporal Collocations


1
Event Detection and Summarization in Weblogs with
Temporal Collocations
  • Chun-Yuan Teng and Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University
  • Taipei, Taiwan
  • hhchen_at_csie.ntu.edu.tw

2
Outlines
  • Motivation
  • Temporal collocation
  • Event detection and summarization using temporal
    collocations
  • Experiments
  • Datasets
  • Evaluation of event detection
  • Evaluation of event summarization
  • Conclusion

3
Motivation
  • Weblogs
  • containing abundant life experiences and public
    opinions toward different topics
  • highly sensitive to the events occurring in the
    real world
  • associated with the personal information of
    bloggers
  • Problem
  • How to know what bloggers write and discuss over
    time?
  • Event detection is fundamental

4
Google Trend
  • Google Trend
  • Plot the frequency of word and frequency of news
    over time
  • E.g., Select the news with highest frequency of
    president
  • Ambiguous peak
  • We dont know the peak of president is caused
    by which president.

5
Collocations
  • Combination of words give the specific meaning.
  • Collocations such as mean and variance,
    hypothesis test, mutual information, etc. are
    used to model the relationship between terms.
  • Can we model collocations over time?

6
Temporal Collocation
  • Mutual Information
  • Temporal Mutual Information
  • P(x,yt) denotes the probability of co-occurrence
    of terms x and y in timestamp t.
  • P(xt) and P(yt) denote the probability of x and
    y in timestamp t.

7
Temporal Collocation
  • Change of Temporal Mutual Information
  • C(x,y,t1,t2) is the change of temporal mutual
    information of terms x and y in time interval
    t1, t2
  • I(x,y t1) and I(x,y t2) are the temporal mutual
    information in time stamps t1 and t2, respectively

8
Event Detection
  • Identify the collocations resulting in events
  • Retrieve the descriptions of events

9
System Architecture
  • Pre-processing phase
  • parse the weblogs
  • retrieve the collocations
  • Event detection phase
  • detect the unusual peak of the change of temporal
    mutual information
  • identify the set of collocations resulting in an
    event in a specific time duration
  • Event summarization phase
  • extract the collocations related to the seed
    collocations found in a specific time duration

10
Pre-processing Phase
  • Retrieve the collocations from the sentences in
    blog posts
  • Propose the candidates within a window size
  • Remove those candidates containing stop-words or
    with low change of temporal mutual information

11
Event Detection Phase
  • Remove the regular pattern by seasonal index
  • Measure the unusual peak of temporal mutual
    information to detect the plausible events
  • change of temporal mutual information
  • (MI2-MI1)
  • favor the events with high frequency
  • relative change of temporal mutual information
  • (MI2-MI1)/MI1
  • favor the events with low mutual information

MI1 and MI2 temporal mutual information at
timestamps t1 and t2
12
Event Summarization Phase
  • Select the collocations with the highest mutual
    information with the word w in a seed collocation
  • Place the seed collocation into a collocation
    network
  • Add the collocation having the highest mutual
    information
  • Compute the mutual information of the multiword
    collocations when a new collocation is added
  • Stop and return the words in the collocation
    network if the multiword mutual information is
    lower than a threshold

13
A Collocation Network
14
Data Sets
  • ICWSM weblog data set
  • collected from May 1, 2006 through May 20, 2006
  • about 20 GB
  • the English weblog of 2,734,518 articles for
    analysis
  • Gold standard
  • http//en.wikipedia.org/wiki/May_2006
  • The events posted in wikipedia are not always
    complete, thus we adopt recall rate
  • The events specified in wikipedia are not always
    discussed in weblogs, thus we remove the events
    listed in wikipedia, but not referenced in the
    weblogs

15
Evaluation of Event Detection Phase
recall rate 75
16
Performance of Event Detection Phase
17
Discussion
  • Change of MI (left) favors regular events and
    events with high frequency
  • Time May 03
  • Feeling fell left
  • Relative change (right) favors person or special
    event
  • Terrorists killed in May 3 zacarias moussaoui,
    parad mahajan
  • best actress award in golden globe award in May
    3 Geena Davis

18
Evaluation of Event Summarization
  • Method 1 Employ the highest temporal mutual
    information
  • Method 2 Utilize the highest product of temporal
    mutual information and change of temporal mutual
    information

19
An Example of Event Retrieval
  • typhoon Chanchu
  • The typhoon Chanchu appears in the pacific ocean
    near 5/10, and the typhoon passes through
    Philippine and China and result in disasters in
    these areas.

20
Event Summarization for Typhoon Chanchu Using
Method 1
21
Event Summarization for Typhoon Chanchu Using
Method 2
22
Some Observations
  • The appearance of the typhoon Chanchu cannot be
    found from the events listed in wikipedia on May
    10.
  • We can identify the appearance of typhoon Chanchu
    from the description of the typhoon appearance
    such as typhoon named and Typhoon eye.
  • The typhoon Chanchus path can also be inferred
    from the retrieved collocations such as
    Philippine China and near China.
  • The responses of bloggers such as unexpected
    typhoon and 8 typhoons are also extracted.

23
Method 1 vs. Method 2
  • Method 1 shows more noise than Method 2.
  • The term typhoon earthquake is extracted using
    the Method 1.
  • The term typhoon earthquake is not retrieved
    using Method 2 because we also consider the
    change of temporal mutual information.

24
Concluding Remarks
  • The works we have done
  • Introduce temporal mutual information to capture
    term-term association over time in weblogs
  • Select the extracted collocation with unusual
    peak in terms of relative change of temporal
    mutual information to represent an event
  • Collect those collocations with the highest
    product of mutual information and change of
    temporal mutual information to summarize the
    specific event
  • Future works
  • Model the collocations over time and location
  • Model the relationship between the user-preferred
    usage of collocations and the profile of users

25
Thanks
Write a Comment
User Comments (0)
About PowerShow.com