From WebArchive to WebDigest concepts and examples - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

From WebArchive to WebDigest concepts and examples

Description:

From WebArchive to WebDigest concepts and examples – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 32
Provided by: lxm6
Category:

less

Transcript and Presenter's Notes

Title: From WebArchive to WebDigest concepts and examples


1
From WebArchive to WebDigest concepts and
examples
  • Li Xiaoming
  • Peking University, China
  • APNG Camp 2007, Xian
  • August 28, 2007

Huang Lianen Yao Conglei
2
Preface
  • Yesterday morning, I gave an introduction on Web
    InfoMall, the Chinese WebArchive at Digital
    Archive Workshop
  • I were told many of APNG Camp members were
    attending.
  • Yesterday afternoon, I had another 15 minutes
    presentation with the Digital Archive Workshop,
    about the issues of accessing the WebArchive
  • I saw no APNG Camp members were present.
  • So, I prepare my talk of next 45 minutes as

3
Outline
  • A quick run on WebArchive
  • Just in case some of you were not there yesterday
    morning
  • A summary of potential ways of accessing (making
    use of) WebArchive
  • To set the stage for WebDigests coming
  • WebDigest
  • Concepts and examples

I apologize in advance some of the examples are
going to be shown in Chinese
4
On the elapsing nature of web data
  • The average life cycle of a random web page is
    about 100 days.
  • .com shorter, .edu longer, etc.
  • 50 of current viewable web pages will disappear
    in about 1 year.

The goal of Web InfoMall fetch and archive as
much web pages as possible before they are gone.
5
The progress and status
  • Jan 18, 2002, the first batch of web pages was
    archived.
  • About 1 million pages incremental a day since
    then.
  • As of today, Web InfoMall has accumulated over
    2.5 billion Chinese web pages.
  • Total online data volume 30TB, with an off line
    backup.

6
??InfoMall??20070827
7
????www.sina.com.cn
8
??2002.1.18??
Headquarter of Bin Ladin was bombed.
9
????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
10
??????
11
APNG home page 2002
12
APNG Camp page 2002
13
Seriously, whats the use ?
  • Preserving historical information before its
    lost
  • Great opportunities for deep text mining
    (information, knowledge, etc.)
  • Providing access to previous information much
    more convenient than libraries even if they have
    kept it.

Test where was the second APNG Camp held ?
14
Shallow accesses
  • Fetching a historical page. lturl,dategt ? page
  • Browsing previous web. Start from lturl,dategt and
    follow the links
  • Backward browsing, to see potential
    customers/collaborators
  • Full text index to support general search

15
Deep accesses WebDigest
  • Temporal search, especially useful for studying
    historical events
  • Pattern mining (structures, named entities, etc)
  • We see at least four meaningful layers of the web
    structure page, host, organization, province. As
    graphs, how they look like ?
  • We may ask what are the top 100 people names on
    the web during 2003

16
Web structure studies
  • Four layers of interest
  • Page layer, eg. http//net.pku.edu.cn/people/lxm.h
    tm
  • Host layer, eg. http//net.pku.edu.cn/
  • Organization layer, eg. http//.pku.edu.cn/
  • Province layer, collection of organizations
    headquartered in the province
  • Clearly, given a snapshot of a countrys web
    (pages), four directed graphs can be constructed,
    conceptually.
  • We are interested in the shapes of the graphs and
    their relations.

17
An experiment with the snapshot of February of
2006
  • 800 million pages
  • SCC strong connected component
  • Given a web graphs computer representation, how
    do you decide on this shape efficiently?

18
How to figure it out ?
  • Representation of the 800m node graph adjacency
    list (100GB)
  • Pick some seeds that are sure in SCC
  • WFS forward, obtaining FS
  • WFS backward, obtaining BS
  • The intersection of FS and BS is SCC
  • FS SCC is OUT
  • BS SCC is IN
  • WFS start from union of FS and BS without
    direction, obtain the WCC
  • Total WCC is the DISKs
  • WCC SCC is the Tendrils

What if we did not know the bow-tie shape in the
beginning ? Can we discover some different and
interesting shapes for other three layers ?
19
WebDigest looking for 5Ws
  • When
  • Time of an event, time of a report about an event
  • Where
  • Venue of an event, location of the publisher
  • Who
  • Not only person, but also organizations
  • What
  • Planed event, breaking news
  • How
  • Good or bad

20
Example about persons
  • Problem 1 given a set of person names, find
    all web pages about each of them
  • Easy search engine will do
  • Not easy what about two persons of the same name
    ?
  • Problem 2 given a snapshot (say 1B pages) of the
    web, find the top N celebrities
  • Not easy we dont even know who should be
    compared !
  • Problem 3 given a snapshot of the web, find all
    the people who were mentioned
  • Not easy where to start ?

These should be done efficiently
21
Determine top N
  • Brute force approach
  • Analyze each page of the snapshot, extract every
    person names.
  • Compare the occurrences of each person and
    declare success !
  • It is not going to work !
  • Typically, analyzing a webpage to extract names
    needs 5 seconds, for 1 billion pages, 50000 days
    are needed !

22
Assumptions and observations
  • top N must be famous people (celebrities), if N
    is not too big
  • For a celebrity, there are many web pages
    describing him/her, in terms of not only name,
    but also many other attributes
  • E.g., age, job, representative work, height,
    weight, birth place,
  • Those information occurs often with certain
    common patterns
  • ????,?????,????? ,?????
  • ????,???,????? ,???
  • Of course, we dont have complete knowledge of
    the patterns and relations in advance.

23
Extended DIPRE (Sergey Brin, 1998)
  • Dual Iterative Pattern Relation Expansion
  • Use these two kinds of incomplete information,
    iteratively enrich each other,to discover more
    and more celebrities
  • Start from some seed persons and their known
    relations, search to get related pages and
    discover the patterns from those pages
  • ??,??? ????????????????,???????????????????
    ??,????????????????????

24
DIPRE
  • With these patterns, search again to find pages
    containing other attributes
  • ??,??,?????,??????,???????,??????lt???,????
    gt?????????????
  • Next round, the new relation ??? ???? is used
    to get some new pages, and probably discover a
    new pattern, such as, ??,????,the new
    pattern then helps us to find new relation, and
    so on

25
2006.7 ????top 100
26
Why can you claim they are really top 100?
  • Prove it suffices to show if somebody belongs
    to top 100 then he will be caught by the above
    process
  • If he belongs to top 100, then he must have a
    lot of occurrences
  • Some of the occurrences must be in some common
    pattern
  • The common pattern will be discovered sooner or
    later in the iteration
  • Then he will be discovered when the pattern is
    issued for search
  • Once discovered, the number of occurrences can be
    obtained and then can be compared with others.

27
Who were mentioned on the Web ?
  • Not necessarily famous people, so we can not have
    the assumption of many occurrences and common
    patterns. As such, DIPRE is not applicable in
    this case
  • In stead, we make use of small world idea as an
    assumptionif some one occurs in one webpage, the
    probability he co-occurs with other person in a
    page is very high, the graph of co-occurrence has
    small diameter.
  • Thus, we start from some people (seeds), get
    related pages, extract new names and use them to
    get new pages,
  • (this way, only pages containing names are
    processed)

28
The program ran 7 days, obtained over 2.1 million
person names
  • It got 2.1 million names when seeds reach 1500
  • Among pages containing names, average 32 names in
    a page.
  • There is one page containing 11480 names !

29
2006?,???????????
The page contains the most number of person
names11480
30
Summary WebDigest
  • WebArchive implies an unprecedented and exciting
    potential for not only computer scientists but
    also social scientists
  • If we ask Google, who are the top 100 popular
    persons in 2004, it will not be able to answer
  • Google basically provides information contained
    in one current page, but here we need information
    collectively contained in a set of previous
    pages.
  • WebDigest is designed based on WebArchive and
    trying to answer such questions.

31
Thanks for your attention and welcome to
collaborate on WebDigest
  • lxm_at_pku.edu.cn
Write a Comment
User Comments (0)
About PowerShow.com