From WebArchive to WebDigest concepts and examples

About This Presentation

Title:

From WebArchive to WebDigest concepts and examples

Description:

From WebArchive to WebDigest concepts and examples – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 32

Provided by: lxm6

Category:

more less

Transcript and Presenter's Notes

Title: From WebArchive to WebDigest concepts and examples

1
From WebArchive to WebDigest concepts and
examples

Li Xiaoming
Peking University, China
APNG Camp 2007, Xian
August 28, 2007

Huang Lianen Yao Conglei
2
Preface

Yesterday morning, I gave an introduction on Web
InfoMall, the Chinese WebArchive at Digital
Archive Workshop
I were told many of APNG Camp members were
attending.
Yesterday afternoon, I had another 15 minutes
presentation with the Digital Archive Workshop,
about the issues of accessing the WebArchive
I saw no APNG Camp members were present.
So, I prepare my talk of next 45 minutes as

3
Outline

A quick run on WebArchive
Just in case some of you were not there yesterday
morning
A summary of potential ways of accessing (making
use of) WebArchive
To set the stage for WebDigests coming
WebDigest
Concepts and examples

I apologize in advance some of the examples are
going to be shown in Chinese
4
On the elapsing nature of web data

The average life cycle of a random web page is
about 100 days.
.com shorter, .edu longer, etc.
50 of current viewable web pages will disappear
in about 1 year.

The goal of Web InfoMall fetch and archive as
much web pages as possible before they are gone.
5
The progress and status

Jan 18, 2002, the first batch of web pages was
archived.
About 1 million pages incremental a day since
then.
As of today, Web InfoMall has accumulated over
2.5 billion Chinese web pages.
Total online data volume 30TB, with an off line
backup.

6
??InfoMall??20070827
7
????www.sina.com.cn
8
??2002.1.18??
Headquarter of Bin Ladin was bombed.
9
????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
10
??????
11
APNG home page 2002
12
APNG Camp page 2002
13
Seriously, whats the use ?

Preserving historical information before its
lost
Great opportunities for deep text mining
(information, knowledge, etc.)
Providing access to previous information much
more convenient than libraries even if they have
kept it.

Test where was the second APNG Camp held ?
14
Shallow accesses

Fetching a historical page. lturl,dategt ? page
Browsing previous web. Start from lturl,dategt and
follow the links
Backward browsing, to see potential
customers/collaborators
Full text index to support general search

15
Deep accesses WebDigest

Temporal search, especially useful for studying
historical events
Pattern mining (structures, named entities, etc)
We see at least four meaningful layers of the web
structure page, host, organization, province. As
graphs, how they look like ?
We may ask what are the top 100 people names on
the web during 2003

16
Web structure studies

Four layers of interest
Page layer, eg. http//net.pku.edu.cn/people/lxm.h
tm
Host layer, eg. http//net.pku.edu.cn/
Organization layer, eg. http//.pku.edu.cn/
Province layer, collection of organizations
headquartered in the province
Clearly, given a snapshot of a countrys web
(pages), four directed graphs can be constructed,
conceptually.
We are interested in the shapes of the graphs and
their relations.

17
An experiment with the snapshot of February of
2006

800 million pages
SCC strong connected component
Given a web graphs computer representation, how
do you decide on this shape efficiently?

18
How to figure it out ?

Representation of the 800m node graph adjacency
list (100GB)
Pick some seeds that are sure in SCC
WFS forward, obtaining FS
WFS backward, obtaining BS
The intersection of FS and BS is SCC
FS SCC is OUT
BS SCC is IN
WFS start from union of FS and BS without
direction, obtain the WCC
Total WCC is the DISKs
WCC SCC is the Tendrils

What if we did not know the bow-tie shape in the
beginning ? Can we discover some different and
interesting shapes for other three layers ?
19
WebDigest looking for 5Ws

When
Time of an event, time of a report about an event
Where
Venue of an event, location of the publisher
Who
Not only person, but also organizations
What
Planed event, breaking news
How
Good or bad

20
Example about persons

Problem 1 given a set of person names, find
all web pages about each of them
Easy search engine will do
Not easy what about two persons of the same name
?
Problem 2 given a snapshot (say 1B pages) of the
web, find the top N celebrities
Not easy we dont even know who should be
compared !
Problem 3 given a snapshot of the web, find all
the people who were mentioned
Not easy where to start ?

These should be done efficiently
21
Determine top N

Brute force approach
Analyze each page of the snapshot, extract every
person names.
Compare the occurrences of each person and
declare success !
It is not going to work !
Typically, analyzing a webpage to extract names
needs 5 seconds, for 1 billion pages, 50000 days
are needed !

22
Assumptions and observations

top N must be famous people (celebrities), if N
is not too big
For a celebrity, there are many web pages
describing him/her, in terms of not only name,
but also many other attributes
E.g., age, job, representative work, height,
weight, birth place,
Those information occurs often with certain
common patterns
????,?????,????? ,?????
????,???,????? ,???
Of course, we dont have complete knowledge of
the patterns and relations in advance.

23
Extended DIPRE (Sergey Brin, 1998)

Dual Iterative Pattern Relation Expansion
Use these two kinds of incomplete information,
iteratively enrich each other,to discover more
and more celebrities
Start from some seed persons and their known
relations, search to get related pages and
discover the patterns from those pages
??,??? ????????????????,???????????????????
??,????????????????????

24
DIPRE

With these patterns, search again to find pages
containing other attributes
??,??,?????,??????,???????,??????lt???,????
gt?????????????
Next round, the new relation ??? ???? is used
to get some new pages, and probably discover a
new pattern, such as, ??,????,the new
pattern then helps us to find new relation, and
so on

25
2006.7 ????top 100
26
Why can you claim they are really top 100?

Prove it suffices to show if somebody belongs
to top 100 then he will be caught by the above
process
If he belongs to top 100, then he must have a
lot of occurrences
Some of the occurrences must be in some common
pattern
The common pattern will be discovered sooner or
later in the iteration
Then he will be discovered when the pattern is
issued for search
Once discovered, the number of occurrences can be
obtained and then can be compared with others.

27
Who were mentioned on the Web ?

Not necessarily famous people, so we can not have
the assumption of many occurrences and common
patterns. As such, DIPRE is not applicable in
this case
In stead, we make use of small world idea as an
assumptionif some one occurs in one webpage, the
probability he co-occurs with other person in a
page is very high, the graph of co-occurrence has
small diameter.
Thus, we start from some people (seeds), get
related pages, extract new names and use them to
get new pages,
(this way, only pages containing names are
processed)

28
The program ran 7 days, obtained over 2.1 million
person names

It got 2.1 million names when seeds reach 1500
Among pages containing names, average 32 names in
a page.
There is one page containing 11480 names !

29
2006?,???????????
The page contains the most number of person
names11480
30
Summary WebDigest

WebArchive implies an unprecedented and exciting
potential for not only computer scientists but
also social scientists
If we ask Google, who are the top 100 popular
persons in 2004, it will not be able to answer
Google basically provides information contained
in one current page, but here we need information
collectively contained in a set of previous
pages.
WebDigest is designed based on WebArchive and
trying to answer such questions.

From WebArchive to WebDigest concepts and examples - PowerPoint PPT Presentation

From WebArchive to WebDigest concepts and examples

From WebArchive to WebDigest concepts and examples – PowerPoint PPT presentation