Title: From WebArchive to WebDigest concepts and examples
1From WebArchive to WebDigest concepts and
examples
- Li Xiaoming
- Peking University, China
- APNG Camp 2007, Xian
- August 28, 2007
Huang Lianen Yao Conglei
2Preface
- Yesterday morning, I gave an introduction on Web
InfoMall, the Chinese WebArchive at Digital
Archive Workshop - I were told many of APNG Camp members were
attending. - Yesterday afternoon, I had another 15 minutes
presentation with the Digital Archive Workshop,
about the issues of accessing the WebArchive - I saw no APNG Camp members were present.
- So, I prepare my talk of next 45 minutes as
3Outline
- A quick run on WebArchive
- Just in case some of you were not there yesterday
morning - A summary of potential ways of accessing (making
use of) WebArchive - To set the stage for WebDigests coming
- WebDigest
- Concepts and examples
I apologize in advance some of the examples are
going to be shown in Chinese
4On the elapsing nature of web data
- The average life cycle of a random web page is
about 100 days. - .com shorter, .edu longer, etc.
- 50 of current viewable web pages will disappear
in about 1 year.
The goal of Web InfoMall fetch and archive as
much web pages as possible before they are gone.
5The progress and status
- Jan 18, 2002, the first batch of web pages was
archived. - About 1 million pages incremental a day since
then. - As of today, Web InfoMall has accumulated over
2.5 billion Chinese web pages. - Total online data volume 30TB, with an off line
backup.
6??InfoMall??20070827
7????www.sina.com.cn
8??2002.1.18??
Headquarter of Bin Ladin was bombed.
9????
The first air strike in new year, American AF
bombed the headquarter of Bin Ladin.
10??????
11APNG home page 2002
12APNG Camp page 2002
13Seriously, whats the use ?
- Preserving historical information before its
lost - Great opportunities for deep text mining
(information, knowledge, etc.) - Providing access to previous information much
more convenient than libraries even if they have
kept it.
Test where was the second APNG Camp held ?
14Shallow accesses
- Fetching a historical page. lturl,dategt ? page
- Browsing previous web. Start from lturl,dategt and
follow the links - Backward browsing, to see potential
customers/collaborators - Full text index to support general search
15Deep accesses WebDigest
- Temporal search, especially useful for studying
historical events - Pattern mining (structures, named entities, etc)
- We see at least four meaningful layers of the web
structure page, host, organization, province. As
graphs, how they look like ? - We may ask what are the top 100 people names on
the web during 2003
16Web structure studies
- Four layers of interest
- Page layer, eg. http//net.pku.edu.cn/people/lxm.h
tm - Host layer, eg. http//net.pku.edu.cn/
- Organization layer, eg. http//.pku.edu.cn/
- Province layer, collection of organizations
headquartered in the province - Clearly, given a snapshot of a countrys web
(pages), four directed graphs can be constructed,
conceptually. - We are interested in the shapes of the graphs and
their relations.
17An experiment with the snapshot of February of
2006
- 800 million pages
- SCC strong connected component
- Given a web graphs computer representation, how
do you decide on this shape efficiently?
18How to figure it out ?
- Representation of the 800m node graph adjacency
list (100GB) - Pick some seeds that are sure in SCC
- WFS forward, obtaining FS
- WFS backward, obtaining BS
- The intersection of FS and BS is SCC
- FS SCC is OUT
- BS SCC is IN
- WFS start from union of FS and BS without
direction, obtain the WCC - Total WCC is the DISKs
- WCC SCC is the Tendrils
What if we did not know the bow-tie shape in the
beginning ? Can we discover some different and
interesting shapes for other three layers ?
19WebDigest looking for 5Ws
- When
- Time of an event, time of a report about an event
- Where
- Venue of an event, location of the publisher
- Who
- Not only person, but also organizations
- What
- Planed event, breaking news
- How
- Good or bad
20Example about persons
- Problem 1 given a set of person names, find
all web pages about each of them - Easy search engine will do
- Not easy what about two persons of the same name
? - Problem 2 given a snapshot (say 1B pages) of the
web, find the top N celebrities - Not easy we dont even know who should be
compared ! - Problem 3 given a snapshot of the web, find all
the people who were mentioned - Not easy where to start ?
These should be done efficiently
21Determine top N
- Brute force approach
- Analyze each page of the snapshot, extract every
person names. - Compare the occurrences of each person and
declare success ! - It is not going to work !
- Typically, analyzing a webpage to extract names
needs 5 seconds, for 1 billion pages, 50000 days
are needed !
22Assumptions and observations
- top N must be famous people (celebrities), if N
is not too big - For a celebrity, there are many web pages
describing him/her, in terms of not only name,
but also many other attributes - E.g., age, job, representative work, height,
weight, birth place, - Those information occurs often with certain
common patterns - ????,?????,????? ,?????
- ????,???,????? ,???
- Of course, we dont have complete knowledge of
the patterns and relations in advance.
23Extended DIPRE (Sergey Brin, 1998)
- Dual Iterative Pattern Relation Expansion
- Use these two kinds of incomplete information,
iteratively enrich each other,to discover more
and more celebrities - Start from some seed persons and their known
relations, search to get related pages and
discover the patterns from those pages - ??,??? ????????????????,???????????????????
??,????????????????????
24DIPRE
- With these patterns, search again to find pages
containing other attributes - ??,??,?????,??????,???????,??????lt???,????
gt????????????? - Next round, the new relation ??? ???? is used
to get some new pages, and probably discover a
new pattern, such as, ??,????,the new
pattern then helps us to find new relation, and
so on
252006.7 ????top 100
26Why can you claim they are really top 100?
- Prove it suffices to show if somebody belongs
to top 100 then he will be caught by the above
process - If he belongs to top 100, then he must have a
lot of occurrences - Some of the occurrences must be in some common
pattern - The common pattern will be discovered sooner or
later in the iteration - Then he will be discovered when the pattern is
issued for search - Once discovered, the number of occurrences can be
obtained and then can be compared with others.
27Who were mentioned on the Web ?
- Not necessarily famous people, so we can not have
the assumption of many occurrences and common
patterns. As such, DIPRE is not applicable in
this case - In stead, we make use of small world idea as an
assumptionif some one occurs in one webpage, the
probability he co-occurs with other person in a
page is very high, the graph of co-occurrence has
small diameter. - Thus, we start from some people (seeds), get
related pages, extract new names and use them to
get new pages, - (this way, only pages containing names are
processed)
28The program ran 7 days, obtained over 2.1 million
person names
- It got 2.1 million names when seeds reach 1500
- Among pages containing names, average 32 names in
a page. - There is one page containing 11480 names !
292006?,???????????
The page contains the most number of person
names11480
30Summary WebDigest
- WebArchive implies an unprecedented and exciting
potential for not only computer scientists but
also social scientists - If we ask Google, who are the top 100 popular
persons in 2004, it will not be able to answer - Google basically provides information contained
in one current page, but here we need information
collectively contained in a set of previous
pages. - WebDigest is designed based on WebArchive and
trying to answer such questions.
31Thanks for your attention and welcome to
collaborate on WebDigest