Title: Web Characterization
1Web Characterization
- Week 11
- LBSC 690
- Information Technology
2The Why of the Web (in 1995)
- Affordable storage
- 300,000 words/
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate last mile bandwidth
- 1 second/screen
- Display capability
- 10 of US population
- Effective search capabilities
- Lycos, Yahoo
3Defining the Web
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
- Content or behavior?
4Number of Web Sites
5Discussion TopicWhats a Web Site?
- OCLC counted any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Geocities
- Some content requires specialized servers
- rtsp
6Crawling the Web
7Link Structure of the Web
8Web Crawl Challenges
- Discovering islands and peninsulas
- Duplicate and near-duplicate content
- 30-40 of total content
- Server and network loads
- Dynamic content generation
- Link rot
- Changes at 1 per week
- Temporary server interruptions
9Duplicate Detection
- Structural
- Identical directory structure (e.g., mirrors,
aliases) - Syntactic
- Identical bytes
- Identical markup (HTML, XML, )
- Semantic
- Identical content
- Similar content (e.g., with a different banner
ad) - Related content (e.g., translated)
10Robots Exclusion Protocol
- Requires voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the servers top
level - Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- ltmeta name"robots content"noindex,nofollow"gt
11Hands onThe Internet Archive
- alexa.com Web crawls since 1997
- http//archive.org
- Check out the CLIS Web site from 1998!
- Check out the history of your favorite site
12Discussion Point
- Can we save everything?
- Should we?
- Do people have a right to remove things?
13The Deep Web
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
14(No Transcript)
15Content of the Deep Web
16Deep Web
- 60 Deep Sites Exceed Surface Web by 40 Times
17Source James Crawford, http//ourworld.compuserve
.com/homepages/JWCRAWFORD/can-pop.htm
18Global Internet Users
Native speakers, Global Reach projection for 2004
(as of Sept, 2003)
19Global Internet Users
Web Pages
Native speakers, Global Reach projection for 2004
(as of Sept, 2003)
20World Trade in 2001
Source World Trade Organization
21European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
22Blogs
Doubling
18.9 Million Weblogs Tracked Doubling in size
approx. every 5 months Consistent doubling over
the last 36 months
Doubling
Doubling
Doubling
23Blue Mainstream Media
Red Blog
Challenge Fight, or Embrace?
24Daily Posting Volume
Katrina
1.2 Million legitimate Posts/Day Spam posts
marked in red On average, additional 5.8 are
spam posts Some spam spikes as high as 18
London Bombings
Justice OConnor Live 8 Concerts
Deepthroat Revealed
Kryptonite Lock Controversy
Newsweek Koran
Schiavo Dies
US Election Day
Superbowl
Indian Ocean Tsunami
25(No Transcript)
26A Web of Speech?
Web in 1995 Speech in 2005
Storage (words per ) 300K 1.5M
Internet Backbone (simultaneous users) 250K 30M
Last Mile (Download time) 1 second (no graphics) Streaming
Display Capability (Computers/US population) 10 100
Search Systems Lycos Yahoo
27Rethinking the Spoken Word
- Speech is better for some things than writing
- Spoken bits are as persistent as written bits
- Storage costs is 80 times more than text
- Disk cost falls by a factor of 80 in 16 years
- If speech is searchable, we will keep lots of it
28A Little Math
- Collectable spoken words 10 Tw/day
- 1 billion users 100 words/min 200 min/day / 2
- Compressed speech 2 words/kiloByte
- (100/60 w/sec) (6.5 kb/sec / 8 b/B)
- Required storage 5 PetaBytes/day
29A Little Math
- Collectable spoken words 10 Tw/day
- 1 billion users 100 words/min 200 min/day / 2
- Compressed speech 2 words/kiloByte
- (100/60 w/sec) (6.5 kb/sec / 8 b/B)
- Required storage 5 PetaBytes/day
- Storage array sales gt 5 PB/day
- 457 PB in 2Q 2005 (increasing 59 per year)
- 22/person/year (decreasing at 31/year)
Source IDC Worldwide Disk Storage Systems
Tracker, 2Q 2005
30Human History
Oral Tradition
Writing
31Hands On Speech on the Web
- audio.search.yahoo.com
- blinkx.com
- ocw.mit.edu
- podcasts.net
32 View
Select
Listen
Print
Bookmark
Save
Purchase
Subscribe
Delete
Copy / paste
Forward
Quote
Reply
Link
Cite
Mark up
Tag
Organize
Publish
Type
Edit
33 View
Select
Listen
Print
Bookmark
Save
Purchase
Subscribe
Delete
Copy / paste
Forward
Quote
Reply
Link
Cite
Mark up
Tag
Organize
Publish
Type
Edit
34 View
Select
Listen
Print
Bookmark
Save
Purchase
Subscribe
Delete
Copy / paste
Forward
Quote
Reply
Link
Cite
Mark up
Tag
Organize
Publish
Type
Edit
35Estimating Authority from Links
Hub
Authority
Authority
36Collecting Click Streams
- Browsing histories are easily captured
- Make all links initially point to a central site
- Encode the desired URL as a parameter
- Build a time-annotated transition graph for each
user - Cookies identify users (when they use the same
machine) - Redirect the browser to the desired page
- Reading time is correlated with interest
- Can be used to build individual profiles
- Used to target advertising by doubleclick.com
37Search EngineQuery Logs
A Southeast Asia (Dec 27, 2004) B
Indonesia (Mar 29, 2005) C Pakistan (Oct
10, 2005) D Hawaii (Oct 16, 2006) E
Indonesia (Aug 8, 2007) F Peru (Aug 16, 2007)
38Search Engine Query Logs
- http//hannu.biz/aolsearch/
39AOL User 4417749
40Gaining Access to Observations
- Observe public behavior
- Hypertext linking, publication, citing,
- Policy protection
- EU Privacy laws
- US Privacy policies FTC enforcement
- Statistical assurance of privacy
- Distributed architecture
- Model and mitigate privacy risks
41180
160
140
120
Reading Time (seconds)
100
80
60
58
50
43
40
32
20
0
No Interest
Low Interest
Moderate Interest
High Interest
Rating
Full Text Articles (Telecommunications)
42More Complete Observations
- User selects an article
- Interpretation Summary was interesting
- User quickly prints the article
- Interpretation They want to read it
- User selects a second article
- Interpretation another interesting summary
- User scrolls around in the article
- Interpretation Parts with high dwell time and/or
repeated revisits are interesting - User stops scrolling for an extended period
- Interpretation User was interrupted
4355
42
51
52
No Interest
No Interest
Low Interest
Moderate Interest
High Interest
Abstracts (Pharmaceuticals)
44Critical Issues
- Protecting privacy
- What absolute assurances can we provide?
- How can we make remaining risks understood?
- Scalable rating servers
- Is a fully distributed architecture practical?
- Non-cooperative users
- How can the effect of spamming be limited?