Title: Characteristics of Streaming Media Stored on the Web
1Characteristics of Streaming Media Stored on the
Web
Mingzhe Li, Mark Claypool, Robert Kinicki and
James Nichols
ACM Transactions on Internet Technology
(TOIT) (Accepted for Publication, probably 2005)
2Introduction (1 of 2)
- Improvements to Internet enable users to stream
from Web browsers - Across national and cultural boundaries
- Web users expect point and click to stream
- 2001, RealNetworks says 350,000 hours 1
- 2002, CAIDA says streaming is significant
fraction of traffic - Going to increase with cellular networks
- Concern drives new protocols, routers, etc. to
deal with traffic better
3Introduction (2 of 2)
- Much work that characterizes streaming
applications to better understand - Unfortunately, little shows what current streams
stored on Web look like - Previous study in 1997 19
- Looked at every video on the Web
- Found Internet could not support streaming
- RealPlayer and Media Player not created
- In 1985, papers by Ousterhout et al 21 studied
characteristics of files - Fundamental in designing new file system
- ? Need study of streaming media stored on the Web
to help research today
4Investigation (1 of 2)
- What are the most popular streaming media
products? - Previous studies 12 show very different
- Earlier, prevalence of MPEG, AVI, QuickTime made
it difficult for new comers - What is the ratio of streaming audio versus
streaming video? - Audio has lower bitrate cap (voice, music) than
video - Can give current bitrate expectations
- Are media durations long-tailed?
- Long-tailed can contribute to self-similarity
- Self-similar traffic difficult to manage
5Investigation (2 of 2)
- What are typical streaming media target bitrates?
- Direct impact on network traffic
- Provides insight into frame resolution, frame
rates, color depth - What fraction of streaming codecs being used?
- Codecs determine compression efficiency
- Knowledge of codec prevalence suggests how fast
improvements incorporated
6Focus
- Focus on commercial
- Big 3 Media Player, RealPlayer, QuickTime
- Other studies looked at server side or one client
- This study broader
- Have been p2p studies, but p2p not streamed
(mostly) - Instead downloaded, as is file transfer
- Build specialized crawler, crawl over 17 million
URLs from different starting points, and analyze
about 30 thousand clips
7Teasers
- Volume and relative amount increased since 1997
- Proprietary most prevalent
- RealPlayer 1st, Media Player 2nd
- Most clips short, with long-tailed duration
- Encoded at low-resolution, less than current
monitors can handle - Work useful for
- Selecting clip workloads
- Generating streaming models
8Outline
- Introduction (done)
- Methodology
- Analysis
- Sampling Issues
- Conclusions
9Methodology(Mini-Outline)
- Media Crawler
- Starting Pages
- Measurement
10Media Crawler
- Modify Larbin Web crawler
- Recursively traverses URLs
- Avoid loops by caching previous
- Identify streaming media based on protocol type
- Ex mms//,
- rtsp//
- Also examine
- HTTP extensions
11Starting Pages
- Wanted international and popular
- International chose 10 most wired countries
- Allow for cross cultural analysis
- If Nielsen gave no additional info, chose
domestic newspaper as starting point - USA chose 7 popular themes
- Allow for cross-content analysis
- Feb 13, 2003, crawl 1 million from each
- Took 4 to 24 hours, based on RTT
12Measurement of Content Characteristics
- Use specialized tools to access each Media URL
- Collect encoding, bitrate, duration, size,
- Tools built from SDK, use player core
- RealNetworks
- RealAnalyzer, TestPlay (could not do levels)
- Microsoft Media
- Media Analyzer, Wmprop (could do levels)
- MPlayer
- Open source (could not do bitrate)
13Outline
- Introduction (done)
- Methodology (done)
- Analysis
- Aggregate analysis
- Commercial products
- Video
- Audio
- Codec
- Sampling Issues
- Conclusions
14Aggregate Analysis (1 of 3)
- Remove unique, giving about 11 million URLs
- About 54,000 were streaming
- In 1997, about 25 million URLs
- About 22,000 were streaming
- Extrapolating
- ? Today, about 15 million total
- ? Increase from 0.09 to 0.47
15Aggregate Analysis (2 of 3)
Some heavy hitters, more so than typical Web
servers
16Aggregate Analysis (3 of 3)
- Real almost ½ of all streaming content - In
1997, MPEG, AVI, QuickTime were all, but now only
10 combined - MP3 is most popular
non-proprietary format
17Outline
- Introduction (done)
- Methodology (done)
- Analysis
- Aggregate analysis
- Commercial products
- Video
- Audio
- Codec
- Sampling Issues
- Conclusions
18Commercial Product Analysis
- Run custom tools on commercial
- Of original 39,000 only about 29,000 valid
- 50 cannot find specified file
- 25 cannot connect to server
- 10 authorization failure
- Can be from playlist
- But 97 only 1 clip
19Live versus Pre-Recorded
- Most pre-recorded - 98 is pre-recorded, 2 live
20Percentage of Audio and Video
- More RealAudio than MP3 Audio - Proportionally
less WSM is audio - Almost no QuickTime is audio
21Duration
- 1997, 90 only 45 seconds or less - Still,
today much shorter than T.V. show or movie
22Self-Similar Analysis (1 of 2)
Definitive test Is tail flat?
Looks flat, but that is not good enough 31
23Self-Similar Analysis (2 of 2)
- Measure curve of tail (1/16th of distro, others
same) - Curve defined as 3 point estimate, take
derivative - Estimate Pareto (long-tailed) slope ?
- Used aest tool
- Generate 1000 samples from Pareto with ?
- Each sample has same number of points as n
- Calculate curvature of sample tail, mean ?
- Calculate difference (d) between ? and original
- Count number out of 1000 differ by d
- 495 (video) and 498 (audio), about ½
- Cannot reject null-hypothesis ? May be
long-tailed
24Outline
- Introduction (done)
- Methodology (done)
- Analysis
- Aggregate analysis
- Commercial products
- Video
- Audio
- Codec
- Sampling Issues
- Conclusions
25Video Encoded Bitrate
In 1997, 1 stream for modem, 50 for broadband,
20 for T1 - Said, modem could not support
streaming Note, today, broadband still not
targeted
26Streams Encoded Per Clip
Audio is one stream
Media Scaling will be difficult! Note, earlier
study 15 found real at 65
27Aspect Ratios
Very uniform, but a few odd-balls 30 above or
below Take product for size (next)
28Video Resolution
- Most much smaller than typical monitors (1024 x
768 would be 786,432) - Room to grow!
29Outline
- Introduction (done)
- Methodology (done)
- Analysis
- Aggregate analysis
- Commercial products
- Video
- Audio
- Codec
- Sampling Issues
- Conclusions
30Audio Encoded Bitrates
- Most for modems, but 10 for broadband - In
1999, 100 found for modems - Will likely
increase (MP3 128 kbps), but cap
31Video Codecs
v8 buffers differently than v9
- Newest versions, v9, still not deployed much -
Useful as snapshot in time
32Outline
- Introduction (done)
- Methodology (done)
- Analysis (done)
- Sampling Issues
- Conclusions
33Sampling Issues
- In 1997, could analyze all on Web
- Today, impractical
- Would take 16 years to crawl and analyze clips
- Is 17 million large enough sample?
- Is is possible to obtain same results with fewer
starting points? - Is it possible to obtain same results with fewer
than 1 million URLs per starting point? - How does sampling affect distributions?
- How does choice of starting point affect
distribution?
34Percentage of Media versus URLs
Took 200k from each, build set Overall, above
400k from each is stable ? ½ million
35Duration of Video for Number of URLs
Can get away with far fewer and have same
distribution of durations
36Media Type versus Starting Points
9 Starting points sufficient
37Duration for Number of Starting Points
38Media Type in USA versus International
- International similar - May be because
cross-cultural Web
39Duration for USA and Non-USA
40Summary
- Many researchers worry about volume increase of
Video - Video characteristics made based on old data
- Current data on media stored on Web
- Crawled 17 million URLs, analyzed 30k clips
41Conclusions
- Streaming media increased 600 in past 5 years
- Real Media 1st, Microsoft Media 2nd
- Audio and video about equal
- Vast majority pre-recorded (not live)
- Most targets still for modem
- Potential to be large since monitor resolutions
much larger than video
42Future Work?
43Future Work
- Correlate to actual data streamed
- Congestion responsiveness
- P2P
- Future study (now 1.5 years old!)