Supporting ContentAddressable Caching with CZIP Compression - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Supporting ContentAddressable Caching with CZIP Compression

Description:

Supporting Content-Addressable Caching with CZIP Compression ... Fark.com. New York Times. All of them non-cacheable 'no-cache', 'no-store' or 'private' ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 28
Provided by: kyoungs
Category:

less

Transcript and Presenter's Notes

Title: Supporting ContentAddressable Caching with CZIP Compression


1
Supporting Content-Addressable Caching with CZIP
Compression
  • KyoungSoo Park, Sunghwan Ihm, Mic Bowman and
    Vivek Pai
  • Princeton University
  • Intel Research

2
Content-Based Naming (CBN)
  • Naming scheme based on its content
  • Name one-way hash (content)
  • Hashing function MD5, SHA-1, etc.
  • Rabins fingerprint for chunk detection
  • Redundancy elimination
  • Network-traffic/storage systems
  • Research/commercial systems
  • Special-purpose systems

3
Where Can CBN be Applied?
  • Similar file distribution
  • Linux distribution mirror
  • DVD ISO contains all CD ISOs
  • Virtual machine image migration
  • Base OS takes up majority of content
  • httpd VM vs. httpdmysqld VM
  • Uncacheable Web content
  • Some dynamic content doesnt change

4
Contribution of This Work
  • Generic CBN tool
  • Easy to build new systems
  • Easy to upgrade existing non-CBN systems
  • CZIP compression CZIP-aware apps
  • Can be used on existing platforms
  • Provides benefit to non-CZIP apps
  • Demonstrate sample systems
  • Reduces FC6 mirror memory footprint by half
  • Comparable compression speed to GZIPs
  • 2x throughput for CZIP-aware Apache
  • 4x origin server BW reduction for CZIP-aware CDN

5
CZIP Compression
  • Compression scheme like GZIP, BZIP2
  • Export CBN information in the header

CZIP
UNCZIP
CZIP Header
6
CZIP Header
  • Header global attributes chunk info
  • Global attributes
  • One-way hash function (SHA-1/MD5)
  • Chunk data compression (GZIP/BZIP2)
  • Convergent encryption (on/off)
  • Header CRC, File Hash, etc.
  • Chunk information
  • Content hash, start offset, chunk size

7
Deployment Scenario
  • CZIP-aware server

xyzlo5g
hdr
asdfghk
Client A
Chunk A
Server
Chunk B
file1.cz
CBN Cache
Client B
xyzlo5g
header
asdfghk
qoiertty
Chunk A
Chunk B
Chunk C
file2.cz
8
Deployment Scenario
  • CZIP-aware client-side proxy

xyzlo5g
hdr
asdfghk
file1.cz
Client A
Chunk A
Proxy
Server
Chunk B
file1.cz
CBN Cache
Client B
xyzlo5g
header
asdfghk
qoiertty
Chunk A
1. X-SHA-1 field helps CZIP-aware server 2.
Browser cache can support CBN too!
Chunk B
Chunk C
file2.cz
9
Compressibility
  • Fedora Core 6 ISOs/ All files/ Wikipedia DB

1
Data Compression Ratio
CZIPplain
0.9
CZIPgzip
0.8
CZIPbzip2
0.7
GZIP
0.6
BZIP2
0.5
0.4
0.3
0.2
0.1
0
FC6_i386_ISOs.tar
FC6_All_files.tar
Wikipedia_DB.tar
6.7 GB
49.7 GB
7.9 GB
10
Compression speed
  • On Pentium D 2.8GHz with 4GB memory

29,004 secs
3,151 secs
3,964 secs
11
Virtual Machine Images
  • Server consolidation/management
  • Much redundancy among similar VMs
  • Xen FC4 base image (X)
  • X httpd (Y) / Y mysqld (Z)
  • Investigating content overlap over
  • Chunk size
  • Chunking methods
  • Rabins fingerprint vs. fixed-sized
  • After extensive use

12
Chunk Size / Chunking Methods
  • Compare three VM images
  • Base Xen FC4 image / Apache Base httpd
  • Both Apache mysqld

Rabins fingerprint
Fixed-sized chunking
13
Real VM Images
EC1 EC5 VMs based on Xen FC-4 standard tools
Daily used by five different engineers for three
weeks
14
Dynamic Web Pages
  • Observed the front page of these sites
  • Google News
  • CNN
  • Slashdot
  • Digg.com
  • Fark.com
  • New York Times
  • All of them non-cacheable
  • no-cache, no-store or private

15
Average Content Overlap
Downloaded pages every 10 minutes for 18 days
16
Potential Data Savings via CZIP
37
39
61
24
57
90
17
Summary So far
  • CZIP is comparable to GZIP in speed and
    performance
  • CZIP is far better with files with much
    redundancy
  • Redundancy decreases as chunk size increases
  • Rabins fingerprint exposes a good deal of
    redundancy regardless of chunk sizes
  • Optimal chunk size varies over workload
  • Bigger chunk size is better for network transfer
  • Dynamic content also exposes redundancy
  • CZIP can save 24-90 of BW instead of GZIP

18
Server Performance
  • CZIP Apache Module
  • Test scenario (FC mirror simulation)
  • 1.5 GB from FC6 DVD
  • 1.5 GB is split into three 0.5 GB images
  • Each file is requested in round-robin fashion
  • 100-300 clients simulated by six machines in LAN
  • Server is 2.8GHz Pentium D w/ 2GB memory
  • w/ 2GB physical memory with 2 Gbps-NICs

19
CZIP Apache Module
90 2.56 times
Median 2.07 times
20
CBN-Aware Content Distribution
  • CoBlitz large-file CDN NSDI06
  • Serving 1-2 TB every day on PlanetLab
  • http//coblitz.codeen.org/URL
  • University channel podcast/vodcast
  • Fedora Core mirror, Citeseer etc.
  • Chunk is basic caching unit
  • Parallel chunk requests/responses
  • Chunk request in HTTP byte-range query

21
Making CoBlitz CZIP-Aware
  • CoBlitzs chunk request
  • GET /coblitz.codeen.org/www.cs.princeton.edu/
  • bigfile.cz,start1000,end1999 HTTP/1.0
  • Host coblitz.codeen.org
  • CZIP-aware CoBlitz (C-CoBlitz) request
  • GET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0
  • Host czip.codeen.org
  • X-URL www.cs.princeton.edu/bigfile.cz
  • X-Range byte1000-1999

22
CZIP-Aware CoBlitz Testing
  • Two content-overlapping files
  • Simultaneously fetch from 100 PlanetLab nodes
  • Origin server is at Princeton
  • Testing cases
  • Regular Download original files by regular
    CoBlitz
  • File-CZIP Download CZIPed files by regular
    CoBlitz
  • CZIP-CDN Download CZIPed files by C-CoBlitz

23
100 MB File Downloading
388 MB
Regular
File-CZIP
CZIP-CDN
24
50 MB File Downloading
183 MB
Regular
File-CZIP
CZIP-CDN
25
Conclusion
  • CZIP is a generic compression tool providing CBN
    benefits
  • CZIP is comparable to GZIP in compression
    performance
  • CZIP helps greatly reduce memory footprint in
    serving similar files
  • It is very easy to support CZIP and the benefit
    is transparent

26
Thank you!
  • More information can be found at
    http//codeen.cs.princeton.edu/czip/
  • CZIP code will be released soon!

27
200/300 Clients
90 2.27 times
90 2.11 times
80
65
Median 1.95 times
Median 1.84 times
200 clients
300 clients
Write a Comment
User Comments (0)
About PowerShow.com