Title: Distributed Data Storage and Parallel Processing Engine
1Distributed Data Storage and Parallel Processing
Engine
Sector Sphere
Yunhong Gu Univ. of Illinois at Chicago
2What is Sector/Sphere?
- Sector Distributed File System
- Sphere Parallel Data Processing Engine (generic
MapReduce) - Open source software, GPL/BSD, written in C.
- Started since 2006, current version 1.23
- http//sector.sf.net
3Overview
- Motivation
- Sector
- Sphere
- Experimental Results
4Motivation
Super-computer model Expensive, data IO
bottleneck
Sector/Sphere model Inexpensive, parallel data
IO, data locality
5Motivation
Parallel/Distributed Programming with MPI,
etc. Flexible and powerful. But too complicated
Sector/Sphere model (cloud model) Clusters are a
unity to the developer, simplified programming
interface. Limited to certain data parallel
applications.
6Motivation
Systems for single data centers Requires
additional effort to locate and move data.
Sector/Sphere model Support wide-area data
collection and distribution.
7Sector Distributed File System
User account Data protection System Security
Metadata Scheduling Service provider
System access tools App. Programming Interfaces
Security Server
Masters
Clients
SSL
SSL
Data
UDT Encryption optional
slaves
slaves
Storage and Processing
8Sector Distributed File System
- Sector stores files on the native/local file
system of each slave node. - Sector does not split files into blocks
- Pro simple/robust, suitable for wide area, fast
and flexible data processing - Con users need to handle file size properly
- The master nodes maintain the file system
metadata. No permanent metadata is needed. - Topology aware
9Sector Performance
- Data channel is set up directly between a slave
and a client - Multiple active-active masters (load balance),
starting from 1.24 - UDT is used for high speed data transfer
- UDT is a high performance UDP-based data transfer
protocol. - Much faster than TCP over wide area
10UDT UDP-based Data Transfer
- http//udt.sf.net
- Open source UDP based data transfer protocol
- With reliability control and congestion control
- Fast, firewall friendly, easy to use
- Already used in many commercial and research
software
11Sector Fault Tolerance
- Sector uses replications for better reliability
and availability - Replicas can be made either at write time
(instantly) or periodically - Sector supports multiple active-active masters
for high availability
12Sector Security
- Sector uses a security server to maintain user
account and IP access control for masters,
slaves, and clients - Control messages are encrypted
- not completely finished in the current version
- Data transfer can be encrypted as an option
- Data transfer channel is set up by rendezvous, no
listening server.
13Sector Tools and API
- Supported file system operation ls, stat, mv,
cp, mkdir, rm, upload, download - Wild card characters supported
- System monitoring sysinfo.
- C API list, stat, move, copy, mkdir, remove,
open, close, read, write, sysinfo. - FUSE
14Sphere Simplified Data Processing
- Data parallel applications
- Data is processed at where it resides, or on the
nearest possible node (locality) - Same user defined functions (UDF) are applied on
all elements (records, blocks, or files) - Processing output can be written to Sector files
or sent back to the client - Generalized Map/Reduce
15Sphere Simplified Data Processing
for each file F in (SDSS datasets) for each
image I in F findBrownDwarf(I, )
SphereStream sdss sdss.init("sdss
files") SphereProcess myproc myproc-gtrun(sdss,"f
indBrownDwarf", ) myproc-gtread(result)
findBrownDwarf(char image, int isize, char
result, int rsize)
16Sphere Data Movement
- Slave -gt Slave Local
- Slave -gt Slaves (Shuffle/Hash)
- Slave -gt Client
17Sphere/UDF vs. MapReduce
- Record Offset Index
- UDF
- Hashing / Bucket
- -
- UDF
- -
- Parser / Input Reader
- Map
- Partition
- Compare
- Reduce
- Output Writer
18Sphere/UDF vs. MapReduce
- Sphere is more straightforward and flexible
- UDF can be applied directly on records, blocks,
files, and even directories - Native binary data support
- Sorting is required by Reduce, but it is optional
in Sphere - Sphere uses PUSH model for data movement, faster
than the PULL model used by MapReduce
19Why Sector doesnt Split Files?
- Certain applications need to process a whole file
or even directory - Certain legacy applications need a file or a
directory as input - Certain applications need multiple inputs, e.g.,
everything in a directory - In Hadoop, all blocks would have to be moved to
one node for processing, hence no data locality
benefit.
20Load Balance
- The number of data segments is much more than the
number of SPEs. When an SPE completes a data
segment, a new segment will be assigned to the
SPE. - Data transfer is balanced across the system to
optimize network bandwidth usage.
21Fault Tolerance
- Map failure is recoverable
- If one SPE fails, the data segment assigned to it
will be re-assigned to another SPE and be
processed again. - Reduce failure is unrecoverable
- In small-medium systems, machine failure during
run time is rare - If necessary, developers can split the input into
multiple sub-tasks to reduce the cost of reduce
failure.
22Open Cloud Testbed
- 4 Racks in Baltimore (JHU), Chicago (StarLight
and UIC), and San Diego (Calit2) - 10Gb/s inter-site connection on CiscoWave
- 2Gb/s inter-rack connection
- Two dual-core AMD CPU, 12GB RAM, 1TB single disk
- Will be doubled by Sept. 2009.
23Open Cloud Testbed
24The TeraSort Benchmark
- Data is split into small files, scattered on all
slaves - Stage 1 On each slave, an SPE scans local files,
sends each record to a bucket file on a remote
node according to the key. - Stage 2 On each destination node, an SPE sort
all data inside each bucket.
25TeraSort
100 bytes record
Stage 2 Sort each bucket on local node
10-byte
90-byte
Value
Key
Bucket-0
Bucket-0
Bucket-1
Bucket-1
10-bit
0-1023
Stage 1 Hash based on the first 10 bits
Bucket-1023
Bucket-1023
26Performance Results TeraSort
Run time seconds Sector v1.16 vs Hadoop 0.17
Data Size Sphere Hadoop (3 replicas) Hadoop (1 replica)
UIC 300GB 1265 2889 2252
UIC StarLight 600GB 1361 2896 2617
UIC StarLight Calit2 900GB 1430 4341 3069
UIC StarLight Calit2 JHU 1.2TB 1526 6675 3702
27Performance Results TeraSort
- Sorting 1.2TB on 120 nodes
- Sphere Hash Local Sort 981sec 545sec
- Hadoop 3702/6675 seconds
- Sphere Hash
- CPU 130 MEM 900MB
- Sphere Local Sort
- CPU 80 MEM 1.4GB
- Hadoop CPU 150 MEM 2GB
28The MalStone Benchmark
- Drive-by problem visit a web site and get
comprised by malware. - MalStone-A compute the infection ratio of each
site. - MalStone-B compute the infection ratio of each
site from the beginning to the end of every week.
http//code.google.com/p/malgen/
29MalStone
Text Record
Event ID Timestamp Site ID Compromise Flag
Entity ID 00000000005000000043852268954353585368
2008-11-08 175652.42264038572689543536285991
000000497829
Transform
Stage 2 Compute infection rate for each merchant
Site ID
Time
Flag
Key
Value
site-000X
site-000X
3-byte
site-001X
site-001X
000-999
Stage 1 Process each record and hash into
buckets according to site ID
site-999X
site-999x
30Performance Results MalStone
Process 10 billions records on 20 OCT nodes
(local).
MalStone-A MalStone-B
Hadoop 454m 13s 840m 50s
Hadoop Streaming/Python 87m 29s 142m 32s
Sector/Sphere 33m 40s 43m 44s
Courtesy of Collin Bennet and Jonathan Seidman
of Open Data Group.
31System Monitoring (Testbed)
32System Monitoring (Sector/Sphere)
33For More Information
- Sector/Sphere code docs http//sector.sf.net
- Open Cloud Consortium http//www.opencloudconsort
ium.org - NCDM http//www.ncdm.uic.edu