Title: The DataCentric Revolution in Networking
1The Data-Centric Revolution in Networking?
- Scott Shenker
- International Computer Science Institute
- U. C. Berkeley
Liberally stealing the insight and work of
others, particularly Hari Balakrishnan, Deborah
Estrin, Ramesh Govindan, Joe Hellerstein, and Ion
Stoica
2Two Communities Apart
- Networking (Internet) researchers
- dont know and dont care about databases
- Vast gap between communities
- much more overlap with other systems communities
- But data-centrism has narrowed the gap
- metaphors and algorithms
- This talk will tell that story in reverse order
- Internet, then sensornets
3Our Central Mission
Get data from here to there
from here to there
4Host-centric Protocols
- Protocols defined in terms of IP addresses
- Unicast IP address host
- Multicast IP address set of hosts
- Destination address is given to protocol
- Protocol delivers data from one host to another
- unicast conceptually trivial
- multicast address is logical, not physical
5Host-centric Applications
- Classic applications destination is intrinsic
- telnet target machine
- FTP location of files
- electronic mail email address turns into mail
server - multimedia conferencing machines of participants
- Destination is specified by user (not network)
- Usually specified by hostname not address
- DNS translates names into addresses
6Domain Name System (DNS)
- DNS is built around recursive delegation
- Top level domains (TLDs) .com, .net, .edu, etc.
- TLDs delegate authority to subdomains
- berkeley.edu
- Subdomains can further delegate
- cs.berkeley.edu
- Hierarchy fits host administrative structure
- Local decentralized control
- Crucial to efficient hostname resolution
7Network Research in Early 90s
- Consumed by a few obsessions
- Quality of service for streaming media
- Multicast
- Congestion control
- But nobody questioned host-centricity
- assumed to be the only way to build Internet
8Surprise 1 The web catches on!
9The Web
- Web URLs have host-name/path format
- Essentially the same information as FTP
- Early web
- browsers basically a GUI for FTP
- URLs were easily transmitted pointers
- Early web was host-centric
- and largely ignored (but used) by net researchers
10Modern Web
- URLs often function as names of data
- users think of www.cnn.com as data, not a host
- Fact that www.cnn.com is a hostname is irrelevant
- Users want data, not access to particular host
- The web is now data-centric
11Data-centric App in Host-centric World
- Data still associated with host names (URLs)
- administrative structure of data same as hosts
- weak point in current web
- Key enabler search engines
- Searchable databases map keywords to URLs
- Allowed users to find desired data
- Networkers focused on technical problems
- HTTP, persistence (URNs), replication (CDNs), ...
12We Missed the Point!
- We thought
- web was an aberration
- search engines were a sufficient hack
- No networker (except Jacobson) articulated that
- web had gone from host-centric to data-centric
- it was a harbinger of future applications
13Surprise 2 Stolen Music is Popular!
- And we finally get the message...
14The P2P Filesharing Phenomena
- Napster Fastest growing Internet application
- Music sharing is intrinsically data-centric
- data never associated with hosts
- Centralized searchable database
- listed IP addresses where content could be found
- analogous to GoogleDNS in the web
- Legal problems forced decentralization
- Led to Gnutella and other distributed programs
15Gnutella-style File Sharing
- Gnutella nodes form an overlay network
- each node has a few neighbors in a virtual
network - virtual link node knows others IP address
- do app-level networking on this graph
16Gnutella-style Searching
- Keyword queries are flooded (within scope)
- query is processed locally at each node
- all nodes having hits respond to source
- many variations on this theme (freenet, etc.)
- Clearly not scalable
- P2P traffic now sizable fraction of overall load
- We finally realize that we need a scalable way to
find data for data-centric applications
17Is there life outside the Internet?
- Yes, and we should have been listening!
18Sensornets (predating P2P)
- Vision
- Many sensing devices with radio and processor
- Enable fine-grained measurements over large areas
- Huge potential impact on science, and society
- Technical challenges
- untethered power consumption must be limited
- unattended robust and self-configuring
- wireless ad hoc networking
19Conceptual Challenge
- Sensornets are inherently data-centric
- Users know what data they want, not where it is
- Estrin, Govindan, Heidemann (2000, etc.)
- Centralized database infeasible
- vast amount of data, constantly being updated
- small fraction of data will ever be queried
- sending to single site expends too much energy
20Flood-then-Aggregate
- General class of methods
- Flood query to all nodes (or in region)
- Nodes with data matching query respond
- Responses are aggregated as appropriate
- Examples
- Directed diffusion reinforce based on data
- TAG tree for flood and return-path aggregation
- Etc....
21Scaling Problems
- This approach suffers as
- systems get bigger
- queries more frequent and more specific
- For current deployments, not an issue
- systems are small, queries primitive
- But if technology progresses as hoped
- want to get relevant data without flooding
- similar to situation in Internet
22Is Data-centric Flooding Necessary?
- The initial decentralized data-centric designs
(in both Internet and sensornets) used flooding - unscalable and unsustainable
- Since data-centrism is here to stay, we cant
ignore this problem - We had to broaden our research charter
23Our Revised Mission
Get data from here to there
Get data from here to there
24A DNS for Data?
- Can we map data names into addresses?
- a data-centric DNS, distributed and scalable
- doesnt alter net protocols, but aids data
location - not just about stolen music, but a general
facility - A formidable challenge
- Data does not have a clear administrative
hierarchy - Likely need to support a flat namespace
- Can one do this scalably?
- Data-centrism requires scalable flat lookups
25Distributed Hash Tables (DHTs)
- The latest networking fad....
Presented from the Internet perspective but
applies to sensornets as well
26An Internet-scale Distributed Index
- Interface put(key,object), get(key)
- DHTs form a structured overlay network
- nodes choose particular neighbors
- all objects have keys, usually hash(name)
- each node responsible for range of keys
- puts/gets routed to appropriate node
27Example Design Chord
?
?
?
- Node and object keys
- random location around a circle
- Neighbors
- nodes 2-i around the circle
- found by routing to desired key
- Routing greedy
- pick nbr closest to destination
- Storage own interval
- node owns key range betweenher key and previous
nodes key
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Ownership range
?
?
28Key Properties
- Large aggregate capacity O(n) storage/bwidth
- Scalable
- O(log n) routing hops and state
- O((log n)2) update costs for node join/leaves
- Robust self-configuring and resilient to
failures - Nonproperty strict guarantees when failures
29Our Version of Data Independence
- DHT interface allows us to get data by name
- We no longer care where data is
- A radical transition in databases
- perhaps it will be one in networking as well
- Apologies to Joe Hellerstein...
- see latest SIGMOD Record for his article
30Caveat!
- DHTs are a work-in-progress
- A flurry of research activity on
- security
- replication
- proximity
- real operational experience
- .....
- For rest of talk, we put these worries aside...
31Why Not Centralized Solutions?
- Ugh! (and infeasible for sensornets)
- Fault tolerance avoid single point of failure
- Economic
- DNS donated machines, scales organically
- Centralized solutions require business model
- Issue still open....but irrelevant to
data-centrism - need to support interface
- DHTs allow us to choose between cent. and decent.
32Multiple Roles for DHTs
- Application-specific
- rolled into P2P application, run on peers
- General-purpose service
- run on managed nodes
- Intrinsic part of Internet architecture
- run on managed nodes
33Multiple Roles for DHTs
- Application-specific
- rolled into P2P app, run on peers
- General-purpose service
- run on managed nodes
- Intrinsic part of Internet architecture
- run on managed infrastructure nodes
34Some Applications using DHTs
- Partial list
- File sharing
- Storage repositories and file systems
- Backup systems
- Event notification systems
- Electronic mail
- App-layer multicast and streaming media
- .....
- Useful substrate for many (not all) large
distributed applications because HTs are useful
35Multiple Roles for DHTs
- Application-specific
- rolled into P2P app, run on peers
- General-purpose service
- run on managed nodes
- Intrinsic part of Internet architecture
- run on managed infrastructure nodes
36Internet-scale Query Processing
- Superficial motivation
- Joins can be implemented with hash tables so...
- Distributed joins can be implemented with DHTs
- Scaling latency O(log n) while computation O(n)
- PIER (talk later today in session A9!)
- joins, aggregation, recursive and continuous
queries - Intended targets
- data in the wild (filesharing, net monitoring,
etc.) - schema provided by standardized protocols
- no need for ACID semantics
37More Complex Queries
- Range search
- using prefix hash table
- no need to walk tree
- Keyword search
- engineering the boolean approach
- Active research on DHT-based distributed data
structures for search (net and db communities)
38Multiple Roles for DHTs
- Application-specific
- rolled into P2P app, run on peers
- General-purpose service
- run on managed nodes
- Intrinsic part of Internet architecture
- run on managed infrastructure nodes
39Cleaning Up the Architecture
- Making URNs a reality
- webNG based on flat and opaque DHT keys
- enables persistence and eliminates branding
- Host identifiers versus routing information
- IP addresses currently (and stupidly) serve as
both - DHT key host id, resolves to routing address
- Architectural challenge for basic protocols
40Subverting the Architecture
- Use DHT for forwarding, not just lookup!
- e.g., Internet Indirection Infrastructure (i3)
- similar in spirit to multicast (logical
addressing) - transcends current naming/addressing structures
- Make overlay the real network layer
- turn IP into a link layer technology
- Leverages, not limited by, current infrastructure
- New network layer is still simple, but not IP
41New Generation of Networking?
- Current Internet relies on hierarchies to scale
- DNS naming, IP addressing, etc.
- Hierarchies limit flexibility
- addresses and names have to fit given structure
- need to care where data/machines are
- Scalable flat lookup avoids hierarchy
- network would be structure independent
- Less of a distinction between hosts and data
42Do DHTs Apply to Sensornets?
- Can we build them?
- Do they help?
43Finding Sensornet Data w/o Flooding
- Extract high-level features or events
- Temperature spikes, toxins, animal sightings
- Name these events
- Store/Access events with DHT-like structure
- Can later get detailed data from specific nodes
- Call this data-centric storage (DCS)
- Good for frequent specific queries
- Not good for long-running or aggregate queries
- But how do you build a sensornet DHT?
44Geographic Routing
- Nodes know own and neighbors positions
- Packets routed to geographic destination
- Greedy forwarding, when possible
- If greedy fails at a void, use the right hand
rule to navigate around the void
B
(x,y)
A
45Geographic Hash Table (GHT)
- Keys hashed to random coordinates
- Likely no node exists at that location!
- Forwarding ends at node closest to destination
- Closest node stores the data
A
(x,y)
46Additional Algorithms
- Caching and replication
- Cache around perimeter, replicate independently
- Structured replication (SR)
- Hierarchical decomposition of key space
- Tree of mirror images
Mirror Images
Hash(event)
47More Complex Queries
- Using GHTSR (which has spatial structure)
- Range searches in space and value
- Wavelet analysis
- New data structures
- Higher-dimensional range searches
- Active research in distributed data structures
for sensornet queries (in net and db communities)
48We are finally on our way to the land of data
independence...
- We ask for your guidance....
49Areas of Common Interest
- Algorithmic
- distributed data structures for search
- Metaphoric
- thinking about data independence