Title: Reliable Transport and Code Distribution in Wireless Sensor Networks
1Reliable Transport and Code Distribution in
Wireless Sensor Networks
- Thanos Stathopoulos
- CS 213 Winter 04
2Reliability Introduction
- Not an issue on wired networks
- TCP does a good job
- Link error rates are usually 10-15
- No energy cost
- However, WSNs have
- Low power radios
- Error rates of up 30 or more
- Limited range
- Energy constraints
- Retransmissions reduce lifetime of network
- Limited storage
- Buffer size cannot be too large
- Highly application-specific requirements
- No single TCP-like solution
3Approaches
- Loss-tolerant algorithms
- Leverage spatial and temporal redundancy
- Good enough for some applications
- But what about code updates?
- Add retransmission mechanism
- At the link layer (e.g. SMAC)
- At the routing/transport layer
- At the application layer
- Hop-by-hop or end-to-end?
4Relevant papers
- PSFQ A Reliable Transport Protocol for Wireless
Sensor Networks - RMST Reliable Data Transport in Sensor Networks
- ESRT Event-to-Sink Reliable Transport in
Wireless Sensor Networks
5PSFQ Overview
- Key ideas
- Slow data distribution (pump slowly)
- Quick error recovery (fetch quickly)
- NACK-based
- Data caching guarantees ordered delivery
- Assumption no congestion, losses due only to
poor link quality - Goals
- Ensure data delivery with minimum support from
transport infrastructure - Minimize signaling overhead for
detection/recovery operations - Operate correctly in poor link quality
environments - Provide loose delay bounds for data delivery to
all intended receivers - Operations
- Pump
- Fetch
- Report
6End-to-end considered harmful ?
- Probability of reception degrades exponentially
over multiple hops - Not an issue in the Internet
- Serious problem if error rates are considerable
- ACKs/NACKs are also affected
7Proposed solution Hop-by-Hop error recovery
- Intermediate nodes now responsible for error
detection and recovery - NACK-based loss detection probability is now
constant - Not affected by network size (scalability)
- Exponential decrease in end-to-end
- Cost Keeping state on each node
- Potentially not as bad as it sounds!
- Cluster/group based communication
- Intermediates are usually receivers as well
8Pump operation
- Node broadcasts a packet to its neighbors every
Tmin - Data cache used for duplicate suppression
- Receiver checks for gaps in sequence numbers
- If all is fine, it decrements TTL and schedules a
transmission - Tmin lt Ttransmit lt Tmax
- By delaying transmission, quick fetch operations
are possible - Reduce redundant transmissions (dont transmit if
4 or more nodes have forwarded the packet
already) - Tmax can provide a loose delay bound for the last
hop - D(n)Tmax ( of fragments) ( of hops)
9Fetch operation
- Sequence number gap is detected
- Node will send a NACK message upstream
- Window specifies range of sequence numbers
missing - NACK receivers will randomize their transmissions
to reduce redundancy - It will NOT forward any packets downstream
- NACK scope is 1 hop
- NACKs are generated every Tr if there are still
gaps - Tr lt Tmax
- This is the pump/fetch ratio
- NACKs can be cancelled if neighbors have sent
similar NACKs
10Proactive Fetch
- Last segments of a file can get lost
- Loss detection impossible no next segment
exists! - Solution timeouts (again)
- Node enters proactive fetch mode if last
segment hasnt been received and no packet has
been delivered after Tpro - Timing must be right
- Too early wasted control messages
- Too late increased delivery latency for the
entire file - Tpro a (Smax - Smin) Tmax
- A node will wait long enough until all upstream
nodes have received all segments - If data cache isnt infinite
- Tpro a k Tmax (Tpro is proportional to
cache size)
11Report Operation
- Used as a feedback/monitoring mechanism
- Only the last hop will respond immediately
(create a new packet) - Other nodes will piggyback their state info when
they receive the report reply - If there is no space left in the message, a new
one will be created
12Experimental results
- Tmax 0.3s, Tr 0.1s
- 100 30-byte packets sent
- Exponential increase in delay happens at 11 loss
rate or higher
13PSFQ Conclusion
- Slow data dissemination, fast data recovery
- All transmissions are broadcast
- NACK-based, hop-by-hop recovery
- End-to-end behaves poorly in lossy environments
- NACKs are superior to ACKs in terms of energy
savings - No out-of-order delivery allowed
- Uses data caching extensively
- Several timers and duplicate suppression
mechanisms - Implementing any of those on motes is challenging
(non-preemptive FIFO scheduler)
14RMST Overview
- A transport layer protocol
- Uses diffusion for routing
- Selective NACK-based
- Provides
- Guaranteed delivery of all fragments
- In-order delivery not guaranteed
- Fragmentation/reassembly
15Placement of reliability for data transport
- RMST considers 3 layers
- MAC
- Transport
- Application
- Focus is on MAC and Transport
16MAC Layer Choices
- No ARQ
- All transmissions are broadcast
- No RTS/CTS or ACK
- Reliability deferred to upper layers
- Benefits no control overhead, no erroneous path
selection - ARQ always
- All transmissions are unicast
- RTS/CTS and ACKs used
- One-to-many communication done via multiple
unicasts - Benefits packets traveling on established paths
have high probability of delivery - Selective ARQ
- Use broadcast for one-to-many and unicast for
one-to-one - Data and control packets traveling on established
paths are unicast - Route discovery uses broadcast
17Transport Layer Choices
- End-to-End Selective Request NACK
- Loss detection happens only at sinks (endpoints)
- Repair requests travel on reverse (multihop) path
from sinks to sources - Hop-by-Hop Selective Request NACK
- Each node along the path caches data
- Loss detection happens at each node along the
path - Repair requests sent to immediate neighbors
- If data isnt found in the caches, NACKs are
forwarded to next hop towards source
18Application Layer Choices
- End-to-End Positive ACK
- Sink requests a large data entity
- Source fragments data
- Sink keeps sending interests until all fragments
have been received - Used only as a baseline
19RMST details
- Implemented as a Diffusion Filter
- Takes advantage of Diffusion mechanisms for
- Routing
- Path recovery and repair
- Adds
- Fragmentation/reassembly management
- Guaranteed delivery
- Receivers responsible for fragment retransmission
- Receivers arent necessarily end points
- Caching or non-caching mode determines
classification of node
20RMST Details (contd)
- NACKs triggered by
- Sequence number gaps
- Watchdog timer inspects fragment map periodically
for holes that have aged for too long - Transmission timeouts
- Last fragment problem
- NACKs propagate from sinks to sources
- Unicast transmission
- NACK is forwarded only if segment not found in
local cache - Back-channel required to deliver NACKs to
upstream neighbors
21Evaluation
- NS-2 simulation
- 802.11 MAC
- 21 nodes
- single sink, single source
- 6 hops
- MAC ARQ set to 4 retries
- Image size 5k
- 50 100-byte fragments
- Total cost of sending the entire file 87,818
bytes - Includes diffusion control message overhead
- All results normalized to this value
22Results Baseline (no RMST)
- ARQ and S-ARQ have high overhead when error rates
are low - S-ARQ is better in terms of efficiency
- Also helps with route selection
- No ARQ results drop considerably as error rates
increase - Exponential decay of end-to-end reliability
mechanisms
23Results RMST with H-b-H Recovery and Caching
- Slight improvement for ARQ and S-ARQ results over
baseline - No ARQ is better even in the 10 error rate case
- But, many more exploratory packets were sent
before the route was established
24Results RMST with E-2-E Recovery
- No ARQ doesnt work for the 10 error rate case
- Numerous holes that required NACKs couldnt make
it from source to sink without link-layer
retransmissions - ARQ and S-ARQ results are statistically
insignificant from H-b-H results - NACKs were very rare when any form of ARQ was
used
25Results Performance under High Error Rates
- No ARQ doesnt work for the 30 error rate case
- Diffusion control messages could not establish
routes most of the time - In the 20 case, it took several minutes to
establish routes
26RMST Conclusion
- ARQ helps with unicast control and data packets
- In high error-rate environments, routes cannot be
established without ARQ - Route discovery packets shouldnt use ARQ
- Erroneous path selection can occur
- RMST combines a NACK-based transport layer
protocol with S-ARQ to achieve the best results
27Congestion Control
- Sensor networks are usually idle
- Until an event occurs
- High probability of channel overload
- Information must reach users
- Solution congestion control
28ESRT Overview
- Places interest on events, not individual pieces
of data - Application-driven
- Application defines what its desired event
reporting rate should be - Includes a congestion-control element
- Runs mainly on the sink
- Main goal Adjust reporting rate of sources to
achieve optimal reliability requirements
29Problem Definition
- Assumption
- Detection of an event is related to number of
packets received during a specific interval - Observed event reliability ri
- of packets received in decision interval I
- Desired event reliability R
- of packets required for reliable event
detection - Application-specific
- Goal configure the reporting rate of nodes
- Achieve required event detection
- Minimize energy consumption
30Reliability vs Reporting frequency
- Initially, reliability increases linearly with
reporting frequency - There is an optimal reporting frequency (fmax),
after which congestion occurs - Fmax decreases when the of nodes increases
31Characteristic Regions
- n normalized reliability indicator
- (NC,LR) No congestion, Low reliability
- f lt fmax, n lt 1-e
- (NC, HR) No congestion, High reliability
- f lt fmax, n lt 1e
- (C, HR) Congestion, High reliability
- f gt fmax, n gt 1
- (C, LR) Congestion, Low reliability
- f lt fmax, n lt 1
- OOR Optimal Operating Region
- f lt fmax, 1-e lt n lt 1e
32Characteristic Regions
33ESRT Requirements
- Sink is powerful enough to reach all source nodes
(i.e. single-hop) - Nodes must listen to the sink broadcast at the
end of each decision interval and update their
reporting rates - A congestion-detection mechanism is required
34Congestion Detection and Reliability Level
- Both done at the sink
- Congestion
- Nodes monitor their buffer queues and inform the
sink if overflow occurs - Reliability Level
- Calculated by the sink at the end of each
interval based on packets received
35ESRT Protocol Operation
- (NC, LR)
- (NC, HR)
- (C, HR)
- (C, LR)
36ESRT Conclusion
- Reliability notion is application-based
- No delivery guarantees for individual packets
- Reliability and congestion control achieved by
changing the reporting rate of nodes - Pushes all complexity to the sink
- Single-hop operation only
37Code Distribution Introduction
- Nature of sensor networks
- Expected to operate for long periods of time
- Human intervention impractical or detrimental to
sensing process - Nevertheless, code needs to be updated
- Add new functionality
- Incomplete knowledge of environment
- Predicting right set of actions is not always
feasible - Fix bugs
- Maintenance
38Approaches
- Transfer the entire binary to the motes
- Advantage
- Maximum flexibility
- Disadvantage
- High energy cost due to large volume of data
- Use a VM and transfer capsules
- Advantage
- Low energy cost
- Disadvantages
- Not as flexible as full binary update
- VM required
- Reliability is required regardless of approach
39Papers
- A Remote Code Update Mechanism for Wireless
Sensor Networks - Trickle A Self-Regulating Algorithm for Code
Propagation and Maintenance in Wireless Sensor
Networks
40MOAP Overview
- Code distribution mechanism specifically targeted
for Mica2 motes - Full binary updates
- Multi-hop operation achieved through recursive
single-hop broadcasts - Energy and memory efficient
41Requirements and Properties of Code Distribution
- The complete image must reach all nodes
- Reliability mechanism required
- If the image doesnt fit in a single packet, it
must be placed in stable storage until transfer
is complete - Network lifetime shouldnt be significantly
reduced by the update operation - Memory and storage requirements should be moderate
42Resource Prioritization
- Energy Most important resource
- Radio operations are expensive
- TX 12 mA
- RX 4 mA
- Stable storage (EEPROM)
- Everything must be stored and Write()s are
expensive - Memory usage
- Static RAM
- Only 4K available on current generation of motes
- Code update mechanism should leave ample space
for the real application - Program memory
- MOAP must transfer itself
- Large image size means more packets transmitted!
- Latency
- Updates dont respond to real-time phenomena
- Update rate is infrequent
- Can be traded off for reduced energy usage
43Design Choices
- Dissemination protocol How is data propagated?
- All at once (flooding)
- Fast
- Low energy efficiency
- Neighborhood-by-neighborhood (ripple)
- Energy efficient
- Slow
- Reliability mechanism
- Repair scope local vs global
- ACKs vs NACKs
- Segment management
- Indexing segments and gap detection Memory
hierarchy vs sliding window
44Ripple Dissemination
- Transfer data neighborhood-by-neighborhood
- Single-hop
- Recursively extended to multi-hop
- Very few sources at each neighborhood
- Preferably, only one
- Receivers attempt to become sources when they
have the entire image - Publish-subscribe interface prevents nodes from
becoming sources if another source is present - Leverage the broadcast medium
- If data transmission is in progress, a source
will always be one hop away! - Allows local repairs
- Increased latency
-
45Reliability Mechanism
- Loss responsibility lies on receiver
- Only one node to keep track of (sender)
- NACK-based
- In line with IP multicast and WSN reliability
schemes - Local scope
- No need to route NACKs
- Energy and complexity savings
- All nodes will eventually have the same image
46Retransmission Policies
- Broadcast RREQ, no suppression
- Simple
- High probability of successful reception
- Highly inefficient
- Zero latency
- Broadcast RREQ, suppression based on randomized
timers - Quite efficient
- Complex
- Latency and successful reception based on
randomization interval
47Retransmission Policies (contd)
- Broadcast RREQ, fixed reply probability
- Simple
- Good probability of successful reception
- Latency depends on probability of reply
- Average efficiency
- Broadcast RREQ, adaptive reply probability
- More complex than the static case
- Similar latency/reception behavior
- Unicast RREQ, single reply
- Smallest probability of successful reception
- Highest efficiency
- Simple
- Complexity increases if source fails
- Zero latency
- High latency if source fails
48Segment Management Discovering if a segment is
present
- No indexing
- Nothing kept in RAM
- Need to read from EEPROM to find if segment i is
missing - Full indexing
- Entire segment (bit)map is kept in RAM
- Look at entry i (in RAM) to find if segment is
missing - Partial indexing
- Map kept in RAM
- Each entry represents k consecutive segments
- Combination of RAM and EEPROM lookup needed to
find if segment i is missing
49Segment Management (contd)
- Hierarchical full indexing
- First-level map kept in ram
- Each entry points to a second-level map stored in
EEPROM - Combination of RAM and EEPROM lookup needed to
find if segment i is missing - Sliding window
- Bitmap of up to w segments kept in RAM
- Starting point last segment received in order
- RAM lookup
- Limited out-of-order tolerance!
50Retransmission Polices Comparison
51Segment Management Comparison
52Results Energy efficiency
- Significant reduction in traffic when using
Ripple - Up to 90 for dense networks
- Full Indexing performs 5-15 better than Sliding
Window - Reason better out-of-order tolerance
- Differences diminish as network density grows
53Results Latency
- Flooding is 5 times faster than Ripple
- Full indexing is 20-30 faster than Sliding
window - Again, reason is out-of-order tolerance
54Results Retransmission Policies
- Order-of-magnitude reduction when using unicasts
55Current Mote implementation
- Using Ripple-sliding window with unicast
retransmission policy - User builds code on the PC
- Packetizer creates segments out of binary
- Mote attached to PC becomes original source and
sends PUBLISH message - Receivers 1 hop away will subscribe, if version
number is greater than their own - When a receiver gets the full image, it will send
a PUBLISH message - If it doesnt receive any subscriptions for some
time, it will COMMIT the new code and invoke the
bootloader - If a subscription is received, node becomes a
source - Eventually, sources will also commit
56Current Mote Implementation (contd)
- Retransmissions have higher priority than data
packets - Duplicate requests are suppressed
- Nodes keep track of their sources activity with
a keepalive timer - Solves NACK last packet problem
- If the source dies, the keepalive expiration will
trigger a broadcast repair request - Late joiner mechanism allows motes that have just
recovered from failure to participate in code
transfer - Requires all nodes to periodically advertise
their version - Footprint
- 700 bytes RAM
- 4.5K bytes ROM
57MOAP Conclusion
- Full binary updates over multiple hops
- Ripple dissemination reduces energy consumption
significantly - Sliding window method and unicast retransmission
policy also reduce energy consumption and
complexity - Successful updates of images up to 30K in size
- Next steps
- Larger experiments
- Better Late Joiner mechanism
- Verification phase
- Sending DIFFS instead of full image
58Trickle Overview
- State synchronization/code propagation mechanism
- Suitable for VM environments, where transmitted
code is small - Uses polite gossip dissemination
- Periodic broadcasts of state summary
- Nodes overhear transmissions and stay quiet
unless they need to update - Goals
- Propagation install new code
- Maintenance detect propagation need
59Basic Mechanism
- A node will periodically transmit information
- Only if less than a number of neighbors hasnt
sent the same data - Cells (neighborhoods) can be in two states
- All nodes up to date
- Update needed
- Node learns about new code
- Node detects neighbor with old code
- Since communication can be transmission or
reception, ideally only one node per cell needs
to transmit - Similar to MOAPs ideal single-source scenario
60Maintenance
- Time is split in periods
- Nodes pick a random slot from 0..T
- Transmission occurs if a node has heard less than
K other identical transmissions - Otherwise, node stays quiet
- K is small (1 or 2 usually)
- If a node detects a neighbor that is out of date,
it transmits the newer code - If a node detects it is out of date, it transmits
its state - Update is triggered when other nodes receive this
transmission - Nodes transmit at most once per period
- In the presence of losses, scaling property is
O(logn)
61Maintenance and timesync
- When nodes are synchronized, everything works
fine - If nodes are out of sync, some might transmit
before others had a chance to listen - Short-listen problem
- O(sqrt(n)) scaling
- Solution Enforce a listen-only periond
- Pick a slot from T/2..T for transmission
62Maintenance and timesync (contd)
63Propagation
- Large T
- Low communication overhead (less probable to pick
same slot) - High latency
- Small T reversed
- Solution dynamic scaling of T
- Use two bounds TL and TH
- When T expires, it doubles until it reaches TH
- When newer state is overheard, T TL
- When older state is overheard, immediately send
updates - When new code is installed, T TL
- Helps spread new code quickly
64Propagation Summary
65Trickle Conclusion
- Efficient state synchronization protocol
- Fast
- Limits transmissions (localized flood)
- Scales well with network size
- Does not propagate code per se
- Instead, it notifies the system that update is
needed - In many cases, determining when to propagate can
be more expensive than propagation - Contrast with MOAPs simple late-joiner algorithm
66The End!