Title: Flow Stats Module -- Control Design
1Flow Stats Module--Control Design
John DeHart
2SPP V1 LC Egress with 1x10Gb/s Tx
XScale
NAT Miss Scratch Ring
S W I T C H
R B U F
M S F
Rx1
Rx2
Key Extract
Lookup
Hdr Format
TCAM
T B U F
QM0
Port Splitter
Flow Stats1
M S F
R T M
1x10G Tx1
1x10G Tx2
QM1
QM2
QM3
NAT Pkt return
Stats (1 ME)
SRAM1
SRAM3
Flow Stats2
XScale
SRAM2
XScale
Archive Records
3SPP V1 LC Egress with 10x1Gb/s Tx
XScale
NAT Miss Scratch Ring
S W I T C H
R B U F
M S F
Rx1
Rx2
Key Extract
Lookup
Hdr Format
TCAM
5x1G Tx1 (P0-P4)
T B U F
QM0
Port Splitter
Flow Stats1
M S F
R T M
QM1
5x1G Tx2 (P5-P9)
QM2
QM3
NAT Pkt return
Stats (1 ME)
SRAM1
SRAM3
Flow Stats2
XScale
SRAM2
XScale
Archive Records
4Overview of Flow Stats
- 2 MEs in Fastpath to collect flow data for each
pkt - Byte counter per flow
- Pkt counter per flow
- Archive data to XScale via SRAM ring every 5
minutes - XScale control daemon(s) to process data
- Receive flow information from MEs
- Reformat to put into PlanetFlow format
- Maintain databases for PlanetLab archiving and
for identifying internal flows (pre-NAT
translation) when an external flow (post-NAT) has
a complaint lodged against it.
5SPP V1 LC Egress with 10x1Gb/s Tx
XScale
NAT Miss Scratch Ring
S W I T C H
R B U F
M S F
Rx1
Rx2
Key Extract
Lookup
Hdr Format
TCAM
5x1G Tx1 (P0-P4)
T B U F
QM0
Port Splitter
Flow Stats1
M S F
R T M
QM1
5x1G Tx2 (P5-P9)
QM2
QM3
NAT Pkt return
Stats (1 ME)
SRAM1
SRAM3
Flow Stats2
XScale
SRAM2
XScale
Archive Records
6Flow Record
- Total Record Size 8 32-bit words
- V is valid bit
- Only needed at head of chain
- 1 for valid record
- 0 for invalid record
- Start timestamp (16-bits) is set when record
starts counting flow - Reset to zero when record is archived
- End timestamp (16-bits) is set each time a packet
is seen for the given flow - Packet and Byte counters are incremented for each
packet on the given flow - Reset to zero when record is archived
- For TCP Flows, the TCP Flags are ored in from
each packet - Next Record Number is next record in hash chain
- 0x1FFFF if record is tail
- Address of next record (next_record_num
record_size) collision_table_base_addr
Source Address (32b)
LW0
Destination Address (32b)
LW1
SrcPort (16b)
DestPort (16b)
LW2
Protocol (8b)
LW3
Slice ID (VLAN) (12b)
Reserved (6b)
TCP Flags (6b)
LW4
Next Record Number (17b)
Reserved (14b)
V (1b)
Packet Counter (32b)
LW5
LW6
Byte Counter (32b)
Start Timestamp (16b)
LW7
End Timestamp (16b)
7Archiving Hash Table Records
- Send all valid records in hash table to XScale
for archiving every 5 minutes - Set Command field to indicate FLOW_RECORD
- For each record in the main table (i.e. start of
chain) ... - For each record in hash chain ...
- If record is valid ...
- If packet count gt 0 then
- Send record to XScale via SRAM ring
- Set packet count to 0
- Set byte count to 0
- Leave record in table
- If packet count 0 then
- Flow has already been archived
- No packet has arrived on flow in 5 minutes
- Record is no longer valid
- Delete record from hash table to free memory
Info Sent to XScale for each flow every 5 minutes
Source Address (32b)
LW0
Destination Address (32b)
LW1
SrcPort (16b)
DestPort (16b)
LW2
Protocol (8b)
LW3
Slice ID (VLAN) (12b)
TCP Flags (6b)
Command (6b)
Packet Counter (32b)
LW4
LW5
Byte Counter (32b)
Start Timestamp_high (32b)
LW6
Start Timestamp_low (32b)
LW7
End Timestamp_high (32b)
LW8
End Timestamp_low (32b)
LW9
8Sending Time Records to XScale
- ME Precedes a series of Flow Records with a time
record. - Set Command field to indicate TIME_RECORD
- Time Record must be same size as Flow Record,
currently 10 Words
Time Record Sent to Xscale Preceding Flow Records
Timestamp_high (32b)
LW0
Timestamp_low (32b)
LW1
LW2
Reserved (32b)
LW3
Reserved (26b)
Command (6b)
LW4
Reserved (32b)
LW5
Reserved (32b)
LW6
Reserved (32b)
LW7
Reserved (32b)
LW8
Reserved (32b)
LW9
Reserved (32b)
9Overview of Flow Stats Control
- Main functions
- Collection of Flow Information for PlanetLab Node
- Used when a complaint is lodged about a
misbehaving flow - Must be able to identify flow and the Slice that
produced it. - Aggregation of Flow Information from
- Multiple GPEs
- Multiple NPEs
- Correlation with NAT records to identify internal
flow and external flow - External flow will be what complaint will be
about. - Internal flow will be what involved PlanetLab
researcher will know about.
10Translations needed
- NPE Flow Records
- VLAN to SliceID
- Comes from SRM
- IXP timestamp to wall clock time
- SCD records wall clock time it started IXP
- How do we manage time slip between clocks?
- GPE Flow Records
- NAT Port translations
- Src Port from GPE record becomes SPP Orig Src
Port - Src Port from natd translation record becomes Src
Port - natd provides port translation updates
11Merging of DBs
- NPE Flows
- No NAT
- Goes directly into Ext PF DB
- SPP Orig Src Port SrcPort
- Do they need SliceID translation?
- We use the VLAN, but this probably needs to be
the PlanetLab version of a Slice ID. - SRM will provide a VLAN to SliceID translation
- Where and When?
- GPE Configured Flows
- How do we identify configured flow pkt?
- Because they dont match a NAT Record?
- No NAT
- Goes directly into Ext PF DB
- SPP Orig Src Port SrcPort
- GPE NAT Flows
- Find corresponding NAT Record, extract Translated
SrcPort - Insert record into Ext PF DB with original
SrcPort moved to SPP Orig Src Port - Set Src Port to translated SrcPort
- CP Traffic?
12Overview of PlanetFlow
- PlanetFlow
- Unprivileged slice
- Flow Collector
- Ulogd (fprobe-ulog)
- Netlink socket
- Uses VSys for privileged operations
- Every 5 minutes dumps its cache to DB
- DB
- On PlanetLab Node
- 5-minute records
- Flows spanning 5-minute intervals aggregated
daily. - Central Archive
- At Princeton?
- Updated periodically by using rsync to retrieve
new DB entries from ALL PlanetLab nodes.
X
X
13PlanetFlow Raw Data
- 0005 0011 8e10638b 48a40477 00062638
- 0000371d 0000 0000 80fc99cd 80fc99d3
- 00000000 0000 0004 0000000b 0000062d
- 8dae5570 8dae558b cc1f 01bb 00 1f 0600
- 0000 0000 02000000 80fc99cd 80fc99d3
- 00000000 0000 0004 0000001a 000008b7
- 8dae54eb 8dae5533 cc1e 01bb 001e 0600
- 0000 0000 02000000
NetFlow Header (beginning of file and
repeats every 30 flow records)
Uptime
Pad16 (unused)
Engine Type (unused)
Engine Id (unused)
SA
DA
Flow Sequence
128.252.153.205
128.252.153.211
IPv4 NextHop (Unused)
In SNMP (if_nametoindex)
Out SNMP (if_nametoindex)
Pkt Count
Byte Count
NetFlow Flow Record
1581
11
Tcp flags
First Switched (flow creation time)
Last Switched (time of last pkt)
Proto
Src Tos
Src Port
Dst Port
Pad
443
52255
Src As (Unused)
Dst As (Unused)
XID (SliceID)
SA
DA
128.252.153.205
128.252.153.211
IPv4 NextHop (Unused)
NetFlow Flow Record
In SNMP (if_nametoindex)
Out SNMP (if_nametoindex)
Pkt Count
Byte Count
2231
26
First Switched (flow creation time)
Last Switched (time of last pkt)
Tcp flags
Src Tos
Proto
Src Port
Dst Port
Pad
443
52254
Src As (Unused)
Dst As (Unused)
XID (SliceID)
14SPP/PlanetFlow Raw Data
- 0005 0011 8e10638b 48a40477 00062638
- 0000371d xx yy 0000 80fc99cd 80fc99d3
- 00000000 0000 0004 0000000b 0000062d
- 8dae5570 8dae558b cc1f 01bb 00 1f 0600
- zzzz 0000 02000000 80fc99cd 80fc99d3
- 00000000 0000 0004 0000001a 000008b7
- 8dae54eb 8dae5533 cc1e 01bb 001e 0600
- zzzz 0000 02000000
NetFlow Header (beginning of file and
repeats every 30 flow records)
Uptime (msecs)
Pad16 (unused)
SPP Engine Type
SPP Engine Id
SA
DA
Flow Sequence
128.252.153.205
128.252.153.211
IPv4 NextHop (Unused)
In SNMP (if_nametoindex)
Out SNMP (if_nametoindex)
Pkt Count
Byte Count
NetFlow Flow Record
1581
11
Tcp flags
First Switched(msec) (flow creation time)
Last Switched(msec) (time of last pkt)
Proto
Src Tos
Src Port
Dst Port
Pad
443
52255
SPP Orig Src Port
Dst As (Unused)
XID (SliceID)
SA
DA
128.252.153.205
128.252.153.211
IPv4 NextHop (Unused)
NetFlow Flow Record
In SNMP (if_nametoindex)
Out SNMP (if_nametoindex)
Pkt Count
Byte Count
2231
26
First Switched (flow creation time)
Last Switched (time of last pkt)
Tcp flags
Src Tos
Proto
Src Port
Dst Port
Pad
443
52254
SPP Orig Src Port
Dst As (Unused)
XID (SliceID)
15Issues and Notes
- Time
- Keeping time in sync among various machines
- Flow Stats ME timestamps with IXP clock ticks.
- Something has to convert this to a Unix time.
- GPE(s) timestamps with Unix gettimeofday().
- CP collects flow records and aggregates based on
time. - Proposal
- XScale, GPE(s) and CP will use ntp to keep their
Unix times in sync - At the beginning of each reporting cycle, the
Flow Stats ME should send a timestamp record just
to allow the XScale and CP to keep the time in
sync. - OR Can XScale read the IXP clock tick and report
that to the CP with along with the XScales Unix
time. - What are the times that are recorded in the
Header and Flow Records? - Header
- Uptime (msecs) msecs since a base start time
- Time since Unix Epoch time since January 1, 1970
- Unix secs
- Unix nSecs
- Uptime and Unix (secs, nSecs) represent the SAME
time - So that the Flow times can be calculated based on
them. - Flow Record
16Issues and Notes (continued)
- NetFlow Header
- Filled in AFTER 30 flow records are filled in OR
we get a timeout (10 minutes) - COUNT field tells how many flow records are
valid. - File or data packet is ALWAYS padded out to a
size that would hold 30 flow records - Flow Sequence Running total of number of flow
records emitted. - Flow Header and Flow Records
- Emitted in chunks of 30 flow records plus a Flow
Header - Emitted either by writing to a file or sending
over a socket to a mirror site. - Padded out to a size that would hold 30 flow
records. - A flow is emitted when it has been inactive for
at least a minute or when it has been active for
at least 5 minutes. - Fprobe-ulog threads
- emit_thread
- scan_thread
- cap_thread
- unpending_thread
- Flow lists
- flows hashed array of flows, buckets chained
off head of list - These are flows that have been reported over
netlink socket
17Issues and Notes (continued)
- VLANs and SliceIDs
- NPE and LC use VLANs to differentiate Slices
- Flow records must record slice IDs
- SRM will provide VLAN to SliceID translation
- GPE(s) do not differentiate Slices by VLAN.
- All flows from a GPE will use the same VLAN
- GPE keeps flow records locally using Slice ID
- Flow Stats ME could ignore GPE flow packets if it
was told what the default GPE VLAN was. - Otherwise, one of the fs daemons could drop the
flow records for the GPE flows that the Flow
Stats ME reports. - Slice ID
- What exactly is it?
- Is the XID that is recorded by PlanetFlow
actually the slice id or is it the VServer id?
18Issues and Notes (continued)
- NAT Port Translations
- GPE flow records are the ones that need the NAT
Port translation data - GPE flow records will come across from the GPE(s)
to the CP via rsync or similar - natd will report NAT port translations with
timestamps to the fs daemon - fs daemon will have to maintain NAT port
translations (with their timestamps) for possible
later correlation with GPE flow records - GPE(s) will all use the same default VLAN
- SRM will send this VLAN to scd so it can write it
to SRAM for the fs ME to read in - Fs ME will then filter out GPE flow records.
- SRM ? ? fsd messaging
- srm will push out VLAN ? SliceID translation
creation and deletion messages - srm will wait 10 minutes before re-using a VLAN
- srm will send the delete VLAN message after
waiting the 10 minutes. - fsd should not have to keep any history of
VLAN/SliceID translations - It should get the creation before it receives any
flow records for it - It should get the last flow record before it gets
the deleteion - fsd will also be able to query SRM for current
translation - This will facilitate a restart of the fsd while
the SRM maintains current state.
19Issues and Notes (continued)
- rsync of flow record files from GPE(s) to CP
- A particular run of rsync may get a file that is
still being written to by fprobe-ulog on the GPE - A subsequent rsync will may get the file again
with additional records in it. - Sample rsync command
- rsync --timeout 15 -avzu -e "ssh -i
/vservers/plc1/etc/planetlab/root_ssh_key.rsa "
root_at_drn02/vservers/pl_netflow/pf /root/pf - This will report the files that have been copied
over
20Issues and Notes (continued)
- Sample fprobe-ulog command
- /sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168
-f pf2 -q 1000 -s 30 -D 250000 - Started from /etc/rc.d/rc2345.d/S56fprobe-ulog
- All linked to /etc/init.d/fprobe-ulog
- GPE Flow record collection daemon fprobe-ulog
- Scan thread
- Collects flow records into a linked list
- Emit thread
- Periodically writes flow records out to a file
- Every 600 seconds ten minutes!
- Daemon can also send flow records to a remote
collector! - So we could have the GPEs emit their flow records
directly to the flow stats daemon on the CP. - Sample command
- /sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168
-f pf2 -q 1000 -s 30 -D 250000 ltremotegtltportgt/lt
local/lttype - There can be multiple remote host specifications
- Where
- remote remote host to send to
- port destination port to send to
- local local hostname to use
21SPP PlanetFlow
CP
GPE
fprobe
fsd
rsync
srm
GPE
fprobe
Ingress XScale
Egress XScale
scd
natd
FlowStats SRAM Ring
NAT Scratch Rings
MEs
HF
LK
FS2
Central Archive Record lttime, sliceID, Proto,
SrcIP, SrcPort, DstIP, DstPort, PktCnt,
ByteCntgt Ext PF DB Record ltCentral Archive
Recordgt
22Plan/Design
- Flow Stats daemon, fsd, runs on CP
- Collects flow records from GPE(s) and NPE(s) and
writes them into a series of PlanetFlow2 files
with names - pf2., where is (0-162)
- Current file is closed after N minutes and is
incremented and new file is opened and started. - This mimics what fprobe-ulog does now on the
GPE(s) - These files are then collected periodically by
PLC for use and archiving - I dont think there is any explicit indication
that PLC has picked up the files but the timing
must be such that we know it is done before we
roll over the file names and overwrite an old
file. - Gets NAT data from natd
- Keep records of this with timestamps so we can
correlate with flow records coming from GPE(s) - Keep NAT records on a per Src IP Address basis.
- One set of NAT records per external interface
- Check with Mart on how this will work
- Gets VLAN to sliceID data from srm
- srm will send start translation, stop translation
msgs with a 10 minute wait period when stopping a
translation to make sure we are done with flow
records for that slice - FS ME archives records every 5 minutes.
- Slices are long lived (right?) so this should not
be a problem - Fsd can also request a translation from srm
- This is in case fsd has to be restarted while srm
and other daemons continue running.
23Plan/Design (continued)
- Fsd gathers records from GPE(s) and NPE(s)
- Gathers flow records from GPE(s) via socket(s)
from fprobe-ulog on GPE(s) - Come across as one data packet with up to 30 flow
records - Packet is padded out to full 30 flow records with
Count in Header indicating how many of them are
valid - Update NetFlow header to indicate that this is an
SPP and which SPP node it is using Engine Type
and Engine ID fields - Update with NAT data and write immediately out to
current pf2 file keeping its NetFlow header. - Gathers flow records from NPE(s) via socket from
scd on XScale - Come across one flow record at a time
- No NetFlow Header
- Create NetFlow Header
- With appropriate Uptime and UnixTime (secs,
nsecs) - With SPP Engine Type and SPP Engine ID
- Modify Flow Record times to be msecs correlated
with Uptime - Update NPE flow record with SliceID from srm.
- Collect NPE records for a period of time or until
we get 30 and then write them out to current pf2
file with NetFlow header.
24Plan/Design (continued)
- FS ME and scd
- Use a command field in records coming across from
FS ME to scd - Use one command to set current time
- When FS ME is starting an archive cycle, first it
sends a timestamp command - When scd gets this timestamp command it
associates it with a gettimeofday() time and
sends the FS ME time and the gettimeofday() time
to the fsd on the CP so it can associated ME
times with Unix times. - Use another command to indicate flow records
- Flow records can be sent directly on to fsd on CP
25Data to fsd
- srm ? fsd
- Start_vlan_to_sliceId_translate(vlan, sliceId)
- Stop_vlan_to_sliceId_translate(vlan, sliceId)
- scd ? fsd
- Timestamp command
- ME Timestamp
- Unix time
- flowRecord(Saddr, Daddr, Sport, Dport, tcpFlags,
VLAN, protocol, pktCnt, ByteCnt,
startTimeStampHigh, startTimestampLow,
endTimestampHigh, endTimestampLow)
26Data to fsd (continued)
- natd ? fsd
- startNatTranslation(Saddr, Daddr, internalPort,
externalPort, protocol, srcMAC, timeStampHigh) - stopNatTranslation(Saddr, Daddr, internalPort,
externalPort, protocol, srcMAC, timeStampHigh) - gpe ? fsd
- NetFlow Header
- 30 NetFlow Flow Records
27