Title: Learning Communication Rules
1Learning Communication Rules
Srikanth Kandula
Ranveer Chandra and Dina Katabi
2Network Admins. are Groping in the Dark
- Focus on Traffic Volume
- TCP80, HTTP30
- Adapt report categories (e.g., AutoFocus)
- Much traffic from ports 500-600
- But, Whats Going On?
- Traffic follows plan?
- Misconfigurations
- Suspicious Traffic
Besides focusing on volume, learn rules
underlying the traffic
- (Active) user browsing web, reading/sending mail
- (Automatic) SMS scan on a network, outlook refresh
3X
X
X
Y
Y
Y
X
t
Whenever flowy happens, flowx is likely to occur
flowY ? flowX
Rule
(http ? DNS)
If you could learn such rules directly from a
trace,
- Infer the actual behavior of applications
- AFS root servers direct traffic to volume servers
evenly - mail to the incoming MX, is forwarded onto group
MXes - Notice misconfigurations and badness
- these clients shld not be talking on known
command-control ports this server shld not be
responding to DHCP requests - this mail server shld not attempt connections to
non-existent MXes
4Report all significant rules with no specific
knowledge about a trace
5Mining for Rules is Hard
- How to define significance?
- When is a group of flows interesting enough to
report? - Avoid observer bias but cannot evaluate
everything - Focus on one server, miss what you are not
looking for - Practical, deal with noise, search quickly
- eXpose
- A scoring function for significance
- Heuristics that bias search toward high hit-rate
- Empirical validation on enterprise traces
6Overview
Activity Matrix
flow1 flowK
time1
timeR
Packet Trace
Rules
- Packet trace to Activity Matrix
- Rows are 1s windows Columns are flows
- Is flow active in timei-1, timei )? (at least
one packet) - Association rule mining (X,Y are r.v. for
columns) - Need not worry about interleaving
- Dependencies are at these time-scales (an rtt, a
server response)
All windows in .25s, 2s range yield similar
rules
7Which Rules are Significant?
X ? Y
- High Joint Probability?
- X, Y may occur very often individually (e.g.,
breeze, sun shining) - High Conditional Probability?
- Say Y occurs only when X does, but both are rare
(lottery, buy a jet)
8Which Rules are Significant?
X ? Y
- High Joint Probability?
- High Conditional Probability?
- We use mutual information (combines the two)
Measures fraction of change in Y due to X
Score0, if Y is independent of X
ScoreMax, if Y is fully dependent on X
Trades off dependency frequency
Encodes Directionality
9Modifying Scores for Networking
- Negative Correlation
- Flows with little overlap
P(?YX) ? 1
leads to high score
10Modifying Scores for Networking
- Negative Correlation
- Flows with little overlap
- Long Running Flows
- Large downloads, ssh/remote desktop
- Trivial overlaps with long flow
- Distinguish new vs. present
- Present rules reported only if small mismatch in
freq. - Too Many Possibilities
- Bias, focus on pairs with at least one common IP
- Miss rules, but hit-rate up 1000x and costs down
10x
P(YX) ? 1
11Generics
- - Miss, if no client accesses server often
- Rules that abstract away parts of a flow
Database
Client Server ? Server Database
Server
Client Server ? Server Database
(any client)
Kerberos
Client Rsrv. ? Client Kerberos
Client Rsrv. ? Client Kerberos
Reservation
(any client, but same on both sides)
- To do this automatically,
- what to abstract? (IP addresses at non-server
port) - which pairs to consider for rule?
- flows match IP, generics match abstracted IP
12Mining for Rules
- Techniques extend to arbitrary sized rules
- Instead,
- Focus on pair-wise rules (simpler is likelier)
- Group similar rules
- Eliminate weak rules between strongly connected
groups - Transitive closure to read off clusters
O(f2)
O(fn1)
Recursive Spectral Partitioning (VKV00)
Rule Score
Rule Mining
Digests 105106 flows into 102103 rule clusters
13Recap eXpose Mines for Rules
Activity Matrix
Rules
Rule Clusters
flow1 flowK
time1 present new
timeR
flowi.new ? flowj.present ...
Packet Trace
Contributions
- Learn all significant rules without prior
knowledge - Scoring function for rule significance
- Avoids observer bias, yet stays feasible by
focusing on high hit-rate - Algorithms to mine and prune
14Related Work
- Semi-Automated Discovery of App. Session
Structure (KJPK06) - Sherlock (Diagnosing Performance Problems,
BCGKMZ07) - Autofocus (ESV03)
- BLINC (KPF05)
- Stepping Stones (ZP00)
- Learn all significant rules without prior
knowledge - Avoids observer bias, yet stays feasible by
focusing on high hit-rate - Scoring function for rule significance
- Algorithms to mine and prune
15Results
16Evaluation Setup
Inside Microsoft
Before CSAILs Servers
Access Link of Conf. LANs
CSAILs Access
- Traces at access and internal server-facing links
- Packet Headers, Connection Records (Bro), some
anon. - Operational n/w with ?103 clients, diverse
traffic mix - Corroborated on test-bed traffic vetted by
admins. - Ran eXpose on a 2.4GHz x86 with 8GB RAM
17Rules Discovered by eXpose
- Dependencies for Major Applications
email _at_ microsoft
Client. PFS1.X
Client. PFS2.X
Client. Proxy.80
Client. DC.88
Client. Mail.X
Client. Mail.135
18Rules Discovered by eXpose
- Dependencies for Major Applications
afs _at_ csail
AFS1.7000 Root.7002
C.7001 .
C.7001 AFS2.7000
C.7001 Root.7003
C.7001 AFS1.7000
19Rules Discovered by eXpose
- Dependencies for Major Applications
- web, e-mail, file-servers, IM, print, video
broadcast
web _at_ microsoft
Proxy3.80 .
Proxy2.80 .
Proxy1.80 .
Proxy4.80 .
20Rules Discovered by eXpose
- Dependencies for Major Applications
- web, e-mail, file-servers, IM, print, video
broadcast - Configuration Errors Other Badness
smtp IDENT _at_ csail
Client.113 MailServer.
Client. MailServer.25
21Rules Discovered by eXpose
- Dependencies for Major Applications
- web, e-mail, file-servers, IM, print, video
broadcast - Configuration Errors Other Badness
- IDENT, Legacy emails, ssh scans, wingate
Legacy email ids _at_ csail
UnivMail. Old1.25
UnivMail. Old3.25
UnivMail. Old2.25
22Rules Discovered by eXpose
- Dependencies for Major Applications
- web, e-mail, file-servers, IM, print, video
broadcast - Configuration Errors Other Badness
- IDENT, Legacy emails, ssh scans, wingate
- Rules for stuff we didnt know before
Nagios monitors _at_ csail
23Rules Discovered by eXpose
- Dependencies for Major Applications
- web, e-mail, file-servers, IM, print, video
broadcast - Configuration Errors Other Badness
- IDENT, Legacy emails, ssh scans, wingate
- Rules for stuff we didnt know before
- Nagios, LLMNR, iTunes
Link level multicast name resolution _at_ hotspots
H.137 Wins.137
Black box Little prior knowledge about servers,
applications, or users ? Can evolve
H. Multicast.5355
H. DNS.53
24Correctness Completeness
- False Positives
- 13 of rule-clusters in CSAIL trace, we couldnt
explain - False Negatives
- Main CSAIL Web Server (too many different
activities) - Dependencies on Personal Web Pages (too few
traffic) - PlanetLab Traffic (punted)
- Other Limitations
- IPSec, Anonymized, Cover Traffic
- Extensions
- Rules repeat over time, and across traces
- Application whitelisting, Customize Generics
25Time to Mine for Rules
Flows (x 106)
.6
.2
.6
.9
2.8
At CSAILs access link, high fan-out with many
distinct flows
Stream Mining Appears Feasible!
26eXpose
Packet Trace
Rules for frequently reoccurring flow sets
- Learn all significant rules with no specific
knowledge - Avoids observer bias, but feasible by focusing on
high hit-rate - Scoring function for rule significance
- Algorithms to mine and prune
- Empirical validation on enterprise traces
- found configurations protocols that we didnt
know existed - learnt rules for actual behavior of applications
- found config. errors, bot scans, infected
machines
http//research.microsoft.com/srikanth
27Backup
28Expanding Search Space ( of flows)
of Discovered Rules
Rule Score (Modified JMeasure)
exposes few significant rules!
29Expanding Search Space ( of flows)
Time to Mine Rules (s)
Memory Footprint (million rules)
Top Active Flows
Top Active Flows
exposes few rules costs a lot in time, memory
30Varying Size of Time Windows
of Discovered Rules
Rule Score (Modified JMeasure)
All window sizes in .25s, 2s produce similar
rules!
31For all rules X ? Y
Joint Probability
Prob. (X)
Prob. (Y)