Title: Network Anomalies: Origins, Structure and Diagnosis
1Network Anomalies Origins, Structure and
Diagnosis
- Anukool LakhinaOral Qualifier Exam
Committee Azer Bestavros John Byers Mark
Crovella (advisor)
2Defining Anomalies
- anomaly Merriam-Webster, 2004
- deviation from the common rule, an irregularity
- something different, abnormal, peculiar, or not
easily classified - Anomalies are unusual events, relative to some
baseline normal state - Relativistic events ? difficult to precisely nail
down
3Network Anomalies
- We want to capture unusual events that network
operators care about - Network anomalies are unexpected events that can
adversely impact the availability and
performance of networks
4Examples
Survey of problem reports to NANOG 1994-99 241
serious problems reported 2001-04 375 same
problems reported Routing loops, instability,
hijacks, filtering, blackhole, outage, etc. have
not reduced in 10 years! Feamster04
5More Examples
6Talk Organization
- Origin of Anomalies
- How and why do anomalies arise?
- Structure of Anomalies
- How do anomalies differ?
- How can we organize them?
- Diagnosing Anomalies
- Strategies to detect, identify, and mitigate
anomalies - This talk focus on detection and identification
7Origin of Anomalies
- Operational events
- Misconfigurations, accidents,
- Design/Implementation Consequences
- Software bugs, unexpected protocol interactions,
- Network abuse (malicious intent)
- DOS attacks, worms, intrusions,
- Unusual end-user behavior (not malicious)
- Flash crowds, peer to peer traffic,
8Operational Anomalies
- Broadly arise due to operator error
accidents - Famous example AS 7007 incident
- In 1997, a small ISP announced routes for the
entire Internet - Most ISPs believed the announcements
- Result all Internet traffic sent to one router
for several hours, disrupting Internet
connectivity to hundreds of networks Farrow02 - BGP configuration errors are pervasive
Mahajan02 - Cause erroneous updates to 0.1-1 of the BGP
table each day 4 of these can cause outage - Many errors due to primitive router configuration
- Distributed program without ability to compile
test Feamster03,Caldwell03
9Operator error not exclusive to the Internet
only
Operator error single largest contributor to
failures in Tandem transaction processing
systems, accounting for 42 of all failures.
Gray85
10 prevalent in yet more domains
Oppenheimer03
- Dominant in large Internet services
- Studied failure data from 3 geographically
Internet services - 51 of failures due to operator error
- Dominant in Public Switched Telephone Networks
- Kuhn97 studied FCC disruption reports from
1992-1994 - 50 outages due to operator error
- 54 in 2000 Enriquez02
Kuhn97
Enriquez02
11Reasons for Human Error
- Humans make errors, even when they know what they
are doing - Because understanding state of large,
tightly-coupled systems is difficult - Humans not good at diagnosing problems from first
principles - Especially in an emergency
- Automation solves common tasks, leaving the rare,
complex ones for operators - Automation irony Poor automation reduces system
visibility ? harder for operators to diagnose
anomalies - Lessons Errors will always occur automation
in synergy with operator
Reason90
12Design Implementation Flaws
- Anomalies that arise due to implementation bugs
or from flawed design. Examples - 1988 Internet congestion collapse Jacobson88
- 2004 Unexpected protocol interactions, e.g.,
interaction between inter-domain intra-domain
routing - Design Goal isolate Internet from routing
changes within an AS - Reality Small changes in internal routing
weights can cause large traffic shifts in
neighbor networks - Result operators set IGP metrics that make BGP
sense, rather than view them separately ? more
complexity - Lesson Routing protocols not designed with
interactions, network design configurations, and
dynamics in mind - Teixeria04,Teixeira05
13Reasons for Design Flaws
- Latent design errors exposed in stress conditions
- Congestion collapse, routers reboot when
overloaded - Correlated failures frequently occur
- Evidence in IP networks CBI04 , in
Internet-scale distributed services
Yalagalunda04 - Traditional fault tolerant designs assume
independent failures - Margin of Safety (in Civil Engineering)
- 25 of all railroad bridges failed between
1850-1890s! Petroski92 - What is the equivalent of a margin of safety
for computer systems networks? Patterson02
Perrow90
Petroski92
14Abuse Anomalies
- Defining characteristic Arise from malicious
intent and violate the targets confidentiality,
integrity, and availability RS91 - Examples
- Intrusions target confidentiality,
- Route hijacks target integrity,
- DOS attacks target availability
15Reasons for Abuse Anomalies
- Technical enablers
- Unrestricted connectivity, platform homogeneity,
anonymity, few defenses - Attacker uses automation to target all systems at
once. Defender must defend all systems - Traditional threat attacker targets high-value
target, defender allocates more resources to
defend it - Economic motivations
- Profit SPAM forwarding, extortion,
- Emerging marketplace can buy sell zombie
machines - Bad guys now have financial incentive to get
better - Savage05
- Political reasons
- Cyber-terror, cyber-warfare, political protest
Weaver04
16Unusual End-User behavior
- Not malicious in intent but can have harmful
impact on availability and performance - Important to manage these anomalies to provision
network resources - e.g., high rate flows, peer to peer traffic,
measurement experiments, flash crowds - Flash crowds unusual demand for a resource
- e.g, Starr report, the slashdot effect, ...
17Flash Crowds hit MSNBC.com
MSNBC is experiencing high site traffic. We have
temporarily moved your personalized news to a
separate page click here
MSNBC.com homepage during a flash crowd
18Lessons from Anomaly Origins
- Network anomalies span a broad range, are
prevalent, and can have catastrophic impact - Anomalies are here to stay new anomalies will
arise - Cannot prevent anomalies, but can try to
accommodate them - Despite anomalies, PSTN still had 99.999
availability Two reasons - 1) Error detection built into design Designers
devote half of the software in telephone switches
to error detection and correction. Kuhn97 - 2) Quick correct human intervention (
capabilities to intervene)
19Talk Organization
- Origin of Anomalies
- How and why do anomalies arise?
- Structure of Anomalies
- How do anomalies differ?
- How can we organize them?
- Diagnosing Anomalies
- Strategies to detect, identify, and mitigate
anomalies - This talk focus on detection and identification
20A Common Feature Anomalies Create Unusual
Network Traffic
- Despite their diversity, many serious anomalies
create unusual traffic - e.g., DDOS, flash, outage, scans, worms.
- Some anomalies may not affect traffic, until
they become seriouse.g., dormant configuration
or design errors that result in an outage event - Challenges
- How do we mine anomalies in traffic?
- What type of traffic to analyze?
All Anomalies
Anomaliesvisible in traffic
21Anomalies by Layer
- Application LayerGenerates, interprets data
- Transport LayerReliable data transfer (TCP)
- Network LayerAddress assignment, and routing
(BGP, IGP, ..) - Physical Layer
- MAC addressing and bit transmission
Layered TCP/IP Model
22Abuse Examples, by Layer
- DDOS floods
- TCP attacks
- Route hijacks
- MAC flooding
- MAC flooding
- Overwhelm address-to-physical port mappings at
switch with spoofed packets - Switch enters failopen mode ? all incoming
traffic is now broadcast - Result eavesdropping of legitimate network
traffic
Deny service to the victim by 1) overwhelming
resource (DOS, MAC), 2) masquerading resource
(hijacks), 3) timing attacks (TCP attacks,
RoQ) RoQ attacks are timing-based attacks
that target adaptation mechanisms instead of
victim directly thus capable of attacking at
multiple layers Guirguis04,Guirguis05
23Anomaly Origin Structure
24Examples by Origin Structure
25Multi-Layer Propagation
- Anomalies travel across layers
- Each layer has containment capability (e.g,
checksums) - Going up if error-checks fail to contain
anomalies, e.g, link failures - Going down e.g., router reboots on overload,
Nimda worm caused routing instability
Andersen04,Wang02 - Multi-Layer Anomalies
- Anomalies at multiple layers
- e.g., Spammers hijack route prefixes (network
layer), to send spam at application layer
Bellovin01 - Makes diagnosis challenging
- especially root-cause analysis
26Talk Organization
- Origin of Anomalies
- How and why do anomalies arise?
- Structure of Anomalies
- How do anomalies differ?
- How can we organize them?
- Diagnosing Anomalies
- Strategies to detect, identify, and mitigate
- This talk focus on detection and identification
27Anomaly Diagnosis
- Key ingredients of Anomaly Diagnosis
- Detection Stating when an anomaly has occurred
or is occurring - Identification Isolating the anomaly from
normal, stating its type, and where possible,
exposing its structure and origin - Diagnosis also includes
- Mitigation avoiding, managing and controlling
adverse impact of anomalies - Lots of research here, most centered on re-design
of protocols and architectures - Not in this talk
28Identification Approaches
- Anomaly Identification Given a detected anomaly,
what - is its structure and cause? Problems
- Origin structure are typically not
observable, - Evidence is partial, ambiguous, inconsistent
- Strategies
- Model-driven e.g., traversing system
dependency models - Rule-based e.g., expert-systems, case-based
reasoning - Learning-based e.g., clustering and
classification
29Detection Approaches
- 1) Anomaly as known signatures to match
- Key Challenge Define a broad set of signatures
(without causing false alarms) - Advantage Identification for free
- Problem Cannot detect new anomalies
- 2) Anomalies as deviations from normality
- Key Challenge Define notion of normality
- Advantage Can detect new anomalies
- Problem Identification problem difficult
- 3) Hybrid Schemes
- Match anomaly against known signatures if no
match, then check for deviations from normality -
this talk
30Deviation from normal
The model is based on the hypothesis that
exploitation of a systems vulnerability involves
abnormal use of the system therefore security
violations could be detected from abnormal
pattern of system usage.
Denning86
31Capturing normal behavior
- Broadly, three strategies to model normal
behavior - System-knowledge models
- Use only a priori knowledge of system
- Useful in static settings when normal behavior
is known (e.g., configuration correctness
checking) - Data-driven models
- Use measurements to model normality via
correlation in data, e.g., using temporal or
spatial correlation - Useful in dynamic settings or when normal is
unknown (e.g., for traffic anomalies) - Hybrid schemes
- Use measurements as input to a system model, and
predict expected behavior, e.g., via dependency
graphs - Useful when normal behavior for system is known,
but workload is unknown (e.g., vulnerability
filters for end-hosts)
32Measurements by Layer
- Measurements placed in a layer if they are
available at that layer (and useful to diagnose
anomalies)
33How methods use data to capture normal
A broad categorization of methods
34Measurements by Location
- Measure at individual end-host (edge)
- Detailed payloads possible
- Limited global visibility, limited mitigation
- e.g., network intrusion detection systems
Paxson99 - Measure at multiple end-hosts (overlay)
- Detailed payloads, that are exchanged
- Better global visibility, but still limited
mitigation - e.g., Collaborative intrusion detection
Yegneswaran04 - Measure at core (ISP)
- Sampled packet headers
- Network-wide visibility, effective mitigation
possible - But, mining network-wide traffic is difficult
LCD05
35Recent Trends in Diagnosis
- Network-Wide Diagnosis
- Exploit correlation across links in order to
- build models of normal traffic,
- trace how anomalies move in network
LCD04,LCD05 - Multi-Layer Diagnosis
- Correlate traffic data from multiple layers
simultaneously Roughan04 - Enables sophisticated identification and
root-cause analysis - Recent thrust in fault management literature also
Steinder04
36Final thoughts
- Network anomalies span a wide range and can have
severe impact - Network anomalies are increasing in prevalence
and unlikely to go away - Effective diagnosis and mitigation methods are
needed to manage anomalies - Despite their diversity, many anomalies disturb
network traffic - General anomaly diagnosis may be possible by
mining anomalies in network traffic at different
layers and topological locations