Title: Backward and forward looking at dependable and secure computing
1Backward and forward looking at dependable and
secure computing
- Yinghua Min
- Fellow of IEEE
- Institute of Computing Technology,
- Chinese Academy of Sciences, Beijing, China
- At PRDC09, 2009/11/16
2Outline
- Historical review of dependable computing
- FTCS
- DSN
- IFIP WG10.4
- PRDC
- New challenges of dependable and secure computing
- Old techniques facing new environments
- Concentrated on practical problems, rather than
conceptual games
3FTCS
- Established in 1970
- FTC for critical applications
- Aviation
- Spaceflight
- Railway transportation
- A highly academic symposium
4Dependable computing
- People understood that our area needed some
extension. - A. Avizienis and Jean-Claude Laprie proposed the
concept of Dependable Computing at FTCS-15 in
1985. - Human being is included in systems then.
- Malicious faults
- FTCS
- DCCA
DSN in 2000
5DSN
- Since 2000
- DSN has pioneered the fusion between security and
dependability. - Understanding the need to simultaneously fight
against cyber attacks, accidental faults, design
errors, and unexpected operating conditions.
6PRDC
- 1989 Joint Symposium on Fault--Tolerant
Computing, Chongqing, China, July 18-20, 1989 - 1991 Pacific Rim international symposium on FTS,
Kawasaki, Japan - 1999 Pacific Rim international symposium on
Dependable Computing, Hong Kong, China. - Keynote Computer Crime in Hong Kong (Mr.
Anthony Fung) - From the HK police department
- Computer Crime and Internet Fraud
- Its evidence for litigation support
7Trusted Computing
- Trusted Computing Platform Alliance (TCPA) in
1999 - TCG since 2003
- TPM ? TCM (Trusted C Module) 2008
- Trusted root ? security chip ? trusted BIOS ?
trusted OS ? trusted systems - Basically for PCs in the area of secure computing
8IEEE Transactions on Dependable and Secure
Computing
- Since 2004
- Separate dependable computing from secure
computing
9System dependability
- The system dependability situation has been
getting worse rather than improving in recent
years. Quoting the AMSD Roadmap, the
availability that was typically achievable by
(wired) telecommunication services, and computer
systems in the 1990s was 99.999 percent to 99.9
percent. Now cellular phone services, and
web-based services, typically achieve an
availability of only 99 per cent to 90 per cent
(AMSD Roadmap 2003, p. 31).
The European Commissions Accompanying Measure on
System Dependability
10New challenges
- Three key requirements for computers
- High performance
- Low power
- Dependability
- Nano-ICs, more vulnerable
- to transient (or soft) errors
- to permanent malfunctions due to materials aging
or wearout mechanisms. - Nano-scale IC reliability
- Counterfeit ICs
- Dependability and security in cloud computing
- Signal integrality
- Dependant software needs evidence.
11Nano-scale IC reliability
- The "International Technology Roadmap for
Semiconductors" SIA estimates that by 2019 the
feature size of process technology will reach
7nm, but only between 10 and 20 of chips will
be defect free. - Power densities to skyrocket and on-chip
temperatures to increase - Small delay defects, adjacent line coupling,
crosstalk and process variation induced
unreliability - variability-tolerant design
- appropriate measures are taken, such as fault
tolerance, redundancy, repair and
reconfiguration.
12Counterfeit Electronic Components
- These are incidents that jeopardize the
performance and reliability of electronics.
13Baofeng.com incident in China
- Network outages in Jiangsu, Anhui, Guangxi,
Henan, Gansu, and Zhejiang in China, May 19, 2009 - The network failure was led by the domain name
system (DNS) failure of Baofeng.com, the website
of the Chinese music player provider - The failure further caused the surge of DNS
server visits and the decrease of processing
performance of the network. - The servers of DNSPod were attacked by a
malicious virus. - The incident was caused by a software fault or an
attack?--- Maybe both
14Bohrbugs and Mandelbugs
- Bohrbugs
- An unusual software bug that consistently makes
its presence known under conditions that are
either well-defined, possibly unknown or both. - Mandelbugs
- A bug whose behavior doesn't appear malicious,
but has such a high level of complexity that it
appears when errors are accumulated for some
time. - Bohrbugs behaving like Mandelbugs
- Becoming an attack
15Dependability in the Cloud
- On April 26 2008, Amazons Elastic Cloud (EC2)
had an outage - due to a single customer applying a very large
set of unusual firewall rules - triggering a performance degradation bug in
Amazons distributed firewall. - Availability and privacy are serious challenges
for applications hosted on cloud infrastructure.
16Challenges on cloud infrastructure
- Cloud applications increase risk levels
- Sharing of cloud resources by entities that
engage in a wide range of behaviors and employ
best practices to varying degrees - An environment with a few large cloud
infrastructure providers - increases the risk of common mode outages
affecting a large number of applications - provides highly visible targets for attackers.
- Multiple administrative domains between the
application and infrastructure operators reduces
end-to-end system visibility and error
propagation information, thus making problem
detection and diagnosis very difficult. - A cloud provider's economies of scale allow
levels of investment in redundancy and
dependability, but smaller operators may not.
17Old FTC techniques facing new environments
- Checkpointing
- Redundancy
- Software fault-tolerance in middleware
- ECC in mass storage systems
- Fault detection and diagnosis in virtual machines
- Assessment of dependability and security
18Checkpointing for supercomputers
- Periodic checkpointing ? cooperative
checkpointing - At runtime, the application requests a
checkpoint. - The system grants or denies the checkpoint (to
skip some of them) - based on various system-wide heuristics,
including disk or network usage and reliability
information. - Using cooperative checkpointing in one instance
- reduced bounded slowdown by a factor of nine,
- improved system utilization, and lost no more
work to failures than periodic checkpointing - even when event prediction had a 90 false
negative rate.
19Checkpointing at micro-operation level
Committed state
Committed state
Processor State
Violation Occurs
Violation detected
- Sliding window based on sensor delay
- Delayed-commit completed results buffered in the
buffers until verified to be correct - Noise-speculative
- Noise-verified
- Rollback to a previous noise-verified state when
a violation is detected
19
20Redundancy
- At the application level and at a hardware level.
- Byzantine fault tolerance
- Algorithms that are robust to arbitrary types of
failures in distributed algorithms. - Do not require any centralized control that have
some guarantee of always working correctly. - Data integrity
- Redundancy in different places
- RAID (redundant array of independent disks), a
fault-tolerant storage device that uses data
redundancy. - Synchronization is a big challenge.
21Software fault-tolerance in middleware
- Optimal fault tolerance strategy for both
stateless and stateful Web services - Retry
- Recovery block
- N-version programming
- Network characteristics
- Freedom
- Dynamic
- Multi-tier service
- Debug performance problems of multi-tier services
of black boxes.
22Soft errors
- Soft errors involve changes to data
- Cosmic rays creating energetic neutrons and
protons - The importance of soft errors increases as chip
technology advances. - chip-level soft error
- the radioactive atoms in the chip's material
decay and release alpha particles into the chip. - Built-in Soft Error Resilience (BISER) Cell
- system-level soft error
- the data being processed is hit with a noise
phenomenon
23Transient Faults
- Program replication
- N-version programming
- Time redundant technique,
- Virtual duplex systems
- Tandem Nonstop Cyclone is a custom system
designed to use process replicas for transaction
processing workloads. - Transient Fault Tolerance for Multi-core
Architectures - Redundancy at the process level
- Ensuring correct hardware execution or ensuring
correct software execution
24Assessment of dependability and security
- The original definition of dependability is the
ability to deliver service that can justifiably
be trusted. - Justification
- Evaluation
- Banchmarking
- Standardization
- A dependability and security gap that is often
perceived by users as a lack of trustworthiness
in computer applications, and that is in fact
undermining the network and service
infrastructures that constitute the very core of
the knowledge-based society.
25Difficulties for assessment
- The assessment of dependability in a standard and
comparable way, considering all - Component failures
- Software bugs
- Human mistakes
- Interaction mistakes
- Malicious attacks
- The quality of measurements
- The assessment of dependability in component
based, dynamic and adaptive systems and networks - The integration with the development process
26Denial of service (DoS)
- Effects of DoS attacks are experienced by users
as a severe slowdown, service quality
degradation, or service disruption. - We need accurate, quantitative, and versatile DoS
impact metrics regardless of the underlying
mechanism for service denial, attack dynamics,
legitimate traffic mix, or network topology. - Measuring DoS through selected legitimate traffic
parameters - packet loss,
- traffic throughput or goodput,
- request/response delay,
- transaction duration, and
- allocation of resources.
27Conceptual games
28Concluding remarks
- Dependable computing is a forever topic for
information technology - Dependability is as important as high
performance, and low power. - New challenges are coming with the advance of IT
- The gap between academia and industry
- Concentrate on practical problems, rather than
conceptual games
29- Thank you for your attention!