Addressing Complexity in Emerging Cyber-Ecosystems

About This Presentation

Title:

Addressing Complexity in Emerging Cyber-Ecosystems

Description:

Addressing Complexity in Emerging Cyber-Ecosystems Experiments with Autonomic Computational Science Manish Parashar* Center for Autonomic Computing – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 49

Provided by: nescAcUk91

Category:

more less

Transcript and Presenter's Notes

Title: Addressing Complexity in Emerging Cyber-Ecosystems

1
Addressing Complexity in Emerging
Cyber-Ecosystems Experiments with Autonomic
Computational Science

Manish Parashar
Center for Autonomic Computing
The Applied Software Systems Laboratory
Rutgers, The State University of New Jersey
In collaboration with S. Jha O. Rana

2
Outline of My Presentation

Computational Ecosystems
Unprecedented opportunities, challenges
Autonomic computing A pragmatic approach for
addressing complexity!
Experiments with autonomics for science and
engineering
Concluding Remarks

3
The Cyberinfrastructure Vision

Cyberinfrastructure integrates hardware for
computing, data and networks, digitally-enabled
sensors, observatories and experimental
facilities, and an interoperable suite of
software and middleware services and tools
- NSFs Cyberinfrastructure Vision for 21st
Century Discovery
A global phenomenon several LARGE deployments
UK National Grid Service (NGS) /European Grid
Infrastructure (EGI), TeraGrid, Open Science Grid
(OSG), EGEE, Cybera, DEISA, etc., etc.
New capabilities for computational science and
engineering
seamless access
resources, services, data, information,
expertise,
seamless aggregation
seamless (opportunistic) interactions/couplings

4
Cyberinfrastructure gt Cyber-Ecosystems

21st century Science and Engineering
New Paradigms Practices
Fundamentally data-driven/data intensive
Fundamentally collaborative

5
Unprecedented opportunities for
Science/Engineering

Knowledge-based, information/data-driven,
context/content-aware computationally intensive,
pervasive applications
Crisis management, monitor and predict natural
phenomenon, monitor and manage engineered
systems, optimize business processes
Addressing applications in an end-to-end manner!
Opportunistically combine computations,
experiments, observations, data, to manage,
control, predict, adapt, optimize,
New paradigms and practices in science and
engineering?
How can it benefit current applications?
How can it enable new thinking in science?

6
The Instrumented Oil Field (with UT-CSM, UT-IG,
OSU, UMD, ANL)
Detect and track changes in data during
production. Invert data for reservoir
properties. Detect and track reservoir
changes. Assimilate data reservoir properties
into the evolving reservoir model. Use
simulation and optimization to guide future
production.
Data Driven
Model Driven
7
Many Application Areas .

Hazard prevention, mitigation and response
Earthquakes, hurricanes, tornados, wild fires,
floods, landslides, tsunamis, terrorist attacks
Critical infrastructure systems
Condition monitoring and prediction of future
capability
Transportation of humans and goods
Safe, speedy, and cost effective transportation
networks and vehicles (air, ground, space)
Energy and environment
Safe and efficient power grids, safe and
efficient operation of regional collections of
buildings
Health
Reliable and cost effective health care systems
with improved outcomes
Enterprise-wide decision making
Coordination of dynamic distributed decisions for
supply chains under uncertainty
Next generation communication systems
Reliable wireless networks for homes and
businesses
Report of the Workshop on Dynamic Data Driven
Applications Systems, F. Darema et al., March
2006, www.dddas.org

Source M. Rotea, NSF
8
The Challenge Managing Complexity, Uncertainty

System
Very large scales
Disruptive trends
many/multi-cores, accelerators, clouds
Heterogeneity
capability, connectivity, reliability,
guarantees, QoS
Dynamics
Ad hoc structures, failure
Distributed system!
Lack of guarantees, common/complete knowledge,
Emerging concerns
Power, resilience,

Data and Information
Scale, heterogeneity
Availability, resolution, quality
Semantics, meta data, data models, provenance
Trust in data, .
Application
Compositions
Dynamic behaviors
Dynamic and complex couplings
Software/systems engineering issues
Emergent rather than by design

9
The Challenge Managing Complexity, Uncertainty
(I)

Increasing application, data/information, system
complexity
Scale, heterogeneity, dynamism, unreliability,
New application formulations, practices
Data intensive and data driven, coupled, multiple
physics/scales/resolution, adaptive,
compositional, workflows, etc.
Complexity/uncertainty must be simultaneously
addressed at multiple levels
Algorithms/Application formulations
Asynchronous/chaotic, failure tolerant,
Abstractions/Programming systems
Adaptive, application/system aware, proactive,
Infrastructure/Systems
Decoupled, self-managing, resilient,

10
The Challenge Managing Complexity, Uncertainty
(II)

The ability of scientists to realize the
potential of computational ecosystems is being
severely hampered due to the increased complexity
and dynamism of the applications and computing
environments.
To be productive, scientists often have to
comprehend and manage complex computing
configurations, software tools and libraries as
well as application parameters and behaviors.
Autonomics and self- can help ?
(with the plumbing for starters)

11
Outline of My Presentation

Computational Ecosystems
Unprecedented opportunities, challenges
Autonomic computing A pragmatic approach for
addressing complexity!
Experiments with autonomics for science and
engineering
Concluding Remarks

12
The Autonomic Computing Metaphor

Current paradigms, mechanisms, management tools
are inadequate to handle the scale, complexity,
dynamism and heterogeneity of emerging systems
and applications
Nature has evolved to cope with scale,
complexity, heterogeneity, dynamism and
unpredictability, lack of guarantees
self configuring, self adapting, self optimizing,
self healing, self protecting, highly
decentralized, heterogeneous architectures that
work !!!
Goal of autonomic computing is to enable
self-managing systems/applications that addresses
these challenges using high level guidance
Unlike AI duplication of human thought is not the
ultimate goal!

Autonomic Computing An Overview, M. Parashar,
and S. Hariri, Hot Topics, Lecture Notes in
Computer Science, Springer Verlag, Vol. 3566, pp.
247-259, 2005.
13
Motivations for Autonomic Computing
Sourcehttp//www.almaden.ibm.com/almaden/talks/Mo
rris_AC_10-02.pdf
2/27/07 Dow fell 546. Since worst plunge took
place after 230 pm, trading limits were not
activated
Source httpidc 2006
8/3/07 (EPA) datacenter energy use by 2011 will
cost 7.4 B, 15 power plants, 15 Gwatts/hour peak
8/1/06 UK NHS hit with massive computer outage.
72 primary care 8 acute hospital trusts
affected.
8/12/07 20K people 60 planes held at LAX after
computer failure prevented customs from screening
arrivals
Key Challenge Current levels of scale,
complexity and dynamism make it infeasible for
humans to effectively manage and control systems
and applications
14
Autonomic Computing A Pragmatic Approach

Separation Integration Automation !
Separation of knowledge, policies and mechanisms
for adaptation
The integration of selfconfiguration, healing,
protection,optimization,
Self- behaviors build on automation concepts and
mechanisms
Increased productivity, reduced operational
costs, timely and effective response
System/Applications self-management is more than
the sum of the self-management of its individual
components

M. Parashar and S. Hariri, Autonomic Computing
Concepts, Infrastructure, and Applications, CRC
Press, Taylor Francis Group, ISBN
0-8493-9367-1, 2007.
15
Autonomic Computing Theory

Integrates and advances several fields
Distributed computing
Algorithms and architectures
Artificial intelligence
Models to characterize,
predict and mine
data and behaviors
Security and reliability
Designs
and models of
robust systems
Systems and software architecture
Designs and models of
components at different IT layers
Control theory
Feedback-based control and estimation
Systems and signal processing theory
System and data models and optimization methods
Requires experimental validation

(From S. Dobson et al.,
ACM Tr. on Autonomous Adaptive Systems,
Vol. 1, No. 2, Dec. 2006.)

16
Some Information Sources

Autonomic Computing Concepts, Infrastructure
and Applications, M. Parashar and S. Hariri
(Ed.), CRC Press, ISBN 0-8493-9367-1 (Available
at http//www.crcpress.com/)
NSF Center on Autonomic Computing
http//nsfcac.rutgers.edu
http//www.nsfcac.org
Autonomic Computing Portal
http//www.autnomiccomputing.org
IEEE International Conference on Autonomic
Computing
http//www.autonomic-conference.org
IEEE Task Force on Autonomous and Autonomic
Systems
http//tab.computer.org/aas/

17
Autonomics for Science and Engineering ?

Autonomic computing aims at developing systems
and application that can manage and optimize
themselves using only high-level guidance or
intervention from users
dynamically adapt to changes in accordance with
business policies and objectives and take care of
routine elements of management
Separation of management and optimization
policies from enabling mechanisms
allows a repertoire of a mechanisms to be
automatically orchestrated at runtime to respond
to heterogeneity, dynamics, etc.
E.g., develop strategies that are capable of
identifying and characterizing patterns at design
and at runtime and, using relevant (dynamically
defined) policies, managing and optimizing the
patterns.
Application, Middleware, Infrastructure

Manage application/information/system complexity
not just hide it!
Enabling new thinking, formulations
how do I think about/formalize my problem
differently?

18
A Conceptual Framework for ACS (GMAC 07, with
S. Jha and O. Rana)

Hierarchical
Within and across level

19
Crosslayer Autonomics
20
Existing Autonomic Practices in Computational
Science (GMAC 09, SOAR 09, with S. Jha and O.
Rana)
Autonomic tuning of the application
Autonomic tuning by the application
21
Spatial, Temporal and Computational Heterogeneity
and Dynamics in SAMR
Temperature
Simulation of combustion based on SAMR (H2-Air
mixture ignition via 3 hot-spots)
OH Profile
Courtesy Sandia National Lab
22
Autonomics in SAMR

Tuning by the application
Application level when and where to refine
Runtime/Middleware level When, where, how to
partition and load balance
Runtime level When, where, how to partition and
load balance
Resource level Allocate/de-allocate resources
Tuning of the application, runtime
When/where to refine
Latency aware ghost synchronization
Heterogeneity/Load-aware partitioning and
load-balancing
Checkpoint frequency
Asynchronous formulations

23
Outline of My Presentation

Computational Ecosystems
Unprecedented opportunities, challenges
Autonomic computing A pragmatic approach for
addressing complexity!
Experiments with autonomics for science and
engineering
Concluding Remarks

24
Autonomics for Science and Engineering
Application-level Examples

Autonomic to address complexity in science and
engineering
Autonomic as a paradigm for science and
engineering
Some examples
Autonomic runtime management multiphysics,
adaptive mesh refinement
Autonomic data streaming and in-network data
processing coupled simulations
Autonomic deployment/scheduling HPC Grid/Cloud
integration
Autonomic workflows simulation based
optimization
(Many system level examples not presented here )

25
Adaptive Methods in Science and Engineering
26
Autonomic (Physics/Model/System Driven) Runtime
Management
Hybrid Runtime Management of Space-Time
Heterogeneity for Dynamic SAMR Applications, X.
Li and M. Parashar, IEEE TPDS 18(8), pp. 1202
1214, August 2007.
27
Cross-layer Adaptations for SAMR
When resources are under-utilized
When resources are scarce
ALP Trade in space (resource) for time
(performance) ALOC Trade in time (performance)
for space (resource)
28
Experimental Results - ALP
Performance gain up to 40 on 512 processors
Experiment Setup IBM SP4 cluster (DataStar at
San Diego Supercomputing Center, total 1632
processors) SP4 (p655) node 8 processors(1.5
GHz), memory 16 GB, 6.0 GFlops
29
Effects of Finite Memory - ALOC
Intel Pentium 4 CPU 1.70GHz, Linux 2.4
kernel Cache size 256 KB, Physical memory 512
M, Swap space 1 G.
30
Experimental Results - ALOC
Boewulf Cluster (Frea at Rutgers, 64
processors) Intel Pentium 4 CPU 1.70GHz, Linux
2.4 kernel Cache size 256 KB, Physical memory
512 M, Swap space 1 G.
31
Coupled Fusion Simulations A Data Intensive
Workflow
32
Autonomic Data Streaming and In-Transit
Processing for Data-Intensive Workflows

Large-scale distributed environments and data
intensive workflows
Applications entities separated in space and time
Seamless interactions and couplings across
entities
Distributed application entities need to interact
at runtime
Data processing, interactive data monitoring,
online data analysis, visualization,
data/service/vm migration, data archiving,
collaboration, etc.
Large data volumes and rates, heterogeneous data
types
Must be streamed efficiently and effectively
between distributed application components
Application-specific manipulations need to be
applied in-transit

An Self-Managing Wide-Area Data
Streaming Service, V. Bhat, M. Parashar, H.
Liu, M. Khandekar, N. Kandasamy, S. Klasky, and
S. Abdelwahed, Cluster Computing The Journal of
Networks, Software Tools, and Applications,
Volume 10, Issue 7, pp. 365 383, December 2007.
33
Autonomic Data Streaming and In-Transit
Processing for Data-Intensive Workflows

Workflow with coupled simulation codes, i.e., the
edge turbulence particle-in-cell (PIC) code (GTC)
and the microscopic MHD code (M3D) -- run
simultaneously on separate HPC resources
Data streamed and processed enroute -- e.g. data
from the PIC codes filtered through noise
detection processes before it can be coupled
with the MHD code
Efficiently data streaming between live
simulations -- to arrive just-in-time -- if it
arrives too early, times and resources will have
to be wasted to buffer the data, and if it
arrives too late, the application would waste
resources waiting for the data to come in
Opportunistic use of in-transit resources

Application level
Proactive QoS management strategies using
model-based LLC controller
Capture constraints for in-transit processing
using slack metric
In-transit level
Opportunistic data processing using dynamic
in-transit resource overlay
Adaptive run-time management at in-transit nodes
based on slack metric generated at application
level
Adaptive buffer management and forwarding

35
Autonomics for Coupled Fusion Simulation Workflows
36
Autonomic Streaming Implementation/Deployment

Simulation Workflow
SS Simulation Service (GTC)
ADSS Autonomic Data Streaming Service
CBMS LLC Controller based buffer management
service
DTS Data Transfer service
DAS Data Analysis Service
SLAMS Slack Manager Service
PS Processing Service
BMS Buffer Management Service
ArchS Archiving data at sink

Simulations executes on leadership class machines
at ORNL and NERSC
In-transit nodes located at PPPL and Rutgers

37
Adaptive Data Transfer

No congestion in intervals 1-9
Data transferred over WAN
Congested at intervals 9-19
Controller recognizes this congestion and advises
the Element Manager, which in turn adapts DTS to
transfer data to local storage (LAN).
Adaptation continues until the network is not
congested
Data sent to the local storage by the DTS falls
to zero at the 19th controller interval.

38
Adaptation of the Workflow

Create multiple instances of the Autonomic Data
Streaming Service (ADSS)
Effective Network Transfer Rate dips below the
threshold (our case around 100Mbs)

Transfer
Simulation
ADSS-0
Buffer
Data Transfer
ADSS-1
Buffer
Data Transfer
Network throughput is difference between the
max and current network transfer rate
ADSS-2
Buffer
Data Transfer
39
Buffer Occupancy _at_ In-Transit Nodes w w/o
Coupling

Buffer occupancy at in-transit nodes before
congestion is around 50
During congestion application level controller
throttles data items
Buffer occupancy at in-transit nodes reduces
from 80 without coupling to 60.8 with coupling
Higher buffer occupancies at in-transit nodes
lead to failures loss of data

40
Reservoir Characterization EnKF-based History
Matching (with S. Jha)

Black Oil Reservoir Simulator
simulates the movement of oil and gas in
subsurface formations
Ensemble Kalman Filter
computes the Kalman gain matrix and updates the
model parameters of the ensembles
Hetergeneous, dynamic workflows
Based on Cactus, PETSc

41
Experiment Background and Set-Up (2/2)

Key metrics
Total Time to Completion (TTC)
Total Cost of Completion (TCC)
Basic assumptions
TG gives the best performance but is relatively
more restricted resource.
EC2 is a relatively more freely available but is
not as capable.
Note that the motivation of our experiments is to
understand each of the usage scenarios and their
feasibility, behaviors and benefits, and not to
optimize the performance of any one scenario.

42
Establishing Baseline Performance
Baseline TTC for EC2 and TG for a 1-stage, 128
ensemble member EnKF run. The first 4 bars
represent the TTC as the number of EC2 VMs
increase the next 4 bars represent the TTC as
the number of CPUs (nodes) used increases.
43
Autonomic Integration of HPC Grids Clouds
(with S. Jha)

Acceleration Clouds used as accelerators to
improve the application time-to-completion
alleviate the impact of queue wait times or
exploit an additionally level of parallelism by
offloading appropriate tasks to Cloud resources
Conservation Clouds used to conserve HPC Grid
allocations, given appropriate runtime and budget
constraints
Resilience Clouds used to handle unexpected
situations
handle unanticipated HPC Grid downtime,
inadequate allocations or unanticipated queue
delays

44
Objective I Using Clouds as Acceleratorsfor HPC
Grids (1/2)

Explore how Clouds (EC2) can be used as
accelerators for HPC Grid (TG) work-loads
16 TG CPUs (1 node on Ranger)
average queuing time for TG was set to 5 and 10
minutes.
the number of EC2 nodes from 20 to 100 in steps
of 20.
VM start up time was about 160 seconds

45
Objective I Using Clouds as Acceleratorsfor HPC
Grids (2/2)
The TTC and TCC for Objective I with 16 TG CPUs
and queuing times set to 5 and 10 minutes. As
expected, more the number of VMs that are made
available, the greater the acceleration, i.e.,
lower the TTC. The reduction in TTC is roughly
linear, but is not perfectly so, because of a
complex interplay between the tasks in the work
load and resource availability
46
Objective II Using Clouds for ConservingCPU-Time
on the TeraGrid

Explore how to conserve fixed allocation of CPU
hours by offloading tasks that perhaps dont need
the specialized capabilities of the HPC Grid

Distribution of tasks across EC2 and TG, TTC and
TCC, as the CPU-minute allocation on the TG is
increased.
47
Objective III Response to Changing Operating
Conditions (Resilience) (1/4)

Explore the situation where resources that were
initially planned for, become unavailable at
runtime, either in part or in entirety
How can Cloud services be used to address this
situations and allow the system/application to
respond to a dynamic change in availability of
resources.
Initially 16 TG CPUs for 800 minutes allocated.
After about 50 minutes of execution (i.e., 3
Tasks were completed on the TG), available CPU
time is change to only 20 CPU minutes remain

48
Objective III Response to Changing Operating
Conditions (Resilience) (2/4)
Allocation of tasks to TG CPUs and EC2 nodes for
usage mode III. As the 16 allocated TG CPUs
become unavailable after only 70 minutes rather
than the planned 800 minutes, the bulk of the
tasks are completed by EC2 nodes.
49
Objective III Response to Changing Operating
Conditions (Resilience) (3/4)
Number of TG cores and EC2 nodes as a function of
time for usage mode III. Note that the TG CPU
allocation goes to zero after about 70 minutes
causing the autonomic scheduler to increase the
EC2 nodes by 8.
50
Objective III Response to Changing Operating
Conditions (Resilience) (4/4)
Overheads of resilience on TTC and TCC.
51
Autonomic Formulations/Programming
52
LLC-based Self Management in Accord

Element/Service Managers are augmented with LLC
Controllers
monitors state/execution context of elements
enforces adaptation actions determined by the
controller
augment human defined rules

53
The Instrumented Oil Field

Production of oil and gas can take advantage of
installed sensors that will monitor the
reservoirs state as fluids are extracted
Knowledge of the reservoirs state during
production can result in better engineering
decisions
economical evaluation physical characteristics
(bypassed oil, high pressure zones) productions
techniques for safe operating conditions in
complex and difficult areas

Application of Grid-Enabled Technologies for
Solving Optimization Problems in Data-Driven
Reservoir Studies, M. Parashar, H. Klie, U.
Catalyurek, T. Kurc, V. Matossian, J. Saltz and M
Wheeler, FGCS. The International Journal of Grid
Computing Theory, Methods and Applications
(FGCS), Elsevier Science Publishers, Vol. 21,
Issue 1, pp 19-26, 2005.
54
Effective Oil Reservoir Management Well
Placement/Configuration

Why is it important
Better utilization/cost-effectiveness of existing
reservoirs
Minimizing adverse effects to the environment

Bad Management
Better Management
Less Bypassed Oil
Much Bypassed Oil
55
Autonomic Reservoir Management Closing the
Loop using Optimization
Dynamic Decision System
Dynamic Data-Driven Assimilation

Optimize
Economic revenue
Environmental hazard
Based on the present subsurface knowledge and
numerical model

Subsurface characterization
Management decision
Data assimilation
Acquire remote sensing data
Update knowledge of model
Plan optimal data acquisition
Experimental design
START
Autonomic Grid Middleware
Grid Data Management
Processing Middleware
56
An Autonomic Well Placement/Configuration Workflow
Oil prices, Weather, etc.
57
Autonomic Oil Well Placement/Configuration
Contours of NEval(y,z,500)(10)
Pressure contours 3 wells, 2D profile
permeability
Requires NYxNZ (450) evaluations. Minimum appears
here.
VFSA solution walk found after 20 (81)
evaluations
58
Autonomic Oil Well Placement/Configuration (VFSA)
An Reservoir Framework for the Stochastic
Optimization of Well Placement, V. Matossian, M.
Parashar, W. Bangerth, H. Klie, M.F. Wheeler,
Cluster Computing The Journal of Networks,
Software Tools, and Applications, Kluwer Academic
Publishers, Vol. 8, No. 4, pp 255 269, 2005
Autonomic Oil Reservoir Optimization on the
Grid, V. Matossian, V. Bhat, M. Parashar, M.
Peszynska, M. Sen, P. Stoffa and M. F. Wheeler,
Concurrency and Computation Practice and
Experience, John Wiley and Sons, Volume 17, Issue
1, pp 1 26, 2005.
59
Summary

CI and emerging computational ecosystems
Unprecedented opportunity
new thinking, practices in science and
engineering
Unprecedented research challenges
scale, complexity, heterogeneity, dynamism,
reliability, uncertainty,
Autonomic Computing can address complexity and
uncertainty
Separation Integration Automation
Experiments with Autonomics for science and
engineering
Autonomic data streaming and in-transit data
manipulation, Autonomic Workflows, Autonomic
Runtime Management,
However, there are implications
Added uncertainty
Correctness, predictability, repeatability
Validation