Title: Addressing Complexity in Emerging Cyber-Ecosystems
1Addressing Complexity in Emerging
Cyber-Ecosystems Experiments with Autonomic
Computational Science
- Manish Parashar
- Center for Autonomic Computing
- The Applied Software Systems Laboratory
- Rutgers, The State University of New Jersey
- In collaboration with S. Jha O. Rana
2Outline of My Presentation
- Computational Ecosystems
- Unprecedented opportunities, challenges
- Autonomic computing A pragmatic approach for
addressing complexity! - Experiments with autonomics for science and
engineering - Concluding Remarks
3The Cyberinfrastructure Vision
- Cyberinfrastructure integrates hardware for
computing, data and networks, digitally-enabled
sensors, observatories and experimental
facilities, and an interoperable suite of
software and middleware services and tools - - NSFs Cyberinfrastructure Vision for 21st
Century Discovery - A global phenomenon several LARGE deployments
- UK National Grid Service (NGS) /European Grid
Infrastructure (EGI), TeraGrid, Open Science Grid
(OSG), EGEE, Cybera, DEISA, etc., etc. - New capabilities for computational science and
engineering - seamless access
- resources, services, data, information,
expertise, - seamless aggregation
- seamless (opportunistic) interactions/couplings
4Cyberinfrastructure gt Cyber-Ecosystems
- 21st century Science and Engineering
- New Paradigms Practices
- Fundamentally data-driven/data intensive
- Fundamentally collaborative
5Unprecedented opportunities for
Science/Engineering
- Knowledge-based, information/data-driven,
context/content-aware computationally intensive,
pervasive applications - Crisis management, monitor and predict natural
phenomenon, monitor and manage engineered
systems, optimize business processes - Addressing applications in an end-to-end manner!
- Opportunistically combine computations,
experiments, observations, data, to manage,
control, predict, adapt, optimize, - New paradigms and practices in science and
engineering? - How can it benefit current applications?
- How can it enable new thinking in science?
6The Instrumented Oil Field (with UT-CSM, UT-IG,
OSU, UMD, ANL)
Detect and track changes in data during
production. Invert data for reservoir
properties. Detect and track reservoir
changes. Assimilate data reservoir properties
into the evolving reservoir model. Use
simulation and optimization to guide future
production.
Data Driven
Model Driven
7Many Application Areas .
- Hazard prevention, mitigation and response
- Earthquakes, hurricanes, tornados, wild fires,
floods, landslides, tsunamis, terrorist attacks - Critical infrastructure systems
- Condition monitoring and prediction of future
capability - Transportation of humans and goods
- Safe, speedy, and cost effective transportation
networks and vehicles (air, ground, space) - Energy and environment
- Safe and efficient power grids, safe and
efficient operation of regional collections of
buildings - Health
- Reliable and cost effective health care systems
with improved outcomes - Enterprise-wide decision making
- Coordination of dynamic distributed decisions for
supply chains under uncertainty - Next generation communication systems
- Reliable wireless networks for homes and
businesses -
- Report of the Workshop on Dynamic Data Driven
Applications Systems, F. Darema et al., March
2006, www.dddas.org
Source M. Rotea, NSF
8The Challenge Managing Complexity, Uncertainty
- System
- Very large scales
- Disruptive trends
- many/multi-cores, accelerators, clouds
- Heterogeneity
- capability, connectivity, reliability,
guarantees, QoS - Dynamics
- Ad hoc structures, failure
- Distributed system!
- Lack of guarantees, common/complete knowledge,
- Emerging concerns
- Power, resilience,
- Data and Information
- Scale, heterogeneity
- Availability, resolution, quality
- Semantics, meta data, data models, provenance
- Trust in data, .
- Application
- Compositions
- Dynamic behaviors
- Dynamic and complex couplings
- Software/systems engineering issues
- Emergent rather than by design
9The Challenge Managing Complexity, Uncertainty
(I)
- Increasing application, data/information, system
complexity - Scale, heterogeneity, dynamism, unreliability,
- New application formulations, practices
- Data intensive and data driven, coupled, multiple
physics/scales/resolution, adaptive,
compositional, workflows, etc. - Complexity/uncertainty must be simultaneously
addressed at multiple levels - Algorithms/Application formulations
- Asynchronous/chaotic, failure tolerant,
- Abstractions/Programming systems
- Adaptive, application/system aware, proactive,
- Infrastructure/Systems
- Decoupled, self-managing, resilient,
10The Challenge Managing Complexity, Uncertainty
(II)
- The ability of scientists to realize the
potential of computational ecosystems is being
severely hampered due to the increased complexity
and dynamism of the applications and computing
environments. - To be productive, scientists often have to
comprehend and manage complex computing
configurations, software tools and libraries as
well as application parameters and behaviors. - Autonomics and self- can help ?
- (with the plumbing for starters)
11Outline of My Presentation
- Computational Ecosystems
- Unprecedented opportunities, challenges
- Autonomic computing A pragmatic approach for
addressing complexity! - Experiments with autonomics for science and
engineering - Concluding Remarks
12The Autonomic Computing Metaphor
- Current paradigms, mechanisms, management tools
are inadequate to handle the scale, complexity,
dynamism and heterogeneity of emerging systems
and applications - Nature has evolved to cope with scale,
complexity, heterogeneity, dynamism and
unpredictability, lack of guarantees - self configuring, self adapting, self optimizing,
self healing, self protecting, highly
decentralized, heterogeneous architectures that
work !!! - Goal of autonomic computing is to enable
self-managing systems/applications that addresses
these challenges using high level guidance - Unlike AI duplication of human thought is not the
ultimate goal!
Autonomic Computing An Overview, M. Parashar,
and S. Hariri, Hot Topics, Lecture Notes in
Computer Science, Springer Verlag, Vol. 3566, pp.
247-259, 2005.
13Motivations for Autonomic Computing
Sourcehttp//www.almaden.ibm.com/almaden/talks/Mo
rris_AC_10-02.pdf
2/27/07 Dow fell 546. Since worst plunge took
place after 230 pm, trading limits were not
activated
Source httpidc 2006
8/3/07 (EPA) datacenter energy use by 2011 will
cost 7.4 B, 15 power plants, 15 Gwatts/hour peak
8/1/06 UK NHS hit with massive computer outage.
72 primary care 8 acute hospital trusts
affected.
8/12/07 20K people 60 planes held at LAX after
computer failure prevented customs from screening
arrivals
Key Challenge Current levels of scale,
complexity and dynamism make it infeasible for
humans to effectively manage and control systems
and applications
14Autonomic Computing A Pragmatic Approach
- Separation Integration Automation !
- Separation of knowledge, policies and mechanisms
for adaptation - The integration of selfconfiguration, healing,
protection,optimization, - Self- behaviors build on automation concepts and
mechanisms - Increased productivity, reduced operational
costs, timely and effective response - System/Applications self-management is more than
the sum of the self-management of its individual
components
M. Parashar and S. Hariri, Autonomic Computing
Concepts, Infrastructure, and Applications, CRC
Press, Taylor Francis Group, ISBN
0-8493-9367-1, 2007.
15Autonomic Computing Theory
- Integrates and advances several fields
- Distributed computing
- Algorithms and architectures
- Artificial intelligence
- Models to characterize,
- predict and mine
- data and behaviors
- Security and reliability
- Designs
- and models of
- robust systems
- Systems and software architecture
- Designs and models of
- components at different IT layers
- Control theory
- Feedback-based control and estimation
- Systems and signal processing theory
- System and data models and optimization methods
- Requires experimental validation
- (From S. Dobson et al.,
- ACM Tr. on Autonomous Adaptive Systems,
- Vol. 1, No. 2, Dec. 2006.)
16Some Information Sources
- Autonomic Computing Concepts, Infrastructure
and Applications, M. Parashar and S. Hariri
(Ed.), CRC Press, ISBN 0-8493-9367-1 (Available
at http//www.crcpress.com/) - NSF Center on Autonomic Computing
- http//nsfcac.rutgers.edu
- http//www.nsfcac.org
- Autonomic Computing Portal
- http//www.autnomiccomputing.org
- IEEE International Conference on Autonomic
Computing - http//www.autonomic-conference.org
- IEEE Task Force on Autonomous and Autonomic
Systems - http//tab.computer.org/aas/
17Autonomics for Science and Engineering ?
- Autonomic computing aims at developing systems
and application that can manage and optimize
themselves using only high-level guidance or
intervention from users - dynamically adapt to changes in accordance with
business policies and objectives and take care of
routine elements of management - Separation of management and optimization
policies from enabling mechanisms - allows a repertoire of a mechanisms to be
automatically orchestrated at runtime to respond
to heterogeneity, dynamics, etc. - E.g., develop strategies that are capable of
identifying and characterizing patterns at design
and at runtime and, using relevant (dynamically
defined) policies, managing and optimizing the
patterns. - Application, Middleware, Infrastructure
- Manage application/information/system complexity
- not just hide it!
- Enabling new thinking, formulations
- how do I think about/formalize my problem
differently?
18A Conceptual Framework for ACS (GMAC 07, with
S. Jha and O. Rana)
- Hierarchical
- Within and across level
19Crosslayer Autonomics
20Existing Autonomic Practices in Computational
Science (GMAC 09, SOAR 09, with S. Jha and O.
Rana)
Autonomic tuning of the application
Autonomic tuning by the application
21Spatial, Temporal and Computational Heterogeneity
and Dynamics in SAMR
Temperature
Simulation of combustion based on SAMR (H2-Air
mixture ignition via 3 hot-spots)
OH Profile
Courtesy Sandia National Lab
22Autonomics in SAMR
- Tuning by the application
- Application level when and where to refine
- Runtime/Middleware level When, where, how to
partition and load balance - Runtime level When, where, how to partition and
load balance - Resource level Allocate/de-allocate resources
- Tuning of the application, runtime
- When/where to refine
- Latency aware ghost synchronization
- Heterogeneity/Load-aware partitioning and
load-balancing - Checkpoint frequency
- Asynchronous formulations
-
23Outline of My Presentation
- Computational Ecosystems
- Unprecedented opportunities, challenges
- Autonomic computing A pragmatic approach for
addressing complexity! - Experiments with autonomics for science and
engineering - Concluding Remarks
24Autonomics for Science and Engineering
Application-level Examples
- Autonomic to address complexity in science and
engineering - Autonomic as a paradigm for science and
engineering - Some examples
- Autonomic runtime management multiphysics,
adaptive mesh refinement - Autonomic data streaming and in-network data
processing coupled simulations - Autonomic deployment/scheduling HPC Grid/Cloud
integration - Autonomic workflows simulation based
optimization - (Many system level examples not presented here )
25Adaptive Methods in Science and Engineering
26Autonomic (Physics/Model/System Driven) Runtime
Management
Hybrid Runtime Management of Space-Time
Heterogeneity for Dynamic SAMR Applications, X.
Li and M. Parashar, IEEE TPDS 18(8), pp. 1202
1214, August 2007.
27Cross-layer Adaptations for SAMR
When resources are under-utilized
When resources are scarce
ALP Trade in space (resource) for time
(performance) ALOC Trade in time (performance)
for space (resource)
28Experimental Results - ALP
Performance gain up to 40 on 512 processors
Experiment Setup IBM SP4 cluster (DataStar at
San Diego Supercomputing Center, total 1632
processors) SP4 (p655) node 8 processors(1.5
GHz), memory 16 GB, 6.0 GFlops
29Effects of Finite Memory - ALOC
Intel Pentium 4 CPU 1.70GHz, Linux 2.4
kernel Cache size 256 KB, Physical memory 512
M, Swap space 1 G.
30Experimental Results - ALOC
Boewulf Cluster (Frea at Rutgers, 64
processors) Intel Pentium 4 CPU 1.70GHz, Linux
2.4 kernel Cache size 256 KB, Physical memory
512 M, Swap space 1 G.
31Coupled Fusion Simulations A Data Intensive
Workflow
32Autonomic Data Streaming and In-Transit
Processing for Data-Intensive Workflows
- Large-scale distributed environments and data
intensive workflows - Applications entities separated in space and time
- Seamless interactions and couplings across
entities - Distributed application entities need to interact
at runtime - Data processing, interactive data monitoring,
online data analysis, visualization,
data/service/vm migration, data archiving,
collaboration, etc. - Large data volumes and rates, heterogeneous data
types - Must be streamed efficiently and effectively
between distributed application components - Application-specific manipulations need to be
applied in-transit
An Self-Managing Wide-Area Data
Streaming Service, V. Bhat, M. Parashar, H.
Liu, M. Khandekar, N. Kandasamy, S. Klasky, and
S. Abdelwahed, Cluster Computing The Journal of
Networks, Software Tools, and Applications,
Volume 10, Issue 7, pp. 365 383, December 2007.
33Autonomic Data Streaming and In-Transit
Processing for Data-Intensive Workflows
- Workflow with coupled simulation codes, i.e., the
edge turbulence particle-in-cell (PIC) code (GTC)
and the microscopic MHD code (M3D) -- run
simultaneously on separate HPC resources - Data streamed and processed enroute -- e.g. data
from the PIC codes filtered through noise
detection processes before it can be coupled
with the MHD code - Efficiently data streaming between live
simulations -- to arrive just-in-time -- if it
arrives too early, times and resources will have
to be wasted to buffer the data, and if it
arrives too late, the application would waste
resources waiting for the data to come in - Opportunistic use of in-transit resources
An Self-Managing Wide-Area Data
Streaming Service, V. Bhat, M. Parashar, H.
Liu, M. Khandekar, N. Kandasamy, S. Klasky, and
S. Abdelwahed, Cluster Computing The Journal of
Networks, Software Tools, and Applications,
Volume 10, Issue 7, pp. 365 383, December 2007.
34Autonomic Data Streaming In-Transit Processing
- Application level
- Proactive QoS management strategies using
model-based LLC controller - Capture constraints for in-transit processing
using slack metric - In-transit level
- Opportunistic data processing using dynamic
in-transit resource overlay - Adaptive run-time management at in-transit nodes
based on slack metric generated at application
level - Adaptive buffer management and forwarding
35Autonomics for Coupled Fusion Simulation Workflows
36Autonomic Streaming Implementation/Deployment
- Simulation Workflow
- SS Simulation Service (GTC)
- ADSS Autonomic Data Streaming Service
- CBMS LLC Controller based buffer management
service - DTS Data Transfer service
- DAS Data Analysis Service
- SLAMS Slack Manager Service
- PS Processing Service
- BMS Buffer Management Service
- ArchS Archiving data at sink
- Simulations executes on leadership class machines
at ORNL and NERSC - In-transit nodes located at PPPL and Rutgers
37Adaptive Data Transfer
- No congestion in intervals 1-9
- Data transferred over WAN
- Congested at intervals 9-19
- Controller recognizes this congestion and advises
the Element Manager, which in turn adapts DTS to
transfer data to local storage (LAN). - Adaptation continues until the network is not
congested - Data sent to the local storage by the DTS falls
to zero at the 19th controller interval.
38Adaptation of the Workflow
- Create multiple instances of the Autonomic Data
Streaming Service (ADSS) - Effective Network Transfer Rate dips below the
threshold (our case around 100Mbs)
Transfer
Simulation
ADSS-0
Buffer
Data Transfer
ADSS-1
Buffer
Data Transfer
Network throughput is difference between the
max and current network transfer rate
ADSS-2
Buffer
Data Transfer
39Buffer Occupancy _at_ In-Transit Nodes w w/o
Coupling
- Buffer occupancy at in-transit nodes before
congestion is around 50 - During congestion application level controller
throttles data items - Buffer occupancy at in-transit nodes reduces
from 80 without coupling to 60.8 with coupling - Higher buffer occupancies at in-transit nodes
lead to failures loss of data
40Reservoir Characterization EnKF-based History
Matching (with S. Jha)
- Black Oil Reservoir Simulator
- simulates the movement of oil and gas in
subsurface formations - Ensemble Kalman Filter
- computes the Kalman gain matrix and updates the
model parameters of the ensembles - Hetergeneous, dynamic workflows
- Based on Cactus, PETSc
41Experiment Background and Set-Up (2/2)
- Key metrics
- Total Time to Completion (TTC)
- Total Cost of Completion (TCC)
- Basic assumptions
- TG gives the best performance but is relatively
more restricted resource. - EC2 is a relatively more freely available but is
not as capable. - Note that the motivation of our experiments is to
understand each of the usage scenarios and their
feasibility, behaviors and benefits, and not to
optimize the performance of any one scenario.
42Establishing Baseline Performance
Baseline TTC for EC2 and TG for a 1-stage, 128
ensemble member EnKF run. The first 4 bars
represent the TTC as the number of EC2 VMs
increase the next 4 bars represent the TTC as
the number of CPUs (nodes) used increases.
43Autonomic Integration of HPC Grids Clouds
(with S. Jha)
- Acceleration Clouds used as accelerators to
improve the application time-to-completion - alleviate the impact of queue wait times or
exploit an additionally level of parallelism by
offloading appropriate tasks to Cloud resources - Conservation Clouds used to conserve HPC Grid
allocations, given appropriate runtime and budget
constraints - Resilience Clouds used to handle unexpected
situations - handle unanticipated HPC Grid downtime,
inadequate allocations or unanticipated queue
delays
44Objective I Using Clouds as Acceleratorsfor HPC
Grids (1/2)
- Explore how Clouds (EC2) can be used as
accelerators for HPC Grid (TG) work-loads - 16 TG CPUs (1 node on Ranger)
- average queuing time for TG was set to 5 and 10
minutes. - the number of EC2 nodes from 20 to 100 in steps
of 20. - VM start up time was about 160 seconds
45Objective I Using Clouds as Acceleratorsfor HPC
Grids (2/2)
The TTC and TCC for Objective I with 16 TG CPUs
and queuing times set to 5 and 10 minutes. As
expected, more the number of VMs that are made
available, the greater the acceleration, i.e.,
lower the TTC. The reduction in TTC is roughly
linear, but is not perfectly so, because of a
complex interplay between the tasks in the work
load and resource availability
46Objective II Using Clouds for ConservingCPU-Time
on the TeraGrid
- Explore how to conserve fixed allocation of CPU
hours by offloading tasks that perhaps dont need
the specialized capabilities of the HPC Grid
Distribution of tasks across EC2 and TG, TTC and
TCC, as the CPU-minute allocation on the TG is
increased.
47Objective III Response to Changing Operating
Conditions (Resilience) (1/4)
- Explore the situation where resources that were
initially planned for, become unavailable at
runtime, either in part or in entirety - How can Cloud services be used to address this
situations and allow the system/application to
respond to a dynamic change in availability of
resources. - Initially 16 TG CPUs for 800 minutes allocated.
After about 50 minutes of execution (i.e., 3
Tasks were completed on the TG), available CPU
time is change to only 20 CPU minutes remain
48Objective III Response to Changing Operating
Conditions (Resilience) (2/4)
Allocation of tasks to TG CPUs and EC2 nodes for
usage mode III. As the 16 allocated TG CPUs
become unavailable after only 70 minutes rather
than the planned 800 minutes, the bulk of the
tasks are completed by EC2 nodes.
49Objective III Response to Changing Operating
Conditions (Resilience) (3/4)
Number of TG cores and EC2 nodes as a function of
time for usage mode III. Note that the TG CPU
allocation goes to zero after about 70 minutes
causing the autonomic scheduler to increase the
EC2 nodes by 8.
50Objective III Response to Changing Operating
Conditions (Resilience) (4/4)
Overheads of resilience on TTC and TCC.
51Autonomic Formulations/Programming
52LLC-based Self Management in Accord
- Element/Service Managers are augmented with LLC
Controllers - monitors state/execution context of elements
- enforces adaptation actions determined by the
controller - augment human defined rules
53The Instrumented Oil Field
- Production of oil and gas can take advantage of
installed sensors that will monitor the
reservoirs state as fluids are extracted - Knowledge of the reservoirs state during
production can result in better engineering
decisions - economical evaluation physical characteristics
(bypassed oil, high pressure zones) productions
techniques for safe operating conditions in
complex and difficult areas
Application of Grid-Enabled Technologies for
Solving Optimization Problems in Data-Driven
Reservoir Studies, M. Parashar, H. Klie, U.
Catalyurek, T. Kurc, V. Matossian, J. Saltz and M
Wheeler, FGCS. The International Journal of Grid
Computing Theory, Methods and Applications
(FGCS), Elsevier Science Publishers, Vol. 21,
Issue 1, pp 19-26, 2005.
54Effective Oil Reservoir Management Well
Placement/Configuration
- Why is it important
- Better utilization/cost-effectiveness of existing
reservoirs - Minimizing adverse effects to the environment
Bad Management
Better Management
Less Bypassed Oil
Much Bypassed Oil
55Autonomic Reservoir Management Closing the
Loop using Optimization
Dynamic Decision System
Dynamic Data-Driven Assimilation
- Optimize
- Economic revenue
- Environmental hazard
-
- Based on the present subsurface knowledge and
numerical model
Subsurface characterization
Management decision
Data assimilation
Acquire remote sensing data
Update knowledge of model
Plan optimal data acquisition
Experimental design
START
Autonomic Grid Middleware
Grid Data Management
Processing Middleware
56An Autonomic Well Placement/Configuration Workflow
Oil prices, Weather, etc.
57Autonomic Oil Well Placement/Configuration
Contours of NEval(y,z,500)(10)
Pressure contours 3 wells, 2D profile
permeability
Requires NYxNZ (450) evaluations. Minimum appears
here.
VFSA solution walk found after 20 (81)
evaluations
58Autonomic Oil Well Placement/Configuration (VFSA)
An Reservoir Framework for the Stochastic
Optimization of Well Placement, V. Matossian, M.
Parashar, W. Bangerth, H. Klie, M.F. Wheeler,
Cluster Computing The Journal of Networks,
Software Tools, and Applications, Kluwer Academic
Publishers, Vol. 8, No. 4, pp 255 269, 2005
Autonomic Oil Reservoir Optimization on the
Grid, V. Matossian, V. Bhat, M. Parashar, M.
Peszynska, M. Sen, P. Stoffa and M. F. Wheeler,
Concurrency and Computation Practice and
Experience, John Wiley and Sons, Volume 17, Issue
1, pp 1 26, 2005.
59Summary
- CI and emerging computational ecosystems
- Unprecedented opportunity
- new thinking, practices in science and
engineering - Unprecedented research challenges
- scale, complexity, heterogeneity, dynamism,
reliability, uncertainty, - Autonomic Computing can address complexity and
uncertainty - Separation Integration Automation
- Experiments with Autonomics for science and
engineering - Autonomic data streaming and in-transit data
manipulation, Autonomic Workflows, Autonomic
Runtime Management, - However, there are implications
- Added uncertainty
- Correctness, predictability, repeatability
- Validation
60Thank You!
Email parashar_at_rutgers.edu