Title: Sminaire Ingnierie des Systmes Complexes
1Séminaire Ingénierie des Systèmes Complexes
- 15 Mai 2006
- SLA-based routing for middleware A step towards
self-optimizing BPM - Yves Caseau
- Bouygues Telecom
Comment construire une infrastructure
dintégration adaptative en fonction des
priorités métier et des engagements de Qualité de
Service ?
2Position du Problème
- Soit (1) un ensemble de composants qui exécutent
des processus
Help
PFS
Customer Base
Provisioning
CRM
adapter
Bus
Processflow Engine
- (2) Un contrat de service (3) des aléas
.
- Pics dactivité
- Pannes
- Autres processus
20 clients par Heure en moins De 2 minutes
- Question peut-on automatiser le pilotage des
processus ?
3Glossaire des Acronymes ?
- EAI Enterprise Application Integration
- SOA Service-Oriented Approach
- BPM Business Process Management
- QoS Quality of Service
- SLA Service Level Agreement
- Et . XML, UML, UDDI, WSDL, BPEL, ETL, BAM
4Outline
- I Bouygues Telecoms IT Architecture ( EAI BPM
) - II Optimization of Application Integration
- II Self-Adaptive Middleware and Passing
Strategies - IV Control Strategies and Rules
- V Conclusion
The problem
10 of a solution
5Bouygues Telecoms IT history
- 95-99 (Exponential growth) IT built around BSCS
integrated package (Billing, CRM, Provisioning,
Customer database, ) - Why change ?
- Capacity and Performance problems
- Too much ad-hoc development (costs)
- Time-to-market increase decrease in flexibility
- 99-2000 strategy
- Take ownership of IT Business objects process,
integration - Performance and scalability component
architecture - Flexibility BPM architecture flexible
components (meta-data) - Quality of Service redundant Secure
Infrastructure, SLA monitoring
I Enterprise Architecture and Re-engineering
6A Focus on Business Processes
IT Systems
C. Management
CRM
DWH
Billing
Provisioning
Task
P1
Tasks
P2
Task
Business Processes
Task
distribution
EAI Infrastructure
Process Management
Transport
Business Objects Management
Technical Processes
Directories
Business Processes
Each Transition is defined
through business object updates
I Enterprise Architecture and Re-engineering
7Three dimensions of Enterprise Architecture
I Enterprise Architecture and Re-engineering
8Fractal Enterprise Architecture
- Two recursive patterns
- Recursive decomposition support local vs. global
perspectives - Scale
- Constraints
- performances,
- Technology
-
- Deployment
- Common Features
- Object Model
- Gateway Web service technology
Gateway Double proxy
Bus interne
gateway
I Enterprise Architecture and Re-engineering
9Our Enterprise Architecture
- Demand Management
- (process consistency)
- BPM
- Many instances
- standards
- EAI 2-level processflow (dynamic object
distribution) - Customer repository
- Directory/reference
- Synchro/resync
- Data consistency
I Enterprise Architecture and Re-engineering
10Business Objects
- The cornerstone of our IT is our business object
model, organized into a hierarchy of models - UML model gt XML schema -gt automated data
transformation - Business objects are distributed into many
components (keep the data where it is
philosophy)
Model hierarchy
I Enterprise Architecture and Re-engineering
11Data Architecture
- Timeless problems to be solved
- Copy Synchronization
- Manage synchronization flows
- Maintain snapshots coherence
- General case is impossible (too costly)
- OK if coherence is restricted to a set of
observations that is structured around business
processes - Interactions
- Activities interact through (1)
messages/services (2) shared resources (objects) - Coherence gt signalization / exclusion /
serialization
I Enterprise Architecture and Re-engineering
12The Truth about BPM ?
- Agility is not a matter of technology but design
- Modularity, Flexibility of functional analysis
- Upward-compatibility of XML exchange formats
- Agile Testing easier said than done
- Synchronization of distributed heterogeneous data
sources is an old and hard problem - Coherence of synchronization and process control
flows - Need for re-synchronization (recovery)
- Shared resources means that process executions
are not independent (serialization and
transaction mechanisms are needed) - Business Process Operations the hard part
- Monitoring is more difficult because the system
is more robust ? - Incidents must be resolved on active systems
- A new culture
I Enterprise Architecture and Re-engineering
13Part II
- I Enterprise Architecture and Re-engineering
- II Optimization of Application Integration
- III Self-Adaptive Middleware and Passing
Strategies - IV Control Strategies and Rules
- V Conclusion
II Optimization of Application Integration
14Motivations for OAI
- Quality of Service is the foremost IT objective
for a mobile operator - IT is re-engineered around business processes
(BPM) - QoS is defined through SLA (Service Level
Agreement) - Throughput a flow of 3000 new subscription per
day - Latency end-to-end processing time for a new
subscription - Example less than 4 hours
- Availability of 7/24 time when subscription
service is available - The challenge is to optimize the QoS of a chain
from the specification of its links
II Optimization of Application Integration
15Business Process and Priorities
- i-mode launch example
- i-mode subscription is one of many business
processes - Others include billing / Account management /
- SLA goals seemed straightforward
Customer Base
CRM
Service Platform
Provisioning
Order Management
Fraud
Help
Accounts
Network
Processes
Systems
Infrastructure
III OAI et Processus
16OAI Optimization of Application Integration
Goals (1) Sizing Rules (2) Monitoring
strategy (3) Operations incident
protocols (4) Design routing / sorting rules
- IT Systems
- throughput
- latency
- availability
- Message protocol
- Goals (SLA)
- - Availability
- Latency
- Throughput
- For each process
- Midleware
- Throughput
- Latency
- Availability
- Message routing
Processes
I Optimization of Application Integration
17The challenge of OAI
- Why is OAI hard ?
- Asynchronous availability is hard to compute
- Sizing (multi-commodity flow)
- Stochastic (irregular flows bursts)
- Non-linear behavior (message protocol)
- Monitoring is difficult (for explanations)
- Functional dependencies between processes
(QoS/QoD) - Culture problem
- Batch, Client/server, 3/3 architecture have been
around for a while -gt incident solving know-how - Distributed, asynchronous systems that exchange
messages are far less common - BP culture is long to grow (cf. next slide)
I Optimization of Application Integration
18Business Process Monitoring
First step Taking ownership of business processes
Operations 7/7 24/24 (alerts)
Client (excel)
IT experts (score cards)
- BPM architecture is process-oriented gt better
monitoring - BAM monitoring tools are more and more
relevant - BUT
- Double cycle of maturity
- True complexity
Business Maturity
Processes
Processes
SLA
Applications
Applications
Technical Maturity
Errors
Systems
Incident
19Quality of service and Quality of Data
- References
- Sterling Data synchronization What is Bad
Data Costing Your Company - DWHI Data Quality and the bottom line
achieving business success through a commitment
to high quality data - Error rates ranging from a few up to a few 10s
of ! - Direct impact loss of revenue
- Bouygues Telecoms experience coupled
degradation - QoS gt QoD
- De-synchronization gt functional errors
- QoS degradation gt process exception handling gt
errors (input coherence) - QoD gt QoS
- Data mapping inconstancies gt Errors with
adaptors and gateways gt pending customer
requests - More exception handling gt Longer processing time
gt non-respect of SLA
20Part III
- I Enterprise Architecture and Re-engineering
- II Optimization of Application Integration
- III Self-Adaptive Middleware and Passing
Strategies - IV Control Strategies and Rules
- V Conclusion
III Self-Adaptive Middleware
21SLAs, Priorities and Adaptive Strategies
- Each process has a SLA (throughput, latency,
availability) - Business processes have different priorities
- An adaptive strategy should balance the load
according to priorities and SLAs - Self-adaptive tolerance to bursts
- Self-healing tolerance to short failures
(fail-over) - Two approaches
- Message Handling Rules modify the order in
which messages are handled (higher priority
first) - Control Rules slow down lower priority flows
III Self-Adaptive Middleware
22Simulation Model
- 5 Processes (simplified real problem)
- P1 is a high priority subscription process.
(high latency) - P2 is a medium priority automated baring process.
- P3 is a lower priority (3) barring.
- P4 is a high-priority de-barring process (low
latency) - P5 is a query process of medium priority.
- Finite-event model
- Scenarios to evaluate graceful degradation
Infrastructure
StartTask
StartTask
EndTask
EndTask
ReceivedTask
ReceivedTask
Processflow Engine
System
TimeOutAlert
StartProcess
EndProcess
SetStatus
Failure
Monitor
III Self-Adaptive Middleware
23Routing Strategies
- FCFS (FIFO)
- Default method for most middleware respects
temporal constraints - However, temporal ordering is not preserved by
load distribution - LCFS (FILO)
- Good strategy for handling backlogs
- SLA routing
- Prediction of processing time based on SLA
- Combination with priorities
- Process high priority messages first
III Message Passing
24Scenarios
- 3 types of scenarios
- Reference static (with overload)
- Burst
- Component failure
- Different event distribution (uniform, Poisson,
) - Performance evaluation
- Multiple runs
- Average, standard deviation of SLA achievement
- Goal is to observe graceful degradation
(lower priority processes degrade first)
III Message Passing
25Results
- Priority routing works. The algorithms that use
process priority as part of the sorting strategy
are able to maintain the SLA of high priority
processes much longer. - The second lesson is that FCFS is not a good
default algorithm. LCFS does better as soon as
the event flow become tight. - The combination of priority and SLA sorting is
the best approach.
III Message Passing
26Part IV
- I Enterprise Architecture and Re-engineering
- II Optimization of Application Integration
- III Self-Adaptive Middleware and Passing
Strategies - IV Control Strategies and Rules
- V Conclusion
IV Control Strategies
27Flow Rules
- First intuition at Bouygues Telecom was to
implement control flow mechanisms (emergency
mode) - Before actually implementing it in the EAI
adapter, we use the simulation engine to evaluate
two strategies - RS1 When the QoS of a system X fails lower than
90 of its SLA level (cf. Section 3), we reduce
the flow of systems that are providers of X
whose priority is lower than X. A dual rule
restores the default setting once the QoS of X
reaches 90. - RS2 This is a similar rule, but the triggering
condition is based on processes. When the QoS of
a process P fails below 90, we reduce the flow
of all systems that have a lower priority than P
and who are providers of a system that supports
P. - Control flow is more complex to operate but it is
not necessarily part of the middleware
infrastructure
IV Control Strategies
28Routing Rules
- We implemented rules that dynamically change the
message handling strategy (using a status
FAST means use PRL to process a backlog) - RS3 When the QoS of a system X drops below 95,
the system is switched to FAST status. The system
resumes normal status once the QoS returns above
95. - RS4 When the QoS of a process P drops below 95,
all systems that support this process are
switched to FAST status. - RS5 A system is switched to FAST status whenever
its mailbox size grows over 100. Obviously, the
triggering size is a constant that depends on the
volume that is processed by the EAI and the
number of connected systems.
IV Control Strategies
29Results
Does not provide any stable improvement
- Small improvement
- Simpler is better
IV Control Strategies
30Conclusions
- A first step towards autonomic BPM
- Self-optimization
- Priority handling works it is possible and
fairly simple to take process priority into
account for routing messages and the results show
a real improvement. - Routing (mailbox sorting) algorithm matters the
more sophisticated SLA projection technique
showed a real improvement over a FCFS policy. - Control rules are interesting, but they are
secondary to the routing policy it is more
efficient to deal with congestion problems with a
distributed routing strategy rather than with a
global rule schema. - Self-healing some form of self-healing is
demonstrated but true self-healing requires
collaboration with HW - Self-configuration the goal is to make
configuration declarative (e.g., SLA) vs.
defining time resource configuration (e.g.,
schedules)
V Conclusions
31Next Steps for Bouygues Telecom
- Promote SLA in BPM standards (BPEL lt- WSLA, QML,
) - Priorities in BPM engines (lobbying)
- Organic operations
- From a mechanical toward a biology vision of
fault-tolerance ? - Incidents do occur - handling is part of
business know-how, and often relies on a deep
understanding of business logic. - Incident recovery strategies tools are
first-class citizens of the IT infrastructure.
ST4
ST1 secours
ST3 secours
ST1
ST2
ST3
ST2
ST1
ST3
System-based monitoring / recovery
Process monitoring / recovery
V Conclusions
32Références
- Problème général
- Urbanisation et BPM, 2e édition, Dunod, Mars
2006. - OAI expérimentations
- Self-Adaptive and Self-Healing Message Passing
Strategies for Process-Oriented Integration
Infrastructures. ECBS 2004 506-512 - Self-adaptive middleware Supporting business
process priorities and service level agreements.
Advanced Engineering Informatics 19(3) 199-211
(2005) - Systèmes complexes Architecture
Organisationnelle - Comment modéliser les flux dinformation dans une
entreprise (à partir des processus) en fonction
de lorganisation ? - http//organisationarchitecture.blogspot.com/
- Livre en préparation pour Janvier 2006