Title: ILC Global Control System
1ILC Global Control System
2ILC Accelerator overview
- Major accelerator systems
- Polarized PC gun electron source and
undulator-based positron source. - 5-GeV electron and positron damping rings, 6.7km
circumference. - Beam transport from damping rings to bunch
compressors. - Two 11km long 250-GeV linacs with 15,000
cavities and 600 RF units. - A 4.5-km beam delivery system with a single
interaction point.
J. Bagger
3Control System Requirements and Challenges
- General requirements are largely similar to those
of any large-scale experimental physic machines
but there are some challenges - Scalability
- 100,000 devices, several million control points.
- Large geographic scale 31km end to end
- Multi-region, multi-lab development team.
- Support ILC accelerator availability goals of
85. - Intrinsic Control system availability of 99 by
design. - Cannot rely on approach of fix in place.
- May require 99.999 (five nines) availability
from each crate. - Functionality to help minimize overall
accelerator downtime.
4Requirements and Challenges (2)
- Precision timing synchronization
- Distribute precision timing and RF phase
references to many technical systems throughout
the accelerator complex. - Requirements consistent with LLRF requirements of
0.1 amplitude and 0.1 degree phase stability. - Support remote operations / remote access (GAN /
GDN) - Allow collaborators to participate with machine
commissioning, operation, optimization, and
troubleshooting. - At technical equipment level there is little
difference between on-site and off-site access -
Control Room is already remote. - There are both technical and sociological
challenges.
5Requirements and Challenges (3)
- Extensive reliance on machine automation
- Manage accelerator operations of the many
accelerator systems, eg 15,000 cavities, 600 RF
units. - Automate machine startup, cavity conditioning,
tuning, etc. - Extensive reliance on beam-based feedback
- Multiple beam based feedback loops at 5Hz, eg
- Trajectory control, orbit control
- Dispersion measurement control
- Beam energies
- Emittance correction
6Control System Functional Model
- Client Tier
- GUIs
- Scripting
- Services Tier
- Business Logic
- Device abstraction
- Feedback engine
- State machines
- Online models
- Front-End Tier
- Technical Systems Interfaces
- Control-point level
7Physical Model as applied to main linac
(Front-end)
8Some representative component counts
9Which Control System?
- Established accelerator control system..?
- EPICS, DOOCS, TANGO, ACNET,
- Development from scratch?
- Commercial solution?
- Too early to down-select for ILC and there are
benefits to not down-selecting during RD phase
10Availability Design Philosophy for the ILC
- Design for Availability up front.
- Budget 15 downtime total. Keep an extra 10 as
contingency. - Try to get the high availability for the minimum
cost. - Will need to iterate as design progresses.
- Quantities are not final
- Engineering studies may show that the cost
minimum would be attained by moving some of the
unavailability budget from one item to another. - This means some MTBFs may be allowed to go down,
but others will have to go up. - Availability/reliability modeling (Availsim)
11Availability budgets by system(percentage total
downtime)
12MTBF/MTTR requirements from Availsim
13High Availability primer
- Availability A MTBF/(MTBFMTTR)
- MTBFMean Time Before Failure
- MTTR Mean Time To Repair
- If MTBF approaches infinity A approaches 1
- If MTTR approaches zero A approaches 1
- Both are impossible on a unit basis
- Both are possible on a system basis.
- Key features for HA, i.e. A approaching 1
- Modular design
- Built-in 1/n redundancy
- Hot standby systems
- Hot-swap capable at subsystem unit or subunit
level
14Systems That Never Shut Down
- Any large telecom system will have a few
redundant Shelves, so loss of a whole unit does
not bring down system like RF system in the
Linac. - Load auto-rerouted to hot spare, again like
Linac. - Key All equipment always accessible for hot
swap. - Other Features
- Open System Non-Proprietary very important for
non-Telecom customers like ILC. - Developed by industry consortium¹ of major
companies sharing in 100B market. - 20X larger market than any of old standards
including VME leads to competitive prices.
¹ PICMG -- PCI Industrial Computer Manufacturers
Group
15Controls Cluster
Dual Star/ Loop/Mesh
FEATURES ? Dual Star 1/N Redundant Backplanes ?
Redundant Fabric Switches ? Dual Star/ Loop/
Mesh Serial Links ? Dual Star Serial Links
To/From Level 2 Sector Nodes
Applications Modules
Dual Fabric Switches
Dual Star to/From Sector Nodes
16HA Concept DR Kicker Systems
- Approx 50 unit drivers
- n/N Redundancy System level (extra kickers)
- n/N Redundancy Unit level (extra cards)
- Diagnostics on each card, networked, local
wireless
17Physical Model as applied to main linac
(Front-end)
18High Availability Control System
- Control system itself must be highly available
- Redundant and hot-swap hardware platform
(baseline ATCA). - Redundancy functionality in control system
software. - In many cases, redundancy and hot-swap/hot-reconfi
gure can only be implemented at the accelerator
system level, eg - Rebalance RF systems if a klystron fails.
- Modify control algorithm on loss of critical
sensor. - Control System will provide High Availability
functionality at the accelerator system level. - Technical systems must provide high level of
diagnostics to support remote troubleshooting and
re-configuration.
19ATCA as a reference platform
5-Slot Crate w/ Shelf Manager Fabric Switch
Dual IOC Processors
4 Hot-Swappable Fans
16 Slot Dual Star Backplane
Shelf Manager
Dual 48VDC Power Interface
Dual IOCs Fabric Switch
Rear View
R. Larsen
20(No Transcript)
21ATCA as reference platform for Front-end
electronics
- Representative of the breadth of
high-availability functions needed - Hot-swappable components circuit boards, fans,
power supplies, - Remote power management power on/off each
circuit board - Supports redundancy processors, comms links,
power supplies, - Remote resource management through Shelf Manager
- µTCA offers lower cost but with reduced feature
set. - There is growing interest in the physics
community in exploring ATCA for instrumentation
and DAQ applications. - As candidate technology for the ILC, ATCA/µTCA
have strong potential currently is it an
emerging standard.
22Read Out evolution LHC --gt ILC
Subdetector
Subdetector
Digital Buffer
??C??CTA
Read Out Crate (VME 9U)
Read Out Driver
92
AMC
SLink
400 Robin (PCI)
Read Out Buffer (3 ROBin)
ROS (150 PCs)
ATCA Module
ATCA Crate
Gbit Link to Gbe Switch (60 PCs)
23Cost/Benefit Analysis of HA Techniques
13. Automatic failover
Availability (benefit)
12. Model-based automated diagnosis
11. Manual failover (eg bad memory, live patching)
10. Hot swap hardware
9. Application design (error code checking, etc)
8. Development methodology (testing, standards,
patterns)
7. Adaptive machine control (detect failed BPM,
modify feedback)
6. Model-based configuration management (change
management)
5. Extensive monitoring (hardware and software)
4. COTS redundancy (switches, routers, NFS, RAID
disks, database, etc.)
3. Automation (supporting RF tune-up, magnet
conditioning, etc.)
2. Disk volume management
1. Good administrative practices
Cost (some effort laden, some materials laden)
24HA RD objectives
- Learn about HA (High Availability) in context of
accelerator controls - Bring in expertise (RTES, training, NASA,
military, ) - Develop (adopt) a methodology for examining
control system failures - Fault tree analysis
- FMEA or scenario-based FMEA
- Supporting software (CAFTA, SAPPHIRE, )
- Others?
- Develop policies for detecting and managing
identified failure modes - Development and testing methodology
- Workaround
- Redundancy
- Develop a full vertical prototype
implementation - Ie. how we might implement above policies
- Integrate portions of vertical prototype with
test stands (LLRF) - Feed some software-oriented data to SLAC
availability simulation?
25High Availability Software
- What are the most common and critical failure
modes in control system software?
- Mis-configuration
- Network buffer overruns
- Application logic bugs
- Task deadlock
- Accepting conflicting commands
- Ungraceful handling of failed sensors/actuators
- Flying blind (lack of monitoring)
- Introduction of untested features
- More
- How do we mitigate these, and what is the
cost/benefit?
26Sample of Techniques Shelf Management
Client Tier
Services Tier
Controls Protocol
- Shelf Manager
- Identify all boards on shelf
- Power cycle boards (individually)
- Reset boards
- Monitor voltages/temps
- Manage Hot-Swap LED state
- Switch to backup flash mem bank
- More
Custom
CPU1
I/O
CPU2
Front-end tier
sensor
27SAF Availability Management Framework
A simple example of software component runtime
lifecycle management
Service Unit Administrative States
AMF Logical Entities
Node V
Node U
Service Group
Unlocked
Locked
Service Unit
Service Unit
Component
Component
Component
Component
Locked- Instantiation
Shutting down
active
standby
1. Service unit starts out un-instantiated.
2. State changed to locked, meaning software is
instantiated on node, but not assigned work.
Service Instance is work assigned to Service Unit
Service Instance
3. State changed to unlocked, meaning software is
assigned work (Service Instance).
28SAF Service Availability Forum Specifications
Application Interface Specification
HA Applications
Other Middleware and Application Services
HPI Middleware
AIS Middleware
Carrier Grade Operating System
Managed Hardware Platform
Hardware Platform Interface
Diagram courtesy of Service Availability Forum
29SAF Availability Management Framework
- AMF Availability Management Framework
- Manages software runtime lifecycle, fault
reporting, failover policies, etc. - Works in combination with a collection of
well-defined services to provide a powerful
environment for application software components. - CLM Cluster Membership Service
- LOG Log Service
- CKPT Checkpoint Service
- EVT Event Service
- LCK Lock Service
- More
- An open standard from telecom industry geared
towards supporting a highly available, highly
distributed system. - Potential application to critical core control
system software such as IOCs, device servers,
gateways, nameservers, data reduction, etc. - Know exactly what software is running where.
- Be able to gracefully restart components, or
manage state while hot-swapping underlying
hardware. - Uniform diagnostics to troubleshoot problems.
30An HA software framework is just the start
- SAF (Service Availability Forum) implementations
wont solve HA problem - You still have to determine what you want to do
and encode it in the framework this is where
work lies - What are failures
- How to identify failure
- How to compensate (failover, adaptation,
hot-swap) - Is resultant software complexity manageable?
- Potential fix worse than the problem
- Always evaluate am I actually improving
availability?
31RD Engineering Design (EDR) Phase
- Main focus of RD efforts are on high
availability - Gain experience with high availability tools
techniques to be able to make value-based
judgments of cost versus benefit. - Four broad categories
- Control system failure mode analysis
- High-availability electronics platforms (ATCA)
- High-availability integrated control systems
- Conflict avoidance failover, model-based
resource monitoring. - Control System as a tool for implementing
system-level HA - Fault detection methods, failure modes effects
32HA means doing things differently
- ILC must apply techniques not typically used at
an accelerator, particularly in software - Development culture must be different this time.
- Cannot build ad-hoc with in-situ testing.
- Build modeling, simulation, testing, and
monitoring into hardware and software methodology
up front. - Reliable hardware
- Instrumentation electronics to servers and disks.
- Redundancy where feasible, otherwise adapt in
software. - Modeling and simulation (T. Himel).
- Reliable software
- Equally important.
- Software has many more internal states
difficult to predict. - Modeling and simulation needed here for
networking and software.
33Controls topic areas
- LLRF algorithms
- RF phase timing distribution, synchronization
- Machine automation, beam-based feedback
- ATCA evaluation as front-end instrumentation
platform - ATCA evaluation for control system integration
- HA integrated control system
- Integrated Control System as a tool for
system-level HA - Remote access, remote operations (GAN/GDN)
- Failure modes analysis
- Lots of opportunities to get involved