ILC Global Control System - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

ILC Global Control System

Description:

ILC9 but table B MTBFs and 6% linac energy overhead. Device. Nominal MTBF (hours) all electronics modules. power supply controllers. magnets - water cooled. power supplies. – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 34

Provided by: wwwadFna

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: ILC Global Control System

1
ILC Global Control System

John Carwardine, ANL

2
ILC Accelerator overview

Major accelerator systems
Polarized PC gun electron source and
undulator-based positron source.
5-GeV electron and positron damping rings, 6.7km
circumference.
Beam transport from damping rings to bunch
compressors.
Two 11km long 250-GeV linacs with 15,000
cavities and 600 RF units.
A 4.5-km beam delivery system with a single
interaction point.

J. Bagger
3
Control System Requirements and Challenges

General requirements are largely similar to those
of any large-scale experimental physic machines
but there are some challenges
Scalability
100,000 devices, several million control points.
Large geographic scale 31km end to end
Multi-region, multi-lab development team.
Support ILC accelerator availability goals of
85.
Intrinsic Control system availability of 99 by
design.
Cannot rely on approach of fix in place.
May require 99.999 (five nines) availability
from each crate.
Functionality to help minimize overall
accelerator downtime.

4
Requirements and Challenges (2)

Precision timing synchronization
Distribute precision timing and RF phase
references to many technical systems throughout
the accelerator complex.
Requirements consistent with LLRF requirements of
0.1 amplitude and 0.1 degree phase stability.
Support remote operations / remote access (GAN /
GDN)
Allow collaborators to participate with machine
commissioning, operation, optimization, and
troubleshooting.
At technical equipment level there is little
difference between on-site and off-site access -
Control Room is already remote.
There are both technical and sociological
challenges.

5
Requirements and Challenges (3)

Extensive reliance on machine automation
Manage accelerator operations of the many
accelerator systems, eg 15,000 cavities, 600 RF
units.
Automate machine startup, cavity conditioning,
tuning, etc.
Extensive reliance on beam-based feedback
Multiple beam based feedback loops at 5Hz, eg
Trajectory control, orbit control
Dispersion measurement control
Beam energies
Emittance correction

6
Control System Functional Model

Client Tier
GUIs
Scripting

Services Tier
Business Logic
Device abstraction
Feedback engine
State machines
Online models

Front-End Tier
Technical Systems Interfaces
Control-point level

7
Physical Model as applied to main linac
(Front-end)
8
Some representative component counts
9
Which Control System?

Established accelerator control system..?
EPICS, DOOCS, TANGO, ACNET,
Development from scratch?
Commercial solution?
Too early to down-select for ILC and there are
benefits to not down-selecting during RD phase

10
Availability Design Philosophy for the ILC

Design for Availability up front.
Budget 15 downtime total. Keep an extra 10 as
contingency.
Try to get the high availability for the minimum
cost.
Will need to iterate as design progresses.
Quantities are not final
Engineering studies may show that the cost
minimum would be attained by moving some of the
unavailability budget from one item to another.
This means some MTBFs may be allowed to go down,
but others will have to go up.
Availability/reliability modeling (Availsim)

11
Availability budgets by system(percentage total
downtime)
12
MTBF/MTTR requirements from Availsim
13
High Availability primer

Availability A MTBF/(MTBFMTTR)
MTBFMean Time Before Failure
MTTR Mean Time To Repair
If MTBF approaches infinity A approaches 1
If MTTR approaches zero A approaches 1
Both are impossible on a unit basis
Both are possible on a system basis.
Key features for HA, i.e. A approaching 1
Modular design
Built-in 1/n redundancy
Hot standby systems
Hot-swap capable at subsystem unit or subunit
level

14
Systems That Never Shut Down

Any large telecom system will have a few
redundant Shelves, so loss of a whole unit does
not bring down system like RF system in the
Linac.
Load auto-rerouted to hot spare, again like
Linac.
Key All equipment always accessible for hot
swap.
Other Features
Open System Non-Proprietary very important for
non-Telecom customers like ILC.
Developed by industry consortium¹ of major
companies sharing in 100B market.
20X larger market than any of old standards
including VME leads to competitive prices.

¹ PICMG -- PCI Industrial Computer Manufacturers
Group
15
Controls Cluster
Dual Star/ Loop/Mesh
FEATURES ? Dual Star 1/N Redundant Backplanes ?
Redundant Fabric Switches ? Dual Star/ Loop/
Mesh Serial Links ? Dual Star Serial Links
To/From Level 2 Sector Nodes
Applications Modules
Dual Fabric Switches
Dual Star to/From Sector Nodes
16
HA Concept DR Kicker Systems

Approx 50 unit drivers
n/N Redundancy System level (extra kickers)
n/N Redundancy Unit level (extra cards)
Diagnostics on each card, networked, local
wireless

17
Physical Model as applied to main linac
(Front-end)
18
High Availability Control System

Control system itself must be highly available
Redundant and hot-swap hardware platform
(baseline ATCA).
Redundancy functionality in control system
software.
In many cases, redundancy and hot-swap/hot-reconfi
gure can only be implemented at the accelerator
system level, eg
Rebalance RF systems if a klystron fails.
Modify control algorithm on loss of critical
sensor.
Control System will provide High Availability
functionality at the accelerator system level.
Technical systems must provide high level of
diagnostics to support remote troubleshooting and
re-configuration.

19
ATCA as a reference platform
5-Slot Crate w/ Shelf Manager Fabric Switch
Dual IOC Processors
4 Hot-Swappable Fans
16 Slot Dual Star Backplane
Shelf Manager
Dual 48VDC Power Interface
Dual IOCs Fabric Switch
Rear View
R. Larsen
20
(No Transcript)
21
ATCA as reference platform for Front-end
electronics

Representative of the breadth of
high-availability functions needed
Hot-swappable components circuit boards, fans,
power supplies,
Remote power management power on/off each
circuit board
Supports redundancy processors, comms links,
power supplies,
Remote resource management through Shelf Manager
µTCA offers lower cost but with reduced feature
set.
There is growing interest in the physics
community in exploring ATCA for instrumentation
and DAQ applications.
As candidate technology for the ILC, ATCA/µTCA
have strong potential currently is it an
emerging standard.

22
Read Out evolution LHC --gt ILC
Subdetector
Subdetector
Digital Buffer
??C??CTA

Read Out Crate (VME 9U)
Read Out Driver
92
AMC
SLink
400 Robin (PCI)

Read Out Buffer (3 ROBin)
ROS (150 PCs)
ATCA Module
ATCA Crate
Gbit Link to Gbe Switch (60 PCs)
23
Cost/Benefit Analysis of HA Techniques
13. Automatic failover
Availability (benefit)
12. Model-based automated diagnosis
11. Manual failover (eg bad memory, live patching)
10. Hot swap hardware
9. Application design (error code checking, etc)
8. Development methodology (testing, standards,
patterns)
7. Adaptive machine control (detect failed BPM,
modify feedback)
6. Model-based configuration management (change
management)
5. Extensive monitoring (hardware and software)
4. COTS redundancy (switches, routers, NFS, RAID
disks, database, etc.)
3. Automation (supporting RF tune-up, magnet
conditioning, etc.)
2. Disk volume management
1. Good administrative practices
Cost (some effort laden, some materials laden)
24
HA RD objectives

Learn about HA (High Availability) in context of
accelerator controls
Bring in expertise (RTES, training, NASA,
military, )
Develop (adopt) a methodology for examining
control system failures
Fault tree analysis
FMEA or scenario-based FMEA
Supporting software (CAFTA, SAPPHIRE, )
Others?
Develop policies for detecting and managing
identified failure modes
Development and testing methodology
Workaround
Redundancy
Develop a full vertical prototype
implementation
Ie. how we might implement above policies
Integrate portions of vertical prototype with
test stands (LLRF)
Feed some software-oriented data to SLAC
availability simulation?

25
High Availability Software

What are the most common and critical failure
modes in control system software?

Mis-configuration
Network buffer overruns
Application logic bugs
Task deadlock
Accepting conflicting commands

Ungraceful handling of failed sensors/actuators
Flying blind (lack of monitoring)
Introduction of untested features
More

How do we mitigate these, and what is the
cost/benefit?

26
Sample of Techniques Shelf Management
Client Tier
Services Tier
Controls Protocol

Shelf Manager
Identify all boards on shelf
Power cycle boards (individually)
Reset boards
Monitor voltages/temps
Manage Hot-Swap LED state
Switch to backup flash mem bank
More

Custom
CPU1
I/O
CPU2
Front-end tier
sensor
27
SAF Availability Management Framework
A simple example of software component runtime
lifecycle management
Service Unit Administrative States
AMF Logical Entities
Node V
Node U
Service Group
Unlocked
Locked
Service Unit
Service Unit
Component
Component
Component
Component
Locked- Instantiation
Shutting down
active
standby
1. Service unit starts out un-instantiated.
2. State changed to locked, meaning software is
instantiated on node, but not assigned work.
Service Instance is work assigned to Service Unit
Service Instance
3. State changed to unlocked, meaning software is
assigned work (Service Instance).
28
SAF Service Availability Forum Specifications
Application Interface Specification
HA Applications
Other Middleware and Application Services
HPI Middleware
AIS Middleware
Carrier Grade Operating System
Managed Hardware Platform
Hardware Platform Interface
Diagram courtesy of Service Availability Forum
29
SAF Availability Management Framework

AMF Availability Management Framework
Manages software runtime lifecycle, fault
reporting, failover policies, etc.
Works in combination with a collection of
well-defined services to provide a powerful
environment for application software components.
CLM Cluster Membership Service
LOG Log Service
CKPT Checkpoint Service
EVT Event Service
LCK Lock Service
More
An open standard from telecom industry geared
towards supporting a highly available, highly
distributed system.
Potential application to critical core control
system software such as IOCs, device servers,
gateways, nameservers, data reduction, etc.
Know exactly what software is running where.
Be able to gracefully restart components, or
manage state while hot-swapping underlying
hardware.
Uniform diagnostics to troubleshoot problems.

30
An HA software framework is just the start

SAF (Service Availability Forum) implementations
wont solve HA problem
You still have to determine what you want to do
and encode it in the framework this is where
work lies
What are failures
How to identify failure
How to compensate (failover, adaptation,
hot-swap)
Is resultant software complexity manageable?
Potential fix worse than the problem
Always evaluate am I actually improving
availability?

31
RD Engineering Design (EDR) Phase

Main focus of RD efforts are on high
availability
Gain experience with high availability tools
techniques to be able to make value-based
judgments of cost versus benefit.
Four broad categories
Control system failure mode analysis
High-availability electronics platforms (ATCA)
High-availability integrated control systems
Conflict avoidance failover, model-based
resource monitoring.
Control System as a tool for implementing
system-level HA
Fault detection methods, failure modes effects

32
HA means doing things differently

ILC must apply techniques not typically used at
an accelerator, particularly in software
Development culture must be different this time.
Cannot build ad-hoc with in-situ testing.
Build modeling, simulation, testing, and
monitoring into hardware and software methodology
up front.
Reliable hardware
Instrumentation electronics to servers and disks.
Redundancy where feasible, otherwise adapt in
software.
Modeling and simulation (T. Himel).
Reliable software
Equally important.
Software has many more internal states
difficult to predict.
Modeling and simulation needed here for
networking and software.