Title: Deploying a High Throughput Computing Cluster
1Deploying a High Throughput Computing Cluster
- Jim Basney and Miron Livny
- Presented by
- Vishal Singh
2Seminar Overview
- I Introduction
- Primary Goal of Condor
- Condor Overview
- II Challenges of deploying an HTC environment
- Layered Software Architecture
- Protocol flexibility
- Remote file access
- Checkpointing
- III System administration of an HTC environment
- Access policies
- Reliability
- System log file management
- Security
- IV Summary
-
3Goals
- GlobusThe Globus project is developing
fundamental technologies needed to build
computational grids. Grids are persistent
environments that enable software applications to
integrate instruments, displays, computational
and information resources that are managed by
diverse organizations in widespread locations.
CondorThe goal of the Condor project is to
develop, implement, deploy, and evaluate
mechanisms and policies that support High
Throughput Computing (HTC) on large collections
of distributively owned computing resources.
4Condor Overview
- Three entities
- Customer Agent Manages a queue of
application descriptions and sends resource
requests to the matchmaker. - Resource Agent Implements the policies of
resource owner and sends resource offers to
matchmaker. - Matchmaker Finds a match between the resouce
requests and the resource offers and notifies the
agents when a match is found.
5Four Primary Challenges
- Utilization of heterogeneous resources
- Evolution of network protocols
- Utilization of non dedicated resources
6Layered Software Architecture
ReasonPortability of HTC system
- Network API provides both connection-oriented
and - connectionless,reliable and unreliable interfaces.
- Process management API provides the ability to
create ,suspend, - unsuspend, and kill a process.
- Workstation statistics API reports the
information necessary to - 1.gtimplement the resource owner policies
- 2.gtverify the validation of customer
application requirements.
7 Layered resource management
architectureCondor
8PROTOCOL FLEXIBILTY
Why?
Inconvenient to frequently update components in a
HTC, so new features are not deployed until a
future major system upgrade.
A general-purpose data format may help
Example of protocol data format
Backward compatibility is ensured.
9Remote File Access
- Guarantees a HTC application, access to data
files from any workstation in the cluster.
Three Implementation options
- Requires authentication of customer app. to
file system.
- Privileges need to be assigned.
- Large data files results in high start-up and
tear down costs.
10Remote File Access (cont.)
Redirect file I/O system calls
HTC environment must interpose itself
between application and operating system and
service file system calls.
System call interposition
How?
- Linking application with an interposition
- library or trapping system calls thru O.S
- HTC environment invokes an RPC
Benefits
No file system requirements on remote station
Drawbacks
- Many high latency operations reduce performance
of application.
- Developing and maintaining a portable
interposition system is difficult.
11Checkpointing
A snapshot of the state of an executing program.
Uses
- Enable preemptive-resume scheduling
Can be
- kernel-level checkpointing
- Often not provided by workstation
operating systems.
12Progress
- I Introduction
- Primary Goal of Condor
- Condor Overview
- II Challenges of deploying an HTC environment
- Layered Software Architecture
- Protocol flexibility
- Remote file access
- Checkpointing
- III System administration of an HTC environment
- Access policies
- Reliability
- System log file management
- Security
- IV Summary
-
13System Administration
Administrator has to answer to.
- Resource owners
- Enforce access policies of resource owners.
- Customers
- Valuable services received from the HTC
environment.
- Policy makers
- Has to demonstrate that the HTC is meeting the
stated goals.
14Access policies
- Answers the question who and when can a resource
can be used.
One method of policy specification is through
expressions
15Access policies (cont.)
- Can be optimized for throughput
- Eg
- For low-bandwidth networks a longer Vacate
interval may be negotiated. - Vacate need not be attempted when chances of
successful check point low.
- Administrator may steer matchmaking to utilize
resources efficeintly - when network bandwidth limited.
16Reliability
Complications
- Distinguish between normal and abnormal
terminations
- Choose the correct checkpoint to use for restart
- Decide when it is safe to restart the application
- problem of one bad node in HTC
Heuristically determine
- if application fails consistently on different
nodes
- if different applications fail on the same node
ImplyHTC must be prepared for failures and must
automate failure recovery for common failures.
17Problem Diagnosis via System Logs
System logs are primary tools for diagnosing
system failures.
HTC Environment Logs
18Monitoring and Accounting
HTC environment provides system monitoring and
accounting facilities to the administrator
Observations 1.gt Approximately 100 resources
were added to the cluster during the month. 2.gt
Resource availability followed a daily cyclic
pattern, where more resources were available for
HTC during the night 3.gtOn average, more
resources available on weekends compared to
weekends.
19Security
- An HTC environment is potentially vulnerable to
Resource Attack - An unauthorized user gains
access to a resource via the HTC environment -
An authorized user violates the resource owners
access policy.
Customer Attack - Customers account or data
files are compromised via the HTC environment.
Steps to be taken
- Protecting the resources requires an effective
user authentication mechanism.
- The HTC environment must ensure that all
resource agents are trustworthy
- Unencrypted network streams and buffer-overflow
attacks are potential - vulnerabilities.
20Summary
- The HTC software must be portable, reliable,
and maintainable.
- Layered architecture with flexible network
provides such a framework.
- Remote file access and checkpointing allow HTC
to utilize distributively - owned, non-dedicated resources
- Development and maintenance costs must be
balanced.
- The HTC software must provide secure services
with effective logging.
21Conclusion
Deploying an HTC environment is efficiently
managing all the complexities described for all
the three entitiesresource owners, customers and
policy makers.It is not exotic scheduling
algorithms and mechanisms which make an HTC
environment successful,but an emphasis on
usability, flexibility, reliability, and
maintainability.
Web site
Condor website http//www.cs.wisc.edu/condor