First Steps with Grid Computing - PowerPoint PPT Presentation

About This Presentation
Title:

First Steps with Grid Computing

Description:

First Steps with Grid Computing & Oracle Application Server 10g Venkata Ravipati Product Manager Oracle Corporation Agenda Introduction Grid Computing OracleAS 10g ... – PowerPoint PPT presentation

Number of Views:256
Avg rating:3.0/5.0
Slides: 59
Provided by: Analy7
Category:

less

Transcript and Presenter's Notes

Title: First Steps with Grid Computing


1
(No Transcript)
2
First Steps with Grid Computing Oracle
Application Server 10g
Session id 40187
  • Venkata Ravipati
  • Product Manager
  • Oracle Corporation

Sastry malladi CMTS Oracle Corporation
Jamie ShiersIT Division, CERNJamie.Shiers_at_cern.c
h
3
Agenda
  • Introduction Grid Computing
  • OracleAS 10g Features
  • CERN Case Study
  • OracleAS 10g Roadmap
  • QA

Introduction Grid Computing
4
IT Challenges
  • Enterprise I/T is highly fragmented, leading to
  • poor utilization, excess capacity, and systems
    inflexibility.
  • Adding capacity is complex and labor-intensive
  • Systems are fragmented into inflexible islands
  • Expensive server capacity sits underutilized
  • Installing, configuring, and managing application
    infrastructure is slow and expensive
  • Poorly integrated applications with redundant
    functionality increase costs and limit business
    responsiveness

5
Grid Computing Solves IT Problems
IT Problem
Grid Solution
  • High cost of adding capacity
  • Islands of inflexible systems
  • Underutilized server capacity
  • Hard to configure and manage
  • Poorly integrated applications with redundant
    functions
  • Pool modular, low-cost hardware components
  • Virtualize system resources
  • Dynamically allocate workloads and information
  • Unify management and automate provisioning
  • Compose applications from reusable services

6
What is Grid computing
  • Grid computing is a hardware and software
    infrastructure that enable
  • Transparent Resource Sharing across an
    enterpriseDivisions,Data Centers, Resources
    Categories
  • Computers
  • Storage,
  • Databases
  • Application Servers
  • Applications
  • Coordination resources that are not subject to
    centralized control
  • Using standard, open, general-purpose protocols
    and interfaces
  • To deliver nontrivial qualities of service

7
Enterprise Grid Infrastructure Must Be
Comprehensive
Middleware
Management
Database
Storage
8
Agenda
  • Introduction Grid Computing
  • OracleAS 10g Features
  • CERN Case Study
  • OracleAS 10g Roadmap
  • QA

OracleAS 10g Features
9
Introducing Oracle 10g
  • Complete, integrated grid infrastructure

10
Oracle Application Server 10g
10g
Workload Management
Workload Management
11
Workload Management
IT Problem
Oracle 10g Solution
  • Adding and allocating computing capacity is
    expensive and too slow to adapt to changing
    business requirements
  • Virtualize servers as modular HW resources
  • Virtualize software as reusable run-time services
  • Manage workloads automatically based on
    pre-defined policies

12
Virtualized Hardware Resources
Add Capacity Quickly and Economically
13
Virtualized Middleware Services
Accounting Application
Group Collections of Resources and Runtime
Services into Logical Applications
14
Policy-based Workload Management

Workload Manager
Dispatcher SchedulerDistribute workloads based
on application-specific policies
Policy Manager Stores application-specific
policies
Resource ManagerManages resource
availability/status
15
Middleware Services
  • HTTP servers
  • Web caches
  • J2EE servers
  • EJB processes
  • Portal services
  • Wireless services
  • Web services
  • Integration services
  • Directory services
  • Authentication services
  • Authorization services
  • Enterprise Reporting services
  • Query Analysis services

16
Metrics-based Workload Reallocation
  • Employee Portal
  • Portal
  • Accounting
  • Discoverer, reports
  • Web Store
  • HTTP, J2EE Server

Unexpected demand! ? shift more capacity to Web
Store
17
Scheduled Workload Reallocation
Start of Quarter
End of Quarter
General Ledger
General Ledger
Order Entry
Order Entry
18
Policy-based Edge Caching
  • Virtualized pools of storage enable sharing and
    transfer of data between nodes
  • Adaptive caching policies flexibly accommodate
    changing demand

Virtual HTTP Server
Client
Grid Caches
19
Oracle Application Server 10g
10g
20
Software Provisioning
IT Problem
Oracle 10g Solution
  • Installing, configuring, upgrading and patching
    systems is labor-intensive and too slow to adapt
    to changing business requirements
  • Manage virtualized HW and SW resources as one
    system
  • Automate installation, configuration, upgrading,
    and patching processes

21
Software Provisioning
  • Grid Control Repository (GCR) with centralized
    inventories for installation and configuration
  • Provision servers
  • Provision software
  • Provision users

22
Automated Deployment
  • Install and configure a single server node
  • Register configuration to the Repository
  • Automatically deploy to nodes as they are added
    to the grid

Grid Control Repository
23
Software Cloning
  • Automated provisioning based on master node
  • Archive replicate specific configurations
  • e.g. Payroll config. optimized for Fridays at
    400pm
  • Context-specific adjustments
  • e.g. IP address, host name, web listener

Select Software and Instances to Clone
Update Configuration Inventory in GCR
Clone to Selected Targets
1
3
2
24
Patch and Update Management
  • Real-time discovery of new patches
  • Automated staging and application of patches
  • Rolling application upgrades
  • Patch history tracking

25
Oracle Application Server 10g
10g
26
User Provisioning
IT Problem
Oracle 10g Solution
  • It takes too long to register new users
  • Users have too many accounts, passwords, and
    privileges to manage
  • Developers re-implement authentication for each
    new application
  • Centralized identity management
  • Shared authentication service

27
Single Sign-on Across the Grid
Accounting
Sales Portal
  • Consolidate accounts
  • Simplify management
  • Facilitate re-use

Directory
Support Portal
28
User Provisioning
  • Create users once
  • Centrally manage roles, privileges, preferences
  • Support single password for all applications
  • Delegate administration
  • Locally administered departments, LOBs, etc.
  • User self-service
  • Interoperate with existing security infrastructure

29
Oracle Application Server 10g
10g
30
Application Availability
IT Problem
Oracle 10g Solution
  • Ensuring required levels of availability is too
    expensive
  • Modular components provide inexpensive redundancy
  • Coordinated response to system failures ensures
    application availability

31
Application Availability
  • Transparent Application Failover (TAF)
  • Automatic session migration
  • Fast-Start Fault Recovery
  • Automatic failure detection and recovery
  • Multi-tier Failover Notification (FaN)
  • Speeds end-to-end application failover time
  • From 15 minutes to lt15 seconds

32
Transparent Application Failover
  • Employee Portal
  • Portal
  • Accounting
  • Discoverer, reports
  • Web Store
  • HTTP, J2EE Server

Resource failure! ? fail-over the service to
additional nodes
33
Fast-Start Fault Recovery
  • Employee Portal
  • Portal
  • Accounting
  • Discoverer, reports
  • Web Store
  • HTTP, J2EE Server

Nodes recovered ? re-instate automatically
34
Multi-tier Failover Notification (FaN)
  • Overcomes TCP/IP timeout delays associated with
    cross-tier application failovers

RAC Failover AS Detection Total Downtime
gt 15 mins
15 mins
Without FaN With FaN
lt 8 secs
lt 12 secs
lt 4 secs
lt 8 secs
35
Oracle Application Server 10g
10g
36
Application Monitoring
IT Problem
Oracle 10g Solution
  • Insufficient performance data to plan, tune, and
    manage systems effectively
  • Software pre-instrumented to provide status and
    fine-grained performance data
  • Centralized console analyzes and summarizes Grid
    performance

37
Application Monitoring
  • Monitor virtual application resources
  • e.g. J2EE containers, HTTP servers, Web caches,
    firewalls, routers, software components, etc.
  • Root cause diagnostics
  • Track real-time and historic performance metrics
  • App. availability, business transactions, end
    user perf.
  • Notifications and alerts
  • Administer service level agreements (SLAs)

38
Repository-based Management
  • Centralized repository-based management provides
    a unified view of entire infrastructure
  • Manage all your end-to-end application
    infrastructure from any device

Grid Control Repository
39
Performance Monitoring
  • Capture real-time and historical performance data
  • Analyze and tune workload policies
  • Answer questions like
  • How much time is being spent in just the JDBC
    part of this application?
  • What was the average response time over the past
    3, 6, and 9 months?

40
Policy-based Alerts
  • User specified targets, metrics, and thresholds
  • e.g. CPU utilization, user response times, etc.
  • Flexible notification methods
  • e.g. Phone, e-mail, fax, SMS, etc.
  • Self-correction via pre-defined responses
  • e.g. Execute a script to shut down low priority
    jobs

41
Agenda
  • Introduction Grid Computing
  • OracleAS 10g Features
  • CERN Case Study
  • OracleAS 10g Roadmap
  • QA

42
LHC Computing Grid Project
  • Oracle-based Production Services for LCG 1

43
Goals
  • To offer production quality services for LCG 1 to
    meet the requirements of forthcoming (and
    current!) data challenges
  • e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb
    CDC04
  • To provide distribution kits, scripts and
    documentation to assist other sites in offering
    production services
  • To leverage the many years experience in running
    such services at CERN and other institutes
  • Monitoring, backup recovery, tuning, capacity
    planning,
  • To understand experiments requirements in how
    these services should be established, extended
    and clarify current limitations
  • Not targeting small-medium scale DB apps that
    need to be run and administered locally (to user)

44
What Services?
  • POOL file catalogue using EDG-RLS (also
    non-POOL!)
  • LRC RLI services client APIs
  • For GUID lt-gt PFN mappings
  • and EDG-RMC
  • For file-level meta-data POOL currently stores
  • filetype (e.g. ROOT file), fully registered, job
    status
  • Expect also 10 items from CMS DC04 others?
  • plus (service behind) EDG Replica Manager client
    tools
  • Need to provide robustness, recovery,
    scalability, performance,
  • File catalogue is a critical component of the
    Grid!
  • Job scheduling, data access,

45
The Supported Configuration
  • All participating sites should run
  • A Local Replica Catalogue (LRC)
  • Contains GUID lt-gt PFN mapping for all local files
  • A Replica Location Index (RLI) lt-- independent
    of EDG deadlines
  • Allows files at other sites to be found
  • All LRCs are configured to publish to all remote
    RLIs
  • Scalability beyond O(10) sites??
  • Hierarchical and other configurations may come
    later
  • A Replica Metadata Catalogue (RMC)
  • Not proposing a single, central RMC
  • Jobs should use local RMC
  • Short-term handle synchronisation across RMCs
  • In principle possible today on the POOL-side
    (to be tested)
  • Long-term middleware re-engineering?

46
Component Overview
CNAF
CERN
Storage Element
Storage Element
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Storage Element
Storage Element
RAL
IN2P3
47
Where should these services be run?
  • At sites that can provide supported h/w O/S
    configurations(next slide)
  • At sites with existing Oracle support team
  • We do not yet know whether we can make
    Oracle-based services easy enough to setup
    (surely?) and run (should be for canned apps?)
    where existing Oracle experience is not available
  • Will learn a lot from current roll-out
  • Pros can benefit from scripts / doc / tools etc.
  • Other sites simply re-extract catalog subset
    from nearest Tier1 in case of problems?
  • Need to understand use-cases and service level

48
Requirements for Deployment
  • A farm node running Red Hat Enterprise Linux and
    Oracle9iAS
  • Runs Java middleware for LRC, RLI etc.
  • One per VO
  • A disk server running Red Hat Enterprise Linux
    and Oracle9i
  • Data volume for LCG 1 small (105 106 entries,
    each lt 1KB)
  • Query / lookup rate low (1 every 3 seconds)
  • Projection to 2008 100 1000Hz 109 entries
  • Shared between all VOs at a given site
  • Site responsible for acquiring and installing h/w
    and RHEL
  • 349 for basic edition http//www.redhat.com/so
    ftware/rhel/es/

49
What if?
  • DB server dies
  • No access to catalog until new server configured
    DB restored
  • Hot standby or clustered solution offers
    protection against most common cases
  • Regular dump of full catalog into alternate
    format, e.g. POOL XML?
  • Application server dies
  • Stateless, hence relatively simple move to a new
    host
  • Could share with another VO
  • Handled automatically with application server
    clusters
  • Data corrupted
  • Restore or switch to alternate catalog
  • Software problems
  • Hardest to predict and protect against
  • Could cause running jobs to fail and drain batch
    queues!
  • Very careful testing, including by experiments,
    before move to a new version of the middleware
    (weeks, including smallish production run?)
  • Need to foresee all possible problems, establish
    recovery plan and test!

What happens during period when catalog is
unavailable?
50
Backup Recovery, Monitoring
  • Backend DB included in standard backup scheme
  • Daily full, hourly incrementals archive log
    allows point in time recovery
  • Need additional logging plus agreement with
    experiments to understand point in time to
    recover to and testing!
  • Monitoring both at box-level (FIO) and
    DB/AS/middleware
  • Need to ensure problems (inevitable, even if
    undesirable) are handled gracefully
  • Recovery tested regularly, by several members of
    the team
  • Need to understand expectations
  • Catalog entries guaranteed for ever?
  • Granularity of recovery?

51
Recommended Usage - Now
  • POOL jobs recommend extracting catalog sub-set
    prior to job and post-cataloging new entries as
    separate step
  • Non-POOL jobs, e.g. EDG-RM client minimum, test
    RC and implement simple retry provide enough
    output in job log for manual recovery if
    necessary
  • Perpetual retry inappropriate if e.g.
    configuration error
  • In all cases, need to foresee hiccoughs in
    servicee.g. 1 hour, particularly during ramp-up
    phase
  • Please provide us with examples of your usage so
    that we can ensure adequate coverage by test
    suite!
  • Strict naming convention essential for any
    non-trivial catalogue maintenance

52
Status
  • RLS/RLI/RMC services deployed at CERN for each
    experiment DTEAM
  • RLSTEST service also available, but should not be
    used for production!
  • Distribution mechanism, including kits, scripts
    and documentation available and well debugged
  • Only 1 outside site deployed so far (Taiwan)
    others in the pipeline
  • FZK, RAL, FNAL, IN2P3, NIKHEF
  • We need help to define list and priorities!
  • Actual installation rather fast (max a few hours)
  • Lead time can be long
  • Assign resources etc a few weeks!
  • Plan is (still) to target first sites with Oracle
    experience to make scripts doc as clear and
    smooth as possible
  • Then see if it makes sense to go further

53
Registration for Access to Oracle Kits
  • Well known method of account registration in
    dedicated group (OR)
  • Names will be added to mailing list to announce
    e.g. new releases of Oracle s/w, patch sets etc.
  • Foreseeing much more gentle roll-out than for
    previous packages
  • Initially just DBAs supporting canned apps
  • RLS backend, later potential conditions DB if
    appropriate
  • For simple, moderate-scale DB apps, consider use
    of central Sun cluster, already used by all LHC
    experiments
  • Distribution kits, scripts etc in afs
  • /afs/cern.ch/project/oracle/export/
  • Documentation also via Web
  • http//cern.ch/db/RLS/

54
Links
  • http//cern.ch/wwwdb/grid-data-management.html
  • High level overview of the various components
    pointers to presentations on use-cases etc
  • http//cern.ch/wwwdb/RLS/
  • Detailed installation configuration
    instructions
  • http//pool.cern.ch/talksandpubl.html
  • File catalog use-cases, DB requirements, many
    other talks

55
Future Possibilities
  • Investigating resilience against h/w failure
    using Application Server Database clusters
  • AS clusters also facilitate move of machines,
    addition of resources, optimal use of resources
    etc.
  • DB clusters (RAC) can be combined with stand-by
    databases and other techniques for even greater
    robustness
  • (Greatly?) simplified deployment, monitoring and
    recovery can be expected with Oracle10g

56
Summary
  • Addressing production-quality DB services for LCG
    1
  • Clearly work in progress, but basic elements in
    place at CERN, deployment just starting outside
  • Based on experience and knowledge of Oracle
    products, offering distribution kits,
    documentation and other tools to those sites that
    are interested
  • Need more input on requirements and priorities of
    experiments regarding production plans

57
A
58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com