The Data Deluge - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

The Data Deluge

Description:

Expect massive increases in amount of data being collected in ... Cosmology. Particle Physics. Environment. Steve Lloyd. The Data Deluge and the Grid. Slide 34 ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 45
Provided by: Llo43
Category:
Tags: cosmology | data | deluge

less

Transcript and Presenter's Notes

Title: The Data Deluge


1
The Data Deluge and the Grid
  • The Data Deluge
  • The Large Hadron Collider
  • The LHC Data Challenge
  • The Grid
  • Grid Applications
  • GridPP
  • Conclusion

Steve Lloyd Queen Mary University of
London s.l.lloyd_at_qmul.ac.uk
2
The Data Deluge
  • Expect massive increases in amount of data being
    collected in several diverse fields over the next
    few years
  • Astronomy - Massive sky surveys
  • Biology - Genome databases etc.
  • Earth Observing
  • Digitisation of paper, film, tape records etc to
    create Digital Libraries, Museums . . .
  • Particle Physics - Large Hadron Collider
  • . . .
  • 1PByte 1000 TBytes 1M GBytes 1.4M CDs
  • Petabyte Terabyte Gigabyte

3
Digital Sky Project
Federating new astronomical surveys 40,000
square degrees 1/2 trillion pixels (1 arc
second) 1 TB x multi-wavelengths gt 1
billion sources
  • Integrated catalogue and image database
  • Digital Palomer Observatory Sky Survey
  • 2? All Sky Survey
  • NRAO VLA Sky Survey
  • VLA FIRST Radio Survey
  • Later
  • ROSAT
  • IRAS
  • Westerbork 327 MHz Survey

4
Sloan Digital Sky Survey
Survey 10,000 square degrees of Northern Sky over
5 years
  • 1 million spectra
  • positions and images of 100 million objects
  • 5 wavelength bands
  • 40 TB

5
VISTA
Visible and Infrared Survey Telescope for
Astronomy
6
Virtual Observatories
Chandra X-ray
HST optical
Gemini mid-IR
VLA radio
Jet in M87
7
NASAs Earth Observing System
Galapagos Oil Spill
1 TB/day
8
ESA EO Facilities
LANDSAT 7 TERRA/MODIS
AVHRR
SEAWIFS
SPOT
IRS-P3
MATERA (I)
HISTORICAL ARCHIVES
KIRUNA (S) - ESRANGE
TROMSO (N)
MATERA (I)
STANDARD PRODUCTION CHAINS
MASPALOMAS (E)
NEUSTREL.ITZ (D)
PRODUCTS
GOME analysis detected ozone thinning over
Europe 31 Jan 2002
ESRIN
USERS
USERS
9
Species 2000
To enumerate all 1.7 million known species of
plants, animals, fungi and microbes on Earth
A federation of initially 18 taxonomic databases
- eventually 200 databases
10
Genomics
11
The LHC
  • The Large Hadron Collider (LHC) will be a 14 TeV
    centre of mass proton proton collider operating
    in the existing 26.7Km LEP tunnel at CERN. Due to
    start operation gt 2006
  • 1,232 superconducting main dipoles of 8.3Tesla
  • 788 quadrupoles
  • 2,835 bunches of 1011 protons per bunch spaced by
    25ns

12
Particle Physics Questions
  • Need to discover (confirm) Higgs Particle
  • Study its properties
  • Prove that Higgs couplings depend on masses
  • Other unanswered questions
  • Does Supersymmetry exist?
  • How are quarks and leptons related?
  • Why are there 3 sets of quarks and leptons?
  • What about Gravity?
  • Anything unexpected?

13
The LHC
14
The LEP/LHC Tunnel
15
LHC Experiments
  • LHC will house 4 experiments
  • ATLAS and CMS are large 'General Purpose'
    detectors designed to detect everything and
    anything
  • LHCb is a specialised experiment designed to
    study CP violation in the b quark system
  • ALICE is a dedicated Heavy Ion Physics Detector

16
Schematic View of the LHC
17
The ATLAS Experiment
  • ATLAS Consists of
  • An inner tracker to measures the momentum of each
    charged particle
  • A calorimeter to measure the energies carried by
    the particles
  • A muon spectrometer to identify and measure muons
  • A huge magnet system for bending charged
    particles for momentum measurement
  • A total of gt 108 electronic channels

18
The ATLAS Detector
19
Simulated ATLAS Higgs Event
20
LHC Event Rates
  • The LHC proton bunches collide every 25ns and
    each collision yields 20 proton proton
    interactions superimposed in the Detector i.e.
  • 40 MHz x 20 8x108 pp interactions/sec
  • The (110 GeV) Higgs cross section is 24.2pb.
  • A good channel is H ? ?? with a branching ratio
    of 0.19 and a detector acceptance 50
  • At full (1034cm-2s-1) LHC luminosity this gives
    1034 x 24.2x10-12 x 10-24 x 0.0019 x 0.5
  • 2x10-4 H ? ?? per second
  • A 2x10-4 needle in a 8x108 Haystack

21
'Online' Data Reduction
Collision Rate 40 MHz
40 TB/sec
Level 1 Special Hardware Trigger
104 - 105 Hz
10-100 GB/sec
Selecting interesting events based on
progressively more detector information
Level 2 Embedded Processor Trigger
1-10 GB/sec
102 - 103 Hz
Level 3 Processor Farm
10 - 100 Hz
100-200 MB/sec
Raw Data Storage
Offline Data Reconstruction
22
Offline Analysis
Raw Data from Detector
1-2 MB/event _at_ 100-400 Hz
Total Data per year from one experiment 1 to 8
PBytes (1015 Bytes)
Data Reconstruction (Digits to Energy/momentum
etc)
Event Summary Data
0.5 MB/event
Analysis Event Selection
10 kB/event
Analysis Object Data
Physics Analysis
23
Computing Resources Required
  • CPU Power (Reconstruction, Simulation, User
    Analysis etc)
  • 2 Million SpecInt95
  • (A 1 GHz PC is rated at 40 SpecInt95)
  • i.e. 50,000 of today's PCs
  • 'Tape' Storage
  • 20,000 TB
  • Disk Storage
  • 2,500 TB
  • Analysis carried out throughout the world by
    hundreds of Physicists

24
Worldwide Collaboration
CMS 1800 physicists 150 institutes 32 countries
25
Solutions
  • Centralised Solution
  • Put all resources at CERN
  • Funding agencies certainly won't place all their
    investment at CERN
  • Sociological problems
  • Distributed solution
  • exploit established computing expertise
    infrastructure in national labs and universities
  • reduce dependence on links to CERN
  • tap additional funding sources (spin off)
  • Is the Grid the solution?

26
What is The Grid?
  • Analogy with the Electricity Power Grid
  • Unlimited ubiquitous distributed computing
  • Transparent access to multipetabyte distributed
    databases
  • Easy to plug in
  • Complexity of infrastructure hidden

27
The Grid
  • Five emerging models
  • Distributed Computing
  • - synchronous processing
  • High-Throughput Computing
  • - asynchronous processing
  • On-Demand Computing
  • - dynamic resources
  • Data-Intensive Computing
  • - databases
  • Collaborative Computing
  • - scientists

Ian Foster and Carl Kesselman, editors, The
Grid Blueprint for a New Computing
Infrastructure, Morgan Kaufmann, 1999,
http//www.mkp.com/grids
28
The Grid
  • Ian Foster / Carl Kesselman
  • "A computational Grid is a hardware and
    software infrastructure that provides dependable,
    consistent, pervasive and inexpensive access to
    high-end computational capabilities."

29
The Grid
  • Dependable - Need to rely on remote equipment as
    much as the machine on your desk
  • Consistency - Machines need to communicate so
    need consistent environments and interfaces
  • Pervasive - The more resources that participate
    in the same system the more useful they all are
  • Inexpensive - Important for pervasiveness - i.e.
    built using commodity PCs and disks

30
The Grid
  • You simply submit your job to the 'Grid'- you
    shouldn't have to know where the data you want is
    or where the job will run. The Grid software
    (Middleware) will take care of
  • running the job where the data is or
  • moving the data to where there is CPU power
    available

31
The Grid for the Scientist
Putting the bottleneck back in the Scientists
mind
32
Grid Tiers
  • For the LHC we envisage a 'Hierarchical'
    structure based on several 'Tiers' since the data
    mostly originates at one place
  • Tier-0 - CERN - the source of the data
  • Tier-1 - 10 Major Regional Centres (inc UK)
  • Tier-2 - smaller more specialised Regional
    Centres (4 in UK?)
  • Tier-3 - University Groups
  • Tier-4 My laptop? Mobile Phone?
  • Doesn't need to be hierarchical e.g. for
    Biologists probably not desirable

33
Grid Services
Cosmology
Chemistry
Environment
Applications
Biology
Particle Physics
Data-
Remote
Problem
Remote
Collaborative
Distributed
Application
Intensive
Visualization
Solving
Instrumentation
Applications
Computing
Applications
Applications
Applications
Applications
Toolkits
Toolkit
Toolkit
Toolkit
Toolkit
Toolkit
Toolkit
Grid Services
Resource-independent and application-independent
services
(Middleware)
authentication, authorization, resource
location, resource allocation, events, accounting,
remote data access, information, policy, fault
detection
Resource-specific implementations of basic
services
Grid Fabric
e.g., Transport protocols, name servers,
differentiated services, CPU schedulers, public
key
(Resources)
infrastructure, site accounting, directory
service, OS bypass
34
Problems
  • Scalability
  • Will it scale to thousands of processors,
    thousands of disks, PetaBytes of data,
    Terabits/sec of IO?
  • Wide-area distribution
  • How to distribute, replicate, cache, synchronise,
    catalogue the data?
  • How to balance local ownership of resources with
    the requirements of the whole?
  • Adaptability/Flexibility
  • Need to adapt to rapidly changing hardware and
    costs, new analysis methods etc.

35
SETI_at_home
  • A distributed computing project - not really a
    Grid project
  • You pull the data from them rather than they
    submit the job to you
  • total of 3,864,230 users
  • 564,194,228 results received
  • 1,063,104 years of cpu time
  • 1.8x1021 floating point operations
  • 77 different cpu types
  • 100 different OS

Arecibo telescope in Puerto Rico
36
SETI_at_home
37
Entropia
  • Uses idle cycles on Home PCs for profit and
    non-profit projects
  • Mersenne Prime Search
  • 146,622 machines
  • 784,360,165 cpu hours
  • FightAIDS_at_Home
  • 13,944 Machines
  • 1,652,126 cpu hours

38
NASA Information Power Grid
  • Knit together widely distributed computing, data,
    instrumentation and human resources
  • to address complex large scale computing and data
    analysis problems

39
Collaborative Engineering
Unitary Plan Wind Tunnel
Multi-source Data Analysis
Real-time collection
Archival storage
40
Other Grid Applications
  • Distributed Supercomputing
  • Simultaneous execution across multiple
    supercomputers
  • Smart Instruments
  • Enhance the power of scientific instruments by
    providing access to data archives and online
    processing capabilities and visualisation

e.g. coupling Argonnes Photon Source to a
supercomputer
41
GridPP
http//www.gridpp.ac.uk
42
GridPP Overview
Provide architecture and middleware
Future LHC Experiments
Running US Experiments
Build prototype Tier-1 and Tier-2s in the UK and
implement middleware in experiments
Use the Grid with simulation data
Use the Grid with real data
43
The Prototype UK Tier-1
  • Jan 2002 Central Facilities used by all
    experiments
  • 250 CPUs (450Mhz-1GHz)
  • 10TB Disk
  • 35TB Tape in use (theoretical tape capacity 330
    TB)
  • March 2002 Extra Resources for LHC and BaBar
  • 312 CPUs
  • 40TB Disk
  • extra 36 TB of tape and three new drives

44
Conclusions
  • Enormous data challenges in next few years.
  • The Grid is likely solution.
  • The Web gives ubiquitous access to distributed
    information.
  • The Grid will give ubiquitous access to computing
    resources and hence knowledge.
  • Many Grid projects and testbeds starting to take
    off.
  • GridPP is building a UK Grid for Particle
    Physicists to prepare for future LHC Data.
Write a Comment
User Comments (0)
About PowerShow.com