EOSDIS Alternate Architecture Study - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

EOSDIS Alternate Architecture Study

Description:

a Z 39.50 problem (Z 39.50 is a FAP). This is a operations problem ... e.g. Berkeley weekly Landsat images since 1972. = 1000 tape accesses. ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 29
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: EOSDIS Alternate Architecture Study


1
EOSDIS Alternate Architecture Study
  • Jim Gray
  • McKay Fellow, UC Berkeley, 1 May 1995,
    gray _at_ crl.com
  • 1. Background - problem and proposed solution
  • 2. What California proposed
  • Co workers
  • Mike Stonebraker Producer / Director / Script
    Writer / Propeller Head
  • Bill Farrell Ramrod and Computer-literate
    DirtBag
  • Jeff Dozier Godfather
  • Special effects
  • Earth Science Frank Davis, C. Roberto Mechoso,
    Jim Frew
  • Computer Science Reagan Moore, Jim Gray, Joe
    Pasquale
  • Administration Claire Mosher
  • Writing Stephanie Sides
  • Prototypes many-many....people

2
Whats The Problem?
  • Antarctica is melting -- 77 of fresh water
    liberated
  • sea level rises 70 meters
  • Chico Memphis are beach front property
  • New York, Washington, SF, LA, London, Paris
  • Lets study it! Mission to Planet Earth
  • EOS Earth Observing System (17B 10B)
  • 50 instruments on 10 satellites 1997-2001
  • Plus Landsat (added later)
  • EOS DIS Data Information System
  • 3-5 MB/s raw, 30-50 MB/s processed.
  • 4 TB/day, 15 PB by year 2007
  • Issues
  • How to store it?
  • How to serve it to users?

3
What Happened?
  • 1986 Mission to Planet Earth
  • 1989 Bids from Hughes TRW
  • 1993 Contract Grant, Public Review
  • customers do not want it (tape/mainframe
    centric)
  • 1994 Alternate Architecture
  • Three outside teams
  • Wyoming Internet 20,000,000
  • Maryland Software Engineering
  • California DB centric
  • One home team CORBA Z 39.50 UNIX
  • 1995 Drifting in the Sequoia direction

4
The Hughes Plan
  • 8 DAACs (Data Active Archive Centers) Bytes
  • (one per congressional district?)
  • N SCFs (Scientific Computation Facilities) MIPS
  • (typically instrument or science teams)
  • Thin wires among them
  • 90 of DAAC processing is PULL
  • building standard data products
  • fixed pipeline calibrate, grid, derive
  • Typical subscriber gets tapes or CDroms
  • (standard data products)
  • One chauffeur per 10 customers (high ops costs)
  • Build everything (operations, HSM, DBMS,...) from
    scratch
  • CORBA and Z 39.50 is the glue.
  • Criticism not evolvable, not open, not online,
    not useful.

5
What California Proposed
  • 0. Design for success expect that millions will
    use the system (online)
  • 1. DBMS centric design automates discovery,
    access, management
  • 2. Object relational databases enable
  • Automate access to data so that the NASA 500,
  • Global Change 10,000 and Internet 20,000,000
    can use system.
  • Cache popular results, not all results (saves 3x
    or more)
  • Compute on demand (saves lots of storage and
    cpu).
  • Emphasize pull processing rather than push
    processing.
  • Use parallelism to get scaleup.
  • Do Batch as a data pump
  • 3 Be Smart Shoppers
  • Use COTS hardware/software (saves 400M)
  • Just-in-time acquisition (saves 400M)
  • Use workstation not mainframe technology (gives
    10x more stuff)
  • Depreciate over 3 years (ends in 2007 with
    "fresh" equipment)
  • 4. 2 N node architecture
  • 2 Super DAACs for fault tolerance and for
    growth.

6
Meta-Model for Sequoia Proposal
  • Be technological optimists
  • couldnt build it today, count on progress.
  • ride technology wave ( not water cooled)
  • Buy or Seed, do not build.
  • Use COTS where possible
  • Fund 2 or more COTS vendors if need product
  • OR DBMS
  • HSM
  • Operations.
  • Replace people with technology ( OR DBMS)
  • automate data discovery, access, visualization
  • DBMS Centric view.

7
DBMS Centric View
  • This is a database problem (no kidding)!
  • This is not
  • a file system problem (file wrong abstraction)
  • a rpc problem (CORBA wrong abstraction)
  • a Z 39.50 problem (Z 39.50 is a FAP).
  • This is a operations problem
  • Hierarchical storage management
  • Network management
  • Source code control
  • client-server tools.
  • You can BUY all this stuff. Fund COTS.
  • BUILD AS LITTLE AS POSSIBLE

8
What California Proposed
  • 0. Design for success expect that millions will
    use the system (online)
  • 1. DBMS centric design automates discovery,
    access, management
  • 2. Object relational databases enable
  • Automate access to data so that the NASA 500,
  • Global Change 10,000 and Internet 20,000,000
    can use system.
  • Cache popular results, not all results (saves 3x
    or more)
  • Compute on demand (saves lots of storage and
    cpu).
  • Emphasize pull processing rather than push
    processing.
  • Use parallelism to get scaleup.
  • Do Batch as a data pump
  • 3 Be Smart Shoppers
  • Use COTS hardware/software (saves 400M)
  • Just-in-time acquisition (saves 400M)
  • Use workstation not mainframe technology (gives
    10x more stuff)
  • Depreciate over 3 years (ends in 2007 with
    "fresh" equipment)
  • 4. 2 N node architecture
  • 2 Super DAACs for fault tolerance and for
    growth.

9
Design for Success Expect Lots of Users
  • Expect that millions will use the system (online)
  • Three user categories
  • NASA 500 -- funded by NASA to do science
  • Global Change 10 k - other dirt bags
  • Internet 20 m - everyone else
  • Grain speculators
  • Environmental Impact Reports
  • New applications
  • discovery access must be automatic
  • Allow anyone to set up a peer-DAAC SCF
  • Design for Ad Hoc queries, Not Standard Data
    Products If push is 90, then 10 of data is read
    (on average).
  • A failure no one uses the data, in DSS,
    push is 1 or less.
  • computation demand is 100x Hughes estimate
  • (pull is 10x to 100x greater than push)

10
The Process Flow
  • Data arrives and is pre-processed.
  • instrument data is calibrated,
  • gridded
  • averaged
  • Geophysical data is derived
  • Users ask for stored data
  • OR to analyze and combine data.
  • Can make the pull-push split dynamically

11
The Software Model Global View
  • SQL is the FAP and API.
  • Applications use it to access data.
  • It includes
  • stored procedures
  • (so RPC)
  • GC class libraries
  • Computation is data driven
  • Gateways for other interfaces
  • HTTP, Z 39.50, Corba COM
  • TP or TP-lite manages workflow

12
Automate access to data
  • Invest in
  • Design global change schema.
  • cooperate with standards groups.
  • OR DBMS class libraries for GC datatypes
  • Develop browser to do resource discovery
  • Community will develop access vis tools
  • OR DBMS will do
  • PUSH processing triggers and workflow
  • PULL processing query optimization.
  • (some assembly required).

13
How Well Did SQL Work?
  • Bill Farrell and others did 30 user scenarios
    schema, application, SQL, performance
  • Snow cover, CO2, GCM,...
  • Avg ad hoc scenario generated about 30 of
  • EOSDIS baseline processing
  • validated PULL over PUSH demand
  • SQL was indeed a power tool
  • Many scenarios became a few simple SQL queries
  • Need a spatial temporal SQL.
  • Personal view
  • Its great!, much better than Farrell or I
    expected.

14
Compute on demand
  • 90 of data is NEVER used (according to Hughes).
  • Some data is used only once.
  • Data is often re-calculated
  • repair hardware/software bugs,
  • new better algorithms
  • Optimization store only popular data.
  • Compute this based on past use
  • (of this data and related data)
  • Balance two costs
  • 1. Re_Compute_Cost / Re_Use_Interval
  • 2. Storage_Cost x Re_Use_Interval
  • Recompute is often cheaper (saves 3x we think).

15
Use parallelism to get scaleup.
  • Many queries look at 100s or 1,000s of data
    tiles.
  • e.g. Berkeley weekly Landsat images since 1972.
  • 1000 tape accesses.
  • 4,000 tape minutes 6 days.
  • Done 1,000 way parallel 4 minutes.
  • Disk tape demands are huge multi-GOX
  • Computation demands are huge tera-ops.
  • Only solution
  • Use parallel execution
  • Use parallel data access
  • SQL does this for you automatically.

16
Data Pump
  • Compute on demand small jobs
  • less than 1,000 tape mounts
  • less than 100 M disk accesses
  • less than 100 TeraOps.
  • (less than 30 minute response time)
  • For BIG JOBS scan entire 15PB database
  • once a day /week
  • Any BIG JOB can piggyback on this data scan.
  • DAAC in 2007

17
What California Proposed
  • 0. Design for success expect that millions will
    use the system (online)
  • 1. DBMS centric design automates discovery,
    access, management
  • 2. Object relational databases enable
  • Automate access to data so that the NASA 500,
  • Global Change 10,000 and Internet 20,000,000
    can use system.
  • Cache popular results, not all results (saves 3x
    or more)
  • Compute on demand (saves lots of storage and
    cpu).
  • Emphasize pull processing rather than push
    processing.
  • Use parallelism to get scaleup.
  • Do Batch as a data pump
  • 3 Be Smart Shoppers
  • Use COTS hardware/software (saves 400M)
  • Just-in-time acquisition (saves 400M)
  • Use workstation not mainframe technology (gives
    10x more stuff)
  • Depreciate over 3 years (ends in 2007 with
    "fresh" equipment)
  • 4. 2 N node architecture
  • 2 Super DAACs for fault tolerance and for
    growth.

18
Use COTS hardware/software (saves 400M)
  • Defense contractors want to build (and maintain)
    stuff.
  • (they do it for the money)
  • Fund SQL (SQL-2007) Object-Relational
    (extensible)
  • supports Global Change data types
  • Automates access
  • Reliable storage
  • Tertiary storage
  • Parallel data search (automatic)
  • Workflow (job control)
  • Reliable
  • Fund Operations software companies (Tivoli...)

19
Use workstation technology (NOW)
  • Use workstation hardware technology,
  • not Super Computers
  • 0.5/MB of disk vs 30/MB of disk
  • 100/MIPS vs 18,000/MIPS
  • 3k/tape drive vs 50k/tape drive
  • Processor, Disk, Tape ARRAYS connected by ATM
  • a NOW
  • Gives 10x (?100x) more stuff for same dollars
  • Allows ad hoc query load
  • Allows a scaleable design
  • Allows same hardware SuperDAACs PeerDAACs

20
Use workstation technology (NOW)
  • Study used RS/6000 and DEC 7000 as workstation
  • (they are 100k/slice).
  • Should have used Compaq.
  • Price for 20GFlop, 24 TB disk, 2PB tape TODAY

Compaq/DLT prices computed by Gray. 10 Peer DAAC
costs 3M today, 1 Micro DAAC (200TB) costs
300K
21
Just-in-time acquisition (saves 400M)
  • Hardware prices decline 20-40/year
  • So buy at last moment
  • Buy best product that day commodity
  • Depreciate over 3 years so that facility is
    fresh.
  • (after 3 years, cost is 23 of original). 60
    decline peaks at 10M

22
What California Proposed
  • 0. Design for success expect that millions will
    use the system (online)
  • 1. DBMS centric design automates discovery,
    access, management
  • 2. Object relational databases enable
  • Automate access to data so that the NASA 500,
  • Global Change 10,000 and Internet 20,000,000
    can use system.
  • Cache popular results, not all results (saves 3x
    or more)
  • Compute on demand (saves lots of storage and
    cpu).
  • Emphasize pull processing rather than push
    processing.
  • Use parallelism to get scaleup.
  • Do Batch as a data pump
  • 3 Be Smart Shoppers
  • Use COTS hardware/software (saves 400M)
  • Just-in-time acquisition (saves 400M)
  • Use workstation not mainframe technology (gives
    10x more stuff)
  • Depreciate over 3 years (ends in 2007 with
    "fresh" equipment)
  • 4. 2 N node architecture
  • 2 Super DAACs for fault tolerance and for
    growth.

23
2N DAAC architecture
  • 2 Super-DAACs Have 2 BIG sites which
  • Each store ALL the data (back each other up)
  • no other way to archive these 15 PB databases
  • Each service 1/2 the queries and run a data pump
  • Each produces 1/2 the standard data products
  • Each has a BIG MIP farm next to the Byte farm
  • (a SCF science computation facility).
  • N Peer-DAACs
  • Each stores part of the data (got from a super
    DAAC)
  • Can be NASA sponsored or private.
  • Same software and hardware as Super-DAACs
  • Super-DAACs are banks, Peer-DAACs are pubs
  • careful anything goes

24
Minimize Operations Costs
  • Reduced sites (DAACs) have reduced costs
  • Use Mosaic, Email, Telephone user support model
  • Count on vendors to provide
  • Network management (NetView SMTP)
  • Data replication
  • Application software version control
  • Workflow control
  • Help desk software
  • More reliable hardware/software

25
Unify data storage centers with data analysis
  • Data analysis (Science Computation Facilities)
  • need quick high bandwidth access to DB.
  • WAN technology is good but not that good.
  • WAN technology is not free.
  • Co-Locate DAACs and SCFs.
  • two super SCFs, many peer SCFs.
  • Instrument teams often find a bug or new
    algorithm
  • reprocess all the base data to make new data
    set.
  • ripple effect to data consumers
  • must track data lineage.

26
Budget
  • We had a VERY difficult time discovering a
    budget.
  • So we did our own.
  • It was less.
  • Big savings in operations and development
  • Hardware savings could give bigger DAACs

27
What California Proposed
  • 0. Design for success expect that millions will
    use the system (online)
  • 1. DBMS centric design automates discovery,
    access, management
  • 2. Object relational databases enable
  • Automate access to data so that the NASA 500,
  • Global Change 10,000 and Internet 20,000,000
    can use system.
  • Cache popular results, not all results (saves 3x
    or more)
  • Compute on demand (saves lots of storage and
    cpu).
  • Emphasize pull processing rather than push
    processing.
  • Use parallelism to get scaleup.
  • Do Batch as a data pump
  • 3 Be Smart Shoppers
  • Use COTS hardware/software (saves 400M)
  • Just-in-time acquisition (saves 400M)
  • Use workstation not mainframe technology (gives
    10x more stuff)
  • Depreciate over 3 years (ends in 2007 with
    "fresh" equipment)
  • 4. 2 N node architecture
  • 2 Super DAACs for fault tolerance and for
    growth.

28
Challenging Problems
  • Design the Global Change Schema
  • Understand data lineage
  • Build discovery, analysis, visualization tools
  • Build an OR DBMS
  • Including distributed,
  • parallel,
  • workflow
  • lazy-eager evaluation
  • tertiary storage,
  • SQL
  • workflow
  • Build a decent reliable HSM
  • Build a way to operate a 1,000 node NOW.
Write a Comment
User Comments (0)
About PowerShow.com