Title: EOSDIS Alternate Architecture Study
1EOSDIS Alternate Architecture Study
- Jim Gray
- McKay Fellow, UC Berkeley, 1 May 1995,
gray _at_ crl.com - 1. Background - problem and proposed solution
- 2. What California proposed
- Co workers
- Mike Stonebraker Producer / Director / Script
Writer / Propeller Head - Bill Farrell Ramrod and Computer-literate
DirtBag - Jeff Dozier Godfather
- Special effects
- Earth Science Frank Davis, C. Roberto Mechoso,
Jim Frew - Computer Science Reagan Moore, Jim Gray, Joe
Pasquale - Administration Claire Mosher
- Writing Stephanie Sides
- Prototypes many-many....people
2Whats The Problem?
- Antarctica is melting -- 77 of fresh water
liberated - sea level rises 70 meters
- Chico Memphis are beach front property
- New York, Washington, SF, LA, London, Paris
- Lets study it! Mission to Planet Earth
- EOS Earth Observing System (17B 10B)
- 50 instruments on 10 satellites 1997-2001
- Plus Landsat (added later)
- EOS DIS Data Information System
- 3-5 MB/s raw, 30-50 MB/s processed.
- 4 TB/day, 15 PB by year 2007
- Issues
- How to store it?
- How to serve it to users?
3What Happened?
- 1986 Mission to Planet Earth
- 1989 Bids from Hughes TRW
- 1993 Contract Grant, Public Review
- customers do not want it (tape/mainframe
centric) - 1994 Alternate Architecture
- Three outside teams
- Wyoming Internet 20,000,000
- Maryland Software Engineering
- California DB centric
- One home team CORBA Z 39.50 UNIX
- 1995 Drifting in the Sequoia direction
-
4The Hughes Plan
- 8 DAACs (Data Active Archive Centers) Bytes
- (one per congressional district?)
- N SCFs (Scientific Computation Facilities) MIPS
- (typically instrument or science teams)
- Thin wires among them
- 90 of DAAC processing is PULL
- building standard data products
- fixed pipeline calibrate, grid, derive
- Typical subscriber gets tapes or CDroms
- (standard data products)
- One chauffeur per 10 customers (high ops costs)
- Build everything (operations, HSM, DBMS,...) from
scratch - CORBA and Z 39.50 is the glue.
- Criticism not evolvable, not open, not online,
not useful.
5What California Proposed
- 0. Design for success expect that millions will
use the system (online) - 1. DBMS centric design automates discovery,
access, management - 2. Object relational databases enable
- Automate access to data so that the NASA 500,
- Global Change 10,000 and Internet 20,000,000
can use system. - Cache popular results, not all results (saves 3x
or more) - Compute on demand (saves lots of storage and
cpu). - Emphasize pull processing rather than push
processing. - Use parallelism to get scaleup.
- Do Batch as a data pump
- 3 Be Smart Shoppers
- Use COTS hardware/software (saves 400M)
- Just-in-time acquisition (saves 400M)
- Use workstation not mainframe technology (gives
10x more stuff) - Depreciate over 3 years (ends in 2007 with
"fresh" equipment) - 4. 2 N node architecture
- 2 Super DAACs for fault tolerance and for
growth.
6Meta-Model for Sequoia Proposal
- Be technological optimists
- couldnt build it today, count on progress.
- ride technology wave ( not water cooled)
- Buy or Seed, do not build.
- Use COTS where possible
- Fund 2 or more COTS vendors if need product
- OR DBMS
- HSM
- Operations.
- Replace people with technology ( OR DBMS)
- automate data discovery, access, visualization
- DBMS Centric view.
7DBMS Centric View
- This is a database problem (no kidding)!
- This is not
- a file system problem (file wrong abstraction)
- a rpc problem (CORBA wrong abstraction)
- a Z 39.50 problem (Z 39.50 is a FAP).
- This is a operations problem
- Hierarchical storage management
- Network management
- Source code control
- client-server tools.
- You can BUY all this stuff. Fund COTS.
- BUILD AS LITTLE AS POSSIBLE
8What California Proposed
- 0. Design for success expect that millions will
use the system (online) - 1. DBMS centric design automates discovery,
access, management - 2. Object relational databases enable
- Automate access to data so that the NASA 500,
- Global Change 10,000 and Internet 20,000,000
can use system. - Cache popular results, not all results (saves 3x
or more) - Compute on demand (saves lots of storage and
cpu). - Emphasize pull processing rather than push
processing. - Use parallelism to get scaleup.
- Do Batch as a data pump
- 3 Be Smart Shoppers
- Use COTS hardware/software (saves 400M)
- Just-in-time acquisition (saves 400M)
- Use workstation not mainframe technology (gives
10x more stuff) - Depreciate over 3 years (ends in 2007 with
"fresh" equipment) - 4. 2 N node architecture
- 2 Super DAACs for fault tolerance and for
growth.
9Design for Success Expect Lots of Users
- Expect that millions will use the system (online)
- Three user categories
- NASA 500 -- funded by NASA to do science
- Global Change 10 k - other dirt bags
- Internet 20 m - everyone else
- Grain speculators
- Environmental Impact Reports
- New applications
- discovery access must be automatic
- Allow anyone to set up a peer-DAAC SCF
- Design for Ad Hoc queries, Not Standard Data
Products If push is 90, then 10 of data is read
(on average). - A failure no one uses the data, in DSS,
push is 1 or less. - computation demand is 100x Hughes estimate
- (pull is 10x to 100x greater than push)
10The Process Flow
- Data arrives and is pre-processed.
- instrument data is calibrated,
- gridded
- averaged
- Geophysical data is derived
- Users ask for stored data
- OR to analyze and combine data.
- Can make the pull-push split dynamically
11The Software Model Global View
- SQL is the FAP and API.
- Applications use it to access data.
- It includes
- stored procedures
- (so RPC)
- GC class libraries
- Computation is data driven
- Gateways for other interfaces
- HTTP, Z 39.50, Corba COM
- TP or TP-lite manages workflow
12Automate access to data
- Invest in
- Design global change schema.
- cooperate with standards groups.
- OR DBMS class libraries for GC datatypes
- Develop browser to do resource discovery
- Community will develop access vis tools
- OR DBMS will do
- PUSH processing triggers and workflow
- PULL processing query optimization.
- (some assembly required).
-
13How Well Did SQL Work?
- Bill Farrell and others did 30 user scenarios
schema, application, SQL, performance - Snow cover, CO2, GCM,...
- Avg ad hoc scenario generated about 30 of
- EOSDIS baseline processing
- validated PULL over PUSH demand
- SQL was indeed a power tool
- Many scenarios became a few simple SQL queries
- Need a spatial temporal SQL.
- Personal view
- Its great!, much better than Farrell or I
expected.
14Compute on demand
- 90 of data is NEVER used (according to Hughes).
- Some data is used only once.
- Data is often re-calculated
- repair hardware/software bugs,
- new better algorithms
- Optimization store only popular data.
- Compute this based on past use
- (of this data and related data)
- Balance two costs
- 1. Re_Compute_Cost / Re_Use_Interval
- 2. Storage_Cost x Re_Use_Interval
- Recompute is often cheaper (saves 3x we think).
15Use parallelism to get scaleup.
- Many queries look at 100s or 1,000s of data
tiles. - e.g. Berkeley weekly Landsat images since 1972.
- 1000 tape accesses.
- 4,000 tape minutes 6 days.
- Done 1,000 way parallel 4 minutes.
- Disk tape demands are huge multi-GOX
- Computation demands are huge tera-ops.
- Only solution
- Use parallel execution
- Use parallel data access
- SQL does this for you automatically.
16Data Pump
- Compute on demand small jobs
- less than 1,000 tape mounts
- less than 100 M disk accesses
- less than 100 TeraOps.
- (less than 30 minute response time)
- For BIG JOBS scan entire 15PB database
- once a day /week
- Any BIG JOB can piggyback on this data scan.
- DAAC in 2007
17What California Proposed
- 0. Design for success expect that millions will
use the system (online) - 1. DBMS centric design automates discovery,
access, management - 2. Object relational databases enable
- Automate access to data so that the NASA 500,
- Global Change 10,000 and Internet 20,000,000
can use system. - Cache popular results, not all results (saves 3x
or more) - Compute on demand (saves lots of storage and
cpu). - Emphasize pull processing rather than push
processing. - Use parallelism to get scaleup.
- Do Batch as a data pump
- 3 Be Smart Shoppers
- Use COTS hardware/software (saves 400M)
- Just-in-time acquisition (saves 400M)
- Use workstation not mainframe technology (gives
10x more stuff) - Depreciate over 3 years (ends in 2007 with
"fresh" equipment) - 4. 2 N node architecture
- 2 Super DAACs for fault tolerance and for
growth.
18 Use COTS hardware/software (saves 400M)
- Defense contractors want to build (and maintain)
stuff. - (they do it for the money)
- Fund SQL (SQL-2007) Object-Relational
(extensible) - supports Global Change data types
- Automates access
- Reliable storage
- Tertiary storage
- Parallel data search (automatic)
- Workflow (job control)
- Reliable
- Fund Operations software companies (Tivoli...)
19Use workstation technology (NOW)
- Use workstation hardware technology,
- not Super Computers
- 0.5/MB of disk vs 30/MB of disk
- 100/MIPS vs 18,000/MIPS
- 3k/tape drive vs 50k/tape drive
- Processor, Disk, Tape ARRAYS connected by ATM
- a NOW
- Gives 10x (?100x) more stuff for same dollars
- Allows ad hoc query load
- Allows a scaleable design
- Allows same hardware SuperDAACs PeerDAACs
20Use workstation technology (NOW)
- Study used RS/6000 and DEC 7000 as workstation
- (they are 100k/slice).
- Should have used Compaq.
- Price for 20GFlop, 24 TB disk, 2PB tape TODAY
Compaq/DLT prices computed by Gray. 10 Peer DAAC
costs 3M today, 1 Micro DAAC (200TB) costs
300K
21Just-in-time acquisition (saves 400M)
- Hardware prices decline 20-40/year
- So buy at last moment
- Buy best product that day commodity
- Depreciate over 3 years so that facility is
fresh. - (after 3 years, cost is 23 of original). 60
decline peaks at 10M
22What California Proposed
- 0. Design for success expect that millions will
use the system (online) - 1. DBMS centric design automates discovery,
access, management - 2. Object relational databases enable
- Automate access to data so that the NASA 500,
- Global Change 10,000 and Internet 20,000,000
can use system. - Cache popular results, not all results (saves 3x
or more) - Compute on demand (saves lots of storage and
cpu). - Emphasize pull processing rather than push
processing. - Use parallelism to get scaleup.
- Do Batch as a data pump
- 3 Be Smart Shoppers
- Use COTS hardware/software (saves 400M)
- Just-in-time acquisition (saves 400M)
- Use workstation not mainframe technology (gives
10x more stuff) - Depreciate over 3 years (ends in 2007 with
"fresh" equipment) - 4. 2 N node architecture
- 2 Super DAACs for fault tolerance and for
growth.
232N DAAC architecture
- 2 Super-DAACs Have 2 BIG sites which
- Each store ALL the data (back each other up)
- no other way to archive these 15 PB databases
- Each service 1/2 the queries and run a data pump
- Each produces 1/2 the standard data products
- Each has a BIG MIP farm next to the Byte farm
- (a SCF science computation facility).
- N Peer-DAACs
- Each stores part of the data (got from a super
DAAC) - Can be NASA sponsored or private.
- Same software and hardware as Super-DAACs
- Super-DAACs are banks, Peer-DAACs are pubs
- careful anything goes
24Minimize Operations Costs
- Reduced sites (DAACs) have reduced costs
- Use Mosaic, Email, Telephone user support model
- Count on vendors to provide
- Network management (NetView SMTP)
- Data replication
- Application software version control
- Workflow control
- Help desk software
- More reliable hardware/software
25Unify data storage centers with data analysis
- Data analysis (Science Computation Facilities)
- need quick high bandwidth access to DB.
- WAN technology is good but not that good.
- WAN technology is not free.
- Co-Locate DAACs and SCFs.
- two super SCFs, many peer SCFs.
- Instrument teams often find a bug or new
algorithm - reprocess all the base data to make new data
set. - ripple effect to data consumers
- must track data lineage.
26Budget
- We had a VERY difficult time discovering a
budget. - So we did our own.
- It was less.
- Big savings in operations and development
- Hardware savings could give bigger DAACs
27What California Proposed
- 0. Design for success expect that millions will
use the system (online) - 1. DBMS centric design automates discovery,
access, management - 2. Object relational databases enable
- Automate access to data so that the NASA 500,
- Global Change 10,000 and Internet 20,000,000
can use system. - Cache popular results, not all results (saves 3x
or more) - Compute on demand (saves lots of storage and
cpu). - Emphasize pull processing rather than push
processing. - Use parallelism to get scaleup.
- Do Batch as a data pump
- 3 Be Smart Shoppers
- Use COTS hardware/software (saves 400M)
- Just-in-time acquisition (saves 400M)
- Use workstation not mainframe technology (gives
10x more stuff) - Depreciate over 3 years (ends in 2007 with
"fresh" equipment) - 4. 2 N node architecture
- 2 Super DAACs for fault tolerance and for
growth.
28Challenging Problems
- Design the Global Change Schema
- Understand data lineage
- Build discovery, analysis, visualization tools
- Build an OR DBMS
- Including distributed,
- parallel,
- workflow
- lazy-eager evaluation
- tertiary storage,
- SQL
- workflow
- Build a decent reliable HSM
- Build a way to operate a 1,000 node NOW.