Title: Surviving The Information Avalanche
1SurvivingThe Information Avalanche
- Jim Gray
- Microsoft Research
- Talk _at_ Adobe Developers Conference
- 26 April 2004
- http//research.microsoft.com/gray/talks
2Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here
3Things Have Changed
1956
- IBM 305 RAMAC
- 10 MB disk
- 1M (y2004 )
4The Next 50 years will see MORE CHANGE ops/s/
Had Three Growth Curves 1890-1990
Combination of Hans Moravac Larry Roberts
Gordon Bell WordSizeops/s/sysprice
- 1890-1945
- Mechanical
- Relay
- 7-year doubling
- 1945-1985
- Tube, transistor,..
- 2.3 year doubling
- 1985-2004
- Microprocessor
- 1.0 year doubling
5Constant Cost or Constant Function?
- 100x improvement per decade
- Same function 100x cheaper
- 100x more function for same price
Mainframe
SMP
Constellation
Cluster
Constant Price
Mini
SMP
Constellation
Workstation
Graphics/storage
Lower Price New Category
PDA
Camera/browser
6Growth Comes From NEW Apps
- The 10M computer of 1980 costs 1k today
- If we were still doing the same things,IT would
be a 0 B/y industry - NEW things absorb the new capacity
7The Surprise-Free Futurein 20 years.
- 10,000x more power for same price
- Personal supercomputer
- Personal petabyte stores
- Same function for 10,000x less cost.
- Smart dust --the penny PC?
- The 10 peta-op computer (for 1,000).
810,000x would change things
- Human computer interface
- Decent computer vision
- Decent computer speech recognition
- Decent computer speech synthesis
- Vast information stores
- Ability to search and abstract the stores.
9How Good is HCI Today?
- Surprisingly good.
- Demo of making faces
- http//research.microsoft.com/research/pubs/view.
aspx?pubid290 - Demo of speech synthesis
- Daisy, Hal
- Synthetic voice
- Speech recognition is improving fast,
- Vision getting better
- Pen computing finally a reality.
- Displays improving fast (compared to last 30
years)
10Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here
11How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Almost everything is recorded digitally.
- Most bytes are never seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
12And 90 in Cyberspace Because
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
13MyLifeBits The guinea pig
- Gordon Bell is digitizing his life
- Has now scanned virtually all
- Books written (and read when possible)
- Personal documents (correspondence, memos,
email, bills, legal,0) - Photos
- Posters, paintings, photo of things (artifacts,
medals, plaques) - Home movies and videos
- CD collection
- And, of course, all PC files
- Recording phone, radio, TV, web pages
conversations - Paperless throughout 2002. 12 scanned, 12
discarded. - Only 30GB Excluding videos
- Video is 2 TB and growing fast
14Capture and encoding
15I mean everything
1625Kday life Personal Petabyte
1PB
Will anyone look at web pages in 2020?
Probably new modalities media will dominate
then.
17Challenges
- Capture Get the bits in
- Organize Index them
- Manage No worries about loss or space
- Curate/ Annotate atutomate where possible
- Privacy Keep safe from theft.
- Summarize Give thumbnail summaries
- Interface how ask/anticipate questions
- Present show it in understandable ways.
18MemexAs We May Think, Vannevar Bush, 1945
- A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility - yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely
19Too much storage?Try to fill a terabyte in a year
Petabyte volume has to be some form of video.
20How Will We Find Anything?
- Need Queries, Indexing, Pivoting, Scalability,
Backup, Replication,Online update, Set-oriented
access - If you dont use a DBMS, you will implement one!
- Simple logical structure
- Blob and link is all that is inherent
- Additional properties (facets extra
tables)and methods on those tables
(encapsulation) - More than a file system
- Unifies data and meta-data
SQL DBMS
21Photos
22Searching the most useful app?
- Challenge What questions for useful results?
- Many ways to present answers
-
23(No Transcript)
24Detail view
25Resource explorerAncestor (collections),
annotations, descendant preview panes turned on
26Synchronized timelines with histogram guide
27Value of media depends on annotations
- Its just bits until it is annotated
28System annotations provide base level of value
29Tracking usage even better
- Date 7/7/2000. Opened 30 times, emailed to 10
people (its valued by the user!)
30Get the user to say a little something is a big
jump
- Date 7/7/2000. Opened 30 times, emailed to 10
people. BARC dim sum intern farewell Lunch
31Getting the user to tell a story is the ultimate
in media value
- A story is a layout in time and space
- Most valuable content (by selection, and by being
well annotated) - Stories must include links to any media they use
(for future navigation/search transclusion). - Cf MovieMaker Creative Memories PhotoAlbums
32Value of media depends on annotations
Its just bits until it is annotated
- Auto-annotate whenever possible e.g. GPS cameras
- Make manual annotation as easy as possible. XP
photo capture, voice, photos with voice, etc - Support gang annotation
- Make stories easy
3380 of data is personal / individual. But, what
about the other 20?
- Business
- Wall Mart online 1PB and growing.
- Paradox most transaction systems
- Have to go to image/data monitoring for big data
- Government
- Government is the biggest business.
- Science
- LOTS of data.
34Instruments CERN LHCPeta Bytes per Year
- Looking for the Higgs Particle
- Sensors 1000 GB/s (1TB/s 30 EB/y)
- Events 75 GB/s
- Filtered 5 GB/s
- Reduced 0.1 GB/s 2 PB/y
- Data pyramid 100GB 1TB 100TB 1PB 10PB
35Information Avalanche
- Both
- better observational instruments and
- Better simulations
- are producing a data avalanche
- Examples
- Turbulence 100 TB simulation then mine the
Information - BaBar Grows 1TB/day 2/3 simulation Information
1/3 observational Information - CERN LHC will generate 1GB/s 10 PB/y
- VLBA (NRAO) generates 1GB/s today
- NCBI only ½ TB but doubling each year, very
rich dataset. - Pixar 100 TB/Movie
Image courtesy of C. Meneveau A. Szalay _at_ JHU
36Q Where will the Data Come From?A Sensor
Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
37The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it
- How to coexist with others
- Query and Vis tools
- Support/training
- Performance
- Execute queries in a minute
- Batch query scheduling
38FTP - GREP
- Download (FTP and GREP) are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 3,000 disks
- At some point we need indices to limit
search parallel data search and analysis - This is where databases can help
- Next generation technique Data Exploration
- Bring the analysis to the data!
39The Speed Problem
- Many users want to search the whole DBad hoc
queries, often combinatorial - Want 1 minute response
- Brute force (parallel search)
- 1 disk 50MBps 1M disks/PB 300M/PB
- Indices (limit search, do column store)
- 1,000x less equipment 1M/PB
- Pre-compute answer
- No one knows how do it for all questions.
40Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at same rate, we can
only keep up with N logN - A way out?
- Relax notion of optimal (data is fuzzy, answers
are approximate) - Dont assume infinite computational resources or
memory - Combination of statistics computer science
41Analysis and Databases
- Much statistical analysis deals with
- Creating uniform samples
- data filtering
- Assembling relevant subsets
- Estimating completeness
- censoring bad data
- Counting and building histograms
- Generating Monte-Carlo subsets
- Likelihood calculations
- Hypothesis testing
- Traditionally these are performed on files
- Most of these tasks are much better done inside a
database - Move Mohamed to the mountain, not the mountain to
Mohamed.
42DataGrid Computing
- Store exabytes twice (for redundancy)
- Access them from anywhere
- Implies huge archive/data centers
- Supercomputer centers become super data centers
- Examples Google, Yahoo!, Hotmail,BaBar, CERN,
Fermilab, SDSC,
43Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here