Surviving The Information Avalanche - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Surviving The Information Avalanche

Description:

At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, Jim ... information is a key technology. We are here. Yotta. Zetta. Exa. Peta. Tera. Giga. Mega. Kilo ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 44
Provided by: gray48
Category:

less

Transcript and Presenter's Notes

Title: Surviving The Information Avalanche


1
SurvivingThe Information Avalanche
  • Jim Gray
  • Microsoft Research
  • Talk _at_ Adobe Developers Conference
  • 26 April 2004
  • http//research.microsoft.com/gray/talks

2
Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
  • Historical trends imply that in 20 years
  • we can store everything in cyberspace.The
    personal petabyte.
  • computers will have natural interfacesspeech
    recognition/synthesisvision, object recognition
    beyond OCR
  • Implications
  • The information avalanche will only get worse.
  • The user interface will change less typing,
    more writing, talking, gesturing, more seeing
    and hearing
  • Organizing, summarizing, prioritizinginformation
    is a key technology.

We are here
3
Things Have Changed
1956
  • IBM 305 RAMAC
  • 10 MB disk
  • 1M (y2004 )

4
The Next 50 years will see MORE CHANGE ops/s/
Had Three Growth Curves 1890-1990
Combination of Hans Moravac Larry Roberts
Gordon Bell WordSizeops/s/sysprice
  • 1890-1945
  • Mechanical
  • Relay
  • 7-year doubling
  • 1945-1985
  • Tube, transistor,..
  • 2.3 year doubling
  • 1985-2004
  • Microprocessor
  • 1.0 year doubling

5
Constant Cost or Constant Function?
  • 100x improvement per decade
  • Same function 100x cheaper
  • 100x more function for same price

Mainframe
SMP
Constellation
Cluster
Constant Price
Mini
SMP
Constellation
Workstation
Graphics/storage
Lower Price New Category
PDA
Camera/browser
6
Growth Comes From NEW Apps
  • The 10M computer of 1980 costs 1k today
  • If we were still doing the same things,IT would
    be a 0 B/y industry
  • NEW things absorb the new capacity

7
The Surprise-Free Futurein 20 years.
  • 10,000x more power for same price
  • Personal supercomputer
  • Personal petabyte stores
  • Same function for 10,000x less cost.
  • Smart dust --the penny PC?
  • The 10 peta-op computer (for 1,000).

8
10,000x would change things
  • Human computer interface
  • Decent computer vision
  • Decent computer speech recognition
  • Decent computer speech synthesis
  • Vast information stores
  • Ability to search and abstract the stores.

9
How Good is HCI Today?
  • Surprisingly good.
  • Demo of making faces
  • http//research.microsoft.com/research/pubs/view.
    aspx?pubid290
  • Demo of speech synthesis
  • Daisy, Hal
  • Synthetic voice
  • Speech recognition is improving fast,
  • Vision getting better
  • Pen computing finally a reality.
  • Displays improving fast (compared to last 30
    years)

10
Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
  • Historical trends imply that in 20 years
  • we can store everything in cyberspace.The
    personal petabyte.
  • computers will have natural interfacesspeech
    recognition/synthesisvision, object recognition
    beyond OCR
  • Implications
  • The information avalanche will only get worse.
  • The user interface will change less typing,
    more writing, talking, gesturing, more seeing
    and hearing
  • Organizing, summarizing, prioritizinginformation
    is a key technology.

We are here
11
How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
  • Almost everything is recorded digitally.
  • Most bytes are never seen by humans.
  • Data summarization, trend detection anomaly
    detection are key technologies
  • See Mike Lesk How much information is there
    http//www.lesk.com/mlesk/ksg97/ksg.html
  • See Lyman Varian
  • How much information
  • http//www.sims.berkeley.edu/research/projects/how
    -much-info/

Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
12
And 90 in Cyberspace Because
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
13
MyLifeBits The guinea pig
  • Gordon Bell is digitizing his life
  • Has now scanned virtually all
  • Books written (and read when possible)
  • Personal documents (correspondence, memos,
    email, bills, legal,0)
  • Photos
  • Posters, paintings, photo of things (artifacts,
    medals, plaques)
  • Home movies and videos
  • CD collection
  • And, of course, all PC files
  • Recording phone, radio, TV, web pages
    conversations
  • Paperless throughout 2002. 12 scanned, 12
    discarded.
  • Only 30GB Excluding videos
  • Video is 2 TB and growing fast

14
Capture and encoding
15
I mean everything
16
25Kday life Personal Petabyte
1PB
Will anyone look at web pages in 2020?
Probably new modalities media will dominate
then.
17
Challenges
  • Capture Get the bits in
  • Organize Index them
  • Manage No worries about loss or space
  • Curate/ Annotate atutomate where possible
  • Privacy Keep safe from theft.
  • Summarize Give thumbnail summaries
  • Interface how ask/anticipate questions
  • Present show it in understandable ways.

18
MemexAs We May Think, Vannevar Bush, 1945
  • A memex is a device in which an individual
    stores all his books, records, and
    communications, and which is mechanized so that
    it may be consulted with exceeding speed and
    flexibility
  • yet if the user inserted 5000 pages of material
    a day it would take him hundreds of years to fill
    the repository, so that he can be profligate and
    enter material freely

19
Too much storage?Try to fill a terabyte in a year
Petabyte volume has to be some form of video.
20
How Will We Find Anything?
  • Need Queries, Indexing, Pivoting, Scalability,
    Backup, Replication,Online update, Set-oriented
    access
  • If you dont use a DBMS, you will implement one!
  • Simple logical structure
  • Blob and link is all that is inherent
  • Additional properties (facets extra
    tables)and methods on those tables
    (encapsulation)
  • More than a file system
  • Unifies data and meta-data

SQL DBMS
21
Photos
22
Searching the most useful app?
  • Challenge What questions for useful results?
  • Many ways to present answers

23
(No Transcript)
24
Detail view
25
Resource explorerAncestor (collections),
annotations, descendant preview panes turned on
26
Synchronized timelines with histogram guide
27
Value of media depends on annotations
  • Its just bits until it is annotated

28
System annotations provide base level of value
  • Date 7/7/2000

29
Tracking usage even better
  • Date 7/7/2000. Opened 30 times, emailed to 10
    people (its valued by the user!)

30
Get the user to say a little something is a big
jump
  • Date 7/7/2000. Opened 30 times, emailed to 10
    people. BARC dim sum intern farewell Lunch

31
Getting the user to tell a story is the ultimate
in media value
  • A story is a layout in time and space
  • Most valuable content (by selection, and by being
    well annotated)
  • Stories must include links to any media they use
    (for future navigation/search transclusion).
  • Cf MovieMaker Creative Memories PhotoAlbums

32
Value of media depends on annotations
Its just bits until it is annotated
  • Auto-annotate whenever possible e.g. GPS cameras
  • Make manual annotation as easy as possible. XP
    photo capture, voice, photos with voice, etc
  • Support gang annotation
  • Make stories easy

33
80 of data is personal / individual. But, what
about the other 20?
  • Business
  • Wall Mart online 1PB and growing.
  • Paradox most transaction systems
  • Have to go to image/data monitoring for big data
  • Government
  • Government is the biggest business.
  • Science
  • LOTS of data.

34
Instruments CERN LHCPeta Bytes per Year
  • Looking for the Higgs Particle
  • Sensors 1000 GB/s (1TB/s 30 EB/y)
  • Events 75 GB/s
  • Filtered 5 GB/s
  • Reduced 0.1 GB/s 2 PB/y
  • Data pyramid 100GB 1TB 100TB 1PB 10PB

35
Information Avalanche
  • Both
  • better observational instruments and
  • Better simulations
  • are producing a data avalanche
  • Examples
  • Turbulence 100 TB simulation then mine the
    Information
  • BaBar Grows 1TB/day 2/3 simulation Information
    1/3 observational Information
  • CERN LHC will generate 1GB/s 10 PB/y
  • VLBA (NRAO) generates 1GB/s today
  • NCBI only ½ TB but doubling each year, very
    rich dataset.
  • Pixar 100 TB/Movie

Image courtesy of C. Meneveau A. Szalay _at_ JHU
36
Q Where will the Data Come From?A Sensor
Applications
  • Earth Observation
  • 15 PB by 2007
  • Medical Images Information Health Monitoring
  • Potential 1 GB/patient/y ? 1 EB/y
  • Video Monitoring
  • 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
    EB/y ? filtered???
  • Airplane Engines
  • 1 GB sensor data/flight,
  • 100,000 engine hours/day
  • 30PB/y
  • Smart Dust ?? EB/y

http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
37
The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it
  • How to coexist with others
  • Query and Vis tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch query scheduling

38
FTP - GREP
  • Download (FTP and GREP) are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 3,000 disks
  • At some point we need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • Next generation technique Data Exploration
  • Bring the analysis to the data!

39
The Speed Problem
  • Many users want to search the whole DBad hoc
    queries, often combinatorial
  • Want 1 minute response
  • Brute force (parallel search)
  • 1 disk 50MBps 1M disks/PB 300M/PB
  • Indices (limit search, do column store)
  • 1,000x less equipment 1M/PB
  • Pre-compute answer
  • No one knows how do it for all questions.

40
Next-Generation Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks Dark matter, Dark energy
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • As data and computers grow at same rate, we can
    only keep up with N logN
  • A way out?
  • Relax notion of optimal (data is fuzzy, answers
    are approximate)
  • Dont assume infinite computational resources or
    memory
  • Combination of statistics computer science

41
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally these are performed on files
  • Most of these tasks are much better done inside a
    database
  • Move Mohamed to the mountain, not the mountain to
    Mohamed.

42
DataGrid Computing
  • Store exabytes twice (for redundancy)
  • Access them from anywhere
  • Implies huge archive/data centers
  • Supercomputer centers become super data centers
  • Examples Google, Yahoo!, Hotmail,BaBar, CERN,
    Fermilab, SDSC,

43
Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
  • Historical trends imply that in 20 years
  • we can store everything in cyberspace.The
    personal petabyte.
  • computers will have natural interfacesspeech
    recognition/synthesisvision, object recognition
    beyond OCR
  • Implications
  • The information avalanche will only get worse.
  • The user interface will change less typing,
    more writing, talking, gesturing, more seeing
    and hearing
  • Organizing, summarizing, prioritizinginformation
    is a key technology.

We are here
Write a Comment
User Comments (0)
About PowerShow.com