The Structure of Computer Scientific Revolutions - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

The Structure of Computer Scientific Revolutions

Description:

Dow Jones Enterprise Ventures. May 2006. Michael Franklin. UC Berkeley. Amalgamated Insight ... Dow Jones EV Summit May 2006. Whither Structured Data? ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 31
Provided by: jeff70
Category:

less

Transcript and Presenter's Notes

Title: The Structure of Computer Scientific Revolutions


1
The Structure of (Computer) Scientific Revolutions
Michael Franklin UC Berkeley Amalgamated
Insight
  • Dow Jones Enterprise Ventures
  • May 2006

2
Data Management Then
Structured Data Processing
3
Data Management Now
4
The Structure Spectrum
  • Structured data (schema-first)
  • regular, known, conforming,
  • e.g., Relational database
  • Unstructured data (schema-never) freeform,
    irregular,
  • e.g., plain text, images, audio,
  • Semi-structured data (schema-later)
  • Provides structural information, but less
    constrained. e.g., XML, tagged text/media

5
Whither Structured Data?
  • Conventional Wisdom 20 of data is structured
    currently.
  • Consumer apps, enterprise search, media apps are
    placing downward pressure on this.

6
A Contrarian View?
  • Two reasons why structured data is where the
    action will be
  • The Data Industrial Revolution Data
    used to be hand-crafted, now its
    generated by computers!!!
  • The Data Integration quagmire structure provides
    crucial cues for making data usable.

7
The New Landscape
  • Bells Law Every decade, a new, lower cost,
    class of computers emerges, defined by platform,
    interface, and interconnect
  • Mainframes 1960s
  • Minicomputers 1970s
  • Microcomputers/PCs 1980s
  • Web-based computing 1990s
  • Devices (Cell phones, PDAs, wireless sensors,
    RFID) 2000s

Enabling a new generation of applications
for Operational Visibility, monitoring, and
alerting.
8
Data Streams ? Data Flood
PoS System
Barcodes
Phones
Sensors
RFID
  • Exponential data growth
  • New challenges continuous, inter-connected,
    distributed, physical
  • Shrinking business cycles
  • More complex decisions

Inventory
Transactional Systems
Telematics
Clickstream
9
State of the Art
  • Custom-coded implementations that are expensive
    and often unsuccessful.
  • Can we develop the right infrastructure to
    support large-scale data streaming apps?

10
High Fan In Systems
  • A data management infrastructure for large-scale
    data streaming environments.
  • Uniform Declarative Framework
  • Every node is a data stream processor that speaks
    SQL-ese
  • ? stream-oriented queries at all levels
  • Hierarchical, stream-based views as an organizing
    principle.
  • Can impose a view over messy devices.

11
HiFi - Taming the Data Flood
Hierarchical Aggregation Spatial Temporal
Headquarters
Regional Centers
In-network Stream Query Processing and Storage
Warehouses, Stores
Fast Data Path vs. Slow Data Path
Dock doors, Shelves
Receptors
12
Device Issues example
Shelf RIFD Test - Ground Truth
13
Actual RFID Readings
Restock every time inventory goes below 5
14
Query-based Data Cleaning
Smooth
CREATE VIEW smoothed_rfid_stream AS (SELECT
receptor_id, tag_id FROM cleaned_rfid_stream
range by 5 sec, slide by 5
sec GROUP BY receptor_id, tag_id HAVING
count() gt count_T)
Point
15
Query-based Data Cleaning
Arbitrate
CREATE VIEW arbitrated_rfid_stream AS (SELECT
receptor_id, tag_id FROM smoothed_rfid_stream rs
range by 5 sec, slide by 5
sec GROUP BY receptor_id, tag_id HAVING
count() gt ALL (SELECT count() FROM
smoothed_rfid_stream range by 5
sec, slide by 5 sec
WHERE tag_id rs.tag_id GROUP BY
receptor_id))
Smooth
Point
16
After Query-based Cleaning
Restock every time inventory goes below 5
17
Once you have the right abstractions
  • Soft Sensors
  • Quality and lineage
  • Optimization (power, etc.)
  • Pushdown of external validation information
  • Data archiving
  • Model-based sensing
  • Imperative processing

18
Data Integration
  • Integration is the ultimate schema-first problem.
  • Structure is both a key enabler and a key
    impediment here.

19
Search vs. Query
  • What if you wanted to find out which actors
    donated to John Kerrys presidential campaign?

20
Search vs. Query
21
Search vs. Query
  • What if you wanted to find out which actors
    donated to John Kerrys presidential campaign?

22
Search vs. Query
  • Search can return only whats been previously
    stored.

23
Also
  • What if you wanted to find out the average
    donation of actors to each candidate?
  • What if you wanted to compare actor donations
    this campaign to the last one?
  • What if you wanted to find out who gave the most
    to each candidate?
  • What if you wanted to know where the information
    came from, and how old it was?

24
A Deep-Web Query Approach
SELECT y.name,f.occupation, FROM Yahoo_Actors y,
FECInfo f WHERE y.name f.name
25
Yahoo Actors JOIN FECInfo
Q Did it Work?
26
The Fundamental Tradeoff
Structure enables computers to help users
manipulate and maintain the data.
Semi-Structured (schema-later)
Structured (schema-first)
Unstructured (schema-less)
27
Dataspaces
  • Deal with all the data from an enterprise in
    whatever form
  • Data co-existence
  • no integrated schema, no single warehouse
  • Pay-as-you-go services
  • Keyword search is bare minimum.
  • Data manipulation and increased consistency as
    you add work.

From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD
Record, December 2005.
28
Dataspaces vs. Databases
  • Single Schema
  • Centralized Administration
  • Structured Query
  • Strict Integrity Constraints
  • Data Coexistence
  • Autonomous Sources
  • Search, Browse, Approximate Answer
  • Best Effort Guarantees

29
The World of Dataspaces
Web Search
Far
Virtual Organization
Administrative Proximity
Federated DBMS
Near
Desktop Search
DBMS
High
Low
Semantic Integration
30
Conclusions
  • Structured data not going away.
  • In fact, there will be lots more of it.
  • and it must be processed as fast as it is
    created.
  • Structure is crucial for successful data
    integration and manipulation.
  • Much effort will be expended to add structural
    information to text and media.
  • Traditional (structured) database technology is
    not up to the task.
  • Great opportunities for innovation.
  • HiFi and Dataspaces are examples.
Write a Comment
User Comments (0)
About PowerShow.com