Title: The Structure of Computer Scientific Revolutions
1The Structure of (Computer) Scientific Revolutions
Michael Franklin UC Berkeley Amalgamated
Insight
- Dow Jones Enterprise Ventures
- May 2006
2Data Management Then
Structured Data Processing
3Data Management Now
4The Structure Spectrum
- Structured data (schema-first)
- regular, known, conforming,
- e.g., Relational database
- Unstructured data (schema-never) freeform,
irregular, - e.g., plain text, images, audio,
- Semi-structured data (schema-later)
- Provides structural information, but less
constrained. e.g., XML, tagged text/media
5Whither Structured Data?
- Conventional Wisdom 20 of data is structured
currently. - Consumer apps, enterprise search, media apps are
placing downward pressure on this.
6A Contrarian View?
- Two reasons why structured data is where the
action will be - The Data Industrial Revolution Data
used to be hand-crafted, now its
generated by computers!!! - The Data Integration quagmire structure provides
crucial cues for making data usable.
7The New Landscape
- Bells Law Every decade, a new, lower cost,
class of computers emerges, defined by platform,
interface, and interconnect - Mainframes 1960s
- Minicomputers 1970s
- Microcomputers/PCs 1980s
- Web-based computing 1990s
- Devices (Cell phones, PDAs, wireless sensors,
RFID) 2000s
Enabling a new generation of applications
for Operational Visibility, monitoring, and
alerting.
8Data Streams ? Data Flood
PoS System
Barcodes
Phones
Sensors
RFID
- Exponential data growth
- New challenges continuous, inter-connected,
distributed, physical - Shrinking business cycles
- More complex decisions
Inventory
Transactional Systems
Telematics
Clickstream
9State of the Art
- Custom-coded implementations that are expensive
and often unsuccessful. - Can we develop the right infrastructure to
support large-scale data streaming apps?
10High Fan In Systems
- A data management infrastructure for large-scale
data streaming environments. - Uniform Declarative Framework
- Every node is a data stream processor that speaks
SQL-ese - ? stream-oriented queries at all levels
- Hierarchical, stream-based views as an organizing
principle. - Can impose a view over messy devices.
11HiFi - Taming the Data Flood
Hierarchical Aggregation Spatial Temporal
Headquarters
Regional Centers
In-network Stream Query Processing and Storage
Warehouses, Stores
Fast Data Path vs. Slow Data Path
Dock doors, Shelves
Receptors
12Device Issues example
Shelf RIFD Test - Ground Truth
13Actual RFID Readings
Restock every time inventory goes below 5
14Query-based Data Cleaning
Smooth
CREATE VIEW smoothed_rfid_stream AS (SELECT
receptor_id, tag_id FROM cleaned_rfid_stream
range by 5 sec, slide by 5
sec GROUP BY receptor_id, tag_id HAVING
count() gt count_T)
Point
15Query-based Data Cleaning
Arbitrate
CREATE VIEW arbitrated_rfid_stream AS (SELECT
receptor_id, tag_id FROM smoothed_rfid_stream rs
range by 5 sec, slide by 5
sec GROUP BY receptor_id, tag_id HAVING
count() gt ALL (SELECT count() FROM
smoothed_rfid_stream range by 5
sec, slide by 5 sec
WHERE tag_id rs.tag_id GROUP BY
receptor_id))
Smooth
Point
16After Query-based Cleaning
Restock every time inventory goes below 5
17Once you have the right abstractions
- Soft Sensors
- Quality and lineage
- Optimization (power, etc.)
- Pushdown of external validation information
- Data archiving
- Model-based sensing
- Imperative processing
-
18Data Integration
- Integration is the ultimate schema-first problem.
- Structure is both a key enabler and a key
impediment here.
19Search vs. Query
- What if you wanted to find out which actors
donated to John Kerrys presidential campaign?
20Search vs. Query
21Search vs. Query
- What if you wanted to find out which actors
donated to John Kerrys presidential campaign?
22Search vs. Query
- Search can return only whats been previously
stored.
23Also
- What if you wanted to find out the average
donation of actors to each candidate? - What if you wanted to compare actor donations
this campaign to the last one? - What if you wanted to find out who gave the most
to each candidate? - What if you wanted to know where the information
came from, and how old it was?
24A Deep-Web Query Approach
SELECT y.name,f.occupation, FROM Yahoo_Actors y,
FECInfo f WHERE y.name f.name
25Yahoo Actors JOIN FECInfo
Q Did it Work?
26The Fundamental Tradeoff
Structure enables computers to help users
manipulate and maintain the data.
Semi-Structured (schema-later)
Structured (schema-first)
Unstructured (schema-less)
27Dataspaces
- Deal with all the data from an enterprise in
whatever form - Data co-existence
- no integrated schema, no single warehouse
- Pay-as-you-go services
- Keyword search is bare minimum.
- Data manipulation and increased consistency as
you add work.
From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD
Record, December 2005.
28Dataspaces vs. Databases
- Single Schema
- Centralized Administration
- Structured Query
- Strict Integrity Constraints
- Data Coexistence
- Autonomous Sources
- Search, Browse, Approximate Answer
- Best Effort Guarantees
29The World of Dataspaces
Web Search
Far
Virtual Organization
Administrative Proximity
Federated DBMS
Near
Desktop Search
DBMS
High
Low
Semantic Integration
30Conclusions
- Structured data not going away.
- In fact, there will be lots more of it.
- and it must be processed as fast as it is
created. - Structure is crucial for successful data
integration and manipulation. - Much effort will be expended to add structural
information to text and media. - Traditional (structured) database technology is
not up to the task. - Great opportunities for innovation.
- HiFi and Dataspaces are examples.