Title: The Next 700 Data Description Languages
1The Next 700 Data Description Languages
- Yitzhak Mandelbaum
- Princeton University
- Computer Science
Collaborators Kathleen Fisher and David Walker
2The Next 700
3What Data Needs Describing?
- There's much data in databases and common formats
like XML theres much data thats ad hoc. - Ad hoc data lacks readily available parsing,
querying, analysis or transformation tools - Its all over the place financial, telecomm,
chemistry, physics, biology, etc.
4Ad Hoc Data in Biology
!autogenerated-by DAG-Edit version 1.419 rev 3
!saved-by gocvs !date Fri Mar 18 210028 PST
2005 !version Revision 3.223 !type is_a
is a !type lt part_of part of !type
inverse_of inverse of !type disjoint_from
disjoint from Gene_Ontology GO0003673
ltbiological_process GO0008150 behavior
GO0007610 synonymbehaviour adult
behavior GO0030534 synonymadult behaviour
adult feeding behavior GO0008343
synonymadult feeding behaviour feeding
behavior GO0007631 adult locomotory
behavior GO0008344 ...
from www.geneontology.org
5Ad Hoc Data in Chemistry
OC(C_at__at_H2OC(C)O)C_at__at_3(C)C_at_(C_at_(CO4) (OC(C)
O)C_at_H4CC_at__at_H3O)(H)C_at_H (OC(C7CCCCC7)O)C
_at__at_1(O)C_at__at_(C)(C)C2C(C) C_at__at_H(OC(C_at_H(O)C_at__at_H
(NC(C6CCCCC6)O) C5CCCCC5)O)C1
6Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200 941
7Ad Hoc Data DNS packets
00000000 9192 d8fb 8480 0001 05d8 0000 0000 0872
...............r 00000010 6573 6561 7263 6803
6174 7403 636f 6d00 esearch.att.com. 00000020
00fc 0001 c00c 0006 0001 0000 0e10 0027
...............' 00000030 036e 7331 c00c 0a68
6f73 746d 6173 7465 .ns1...hostmaste 00000040
72c0 0c77 64e5 4900 000e 1000 0003 8400
r..wd.I......... 00000050 36ee 8000 000e 10c0
0c00 0f00 0100 000e 6............... 00000060
1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
......linux..... 00000070 0f00 0100 000e 1000
0c00 0a07 6d61 696c ............mail 00000080
6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
man............. 00000090 0487 cf1a 16c0 0c00
0200 0100 000e 1000 ................ 000000a0
0603 6e73 30c0 0cc0 0c00 0200 0100 000e
..ns0........... 000000b0 1000 02c0 2e03 5f67
63c0 0c00 2100 0100 ......_gc...!... 000000c0
0002 5800 1d00 0000 640c c404 7068 7973
..X.....d...phys 000000d0 0872 6573 6561 7263
6803 6174 7403 636f .research.att.co
8Data Description Languages
- Data description languages describe many ad hoc
formats and provide the following features - Descriptions serves as documentation, including
semantic of data - Compiler generates tools from description
parser, printer, query engine, converter to XML,
statistical profiler, etc. - Parser includes robust error detection and
recovery. - Parsers can handle high data volume.
- gt 1GB/second Netflow traffic from Cisco routers.
9Many Data Description Languages
Physical
- Logical Descriptions
- ASN.1
- ASDL
- Physical Descriptions
- PacketTypes (SIGCOMM 00)
- DataScript (GPCE 02)
- PADS (PLDI 05)
- Basis for current work
001010101 101001001 111001010 001010100
Logical
10Contributions
- A core data description calculus (DDC)
- Based on dependent type theory
- Simple, orthogonal, composable types
- Types are transducers from external data source
to internal data representation. - Encodings of high-level DDLs in low-level DDC
- Explain semantics of PADS language in particular.
11Base Types and Sequences
- C(e) base type can be parameterized by
expression e. - ?xT.T dependent product describes sequence of
values. - Variable x gives name to first value in sequence.
- Examples
12Constraints
- xT e set types allow you to constrain the
type T and express relationships between elements
of the data. - Examples
13Unions and the Empty String
- true matches the empty string.
- T T deterministic, exclusive or try T on
failure, try T. - Examples
14Array Features
- What features do we need to handle data
sequences? - Elements
- Separator between elements
- Termination condition (are we done yet?)
- Terminator after sequence
- Examples
- 192.168.1.1
- BillCathyJaneBob
15False and Arrays
- T seq(Ts e, Tt) specifies
- Element type T
- Separator types Ts.
- Termination condition e.
- Terminator type Tt.
- false reads nothing, flagging an error.
- Example IP address.
16Abstraction and Application
- Can parameterize types over values ?x.T
- Correspondingly, can apply types to values T e
- Example IP address with terminator
17Absorb, Compute and Scan
- Absorb, Compute and Scan are active types.
- absorb(T) consume data from source produce
nothing. - compute(e?) consume nothing output result of
computation e. - scan(T) scan data source for type T.
- Examples
18Type Kinding
- Kinding ensures types are well formed.
19Parsing Semantics of Types
- Semantics expressed as parsing functions written
in the polymorphic ?-calculus. - Sem(T) DDC Type ? Function
- Input data and offset, output new offset, value
and parse descriptor. - For specifics, see upcoming technical report.
20Types of Parser Output
- Parsers produce values with following type in the
host language
Base Types
unrecoverable error
Products
dependency erased
Abs. and App.
Union
semantic error
Set types
21Properties of the Calculus
- Theorem If ? - T k then
- T F well formed types yield parsers
- ? - F bits offset ? offset Trep
Tpda T-Parser returns values with types that
correspond to T. - Theorem Parsers report errors accurately.
- Errors in parse descriptor correspond to actual
errors in data. - Parsers check all semantic constraints.
- More
22Making Use of the Calculus
IPADS t C(e) Pfun(xs) t t e
Pstructfields Punionfields
Pswitch e of alts tdef Popt t t
Pwhere x.e Paltfields t t e,t
Pcompute e Plit c fields fields x
t alts alts e gt t
? - t ? T
23Example Popt and Plit
true T1 T2
C(e) xT e absorb(T) scan(T)
24Example Pswitch
25Future work
- What are the set of languages recognized by the
DDC? - How does the expressive power of the DDC relate
to CFGs and regular expressions? - Implement recursive types in PADS system based on
the recursive types of the DDC. - Add polymorphism to DDC and PADS.
26Summary
- Data description languages are well-suited to
describing ad hoc data. - No one DDL will ever be right - different domains
and applications will demand different languages
with differing levels of expressiveness and
abstraction. - Our work defines the first semantics for data
description languages. - For more information, visit www.padsproj.org.
27Cut slides follow
28A Brief History
- In the beginning, there was just one program
(maybe two). - No need for programming language.
- That program was copied and changed until there
were many programs. - High-level programming language was invented.
- Nice, but not right for all situations - many new
programming languages appeared. - How do these languages related to each other?
- Programming language semantics was born.
29A Brief History
- In the beginning, there was just one data format
(binary). - No need for data description language.
- That format was evolved until there were many
formats. - Data description language was invented.
- One language did not suit all and many new data
description languages appeared. - This is where we are today
- Wed like to help answer that question by
devising the first data description language
semantics.