Title: Database Application Design
1Database Application Design
March 24, 2000
2Course information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 305A, West Hall
- Phone (734) 615-5225
- Office hours Thursdays 3-4 and Fridays 1-2
- Course page http//www.si.umich.edu/radev/654w00
- Class meets on Fridays, 230 - 530 PM, 311 WH
3Web-based databases
4Types of databases
- Textual databases
- Semi-structured databases
5Indexing textual data
- Inverted files
- Boolean queries
- Signature files
- Signature S1 matches signature S2 if S2S1S2
6XML-QL
7XML-QL
Two slides from Johannes Gehrke, Cornell
University
ltIMG SRCxysq.gif ALT(xy)2gt
ltapplygt ltpower/gt ltapplygt ltplus/gt ltcigtxlt/cigt
ltcigtylt/cigt lt/applygt ltcngt2lt/cngt lt/applygt
WHERE ltBOOKgt ltNAMEgtltLASTgt1lt/LASTgtlt/NAME
gt lt/BOOKgt in www.booklist.com/books.xml CONSTR
UCT ltRESULTgt 1 lt/RESULTgt
8XML-QL (continued)
WHERE ltBOOKgt b ltBOOKgt IN www.booklist.com/boo
ks.xml, ltAUTHORgt n lt/AUTHORgt ltPUBLISHEDgt
p lt/PUBLISHEDgt in e CONSTRUCT ltRESULTgt
ltPUBLISHEDgt p lt/PUBLISHEDgt WHERE
ltLASTgt l lt/LASTgt IN n CONSTRUCT ltLASTgt
l lt/LASTgt lt/RESULTgt
9XML-QL (continued)
- lt!ELEMENT book (author, title, publisher)gt
- lt!ATTLIST book year CDATAgt
- lt!ELEMENT article (author, title, year?,
(shortversionlongversion))gt - lt!ATTLIST article type CDATAgt
- lt!ELEMENT publisher (name, address)gt
- lt!ELEMENT author (firstname?, lastname)gt
10XML-QL (continued)
- WHERE ltbookgt
- ltpublishergtltnamegtAddison-Wesleylt/namegtlt/publisher
gt - lttitlegt tlt/titlegt
- ltauthorgt alt/authorgt
- lt/bookgt IN "www.a.b.c/bib.xml"
- CONSTRUCT a
11XML-QL (continued)
- WHERE ltbookgt
- ltpublishergtltnamegtAddison-Wesleylt/gtlt/gt
- lttitlegt tlt/gt
- ltauthorgt alt/gt
- lt/gt IN "www.a.b.c/bib.xml"
- CONSTRUCT a
12XML-QL (continued)
- WHERE ltbookgt
- ltpublishergtltnamegtAddison-Wesleylt/gtlt/gt
- lttitlegt tlt/gt
- ltauthorgt alt/gt
- lt/gt IN "www.a.b.c/bib.xml"
- CONSTRUCT ltresultgt
- ltauthorgt alt/gt
- lttitlegt tlt/gt
- lt/gt
13XML-QL (continued)
- ltbibgt
- ltbook year"1995"gt
- lt!-- A good introductory text --gt
- lttitlegt An Introduction to Database
Systems lt/titlegt - ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - ltpublishergt ltnamegt Addison-Wesley lt/name gt
lt/publishergt - lt/bookgt
- ltbook year"1998"gt
- lttitlegt Foundation for Object/Relational
Databases The Third Manifesto lt/titlegt - ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - ltauthorgt ltlastnamegt Darwen lt/lastnamegt
lt/authorgt - ltpublishergt ltnamegt Addison-Wesley lt/name gt
lt/publishergt - lt/bookgt
- lt/bibgt
14XML-QL (continued)
- ltresultgt
- ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - lttitlegt An Introduction to Database Systems
lt/titlegt - lt/resultgt
- ltresultgt
- ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - lttitlegt Foundation for Object/Relational
Databases The Third Manifesto lt/titlegt - lt/resultgt
- ltresultgt
- ltauthorgt ltlastnamegt Darwen lt/lastnamegt
lt/authorgt - lttitlegt Foundation for Object/Relational
Databases The Third Manifesto lt/titlegt - lt/resultgt
15XML-QL (continued)
- WHERE ltbook gt plt/gt IN "www.a.b.c/bib.xml",
- lttitle gt tlt/gt,
- ltpublishergtltnamegtAddison-Wesleylt/gtgt IN p
- CONSTRUCT ltresultgt
- lttitlegt t lt/gt
- WHERE ltauthorgt a lt/gt IN p
- CONSTRUCT ltauthorgt alt/gt
- lt/gt
16XML-QL (continued)
- ltresultgt
- lttitlegt An Introduction to Database Systems
lt/titlegt - ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - lt/resultgt
- ltresultgt
- lttitlegt Foundation for Object/Relational
Databases The Third Manifesto lt/titlegt - ltauthorgt ltlastnamegt Date lt/lastnamegt
lt/authorgt - ltauthorgt ltlastnamegt Darwen lt/lastnamegt
lt/authorgt - lt/resultgt
17XML-QL (continued)
- WHERE ltarticlegt
- ltauthorgt
- ltfirstnamegt f lt/gt // firstname f
- ltlastnamegt l lt/gt // lastname l
- lt/gt
- lt/gt CONTENT_AS a IN "www.a.b.c/bib.xml"
- ltbook yearygt
- ltauthorgt
- ltfirstnamegt f lt/gt // join on same
firstname f - ltlastnamegt l lt/gt // join on same
lastname l - lt/gt
- lt/gt IN "www.a.b.c/bib.xml",
- y gt 1995
- CONSTRUCT ltarticlegt a lt/gt
18XML-QL (continued)
19XML-QL (continued)
- lt!ATTLIST person ID ID REQUIREDgt
- lt!ATTLIST article author IDREFS IMPLIEDgt
20XML-QL (continued)
- ltperson ID"o123"gt
- ltfirstnamegtJohnlt/firstnamegt
- ltlastnamegtSmithltlastnamegt
- lt/persongt
- ltperson ID"o234"gt
- . . .
- lt/persongt
- ltarticle author"o123 o234"gt
- lttitlegt ... lt/titlegt
- ltyeargt 1995 lt/yeargt
- lt/articlegt
21XML-QL (continued)
22XML-QL (continued)
- WHERE ltarticlegtltauthorgtltlastnamegt nlt/gtlt/gtlt/gt IN
"abc.xml
WHERE ltarticle authorigt lttitlegt lt/gt
ELEMENT_AS t lt/gt, ltperson IDigt
ltlastnamegt lt/gt ELEMENT_AS l
lt/gt CONSTRUCT ltresultgt t llt/gt
23Scalar values
- lttitlegtA Trip to lttitlepartgt the Moon
lt/titlepartgtlt/titlegt NOT!
lttitlegtltCDATAgt A Trip to lt/CDATAgtlttitlepartgtltCDATA
gt the Moonlt/CDATAgtlt/titlepartgtlt/titlegt
YES
24Tag variables
- WHERE ltpgt
- lttitlegt t lt/titlegt
- ltyeargt1995lt/gt
- ltegt Smith lt/gt
- lt/gt IN "www.a.b.c/bib.xml",
- e IN author, editor
- CONSTRUCT ltpgt
- lttitlegt t lt/titlegt
- ltegt Smith lt/gt
- lt/gt
25Transforming data
- lt!ELEMENT book (author, title, publisher)gt
- lt!ATTLIST book year CDATAgt
- lt!ELEMENT article (author, title, year?,
(shortversionlongversion))gt - lt!ATTLIST article type CDATAgt
- lt!ELEMENT publisher (name, address)gt
- lt!ELEMENT author (firstname?, lastname)gt
lt!ELEMENT person (lastname, firstname,
address?, phone?, publicationtitle)gt
26Transforming data (contd)
- WHERE ltgt ltauthorgt ltfirstnamegt fn lt/gt
- ltlastnamegt ln lt/gt
- lt/gt
- lttitlegt t lt/gt
- lt/gt IN "www.a.b.c/bib.xml",
- CONSTRUCT ltperson IDPersonID(fn, ln)gt
- ltfirstnamegt fn lt/gt
- ltlastnamegt ln lt/gt
- ltpublicationtitlegt t lt/gt
- lt/gt
27Integrating data from different sources
- WHERE ltpersongt
- ltnamegtlt/gt ELEMENT_AS n
- ltssngt ssnlt/gt
- lt/gt IN "www.a.b.c/data.xml",
- lttaxpayergt
- ltssngt ssnlt/gt
- ltincomegtlt/gt ELEMENT_AS i
- lt/gt IN "www.irs.gov/taxpayers.xml"
- CONSTRUCT ltresultgt n i lt/gt
28Query blocks
- WHERE ltegt lttitlegt t lt/gt
- ltyeargt 1995 lt/gt lt/gt CONTENT_A p
- IN "www.a.b.c/bib.xml"
- CONSTRUCT ltresult IDResultID(p)gt lttitlegt t
lt/gt lt/gt - WHERE e "journal-paper",
- ltmonthgt m lt/gt IN p
- CONSTRUCT ltresult IDResultID(p)gt ltmonthgt
m lt/gt lt/gt -
- WHERE e "book",
- ltpublishergtq lt/gt IN p
- CONSTRUCT ltresult IDResultID(p)gt
ltpublishergtq lt/gt lt/gt -
29WSQ
30Web-supported queries
- SIGMOD2000 (Goldman and Widom)
- WebPages (SearchExp,T1,T2,,Tn,URL,Rank, Date)
SELECT NAME, COUNTFROM STATES, WEBCOUNTWHERE
NAME T1ORDER BY COUNT DESC
31XHTML
32Simple example
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt!DOCTYPE html
- PUBLIC "-//W3C//DTD XHTML
1.0 Strict//EN" - "DTD/xhtml1-strict.dtd"gt
- lthtml xmlns"http//www.w3.org/199
9/xhtml" xmllang"en" lang"en"gt - ltheadgt
- lttitlegtVirtual
Librarylt/titlegt - lt/headgt
- ltbodygt
- ltpgtMoved to lta
href"http//vlib.org/"gtvlib.orglt/agt.lt/pgt - lt/bodygt
- lt/htmlgt
33SI 760Language and information(Fall 2000)
34SI 760 (1)
- Classes 1-3 Introduction to the course and
linguistic background - The study of language. Computational Linguistics
and Psycholinguistics. - Classes 4-5 Elementary probability and statistics
- Describing data. Measures of central tendency.
The z score. Hypothesis testing. - Classes 6-8 Information theory
- Entropy, joint entropy, conditional entropy.
Relative entropy and mutual information. Chain
rules. - Classes 9-10 Data compression and coding
- Entropy rate. Language modeling. Examples of
codes. Optimal codes. Huffman codes. Arithmetic
coding. The entropy of English.
35SI 760 (2)
- Classes 11-12 Clustering
- Cluster analysis. Clustering of terms according
to semantic similarity. Distributional
clustering. - Classes 13-14 Concordancing and collocations
- Concordances. Collocations. Syntactic criteria
for collocability. - Classes 15-16 Literary detective work
- The statistical analysis of writing style.
Decipherment and translation. - Classes 17-18 Information extraction
- Message understanding. Trainable methods.
36SI 760 (3)
- Classes 19-20 Word sense disambiguation and
lexical acquisition - Supervised disambiguation. Unsupervised
disambiguation. Attachment ambiguity.
Computational lexicography. - Classes 21-22 Part-of-speech tagging
- Statistical taggers. Transformation-based
learning of tags. Maximum entropy models.
Weighted finite- state transducers. - Classes 23-24 Question answering
- Semantic representation. Predictive annotation.
37SI 760 (4)
- Classes 25-26 Text summarization
- Single-document summarization. Multi-document
summarization. Language models. Maximal Marginal
Relevance. Cross-document structure theory.
Trainable methods. Text categorization. - Classes 27-28 (30) Other topics
- Text alignment. Word alignment. Statistical
machine translation. Discourse segmentation. Text
categorization. Maximum entropy modeling.
38SI 760 (5)
- Manning and Schuetze. Foundations of Statistical
Natural Language Processing. MIT Press. 1999. - Jurafsky and Martin. Speech and Language
Processing. Prentice-Hall 2000. - Cover Thomas. Elements of Information Theory.
John Wiley and Sons 1991. - Baeza-Yates and Ribeiro-Neto. Modern Information
Retrieval. Addison-Wesley 1999. - Oakes. Statistics for Corpus Linguistics.
Edinburgh University Press 1998.
39Course URL
- http//www.si.umich.edu/radev/760f00
40Readings for next time
- Web-based readings
- Asilomar report
- http//www.acm.org/sigmod/record/issues/9812/asilo
mar.html - White paper on XML
- http//www-db.stanford.edu/widom/xml-whitepaper.h
tml