CS246: Web Information Systems - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

CS246: Web Information Systems

Description:

Around 2-3 papers every week. Typically one full day of paper reading. One ... Cars.com. Amazon.com. Apartments.com. 401carfinder.com. CS246 by John Cho. 12 ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 27
Provided by: junghoo
Category:

less

Transcript and Presenter's Notes

Title: CS246: Web Information Systems


1
CS246 Web Information Systems
  • Junghoo John Cho
  • Fall 2007

2
Course Information
  • Web page http//oak.cs.ucla.edu/cs246/
  • Topic Web information management
  • Time TuTh 1000 -- 1150 am
  • Place 5419 Boelter Hall
  • Instructor Junghoo John Cho
  • office 3532E Boelter Hall
  • email cho_at_cs.ucla.edu
  • please use subject CS246
  • office hours M 4-5 pm.

3
Todays Topics
  • Overview of the course topics
  • Central Indexing
  • Dynamic Integration
  • Data Extraction
  • Course logistics
  • Paper reading assignments
  • Class project

4
Who is this class for?
  • Strong interest in research
  • Interest in Web information systems
  • Time commitment
  • Around 2-3 papers every week
  • Typically one full day of paper reading
  • One indepedent project
  • Similar to paper writing
  • In fact we read papers from past student
    projects!
  • Or interesting application implementation

5
Prerequisite
  • Introductory database, e.g., CS143
  • e.g. query? SQL?
  • Basic algorithms and data structures
  • Predicate logic
  • ?x, Taller (Fred, x)
  • Design and implementation experience
  • Basic C
  • Quick test Grab a sample paper
  • See if you can read, understand and build it

6
Tell Us About You
  • Name
  • Department Program
  • Before coming to UCLA
  • Brief history at UCLA
  • Technical/research interests
  • Expectation from the class

7
Information Galore
Biblio sever
Legacy database
Plain text files
8
Central Problem
  • How to manage/access information on the Web?
  • Three major approaches
  • Central indexing
  • E.g., Web search engine
  • Dynamic integration
  • E.g., comparison shopping services
  • Data extraction
  • E.g., spamming companies

9
Topic Web Search (Central Indexing)
Central Index
10
Topic Web Search (Central Indexing)
  • Web collection of passive HTML pages
  • Find Web pages relevant to a query
  • Traditional Information Retrieval
  • Web collection of HTML pages
  • HTML page a bag of words
  • More than that?
  • Links, structure of the Web
  • User access patterns
  • HTML tags (markups)

11
Topic Dynamic Integration
Amazon.com
Cars.com
401carfinder.com
Apartments.com
12
Topic Dynamic Integration
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
13
Topic Data Extraction
Structured data
WWW
Beatles 10 Madonna 20 NSync 20
  • How can we extract structured data from free
    text automatically?

14
High-Level Goal
  • Learn core ideas and techniques
  • Some of the techniques can be useful for other
    fields
  • Learn how to read papers
  • Hopefully learn what it is like to do research
  • Sometimes very frustrating but often very
    rewarding

15
Main Course Workload
  • Paper reading
  • Paper reading assignments
  • Class discussion
  • Independent projects

16
Paper Reading
  • Why
  • Something that you will do all the time as a
    researcher
  • Learn to be critical and communicate well
  • Acquire knowledge to conduct research/project
  • About 20 papers from
  • Conferences SIGMOD, VLDB, WWW, and
  • Before the class
  • Everyone read and review the paper
  • During the class
  • Instructor present his own understanding and
    lead class discussion
  • Everyone participate!!!

17
How to Get Papers
  • From the class homepage
  • http//oak.cs.ucla.edu/cs246/
  • Some of the materials password protected
  • User name cs246
  • Password papers
  • Let me know if any problem

18
How to Read Papers
  • Understand the Big Picture
  • What is the problem?
  • Why is it important?
  • Why is it difficult?
  • What has this paper done?
  • What others have done?
  • How do they support their claim?

19
Paper Reviews (1)
  • Due by the preceding Sunday
  • Submit through our Web submission interface on
    the class Web page
  • Required components at most 5 paragraph
  • Summary (1 paragraph) your own words
  • This paper discusses how to optimize queries
    with...
  • Comments/criticisms (2-4 paragraphs) the good
    the bad
  • It addresses a real problem and the solution is
    interesting
  • But I feel the experiments are not realistic
    because...
  • Optional questions, as many as you want
  • Why the authors assume that queries are
    independent?

20
Paper Reviews (2)
  • May skip 3 paper summaries without penalty
  • Graded by students (Excellent, Good, Poor)
  • 10 Excellent
  • 80 Good
  • 10 Poor

21
Class Project
  • Why
  • Work on a specific problem and learn to find a
    solution
  • 40 of the class
  • Team of up to 3
  • Topic any problem related to the general problem
  • Open style
  • Rigorous study of a research problem or
  • Any interesting system implementation

22
Class Project Schedule
  • Important Milestones
  • Group formation 10/7 (Su)
  • Project proposal 10/21 (Su)
  • Project progress 11/06 (Tu)
  • Final report 11/25 (Su)
  • Project presentation 11/27, 11/29, 12/4, 12/6
  • You are responsible to stay on track
  • Make appointments with instructor as needed

23
Project Please Remember
  • Put your aims high and be realistic
  • Expect to read at least 4-5 papers along the way
  • Start early
  • Dont do it right before the deadline
  • Always unexpected obstacles
  • Most students could not finish in previous
    quarters
  • Please, please start early
  • You are responsible to be on track

24
WebBase dataset
  • 70 million Web pages
  • 600GB before compression
  • C access interface similar to file interface
  • repository open(), read(), seek(), tell()
  • webcat
  • Sunflower.cs.ucla.edu
  • cho_at_cs.ucla.edu

25
Grading
  • Midterm 40
  • Paper reviews 20
  • Project 40

26
Announcements
  • First review due Sunday 09/30
  • Two papers for class 3
  • Graph structure in the Web
  • Extracting Patterns and Relations from the Web
  • Only for this week. All next week papers are due
    in the future
  • Please sign up for the paper reviews!
  • Everyone has to do it at least once
  • Unassigned papers are assigned randomly at the
    beginning of second week
Write a Comment
User Comments (0)
About PowerShow.com