Web-based Information Architectures MSEC 20-760 Mini II

1 / 23
About This Presentation
Title:

Web-based Information Architectures MSEC 20-760 Mini II

Description:

Web-based Information Architectures MSEC 20-760 Mini II Location: GSIA Simon Auditorium Time: 1:30-3:20pm, Tues. & Thurs. Instructor: Prof. Jaime Carbonell – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 24
Provided by: cji5

less

Transcript and Presenter's Notes

Title: Web-based Information Architectures MSEC 20-760 Mini II


1
Web-based Information ArchitecturesMSEC
20-760Mini II
Location GSIA Simon Auditorium Time 130-320
pm, Tues. Thurs. Instructor Prof. Jaime
Carbonell Office NSH 4519 Email
jgc_at_cs.cmu.edu Tel 268-7279 Augmented with
expert guest lectures Teaching assistant Jian
Zhang Office NSH 4605 Email
jianzhang_at_cmu.edu Tel 268-6521 Offices
Hours TBD Administrative assistant
TBD Office NSH 4517 Email
rcpomp_at_cs.cmu.edu Tel 268-4788
2
Administrative Issues
Prerequisites Basic programming skills
(preferably JAVA) Familiarity with the web
(HTML, browsing, etc.) Fundamentals of Web
Programming (20-753). Grading 30 homeworks (2
programming assignments) 30 miniproject
(student teams will propose) 15 midterm (5
pages notes, calculator OK, no laptops) 25
final (10 pages notes, calculator OK, no
laptops) Bulletin Board Schedule/syllabus Lectur
e notes (in powerpoint) Homework Announcements
discussions
3
Textbook and Reference Materials (1)
Required Class notes (slides on web site) and
handouts (to be provided) Required
"Understanding Search Engines Mathematical
Modeling and Text Retrieval" by Michael W.
Berry, Murray Browne Available at
http//www.siam.org (tel 1-800-447-7426) Opti
onal Background reading material provided
4
Textbook and Reference Materials (2)
Optional "Advances in Information Retrieval"
Edited by Croft, Kluwer Academic Pub., 2000
more detailed state-of-the-art IR
book Optional "Machine Learning" by Tom M.
Mitchell, WCB McGraw-Hill Tools for
text categorization and data mining.
5
Information Retrieval The Challenge (1)
Text DB includes (1) Rainfall measurements
in the Sahara continue to show a steady decline
starting from the first measurements in 1961. In
1996 only 12mm of rain were recorded in upper
Sudan, and 1mm in Southern Algiers... (2) Dan
Marino states that professional football risks
loosing the number one position in heart of fans
across this land. Declines in TV
audience ratings are cited... (3) Alarming
reductions in precipitation in desert regions are
blamed for desert encroachment of previously
fertile farmland in Northern Africa. Scientists
measured both yearly precipitation and
groundwater levels...
6
Information Retrieval The Challenge (2)
User query states "Decline in rainfall and
impact on farms near Sahara" Challenges How to
retrieve (1) and (3) and not (2)? How to rank
(3) as best? How to cope with no shared words?
7
Information Retrieval in eCommerce (1)
Bringing in Customers How do Web-search engines
work? How to maximize hits on my eCommerce
pages? How to maximize preselection of
customers who will transact?
8
Information Retrieval in eCommerce (2)
Analyzing the Competition How do we find the
competition? How will customers find the
competition? Can we do preemptive information
strikes? Text Mining How to learn what
customers want most? How to find out what they
missed, but wanted? How to discover customer
search/browsing patterns?
9
Information Retrieval Assumption (1)
Basic IR task There exists a document
collection Dj Users enters at hoc query
Q Q correctly states users interest User
wants Di lt Dj most relevant to Q
10
Information Retrieval Assumption (2)
"Shared Bag of Words" assumption Every query
wi Every document wk ...where wi wk
in same S All syntax is irrelevant (e.g. word
order) All document structure is irrelevant All
meta-information is irrelevant (e.g. author,
source, genre) gt Words suffice for relevance
assessment
11
Information Retrieval Assumption (3)
  • Retrieval by shared words
  • If Q and Dj share some wi , then Relevant(Q, Dj )
  • If Q and Dj share all wi , then Relevant(Q, Dj )
  • If Q and Dj share over K of wi , then
    Relevant(Q, Dj)

12
Boolean Queries (1)
Industrial use of Silver Q silver R "The
Counts silver anniversary..." "Even the crash
of 87 had a silver lining..." "The Lone Ranger
lived on in syndication..." "Sliver dropped to a
new low in London..." ... Q silver AND
photography R "Posters of Tonto and the Lone
Ranger..." "The Queens Silver Anniversary
photos..." ...
13
Boolean Queries (2)
  • Q (silver AND (NOT anniversary)
  • AND (NOT lining)
  • AND emulsion)
  • OR (AgI AND crystal
  • AND photography))
  • R "Silver Iodide Crystals in Photography..."
  • "The emulsion was worth its weight in
    silver..."
  • ...

14
Boolean Queries (3)
  • Boolean queries are
  • a) easy to implement
  • b) confusing to compose
  • c) seldom used (except by librarians)
  • d) prone to low recall
  • e) all of the above

15
Beyond the Boolean Boondoggle (1)
  • Desiderata (1)
  • Query must be natural for all users
  • Sentence, phrase, or word(s)
  • No ANDs, ORs, NOTs, ...
  • No parentheses (no structure)
  • System focus on important words
  • Q I want laser printers now

16
Beyond the Boolean Boondoggle (2)
  • Desiderata (2)
  • Find what I mean, not just what I say
  • Q cheap car insurance
  • (pAND (pOR
  • "cheap" 1.0
  • "inexpensive" 0.9
  • "discount" 0.5)
  • (pOR "car" 1.0
  • "auto" 0.8
  • "automobile" 0.9
  • "vehicle" 0.5)
  • (pOR "insurance" 1.0
  • "policy" 0.3))

17
Beyond the Boolean Boondoggle (3)
  • Desiderata (3)
  • Speech-recognized queries
  • Coming soon, to a system near you
  • longer queries
  • more fluff words to filter
  • acoustic recognition errors

18
INFORMATION RETRIEVAL
User
The Web
Spider
Search Engine
Inverted Index
Library, etc.
19
INFORMATION RETRIEVALAPPLICATIONS
  • Searching Document Archives
  • Libraries (title, subject, full-text)
  • Data bases of patents and applications
  • DBs of legal cases (e.g. Lexis, Westlaw)
  • Searching the Web
  • Pure search engines (Google, Inktomi, )
  • Browsing Search (Yahoo, Terra-Lycos, )
  • Meta-search (Metacrawler, Vivisimo, )
  • Corporate or Government Intranets
  • Non-traditional (e.g. Software DBs, News)

20
INFORMATION RETRIEVAL (IR) EVOLUTION
  • IR in the 1980s
  • Single collection with lt 106 documents (archive)
  • Boolean queries with unordered-set answer
  • IR circa 2000
  • Single collection with gt 109 documents (web)
  • Free-form queries with ranked-list answer
  • IR circa 2010
  • Multiple collections gt 1012 docs (invisible web)
  • Find what I mean queries with clustering,
    summarization and customization.

21
Content for Rest of the Course (1)
  • See the course BB for the latest updates to the
    course schedule.
  • Under the Hood
  • The vector space model for retrieval
  • Building an inverted index
  • Term weighting and selection
  • Web spidering
  • Automated text categorization

22
Content for Rest of the Course (2)
  • IR Uses in eCommerce
  • How to make search engine work for you
  • How to build optimal search-attractive web sites
  • The business(es) of web-based information
  • Beyond Web Search Engines
  • Speech processing primer
  • Information extraction from web pages
  • Data mining primer
  • Multi-media applications
  • Business models

23
Optional Quick Review of Linear Algebra
  • If you know n-dimensional vectors, matrices,
    computing inner products, etc.., Then you do not
    need this review. You may take a break.
  • If you learned this material, but do not remember
    it, please stay and listen to refresh your
    knowledge.
  • If you never learned linear algebra, stay, listen
    and (optionally) read either
  • G. Hadley. Linear Algebra. Addison-Wesley, 1961.
    Ch 3.
  • Or, Stephen W. Goode. An Introduction to
    Differential Equations and Linear Algebra.
    Prentice Hall, 1991. Ch.3).
Write a Comment
User Comments (0)