Title: Jaime Carbonell and Raj Reddy
1Research Problems in Digital LibrariesData
Mining and Text Mining
- Jaime Carbonell and Raj Reddy
- Carnegie Mellon University
- April 21, 2006
- Talk presented at CS50 symposium at CMU
2Keepers of the Faith
3Digital Libraries and Universal Access to
Information
- Create a Universal Digital Library containing all
the books ever published - Unfortunately many of the books are in English
- Not readable by over 80 of the population
4Information Overload
- If we read a book every day
- we can only read, at most, 40,000 books in a life
time - Having millions of books online and accessible
creates an information overload - we have a wealth of information and scarcity of
(human) attention!, Herbert Simon - Multilingual search technology can help to reduce
the overload - permits users to search very large data bases
quickly and reliably - independent of language and location
5Understanding Language
- Books in non-native languages remain
incomprehensible to most people - Translation and Summarization essential for world
wide use - Current translation systems are not yet perfect
- Significant improvements in language
understanding systems in the past few decades - Systems based on statistical and linguistic
techniques have shown significant performance
improvements - improve performance using machine learning
- Digitization projects will act as test bed
- for validating Language Understanding Systems
Research - e.g. The Million Book Digital Library Project
6The Million Book Digital Library
- Collaborative venture among many countries
including USA, China and India - So far 400,000 books have been scanned in China
and 200,000 in India - Content is made freely available around the globe
- Those wishing to see the Video in the next slide
should download from - http//www.rr.cs.cmu.edu/MSRI.zip
7(No Transcript)
8Million Book Project Status
- 21 Centers in India
- 17 centers in China
- 1 Center in Egypt
- Planned Australia and Europe
- About 600,000 books scanned
- About 120,000 accessible on the web from India
- http//dli.iiit.ac.in/
- Uses 8TB of storage
- 10 TB server at CMU Library planned for July 2005
- 1,000,000 books by the end of 2007
- Capacity to scan a million pages a day expected
to be operational by the end of 2006
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Million Book Project Research Challenges
- Providing Access to Billions everyday
- Distributed Cached Servers in every country and
region - Self-Healing Data Bases
- Easy to use interfaces for Billions
- Text Mining Challenges
- Multilingual Information Retrieval
- Summarization
- Text Categorization
- Named-Entity identification
- Novelty Detection
- Translation
22 Information Bill of Rights
- Get the right information
- To the right people
- At the right time
- On the right medium
- In the right language
- With the right level of detail
23 Relevant Text Mining Technologies
- IR (search engines)
- Classification, routing
- Anticipatory analysis
- Info extraction, speech
- Machine translation
- Summarization
- right information
- right people
- right time
- right medium
- right language
- right level of detail
24 The Right InformationNext Generation Search
Engines
- Search Criteria Beyond Query-Relevance
- Google Popularity (link density, click freq, )
- Vivisimo Panoramic view (clustering labeling)
- Information novelty (content differential,
recency) - Trustworthiness of source
- Appropriateness to user (difficulty level, )
- Hidden web 10X visible web (Federated search)
- Find What I Mean Principle
- Search on semantically related terms
- Induce user profile from past history, etc.
- Disambiguate terms (e.g. Jordan)
25Clustering (Vivisimo-style) Search vs Standard IR
documents
query
IR
Cluster summaries
26MMR Ranking vs Standard IR
documents
query
MMR
IR
? controls spiral curl
27 In The Right Level of DetailSynthetic Document
Summary
- Extractive combo (tracking, MMR, )
- Centrality of info
- KIT model relevant
- Novelty (vs last time)
- Entities, relations, dates, raw text
- Later contradiction attitude detection
- Combine CMU, IBM (NE rel extraction), UMD
(user model, summ), Stanford (contradiction
detection)
Entities Relations .
Audio transcripts
Textual summary
Texts (Eng, Arabic, Chinese )
Analyst zoom-in
Novel Attitude
mixed
Sources
28 In the Right Language (MT)
Interlingua
Semantic Analysis
Sentence Planning
Transfer Rules
Syntactic Parsing
Text Generation
Source (Arabic)
Target (English)
Direct EBMT, SMT
29EBMT example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is
my father. Mapudungun Chi doy fütra chi
wentru fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
30Illustration of Multi-Engine MT
31Interlingua Spoken Language Multi Engine Example
Based Statistical Low Resource Automatic MT
Evaluation Portable
Letras
Avenue
MEMT
METEOR
Diplomat
Tongues
GEBMT
KANT
MT Lab
KBMT-89
JANUS C-STAR I
Pangloss
RADD - MT/TIDES
GALE
Enthusiast
TransTac
C-STAR II
ThaiLator
Nespole
Lingwear
Semantic Annotation
Speechalator
Q A
Extraction
CALL
32Language of Life vocabulary
chemical groups, properties of AA
33Evolutionary Methods for Discovering Sequence ?
Structure Mapping
Distribution of amino acids
A Multiple Sequence Alignment
Human Monkey Mouse Rat Cow Dog Fly Worm Yeast
Conserved Properties across Rhodopsin
34Results ?-Helical Rung Prediction
- 1DBG correctly identify 10 out of 11 rungs
35Concluding Observations and Exaggerations
- Everything can be reduced to Information
- Information is the key everything
- All natural information has an underlying
language (genomics, linguistics, ) - Information is all levels of graunularity
- Subatomic ? DNA/proteins ? society ?
- Information language computation lifetime
employment