Jaime Carbonell and Raj Reddy - PowerPoint PPT Presentation

About This Presentation
Title:

Jaime Carbonell and Raj Reddy

Description:

Create a Universal Digital Library containing all the books ever published ... Books in non-native languages remain incomprehensible to most people ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 36
Provided by: red52
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Jaime Carbonell and Raj Reddy


1
Research Problems in Digital LibrariesData
Mining and Text Mining
  • Jaime Carbonell and Raj Reddy
  • Carnegie Mellon University
  • April 21, 2006
  • Talk presented at CS50 symposium at CMU

2
Keepers of the Faith
3
Digital Libraries and Universal Access to
Information
  • Create a Universal Digital Library containing all
    the books ever published
  • Unfortunately many of the books are in English
  • Not readable by over 80 of the population

4
Information Overload
  • If we read a book every day
  • we can only read, at most, 40,000 books in a life
    time
  • Having millions of books online and accessible
    creates an information overload
  • we have a wealth of information and scarcity of
    (human) attention!, Herbert Simon
  • Multilingual search technology can help to reduce
    the overload
  • permits users to search very large data bases
    quickly and reliably
  • independent of language and location

5
Understanding Language
  • Books in non-native languages remain
    incomprehensible to most people
  • Translation and Summarization essential for world
    wide use
  • Current translation systems are not yet perfect
  • Significant improvements in language
    understanding systems in the past few decades
  • Systems based on statistical and linguistic
    techniques have shown significant performance
    improvements
  • improve performance using machine learning
  • Digitization projects will act as test bed
  • for validating Language Understanding Systems
    Research
  • e.g. The Million Book Digital Library Project

6
The Million Book Digital Library
  • Collaborative venture among many countries
    including USA, China and India
  • So far 400,000 books have been scanned in China
    and 200,000 in India
  • Content is made freely available around the globe
  • Those wishing to see the Video in the next slide
    should download from
  • http//www.rr.cs.cmu.edu/MSRI.zip

7
(No Transcript)
8
Million Book Project Status
  • 21 Centers in India
  • 17 centers in China
  • 1 Center in Egypt
  • Planned Australia and Europe
  • About 600,000 books scanned
  • About 120,000 accessible on the web from India
  • http//dli.iiit.ac.in/
  • Uses 8TB of storage
  • 10 TB server at CMU Library planned for July 2005
  • 1,000,000 books by the end of 2007
  • Capacity to scan a million pages a day expected
    to be operational by the end of 2006

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Million Book Project Research Challenges
  • Providing Access to Billions everyday
  • Distributed Cached Servers in every country and
    region
  • Self-Healing Data Bases
  • Easy to use interfaces for Billions
  • Text Mining Challenges
  • Multilingual Information Retrieval
  • Summarization
  • Text Categorization
  • Named-Entity identification
  • Novelty Detection
  • Translation

22
Information Bill of Rights
  • Get the right information
  • To the right people
  • At the right time
  • On the right medium
  • In the right language
  • With the right level of detail

23
Relevant Text Mining Technologies
  • IR (search engines)
  • Classification, routing
  • Anticipatory analysis
  • Info extraction, speech
  • Machine translation
  • Summarization
  • right information
  • right people
  • right time
  • right medium
  • right language
  • right level of detail

24
The Right InformationNext Generation Search
Engines
  • Search Criteria Beyond Query-Relevance
  • Google Popularity (link density, click freq, )
  • Vivisimo Panoramic view (clustering labeling)
  • Information novelty (content differential,
    recency)
  • Trustworthiness of source
  • Appropriateness to user (difficulty level, )
  • Hidden web 10X visible web (Federated search)
  • Find What I Mean Principle
  • Search on semantically related terms
  • Induce user profile from past history, etc.
  • Disambiguate terms (e.g. Jordan)

25
Clustering (Vivisimo-style) Search vs Standard IR
documents
query
IR
Cluster summaries
26
MMR Ranking vs Standard IR
documents
query
MMR
IR
? controls spiral curl
27
In The Right Level of DetailSynthetic Document
Summary
  • Extractive combo (tracking, MMR, )
  • Centrality of info
  • KIT model relevant
  • Novelty (vs last time)
  • Entities, relations, dates, raw text
  • Later contradiction attitude detection
  • Combine CMU, IBM (NE rel extraction), UMD
    (user model, summ), Stanford (contradiction
    detection)

Entities Relations .
Audio transcripts
Textual summary
Texts (Eng, Arabic, Chinese )
Analyst zoom-in
Novel Attitude
mixed
Sources
28
In the Right Language (MT)
Interlingua
Semantic Analysis
Sentence Planning
Transfer Rules
Syntactic Parsing
Text Generation
Source (Arabic)
Target (English)
Direct EBMT, SMT
29
EBMT example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is
my father. Mapudungun Chi doy fütra chi
wentru fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
30
Illustration of Multi-Engine MT
31
Interlingua Spoken Language Multi Engine Example
Based Statistical Low Resource Automatic MT
Evaluation Portable
Letras
Avenue
MEMT
METEOR
Diplomat
Tongues
GEBMT
KANT
MT Lab
KBMT-89
JANUS C-STAR I
Pangloss
RADD - MT/TIDES
GALE
Enthusiast
TransTac
C-STAR II
ThaiLator
Nespole
Lingwear
Semantic Annotation
Speechalator
Q A
Extraction
CALL
32
Language of Life vocabulary
chemical groups, properties of AA
33
Evolutionary Methods for Discovering Sequence ?
Structure Mapping
Distribution of amino acids
A Multiple Sequence Alignment
Human Monkey Mouse Rat Cow Dog Fly Worm Yeast
Conserved Properties across Rhodopsin
34
Results ?-Helical Rung Prediction
  • 1DBG correctly identify 10 out of 11 rungs

35
Concluding Observations and Exaggerations
  • Everything can be reduced to Information
  • Information is the key everything
  • All natural information has an underlying
    language (genomics, linguistics, )
  • Information is all levels of graunularity
  • Subatomic ? DNA/proteins ? society ?
  • Information language computation lifetime
    employment
Write a Comment
User Comments (0)
About PowerShow.com