Title: Multimodal Information Access and Synthesis Research Review
1Multimodal Information Access and Synthesis
Research Review
- Dan Roth
- Department of Computer Science
- University of Illinois at Urbana/Champaign
2MIAS Mission
- Most of the data today is unstructured
- books, newspaper articles, journal publications,
reports, images, and audio and video streams. - How to deal with the huge amount of unstructured
data as if it was organized in a database with a
known schema. - how to locate, organize, access and analyze
unstructured data. - MIAS Mission
- develop the theories, algorithms, and tools for
analysts to - access a variety of data formats and models
- integrate them with existing resources
- transform raw data into useful and understandable
information.
3Task Perspective
- In the next decade, some people will need to
- Monitor a multimodal stream of interesting
events, entities, threats - Formulate and evaluate hypotheses with respect to
them. - Impossible to touch even a fraction of the data
available - Requires interaction, at the appropriate level of
semantic abstraction, with a system that can - synthesize, summarize and interpret vast amounts
of multimodal information, - integrate observed data with multimodal domain
models and background information in multiple
formats, - propose hypotheses, and help verify them.
4Information Access Management
- The key to allowing access to information is to
- Match methods to evoke an information need and to
- Transform information from one form to another to
make it more accessible to the abilities
available to a user. - The solution lies in understanding the meaning of
information and of peoples interaction with it. - Current tools tend to ignore the meaning of
information, and operate on surface phenomena
(particular words, image segments, and so on). - Doing so complicates all stages of the
information use process. - The key is matching the demand of interacting
with information to the abilities available to a
user when they do it. - This will allow access to information under
unusual, but valuable, circumstances (eg., when
some modalities cannot be used temporarily)
5Scenario
- Consider an intelligence analyst researching a
problem - Iranian nuclear program generate a list of
Iranian nuclear scientists, affiliations,
specialties, biographies, photos, and notable
recent activities. - Current technologies have solved the problem of
collecting and storing huge amounts of
information it would be reasonable to assume
that the information she is after does exist - However, multiple barriers exist on the way to
successful analysis, synthesis and decision
support, posing significant research challenges.
- Medical treatment what is known about it who
are the experts what do users say about it what
side effects have been reported
- Disease Outbreaks what is known about it (say,
Ebola, Sars) who are the experts evidence for
outbreak side effects reported, where, when.
- Food Protection, Water Safety, Societal
Infrastructures and Development project
6Multimodal Information Access Synthesis
Saeed Zakeri
Saeed Zakeri
Online Data Sources
Analysis Synthesis
Web pages
attended
News Articles Specific Web sites Text
Repositories Relational databases Surveillance
Videos ...
Discovering unusual events, entities, trends
threats associations. Tracking of events,
entities associations Rapid retrieval of all
multi modal information about a particular entity
Mapping to augmenting institutional resources
Efficient search, querying, question answering,
browsing.
visited
Elkhan Factory
Text documents
U. of Tehran
Elkhan Factory
Relational data
loc Northern Iran name Elkhan
Factory topics fertilizer, enrichment Semantic
categories Temporal categories Subjectivity/Opinio
ns
Images
Infer Metadata Semantic entities
Discover Trends of Relations Between Semantic
Entities
- Focused Multimodal Data Retrieval
-
Support Information Analysis, Knowledge
Discovery, Monitoring
Semantic Disambiguation Integration across
multiple sources and modalities
Meaning Based Transformation of Data for
Presentation and Analysis Support.
7MIAS Processes
Tools Text Processing Analysis Semantic Analysis
Information Extraction Information
Integration Machine Learning Knowledge
Discovery Integrating Text Images
- Focused data retrieval and integration
- Identify and collect relevant data from multiple
sources - Semantic data enrichment Real world Entities and
Relations among them - Infer semantics from unstructured data and
images - Identify real-world entities and relations among
them - Extraction of attributes and relations features
into a common framework (generalized graphs) - Relate them to existing institutional resources
for information integration - Trend Analysis
- Tracking of events, social content, entities and
topics - Knowledge discovery and hypotheses generation and
verification - Construct the rich semantic structure and hidden
networks of entity linkages - Multifaceted output
- Information extraction
- Allow semantic based navigation and search across
disparate data modalities - KR Multi-view representation of the information
as input to visualization tools.
8Integrated Mission Research Education
- Develop diverse human resources to enhance the
scientific research, educational, and
governmental workforce in MIAS - Educational and Outreach Initiatives
- Target students from small research programs
minority-serving - Expose them to the national labs
- Open opportunities for bigger impact
- A comprehensive education program designed to
increase participation in the study and practice
of MIAS topics - Provide substantive training for a new generation
of experts in the field, - Serve as a tool for recruiting an experienced
group of undergraduates into graduate study in
one of the broad fields of information science - Be an intellectual community center, where
participants at all levels of expertise come
together in an enriched environment of
collaboration.
9Data Science Summer Institute at UIUC
1st DSSI May 2007 Huge Success 2nd DSSI May
2008
- Intensive Course
- in
- The Math of Data
- Sciences
- Probability and Statistics
- Linear Algebra
- Data Structures and Algorithms
- Optimization
- Learning Clustering
- Research Projects
- Led by co-PIs and
- Grad Students
- Topics
- Virtual Web Focused Crawling
- Relations and Entities
- Text and Images
8 weeks course 27 students Faculty from UIUC,
Kansas State, UTSA, UTEP
Advanced MIAS Related Tutorials
Speaker Series
10Data Science Summer Institute at UIUC
Advanced MIAS Related Tutorials
11Data Science Summer Institute at UIUC
12Our Team
- Leading researchers in intelligent information
access analysis and its foundations - Machine Learning
- Data Bases, Data Integration and Knowledge
Discovery - Information Retrieval
- Natural Language Processing
- Machine Vision
- Knowledge Representation and Reasoning
- A large number of affiliates/consultants covering
all areas of interest to the MIAS center.
13Kevin C. Chang Data Integration and Retrieval
- Deep Web
- MetaQuerier Large-scale
- integration over deep Web
- pioneered the holistic integration paradigm
- Widely published at SIGMOD, VLDB, ICDE
- System building and demo at CIDR, SIGMOD, VLDB,
- PC/editor/organizers of SIGMOD, ICDE, WWW, SIGKDD
special issue, WIRI06, IIWeb06 workshops - Awards NSF CAREER, IBM Faculty, NCSA Faculty
VLDB00 Best Paper Selection
14AnHai Doan Data Integration
- Data integration
- Matching Schemas, Ontologies, Entities
- Integrating databases and text
- ACM Doctoral Dissertation Award 2004
- Sloan Fellowship 2006
- Edited special issues on Data Integration
- Co-chaired workshops on data integration, Web
technologies, machine learning - Co-writing a book titled Data Integration
Monitoring People and Events
15David Forsyth Computer Vision and Learning
- Linking Text and Images
- Labeling images via (a lot of)
- caption text
- Leading Computer Vision researcher,
- Over 110 papers on vision, graphics, learning
applications - Program Chair, CVPR 2000, General chair CVPR
2006, regular member of PC in all major vision
conferences - IEEE Technical Achievement Award, 2006
- Lead author of main textbook, widely adopted
16German supermodel Claudia Schiffer gave birth to
a baby boy by Caesarian section January 30, 2003,
her spokeswoman said. The baby is the first child
for both Schiffer, 32, and her husband, British
film producer Matthew Vaughn, who was at her side
for the birth. Schiffer is seen on the German
television show 'Bet It...?!' ('Wetten
Dass...?!') in Braunschweig, on January 26, 2002.
(Alexandra Winkler/Reuters)
British director Sam Mendes and his partner
actress Kate Winslet arrive at the London
premiere of 'The Road to Perdition', September
18, 2002. The films stars Tom Hanks as a Chicago
hit man who has a separate family life and
co-stars Paul Newman and Jude Law. REUTERS/Dan
Chung
US President George W. Bush (L) makes remarks
while Secretary of State Colin Powell (R) listens
before signing the US Leadership Against HIV
/AIDS , Tuberculosis and Malaria Act of 2003 at
the Department of State in Washington, DC. The
five-year plan is designed to help prevent and
treat AIDS, especially in more than a dozen
African and Caribbean nations(AFP/Luke Frazza)
17(No Transcript)
18Jiawei Han Knowledge Discovery
- Patterns analysis and knowledge discovery from
massive data - Research focus Data streams, frequent patterns,
sequential patterns, graph patterns, and their
applications - Privacy preserving Data Analysis
- Developed many popular data mining algorithms,
e.g., FPgrowth, PrefixSpan, gSpan, StarCubing,
CrossMine, RankingCube, and CrossClus - Over 300 research papers published in conferences
and journals - Editor-in-Chief, ACM Transactions on Knowledge
Discovery from Data - Textbook, Data mining Concepts and Techniques,
adopted worldwide
19Cinda Heeren Knowledge Discovery, Education
- MIAS Summer School Director
- Mathematical Foundation of Data Science Discrete
- UIUC CS Department Director of Diversity Programs
- Lecturer for Discrete Math, Data Structures
courses at UIUC - One of the leading teachers and educational
leaders at UIUC. - Research in Algorithmic Data Analysis and Data
bases - Speaker and regular presenter at conferences for
young women, including - GAMES and WYSE summer camps
- Expanding Your Horizons careers conference
- Grace Hopper Celebration of Women in Computing
2004 - panel on best practices for recruiting
women into undergraduate programs in CS. - SIGCSE 2006 - workshop, How to host your own
Small Regional Celebration of Women in Computing.
Summer School
20ChengXiang Zhai Information Retrieval and Text
Analysis
- Probabilistic Paradigms for Information Retrieval
- Personalized/Context Dependent Search
- Relation Identification and Data Integration
- Leading expert in information retrieval and
search technologies - Recipient of the 2004 Presidential Early Career
Award for Scientists and Engineers (PECASE), - Main architect and key contributor of the Lemur
Toolkit (being used by many research groups and
IR companies around the world) - ACM SIGIR04 best paper award
- Selected services include Program Chair of ACM
CIKM 2004 HLT/NAACL 06 SIGIR09
21Dan Roth Machine Learning, NLP, Inference
- Semantic Analysis and Data enrichment
- Entity and Relation Identification and
Integration - Textual Entailment
- Machine Learning Methods for NLP and IE
- Leading Researcher in Machine Learning, NLP, AI
- Developed Popular Machine Learning system and
machine learning based NLP tools used in industry
and NLP classes. - Program Chair, ACL03, CoNLL02 Regular senior
PC member in all major Machine Learning, NLP and
AI conferences - Associate Editor Journal of Artificial
Intelligence Research Machine Learning - Multiple papers awards
22Machine Learning, NLP, Reasoning Optimization
- Foundations
- Learning Theory Algorithmic and representational
Issues high Dimensions dimensionality reduction - Learning protocols how to minimize interaction
(supervision) how to map domain/task information
to supervision semi-supervised learning active
learning ranking adaptation. - Constrained Conditional Models Global decisions
in which several local decisions play a role but
there are mutual dependencies on their outcome
NLP Inference as Constrained Optimization. - Natural Language Processing
- Semantic Parsing
- Question answering
- Semantic Entailment
- Intelligent Information Access
- Information Extraction
- Named Entities and Relations
- Matching Entities Mentions within and across
documents and data bases -
- Software
- Many NLP and IE tools that are being used in
research labs and industry - Basic tools development SNoW, FEX shallow
parser, pos tagger, semantic parser NER, - Learning Based Programming
23Textual Entailment
Phrasal verb paraphrasing ConnorRoth07
- Given
- Q Who acquired Overture?
- Determine
- A Eyeing the huge market potential,
currently - led by Google, Yahoo took over
search company - Overture Services Inc last
year.
Entity matching Li et. al, AAAI04, NAACL04
Semantic Role Labeling
Inference for Entailment AAAI05TE07
Is it true that? (Textual Entailment)
Eyeing the huge market potential, currently led
by Google, Yahoo took over search company
Overture Services Inc. last year
?
Yahoo acquired Overture
Overture is a search company
Google is a search company
Google owns Overture
.
24Constrained Conditional Models
Subject to constraints
(Soft) constraints component
How to solve (for best assignment) ? This is an
Integer Linear Program Solve using ILP packages
gives an exact solution. Search techniques are
also possible
How to train? How to decompose global objective
function? Should we incorporate constraints in
the learning process?
25Semantic Categories
- Information Access and Extraction requires the
identification of semantic categories in text.
Query Aids Treatment
Federal health officials are recommending
aggressive use of a newly approved drug that
protects people infected with the AIDS virus
against a form of pneumonia that is the No.1
killer of AIDS victims. (AP890616-0048,
TIPSTER VOL. 1) Relevant documents may mention
specific types of treatments for AIDS
Hemophiliacs lack a protein, called factor VIII,
that is essential for making blood clots. As a
result, they frequently suffer internal bleeding
and must receive infusions of clotting protein
derived from human blood. During the early 1980s,
these treatments were often tainted with the AIDS
virus. (AP890118-0146, TIPSTER Vol.
1) Many irrelevant documents mention AIDS and
treatments for other diseases
- There is a need to identify that this phrase
represent a name of an organization, a name of a
person, a name of a disease, a medicine, etc. - A narrow version of the problem is called named
entity recognition (NER)
26Adaptation of Named Entity Recognition
- Entities are inherently ambiguous (e.g. JFK can
be both location and a person depending on the
context) - Can appear in various forms Can be nested.
- Using lists is not sufficient
- New entities are always being introduced
New NE seen
- A lot of Machine Learning work significant
over fitting - Key difficulties Adaptation to
- New domains/corpora
- Slightly new definition of an entity
- New languages
- New types of entities .
NE seen
- How to reduce the requirements on the resources
needed to produce a semantic categorization for
a new domain/new language/new type of entities
27NER Tools
Screen shot from a CCG demo http//L2R.cs.uiuc.edu
/cogcomp
- Work in progress
- Un-supervised discovery of entities in other
languages - Quick adaptation to new entity types and new
domains.
28Extracting Relations
- Information Access and Extraction requires the
identification of relations between concepts in
text.
- Relations expressed within a single sentence or
paragraph - Relations uncovered by processing large
quantities of text (over time)
- There is a need to identify concepts (e.g.,
entities) and relations that hold between them in
a given sentence. - Closed set of relations
- A causes B
- A works for B
- A prevents B
- A lives in B
- Open ended set of relations
- Every predicate can be a relation
29Extracting Relations via Semantic Analysis
Screen shot from a CCG demo http//L2R.cs.uiuc.edu
/cogcomp
- Semantic parsing reveals several relations in
the sentence along with their arguments.
- This level of analysis, however, cannot abstract
over the inherent variability in expressing the
relations. . - Kill and Explode can be expressed in many
different ways.
30Information extraction ACL07
with Background Knowledge (Constraints)
- Lars Ole Andersen . Program analysis and
specialization for the C Programming language.
PhD thesis. DIKU , University of Copenhagen, May
1994 .
- Prediction result of a trained HMM
- Lars Ole Andersen . Program analysis and
- specialization for the
- C
- Programming language
- . PhD thesis .
- DIKU , University of Copenhagen , May
- 1994 .
AUTHOR TITLE EDITOR BOOKTITLE
TECH-REPORT INSTITUTION DATE
Violates lots of constraints!
31Information Extraction ACL07
with (background Knowledge) Constraints
- Learn simple models.
- Add constraints, to improve model expressivity
and get correct results! - AUTHOR Lars Ole Andersen .
- TITLE Program analysis and
specialization for the - C Programming language .
- TECH-REPORT PhD thesis .
- INSTITUTION DIKU , University of Copenhagen ,
- DATE May, 1994 .
- If incorporated into semi-supervised training,
better results mean - Better Feedback!
32Why is it difficult?
Meaning
Language
33The Reference Problem
The same problem exists with other types of
entities
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959. Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996).
34Entity/Concept Identification in Text
- Goal Given names in text documents and their
semantic types, identify real-world entities they
represent. - A similarity measure between names entity type
dependent - A way to group different looking strings into one
group - A context sensitive way to distinguish between
identical/similar strings that represent
different entities - A generative Model
- Li, Morie, Roth, NAACL04
- A discriminative approach
- Li, Morie, Roth, AAAI04
- Summary AI Magazine Special Issue on Semantic
Integration05
- Goal Semantic Integration Text, Databases and
Institutional Recourses - Map concepts identified in text to entries in
databases. - Construct/augment databases from textual
information. - Aid discovery in text using existing knowledge
bases.
35Demo
Screen shot from a CCG demo http//L2R.cs.uiuc.edu
/cogcomp More work on this problem Scaling
up Integration with DBs Temporal
Integration/Inference
Related Entities Context
36MIAS Processes
- Focused data retrieval and integration
- Identify and collect relevant data from multiple
sources - Semantic data enrichment Real world Entities and
Relations among them - Infer semantics from unstructured data and
images - Identify real-world entities and relations among
them - Extraction of attributes and relations features
into a common framework (generalized graphs) - Relate them to existing institutional resources
for information integration - Trend Analysis
- Tracking of events, social content, entities and
topics - Knowledge discovery and hypotheses generation and
verification - Construct the rich semantic structure and hidden
networks of entity linkages - Multifaceted output
- Information extraction
- Allow semantic based navigation and search across
disparate data modalities - KR Multi-view representation of the information
as input to visualization tools.
Tools Text Processing Analysis Semantic Analysis
Information Extraction Information
Integration Machine Learning Data
Mining Integrating Text Images
37MIAS - DSSI
38MIAS Mission
- Most of the data today is unstructured
- books, newspaper articles, journal publications,
reports, images, and audio and video streams. - How to deal with the huge amount of unstructured
data as if it was organized in a database with a
known schema. - how to locate, organize, access and analyze
unstructured data. - MIAS Mission
- develop the theories, algorithms, and tools for
analysts to - access a variety of data formats and models
- integrate them with existing resources
- transform raw data into useful and understandable
information.