Research Overview for Harvard Medical Library - PowerPoint PPT Presentation

About This Presentation
Title:

Research Overview for Harvard Medical Library

Description:

Research Overview for Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst – PowerPoint PPT presentation

Number of Views:628
Avg rating:3.0/5.0
Slides: 97
Provided by: Andrew1452
Category:

less

Transcript and Presenter's Notes

Title: Research Overview for Harvard Medical Library


1
Research Overviewfor Harvard Medical Library
  • Andrew McCallum
  • Associate Professor
  • Computer Science Department
  • University of Massachusetts Amherst

2
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

3
Personal History
  • PhD, University of Rochester
  • Machine Learning, Reinforcement Learning,
  • Eye movements and short-term memory
  • Postdoc, Carnegie Mellon University
  • Machine Learning for Text, with Tom Mitchell
  • WebKB Project (met Hooman there)
  • Research Scientist, Just Research Labs
  • Information extraction from text, clustering...
  • Built Cora, a precursor to CiteSeer, in 1997.
  • VP Research Development, WhizBang Labs
  • Information extraction from the Web, 50m VC
    funding, 170 people
  • Job search subsidiary, FlipDog.com, sold to
    Monster.com
  • Associate Professor, UMass Amherst
  • 5 CS department in Artificial Intelligence
  • Strong ML NSF Center for Intelligent
    Information Retrieval

4
Information Extraction Synthesis
Laboratory(IESL)
  • Assoc. Prof. Andrew McCallum, Director
  • 9 PhD students
  • 2 postdocs
  • 3 undergraduates
  • 2 full-time staff programmers
  • 40 publications in the past 2 years
  • Grants from NSF, DARPA, DHS, Microsoft, IBM, IMS.
  • Collaborations with BBN, Aptima, BAE, IBM, SRI,
    ... MIT, Stanford, CMU, Princeton, UPenn, UWash,
    ...
  • 70 compute servers, 10 Tb disk storage

5
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

6
Goal of my research
Mine actionable knowledgefrom unstructured text.
7
Extracting Job Openings from the Web
8
A Portal for Job Openings
9
Job Openings Category High Tech Keyword Java
Location U.S.
10
Data Mining the Extracted Job Information
11
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
12
IE from Research Papers
McCallum et al 98
13
IE from Research Papers
14
Mining Research Papers
Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004
Giles et al
15
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
16
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
17
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
18
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

Free Soft..
Microsoft
Microsoft
TITLE ORGANIZATION

founder

CEO
VP

Stallman
NAME
Veghte
Bill Gates
Richard
Bill
19
Larger Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
20
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

21
Hidden Markov Models
HMMs---the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
22
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Rich Caruana spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Rich Caruana spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Rich Caruana
23
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
(500 citations)
24
Linear-chain CRFs vs. HMMs
  • Comparable computational efficiency for inference
  • Features may be arbitrary functions of any or all
    observations
  • Parameters need not fully specify generation of
    observations can require less training data
  • Easy to incorporate domain knowledge

25
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

26
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
  • Non-Table
  • Table Title
  • Table Header
  • Table Data Row
  • Table Section Data Row
  • Table Footnote
  • ... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
  • Percentage of digit chars
  • Percentage of alpha chars
  • Indented
  • Contains 5 consecutive spaces
  • Whitespace in this line aligns with prev.
  • ...
  • Conjunctions of all previous features, time
    offset 0,0, -1,0, 0,1, 1,2.

27
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
HMM
65
Stateless MaxEnt
85
95
CRF
28
IE from Research Papers
McCallum et al 99
29
IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
(Word-level accuracy is gt99)
30
Other Successful Applications of CRFs
  • Information Extraction of gene protein names
    from text
  • Winning teams from UPenn, UWisc, Stanford
  • ...using UMass CRF software
  • Gene finding in DNA sequences
  • Culotta, Kulp, McCallum 2005
  • New work at UPenn
  • Computer vision, OCR, music, robotics, reference
    matching, author resolution, ...protein fold
    recognition, ...

31
Automatically Annotating MedLine Abstracts
32
CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
33
CRF String Edit Distance FSM
subst
copy
insert
delete
34
CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
35
CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
36
CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
37
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

38
Managing and Understanding Connections of People
in our Email World
Workplace effectiveness Ability to leverage
network of acquaintances But filling Contacts DB
by hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
39
System Overview
CRF
WWW
Email
names
40
An Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
41
Summary of Results
Example keywords extracted
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76
  1. Expert Finding When solving some task, find
    friends-of-friends with relevant expertise.
    Avoid stove-piping in large orgs by
    automatically suggesting collaborators. Given a
    task, automatically suggest the right team for
    the job. (Hiring aid!)
  2. Social Network Analysis Understand the social
    structure of your organization. Suggest
    structural changes for improved efficiency.

42
Social Network in an Email Dataset
43
Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
44
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
45
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
46
From LDA to Author-Recipient-Topic
(ART)
47
Inference and Estimation
  • Gibbs Sampling
  • Easy to implement
  • Reasonably fast

r
48
Enron Email Corpus
  • 250k email messages
  • 23k people

Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
49
Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
50
Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
51
Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
52
Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
53
Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
54
ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
55
Groups and Topics
  • Input
  • Observed relations between people
  • Attributes on those relations (text, or
    categorical)
  • Output
  • Attributes clustered into topics
  • Groups of people---varying depending on topic

56
Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
57
Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
58
Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
59
The Group-Topic Model Discovering Groups and
Topics Simultaneously
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
60
Inference and Estimation
  • Gibbs Sampling
  • Many r.v.s can be integrated out
  • Easy to implement
  • Reasonably fast

We assume the relationship is symmetric.
61
Dataset 1U.S. Senate
  • 16 years of voting records in the US Senate (1989
    2005)
  • a Senator may respond Yea or Nay to a resolution
  • 3423 resolutions with text attributes (index
    terms)
  • 191 Senators in total across 16 years

S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
62
Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
63
Groups Discovered (US Senate)
Groups from topic Education Domestic
64
Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
65
Dataset 2The UN General Assembly
  • Voting records of the UN General Assembly (1990 -
    2003)
  • A country may choose to vote Yes, No or Abstain
  • 931 resolutions with text attributes (titles)
  • 192 countries in total
  • Also experiments later with resolutions from
    1960-2003

Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
66
Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
67
GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
68
Do We Get Better Groups with the GT Model?
Baseline Model GT Model
  1. Cluster bills into topics using mixture of
    unigrams
  2. Apply group model on topic-specific subsets of
    bills.
  1. Jointly cluster topic and groups at the same time
    using the GT model.

Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
69
Groups and Topics, Trends over Time (UN)
70
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

71
Previous Systems
72
(No Transcript)
73
Previous Systems
Cites
Research Paper
74
More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
Neural Information Processing Conference Dataset
Volumes 0-12Spanning 1987 1999. Prepared by
Sam Roweis.
  • 1740 Papers
  • 13649 Unique words
  • 2,301,375 Words

96
Trends in 17 years of NIPS proceedings
97
Topic Distributions Conditioned on Time
topic mass (in vertical height)
time
98
Finding Topics in 1 million CS papers
200 topics keywords automatically discovered.
99
Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006
  • Topical Citation Counts
  • Topical Impact Factors
  • Topical Longevity
  • Topical Diversity
  • Topical Precedence
  • Topical Transfer

100
Topical Diversity
101
Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
LowDiversity
HighDiversity
102
Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
  • Speech Recognition
  • Some experiments on the recognition of speech,
    with one and two ears, E. Colin Cherry (1953)
  • Spectrographic study of vowel reduction, B.
    Lindblom (1963)
  • Automatic Lipreading to enhance speech
    recognition, Eric D. Petajan (1965)
  • Effectiveness of linear prediction
    characteristics of the speech wave for..., B.
    Atal (1974)
  • Automatic Recognition of Speakers from Their
    Voices, B. Atal (1976)

103
Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
104
Outline
  • Self Lab Intro
  • Information Extraction and Data Mining.
  • Research vingette Conditional Random Fields
  • Research vingette Social network analysis
    Topic models
  • Demo Rexa, a new Web portal for research
    literature
  • Future work, brainstorming, collaboration, next
    steps

105
Collaborative Avenues
  • Run and extend Rexa on Countway data
  • Research on bibliometric IE and matching
  • citation extraction in bio/med literature
  • author entity resolution (authority, coreference)
    in all PubMed
  • IE for proteins, genes, diseases, treatments,...
    from abstracts.
  • Research on bibliometric / paper-body mining
  • Topic analysis, discovery, impact / influence
    mapping
  • Keyphrase, topic-hierarchy discovery
  • Group-topic discovery on diseases, drugs,
    treatments, success...or on genes, proteins,
    etc.
  • Routing CFPs to potential PIs. Expert-finding.
  • Research on machine learning for bioinformatics
  • Research on bi-clustering gene data
  • Research on CRF string edit distance for bio seq
  • How might I help i2b2?
  • Some additional grant we might write together?
  • Next concrete steps, responsibilities,
    deliverables, timeline.

106
Summary
  • Traditionally, SNA examines links, but not the
    language content on those links.
  • Presented ART, an Bayesian network for messages
    sent in a social network captures topics and
    role-similarity.
  • RART explicitly represents roles.
  • Additional work
  • Group-Topic model discovers groupsand clusters
    attributes of relations.Wang, Mohanty,
    McCallum, LinkKDD 2005

107
End of Talk
Write a Comment
User Comments (0)
About PowerShow.com