Title: Research Overview for Harvard Medical Library
1Research Overviewfor Harvard Medical Library
- Andrew McCallum
- Associate Professor
- Computer Science Department
- University of Massachusetts Amherst
2Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
3Personal History
- PhD, University of Rochester
- Machine Learning, Reinforcement Learning,
- Eye movements and short-term memory
- Postdoc, Carnegie Mellon University
- Machine Learning for Text, with Tom Mitchell
- WebKB Project (met Hooman there)
- Research Scientist, Just Research Labs
- Information extraction from text, clustering...
- Built Cora, a precursor to CiteSeer, in 1997.
- VP Research Development, WhizBang Labs
- Information extraction from the Web, 50m VC
funding, 170 people - Job search subsidiary, FlipDog.com, sold to
Monster.com - Associate Professor, UMass Amherst
- 5 CS department in Artificial Intelligence
- Strong ML NSF Center for Intelligent
Information Retrieval
4Information Extraction Synthesis
Laboratory(IESL)
- Assoc. Prof. Andrew McCallum, Director
- 9 PhD students
- 2 postdocs
- 3 undergraduates
- 2 full-time staff programmers
- 40 publications in the past 2 years
- Grants from NSF, DARPA, DHS, Microsoft, IBM, IMS.
- Collaborations with BBN, Aptima, BAE, IBM, SRI,
... MIT, Stanford, CMU, Princeton, UPenn, UWash,
... - 70 compute servers, 10 Tb disk storage
5Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
6Goal of my research
Mine actionable knowledgefrom unstructured text.
7Extracting Job Openings from the Web
8A Portal for Job Openings
9Job Openings Category High Tech Keyword Java
Location U.S.
10Data Mining the Extracted Job Information
11IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
12 IE from Research Papers
McCallum et al 98
13IE from Research Papers
14Mining Research Papers
Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004
Giles et al
15What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
16What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
17What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
18What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Free Soft..
Microsoft
Microsoft
TITLE ORGANIZATION
founder
CEO
VP
Stallman
NAME
Veghte
Bill Gates
Richard
Bill
19Larger Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
20Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
21Hidden Markov Models
HMMs---the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
22IE with Hidden Markov Models
Given a sequence of observations
Yesterday Rich Caruana spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Rich Caruana spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Rich Caruana
23(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
(500 citations)
24Linear-chain CRFs vs. HMMs
- Comparable computational efficiency for inference
- Features may be arbitrary functions of any or all
observations - Parameters need not fully specify generation of
observations can require less training data - Easy to incorporate domain knowledge
25Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
26Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
- Non-Table
- Table Title
- Table Header
- Table Data Row
- Table Section Data Row
- Table Footnote
- ... (12 in all)
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
- Percentage of digit chars
- Percentage of alpha chars
- Indented
- Contains 5 consecutive spaces
- Whitespace in this line aligns with prev.
- ...
- Conjunctions of all previous features, time
offset 0,0, -1,0, 0,1, 1,2.
27Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
HMM
65
Stateless MaxEnt
85
95
CRF
28 IE from Research Papers
McCallum et al 99
29IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
(Word-level accuracy is gt99)
30Other Successful Applications of CRFs
- Information Extraction of gene protein names
from text - Winning teams from UPenn, UWisc, Stanford
- ...using UMass CRF software
- Gene finding in DNA sequences
- Culotta, Kulp, McCallum 2005
- New work at UPenn
- Computer vision, OCR, music, robotics, reference
matching, author resolution, ...protein fold
recognition, ...
31Automatically Annotating MedLine Abstracts
32CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
33CRF String Edit Distance FSM
subst
copy
insert
delete
34CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
35CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
36CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
37Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
38Managing and Understanding Connections of People
in our Email World
Workplace effectiveness Ability to leverage
network of acquaintances But filling Contacts DB
by hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
39System Overview
CRF
WWW
Email
names
40An Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
41Summary of Results
Example keywords extracted
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76
- Expert Finding When solving some task, find
friends-of-friends with relevant expertise.
Avoid stove-piping in large orgs by
automatically suggesting collaborators. Given a
task, automatically suggest the right team for
the job. (Hiring aid!) - Social Network Analysis Understand the social
structure of your organization. Suggest
structural changes for improved efficiency.
42Social Network in an Email Dataset
43Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
44Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
45Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
46From LDA to Author-Recipient-Topic
(ART)
47Inference and Estimation
- Gibbs Sampling
- Easy to implement
- Reasonably fast
r
48Enron Email Corpus
- 250k email messages
- 23k people
Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
49Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
50Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
51Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
52Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
53Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
54ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
55Groups and Topics
- Input
- Observed relations between people
- Attributes on those relations (text, or
categorical) - Output
- Attributes clustered into topics
- Groups of people---varying depending on topic
56Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
57Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
58Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
59The Group-Topic Model Discovering Groups and
Topics Simultaneously
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
60Inference and Estimation
- Gibbs Sampling
- Many r.v.s can be integrated out
- Easy to implement
- Reasonably fast
We assume the relationship is symmetric.
61Dataset 1U.S. Senate
- 16 years of voting records in the US Senate (1989
2005) - a Senator may respond Yea or Nay to a resolution
- 3423 resolutions with text attributes (index
terms) - 191 Senators in total across 16 years
S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
62Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
63Groups Discovered (US Senate)
Groups from topic Education Domestic
64Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
65Dataset 2The UN General Assembly
- Voting records of the UN General Assembly (1990 -
2003) - A country may choose to vote Yes, No or Abstain
- 931 resolutions with text attributes (titles)
- 192 countries in total
- Also experiments later with resolutions from
1960-2003
Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
66Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
67GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
68Do We Get Better Groups with the GT Model?
Baseline Model GT Model
- Cluster bills into topics using mixture of
unigrams - Apply group model on topic-specific subsets of
bills.
- Jointly cluster topic and groups at the same time
using the GT model.
Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
69Groups and Topics, Trends over Time (UN)
70Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
71Previous Systems
72(No Transcript)
73Previous Systems
Cites
Research Paper
74More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81(No Transcript)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94(No Transcript)
95Neural Information Processing Conference Dataset
Volumes 0-12Spanning 1987 1999. Prepared by
Sam Roweis.
- 1740 Papers
- 13649 Unique words
- 2,301,375 Words
96Trends in 17 years of NIPS proceedings
97Topic Distributions Conditioned on Time
topic mass (in vertical height)
time
98Finding Topics in 1 million CS papers
200 topics keywords automatically discovered.
99Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006
- Topical Citation Counts
- Topical Impact Factors
- Topical Longevity
- Topical Diversity
- Topical Precedence
- Topical Transfer
100Topical Diversity
101Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
LowDiversity
HighDiversity
102Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
- Speech Recognition
- Some experiments on the recognition of speech,
with one and two ears, E. Colin Cherry (1953) - Spectrographic study of vowel reduction, B.
Lindblom (1963) - Automatic Lipreading to enhance speech
recognition, Eric D. Petajan (1965) - Effectiveness of linear prediction
characteristics of the speech wave for..., B.
Atal (1974) - Automatic Recognition of Speakers from Their
Voices, B. Atal (1976)
103Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
104Outline
- Self Lab Intro
- Information Extraction and Data Mining.
- Research vingette Conditional Random Fields
- Research vingette Social network analysis
Topic models - Demo Rexa, a new Web portal for research
literature - Future work, brainstorming, collaboration, next
steps
105Collaborative Avenues
- Run and extend Rexa on Countway data
- Research on bibliometric IE and matching
- citation extraction in bio/med literature
- author entity resolution (authority, coreference)
in all PubMed - IE for proteins, genes, diseases, treatments,...
from abstracts. - Research on bibliometric / paper-body mining
- Topic analysis, discovery, impact / influence
mapping - Keyphrase, topic-hierarchy discovery
- Group-topic discovery on diseases, drugs,
treatments, success...or on genes, proteins,
etc. - Routing CFPs to potential PIs. Expert-finding.
- Research on machine learning for bioinformatics
- Research on bi-clustering gene data
- Research on CRF string edit distance for bio seq
- How might I help i2b2?
- Some additional grant we might write together?
- Next concrete steps, responsibilities,
deliverables, timeline.
106Summary
- Traditionally, SNA examines links, but not the
language content on those links. - Presented ART, an Bayesian network for messages
sent in a social network captures topics and
role-similarity. - RART explicitly represents roles.
- Additional work
- Group-Topic model discovers groupsand clusters
attributes of relations.Wang, Mohanty,
McCallum, LinkKDD 2005
107End of Talk