Title: EPCA Integration
1People in CALOs WorldContact Info, Expertise,
Groups RolesInformation Extraction,
Coreference, Group/Topic ModelsAndrew McCallum
Aron Culotta, Xuerui Wang, Charles Sutton, Wei
LiUMass Amherst
2The Application
Workplace effectiveness Ability to leverage
network of acquaintances The power of your
little black book But filling Contacts DB by
hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
3DEX Overview
CRF
WWW
Email
names
4DEX Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
5Summary of Results
Example keywords extracted
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76
- Expert Finding When solving some task, find
friends-of-friends with relevant expertise.
Avoid stove-piping in large orgs by
automatically suggesting collaborators. Given a
task, automatically suggest the right team for
the job. (Hiring aid!) - Social Network Analysis Understand the social
structure of your organization. Suggest
structural changes for improved efficiency.
6Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
7Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
80. Segmenting and labeling sequence
dataLinear-chain CRFs
Lafferty, McCallum, Pereira 2001
PER O O TIME O O
ORG O LOC
...
y
Named entity labels
...
x
CALO email words
Dave , The Friday meeting with Tembec in NY
Leveraging data from KnowItAll,Etzioni et al,
2004 UPenn help.
Enron email labeled by Michael Collins, et
al. 1200 entities
Field F1 DATE 0.8483 TIME 0.7939 LOCATION 0.64
76 PERSON 0.8439 ORGANIZATION 0.5987 ACRONYM 0.2
804 PHONE 0.7943 MONEY 0.7143 PERCENT 0.9091 OV
ERALL 0.7282
From monika.causholli_at_enron.com Dave, The
Friday meeting with Tembec in NY has been
postponed until next week. Attached is the
information you requested. Let me know if you
need anything else. Also did Doug give you the
data about consumer products? Cheers, Monica
Li, McCallum, unpublished, 2004
9User feedback in the wildas labeling
Labeling for Classification
Seminar How to Organize your Life by Jane
Smith, Stevenson Smith Mezzanine Level,
Papadapoulos Sq 330 pm Thursday March 31 In
this seminar we will learn how to use CALO to...
Seminar announcement
Todo request
Other
Easy Often found in user interfaces e.g. CALO
IRIS, Apple Mail
10Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information Extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
11Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
user corrects labels, not segmentations
12Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
29 percent reduction in user actions needed to
train
13Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
14Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Emailed seminar annmt entities
Email English words
60k words training.
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Too little labeled training data.
15Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Train on related task with more data.
Newswire named entities
Newswire English words
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
16Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
At test time, label email with newswire NEs...
Newswire named entities
Email English words
17Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
then use these labels as features for final task
Emailed seminar annmt entities
Newswire named entities
Email English words
18Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Use joint inference at test time.
Seminar Announcement entities
Newswire named entities
English words
An alternative to hierarchical Bayes. Neednt
know anything about parameterization of subtask.
Accuracy No transfer lt Cascaded Transfer lt
Joint Inference Transfer
19CRF Transfer Learning Results
Sutton, McCallum, 2005
Seminar Announcements Dataset Freitag
1998 CRF location speaker stime etime
overall No transfer 73.7 81.0 99.1 97.3
87.8 Cascaded transfer 74.2 84.3 99.2 96.0
88.4 Joint transfer 76.3 85.3 99.1 96.0 89.2
New best published accuracy on common dataset
20Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
21Joint Co-reference Decisions,Discriminative Model
Culotta McCallum 2005
People
Stuart Russell
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
22Co-reference for Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
S. Russel
Berkeley
23Joint Co-reference of Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
Reduces error by 22
S. Russel
Berkeley
24Joint Co-reference Experimental Results
Culotta McCallum 2005
CiteSeer Dataset 1500 citations, 900 unique
papers, 350 unique venues Paper
Venue indep joint indep joint constraint 88.
9 91.0 79.4 94.1 reinforce 92.2 92.2 56.5 60.1
face 88.2 93.7 80.9 82.8 reason 97.4 97.0 75
.6 79.5 Micro Average 91.7 93.4 73.1 79.1 ?
error20 ?error22
25Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
26Social network from my email
27Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
28Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
29Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
30From LDA to Author-Recipient-Topic
(ART)
31Inference and Estimation
- Gibbs Sampling
- Easy to implement
- Reasonably fast
r
32Enron Email Corpus
- 250k email messages
- 23k people
Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
33Topics, and prominent sender/receiversdiscovered
by ART
Titles chosen by me
34Topics, and prominent sender/receiversdiscovered
by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice Presidence of
Regulatory Affairs Steffes Vice President of
Government Affairs
35Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
36Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
37Comparing Role Discovery Tracy Geaconne ? Rod
Hayslett
Traditional SNA
Author-Topic
ART
Very similar
Not very similar
Different roles
Geaconne Secretary Hayslett Vice President
CTO
38Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
39Comparing Group Discovery Enron TransWestern
Division
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
40McCallum Email Corpus 2004
- January - October 2004
- 23k email messages
- 825 people
From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
41McCallum Email Blockstructure
42Four most prominent topicsin discussions with
____?
43(No Transcript)
44Two most prominent topicsin discussions with
____?
45Topic 37
46Topic 40
47(No Transcript)
48Pairs with highestrank difference between ART
SNA
5 other professors 3 other ML researchers
49Role-Author-Recipient-Topic Models
50Year Three Plans People
- Extraction, for Expert-finding and Group/Role
Analysis - Make learning-in-the-wild practical for
extraction. - Transfer from noisy/incomplete databases to
improve IE. - Support questions about contact info,
organizational affiliation, etc.
- Identity Uncertainty
- Central problem for going from text to knowledge
base. - Many interacting entity types, relationships.
- Group/Role/Topic Analysis
- Explicit topic models of groups, roles,
expertise, tasks,and its interation with
extraction... - Support Qs about topical expertise, forwarding
messages, team building.
- Etc.
- Continue to support and enhance MALLET toolkit,
in collaboration with UPenn and others.