Social Network Inspired Models of NLP and Language Evolution

About This Presentation

Title:

Social Network Inspired Models of NLP and Language Evolution

Description:

Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly (IIT Kharagpur) – PowerPoint PPT presentation

Number of Views:362

Avg rating:3.0/5.0

Slides: 133

Provided by: cseIitkg2

Category:

more less

Transcript and Presenter's Notes

Title: Social Network Inspired Models of NLP and Language Evolution

1
Social Network Inspired Models of NLP and
Language Evolution
Monojit Choudhury (Microsoft Research
India)Animesh Mukherjee (IIT Kharagpur) Niloy
Ganguly (IIT Kharagpur)
2
What is a Social Network?

Nodes Social entities (people, organization
etc.)
Edges Interaction/relationship between entities
(Friendship, collaboration, sex)

Courtesy http//blogs.clickz.com
3
Social Network Inspired Computing

Society and nature of human interaction is a
Complex System
Complex Network A generic tool to model complex
systems
There is a growing body of work on CNT Theory
Applied to a variety of fields Social,
Biological, Physical Cognitive sciences,
Engineering Technology
Language is a complex system

4
Objective of this Tutorial

To show that SNIC (Soc. Net. Inspired Comp.) is
an emerging and promising technique
Apply it to model Natural Languages
NLP, Quantitative Linguistics, Language
Evolution, Historical Linguistics, Language
acquisition
Familiarize with tools and techniques in SNIC
Compare it with other standard approaches to NLP

5
Outline of the Tutorial

Part I Background
Introduction 25 min
Network Analysis Techniques 25 min
Network Synthesis Techniques 25 min
Break 320pm 340pm
Part II Case Studies
Self-organization of Sound Systems 20 min
Modeling the Lexicon 20 min
Unsupervised Labeling (Syntax Semantics) 20
min
Conclusion and Discussions 20 min

6
Complex System

Non-trivial properties and patterns emerging from
the interaction of a large number of simple
entities
Self-organization The process through which
these patterns evolve without any external
intervention or central control
Emergent Property or Emergent Behavior The
pattern that emerges due to self-organization

7
The best example from nature
A termite "cathedral" mound produced by a
termite colony
8
Emergence of a networked life
Communities
Atom
Organisms
Molecule
Tissue
Cell
Organs
9
Language a complex system

Language medium for communication through an
arbitrary set of symbols
Constantly evolving
An outcome of self-organization at many levels
Neurons
Speakers and listeners
Phonemes, morphemes, words
80-20 Rule in every level of structure

10
Three Views of a System
A useful trade-off between the two
MESOSCOPY
MACROSCOPY
May not give a complete picture or explanation of
what goes on
MICROSCOPY
May be too difficult to analyze or simulate the
macroscopic behavior
11
Language as a physical system

Microscopic a collection of utterances by
individual speakers
Mesoscopic an interaction between phonemes,
syllables, words, phrases
Macroscopic A set of grammar rules with a lexicon

12
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
red
13
Complex Network Theory

Handy toolbox for modeling mesoscopy
Marriage of Graph theory and Statistics
Complex because
Non-trivial topology
Difficult to specify completely
Usually large (in terms of nodes and edges)
Provides insight into the nature and evolution of
the system being modeled

14
Internet
15
Genetic interaction network
16
9-11 Terrorist Network Social Network Analysis
is a mathematical methodology for connecting the
dots -- using science to fight terrorism.
Connecting multiple pairs of dots soon reveals an
emergent network of organization.
17
CNT Examples Road and Airlines Network
18
What Questions can be asked

Does these networks display some symmetry?
Are these networks creation of intelligent
objects or they have emerged?
How have these networks emerged
What are the underlying simple rules leading
to their complex formation?

19
Bi-directional Approach

Analysis of the real-world networks
Global topological properties
Community structure
Node-level properties
Synthesis of the network by means of some simple
rules
Preferential attachment models
Small-world models ..

20
Application of CNT in Linguistics - I

Quantitative linguistics
Invariance and typology (Zipfs law, syntactic
dependencies)
Natural Language Processing
Unsupervised methods for text labeling (POS
tagging, NER, WSD, etc.)
Textual similarity (automatic evaluation,
document clustering)
Evolutionary Models (NER, multi-document
summarization)

21
Application of CNT in Linguistics - II

Language Evolution
How did sound systems evolve?
Development of syntax
Language Change
Innovation diffusion over social networks
Language as an evolving network
Language Acquisition
Phonological acquisition
Evolution of the mental lexicon of the child

22
Linguistic Networks
Name Nodes Edges Why?
PhoNet Pho-nemes Co-occurrence likelihood in languages Evolution of sound systems
WordNet Words Ontological relation Host of NLP applications
Syntactic Network Words Similarity between syntactic contexts POS Tagging
Semantic Network Words, Names Semantic relation IR, Parsing, NER, WSD
Mental Lexicon Words Phonetic similarity and semantic relation Cognitive modeling, Spell Checking
Tree-banks Words Syntactic Dependency links Evolution of syntax
Word Co-occurrence Words Co-occurrence IR, WSD, LSA,
23
Summarizing

SNIC and CNT are emerging techniques for modeling
complex systems at mesoscopic level
Applied to Physics, Biology, Sociology,
Economics, Logistics
Language - an ideal application domain for SNIC
SNIC models in NLP, Quantitative linguistics,
language change, evolution and acquisition

24
Topological Characterization of Networks
25
Types Of Networks and Representation
Unipartite Binary/ Weighted Undirected/ Directed
Bipartite Binary/ Weighted Undirected/ Directed

Representation
Adjacency Matrix
Adjacency List

a b c
a 0 1 1
b 1 0 1
c 1 1 0
a b,c
b a,c
c a,b
26
Properties of Adjacency Matrix

Aaij, where i and j are nodes and aij1 if
there is an edge between i an j.
A2 AA Entries denote number paths of length 2
between any two node (Saikakj)
In general, An denotes number of paths of length
n
Trace(A) Saii
How is the trace of A3 related to the number of
triangles in the n/w?

k
27
Characterization of Complex N/ws??

They have a non-trivial topological structure
Properties
Heavy tail in the degree distribution
(non-negligible probability mass towards the
tail more than in the case of an exp.
distribution)
High clustering coefficient
Centrality Properties
Social Roles Equivalence
Assortativity
Community Structure
Random Graphs Small avg. path length
Preferential attachment
Small World Properties

28
Degree Distribution (DD)

Let pk be the fraction of vertices in the network
that has a degree k.
The k versus pk plot is defined as the degree
distribution of a network
For most of the real world networks these
distributions are right skewed with a long right
tail showing up values far above the mean pk
varies as k-a
Due to noisy and insufficient data sometimes the
definition is slightly modified
Cumulative degree distribution is plotted
Probability that the degree of a node is greater
than or equal to k

29
A Few Examples
Power law Pk k-a
30
Friend of Friends

Consider the following scenario
Sourish and Ravi are friends
Sourish and Shaunak are friends
Are Shaunak and Ravi friends?
If so then
This property is known as transitivity

31
Measuring Transitivity Clustering Coefficient

The clustering coefficient for a vertex v in a
network is defined as the ratio between the total
number of connections among the neighbors of v
to the total number of possible connections
between the neighbors
High clustering coefficient means my friends know
each other with high probability a typical
property of social networks

32
Mathematically

The clustering coefficient of a vertex i is
The clustering coefficient of the whole network
is the average
Alternatively,

33
Centrality

Centrality measures are commonly described as
indices of 4 Ps -- prestige, prominence,
importance, and power
Degree Count of immediate neighbors
Betweenness Nodes that form a bridge between
two regions of the n/w
Where sst is total number of shortest paths
between s and t and sst (v) is the total number
of shortest paths from s to t via v

34
Eigenvector centrality Bonacich (1972)

It is not just how many people knows me counts to
my popularity (or power) but how many people
knows people who knows me this is recursive!
In context of HIV transmission A person x with
one sex partner is less prone to the disease than
a person y with multiple partners
But imagine what happens if the partner of x has
multiple partners
The basic idea of eigenvector centrality

35
Definition

Eigenvector centrality is defined as the
principal eigenvector of the adjacency matrix
Eigenvector of any symmetric matrix A aij is
any vector e such that
Where ? is a constant and ei is the centrality of
the node i
What does it imply centrality of a node is
proportional to the centrality of the nodes it is
connected to (recursively)
Practical Example Google PageRank

36
Node Equivalence

Social Roles Nodes (actors) in a social n/w who
have similar patterns of relations (ties) with
other nodes.
Three Different ways to find equivalence classes
Structural Equivalence
Automorphic Equivalence
Regular Equivalence

37
Structural Equivalence

Two nodes are said to be exactly structurally
equivalent if they have the same relationships to
all other nodes.

Computation Let A be the adjacency
matrix. Compute the Euclidean Distance /Pearson
Correlation between a pair or rows/columns
representing the neighbor profile of two nodes
(say i and j). This value shows how much
structurally similar i and j are.
38
Automorphic Equivalence

The idea of automorphic equivalence is that sets
of actors can be equivalent by being embedded in
local structures that have the same patterns of
ties -- "parallel" structures.

Swap(B,D) with all their neighborsThe distances
among all the actors in the graph would be
exactly identical Path vectors of i how many
nodes are at distance 1, 2, from node i. Amount
of Equivalence Distance between path vectors
39
Regular Equivalence

Two nodes are said to be regularly equivalent if
they have the same profile of ties with members
of other sets of actors that are also regularly
equivalent.

1 tie with Class II No tie with Class III
Class I
1 tie with Class I 1/2 tie(s) with Class III
Class II
No tie with Class I 1 tie with Class II
Class III
40
Assortativity (homophily)

Rich goes with the rich (selective linking)
A famous actor (e.g., Shah Rukh Khan) would
prefer to pair up with some other famous actor
(e.g., Rani Mukherjee) in a movie rather than a
new comer in the film industry.

Assortative Scale-free network
Disassortative Scale-free network
41
Measures of Assortativity

ANND (Average nearest neighbor degree)
Find the average degree of the neighbors of each
node i with degree k
Find the Pearson correlation (r) between the
degree of i and the average degree of its
neighbors
For further reference see the supplementary
material

42
Community structure

Community structure a group of vertices that
have a high density of edges within them and a
low density of edges in between groups
Example

Friendship n/w of children
Citation n/ws research interest
World Wide Web subject matter of pages
Metabolic networks Functional units
Linguistic n/ws similar linguistic categories

43
Some Examples
Community Structure in Political Books
Community structure in a Social n/w of Students
(American High School)
44
Community Identification Algorithms

Hierarchical
Girvan-Newman
Radicchi et al.
Chinese Whishpers
Spectral Bisection
See (Newman 2004) for a comprehensive
survey (you will find the ref. in the
supplementary material)

45
Girvan-Newman Algorithm

Bisection Method
Calculate the betweenness for all edges in the
network.
Remove the edge with the highest betweenness.
Recalculate betweennesses for all edges affected
by the removal.
Repeat from step 2 until no edges remain.

46
Evolution of NetworksProcesses on Networks
47
Random Graphs Small Average Path Length
Q What do we mean by a random graph? A
Erdos-Renyi random graph model For every pair of
nodes, draw an edge between them with equal
probability p.
Degrees of Separation in a Random Graph

N nodes
z neighbors per node, on average, z ltkgt
D degrees of separation

P(k) e-ltkgt ltkgtk/k!
48
Degree Distributions
49
Degree distributions for various networks

World-Wide Web
Coauthorship networks computer science, high
energy physics, condensed matter physics,
astrophysics
Power grid of the western United States and
Canada
Social network of 43 Mormons in Utah

50
How do Power law DDs arise?
Barabási-Albert Model of Preferential Attachment
(Rich gets Richer)
(1) GROWTH Starting with a small number of
nodes (m0) at every timestep we add a new node
with m (ltm0) edges (connected to the nodes
already present in the system). (2) PREFERENTIAL
ATTACHMENT The probability ? that a new node
will be connected to node i depends on the
connectivity ki of that node
A.-L.Barabási, R. Albert, Science 286, 509 (1999)
51
Mean Field Theory
52
The World is Small!

Registration fee for IJCNLP 2008 are being
waived for all participants get it collected
from the registration counter
How long do you think the above information will
take to spread among yourselves
Experiments say it will spread very fast within
6 hops from the initiator it would reach all
This is the famous Milgrams six degrees of
separation

53
The Small World Effect

Even in very large social networks, the average
distance
between nodes is usually quite short.
Milgrams small world experiment
Target individual in Boston
Initial senders in Omaha, Nebraska
Each sender was asked to forward a packet to a
friend who was closer to the target
Friends asked to do the same
Result Average of six degrees of separation.
S. Milgram, The small world problem, Psych.
Today, 2 (1967), pp. 60-67.

54
Measure of Small-Worldness

Low average geodesic path length
High clustering coefficient
Geodesic path Shortest path through the network
from one vertex to another
Mean path length
l 2?ijdij/n(n1) where dij is the geodesic
distance from vertex i to vertex j
Most of the networks observed in real world have
l 6
Film actors 3.48
Company Directors 4.60
Emails 4.95
Internet 3.33
Electronic circuits 4.34

55
Clustering
C Probability that two of a nodes neighbors
are themselves connected In a random graph
Crand 1/N (if the average degree is held
constant)
56
Watts-Strogatz Small World Model
Watts and Strogatz introduced this simple model
to show how networks can have both short path
lengths and high clustering.
D. J. Watts and S. H. Strogatz, Collective
dynamics of small-world networks, Nature, 393
(1998), pp. 440442.
57
Small-world model

Used for modeling network transitivity
Many networks assume some kind of geographical
proximity
Small-world model
Start with a low-dimensional regular lattice
Rewire
Add/remove edges to create shortcuts to join
remote parts of the lattice
For each edge with prob p move the other end to a
random vertex
Rewiring allows to interpolate between regular
lattice and random graph

58
Small-world model

Regular lattice (p0)
Clustering coefficient C(3k-3)/(4k-2)3/4
Mean distance L/4k
Almost random graph (p1)
Clustering coefficient C2k/L
Mean distance log L / log k
No power-law degree distribution

Rewiring probability p
Degree distribution
59
Resilience of Networks

We consider the resilience of the network to the
removal of its vertices (site percolation) or
edges (bond percolation).
As vertices (or edges) are removed from the
network, the average path length will increase.
Ultimately, the giant component will
disintegrate.
Networks vary according to their level of
resilience to vertex (or edge) removal.

60
Stability MetricPercolation Threshold
fc fraction of nodes removed
f fraction of nodes removed
Initial single connected component
The entire graph breaks into smaller fragments
Giant component still exists
Therefore fc 1-qc becomes the percolation
threshold
61
Ordinary Percolation on Lattices
Fill in each link (bond percolation) or site
(site percolation) with probability p and ask
questions about the sizes of connected
components.
62
Percolation in Poisson and Scale free networks
Exponential Network
Scale free Network
63
CASE STUDY I Self-Organization of the Sound
Inventories
64
Human Communication

Human beings and many other living organisms
produce sound signals
Unlike other organisms, they can concatenate
these sounds to produce new messages Language
Language is one of the primary cause/effect of
human intelligence

65
Human Speech Sounds

Human speech sounds are called phonemes the
smallest unit of a language
Phonemes are characterized by certain distinctive
features like

66
Types of Phonemes
Consonants
Vowels
Diphthongs
L
/t/
/i/
/ai/
/a/
/u/
/p/
/k/
67
Choice of Phonemes

How a language chooses a set of phonemes in order
to build its sound inventory?
Is the process arbitrary?
Certainly Not!
What are the forces affecting this choice?

68
Forces of Choice
A Linguistic System How does it look?
Desires ease of articulation
Desires perceptual contrast / ease of
learnability
The forces shaping the choice are opposing
Hence there has to be a non-trivial solution
69
Vowels A (Partially) Solved Mystery

Languages choose vowels based on maximal
perceptual contrast.
For instance if a language has three vowels then
in more than 95 of the cases they are /a/,/i/,
and /u/.

70
Consonants A puzzle

Research From 1929 Date
No single satisfactory explanation of the
organization of the consonant inventories
The set of features that characterize consonants
is much larger than that of vowels
No single force is sufficient to explain this
organization
Rather a complex interplay of forces goes on in
shaping these inventories

71
Principle of Occurrence

PlaNet The Phoneme-Language Network
A bipartite network N(VL,VC,E)
VL Nodes representing languages of the world
VC Nodes representing consonants
E Set of edges which run between VL and VC
There is an edge e ? E between two nodes
vl ? VL and vc ? VC if the consonant c occurs
in the language l.

Choudhury et al. 2006 ACL Mukherjee et al. 2007
Int. Jnl of Modern Physics C
The Structure of PlaNet
72
Construction of PlaNet

Data Source UCLA Phonological Inventory
Database (UPSID)
Number of nodes in VL is 317
Number of nodes in VC is 541
Number of edges in E is 7022

73
Degree Distribution of PlaNet
DD of the language nodes follows a ß-distribution
DD of the consonant nodes follows a
power-law with an exponential cut-off
Distribution of Consonants over Languages follow
a power-law
74
Synthesis of PlaNet

Non-linear preferential attachment
Iteratively construct the language inventories
given their inventory sizes

dia e
Pr(Ci)
?x?V (dxa e)
75
Simulation Result
The parameters a and e are 1.44 and 0.5
respectively. The results are averaged over 100
runs
76
Principle of Co-occurrence

Consonants tend to co-occur in groups or
communities
These groups tend to be organized around a few
distinctive features (based on manner of
articulation, place of articulation phonation)
Principle of feature economy

If a language has
in its inventory
then it will also tend to have
77
How to Capture these Co-occurrences?

PhoNet Phoneme Phoneme Network
A weighted network N(VC,E)
VC Nodes representing consonants
E Set of edges which run between the nodes in
VC
There is an edge e ? E between two nodes vc1 ,vc2
? VC if the consonant c1 and c2 co-occur in a
language. The number of languages in which c1 and
c2 co-occurs defines the edge-weight of e. The
number of languages in which c1 occurs defines
the node-weight of vc1.

78
Construction of PhoNet

Data Source UPSID
Number of nodes in VC is 541
Number of edges is 34012

PhoNet
79
Community Structures in PhoNet

Radicchi et al. algorithm (for unweighted
networks) Counts number of triangles that an
edge is a part of. Inter-community edges will
have low count so remove them.
Modification for a weighted network like PhoNet
Look for triangles, where the weights on the
edges are comparable.
If they are comparable, then the group of
consonants co-occur highly else it is not so.
Measure strength S for each edge (u,v) in PhoNet
where S is,
Remove edges with S less than a threshold ?

80
Community Formation
S
For different values of ? we get different sets
of communities
81
Consonant Societies!
?0.35
?0.60
?0.72
?1.25
The fact that the communities are good can
quantitatively shown by measuring the feature
entropy
82
Problems to ponder on

Physical significance of PA
Functional forces
Historical/Evolutionary process
Labeled synthesis of PlaNet and PhoNet
Language diversity vs. Preferential attachment

83
CASE STUDY II Modeling the Mental Lexicon
84
Metal Lexicon (ML) Basics

It refers to the repository of the word forms
that resides in the human brain
Two Questions
How words are stored in the long term memory,
i.e., the organization of the ML.
How are words retrieved from the ML (lexical
access)
The above questions are highly inter-related
to predict the organization one can investigate
how words are retrieved and vice versa.

85
Different Possible Ways of Organization

Un-organized (a bag full of words) or,
Organized
By sound (phonological similarity)
E.g., start the same banana, bear, bean
End the same look, took, book
Number of phonological segments they share
By Meaning (semantic similarity)
Banana, apple, pear, orange
By age at which the word is acquired
By frequency of usage
By POS
Orthographically

86
The Hierarchical Model of ML

Proposed by Collins and Quillian in 1969
Concepts are organized in a hierarchy
Taxonomic and attributive relations are
represented
Cognitive Economy Put the attributes at the
highest of all appropriate levels e.g.,
reproduces applies to the whole animal kingdom

87
Hierarchical Model

According to the principle of cognitive economy
Animals eat lt mammals eat lt humans eat
However, shark is a fish salmon is a fish
What do lt and mean?
lt Less time to judge
Equal time to judge

88
Spreading Activation Model of ML

Not a hierarchical structure but a web of
inter-connected nodes (first proposed by Collins
and Loftus in 1975)
Distance between nodes is determined by the
structural characteristics of the word-forms (by
sound, by meaning, by age, by )
Combining the above two plethora of complex
networks

89
Phonological Neighborhood Network

(Vitevitch 2004)
(Gruenenfelder Pisoni, 2005)
(Kapatsinski 2006)
Sound Similarity Relations in the Mental
Lexicon Modeling the Lexicon as a Complex
Network

90
N/W Definition

Nodes Words
Edge An edge is drawn from node A to node B if
at least 2/3 of the segments that occur in word
represented by A also occurs in the word
represented by B
i.e., if the word represented by A is 6 segments
long then one can derive all its neighbors (B)
from it by two phoneme changes (insertions,
deletions or substitutions).

91
N/W Construction

Datbase
Hoosier Mental Lexicon (Nusbaum et al., 1984)
phonologically transcribed words ? n/w using the
metric defined earlier
Nodes with no links (correspond to hermit words
i.e., words that have no neighbors)
Random networks (E-R) for comparison
Directed n/w ? a long word can have a short word
as a neighbor, not vice versa
Have a link only if the duration of the
difference in the word pair lt (duration of a
word)/3 (the factor 1/3 is experimentally derived
see the paper for further info.)

92
Neighborhood Density

The node whose neighbors are searched ? base
words
Neighborhood density of a base word is expressed
as the out-degree of the node representing the
base word
Is an estimate of the number of words activated
by the base word when the base word is presented
? spreading activation
Something like semantic priming (however, in the
phonological level)

93
Results of the N/W Analysis

Small-world Properties
High clustering but also long average path length
-- like a SW network the lexicon has densely
connected neighborhoods but the links between two
nodes of different neighborhoods is harder to
find than in SW networks

94
Visualization A Disconnected Graph with a
Giant Component (GC)

GC is elongated there are some nodes that have
really long chain of intermediates and hence the
mean path length is long

95
Low Degree Nodes are Important!!!

Removal of low degree nodes renders the n/w
almost disconnected
A bottleneck is formed between longer (more than
7 segments long) and shorter words
This bottleneck consists the tion final words
coalition, passion, nation, fixation/fission
they form short-cuts between the high-degree
nodes (i.e., they are low-degree stars that
connect mega-neighborhoods)

96
Removal of Nodes with Degree lt 40
2-4 segment words
Removal of low-degree nodes disconnect the n/w
as opposed to the removal of hubs like pastor
(deg. 112)
8-10 segment words
97
Why low connectivity between neighborhoods?

Spreading activation should not inhibit
neighbors of the stimulus neighbors that are
non-neighbors of the stimulus itself (and are
therefore, not similar to the stimulus)
Low mean path ? complete traversal of n/ws, for
e.g., in general purpose search
Search in lexicon does not need to traverse links
between distant nodes rather it involves an
activation of the structured neighborhood that
share a single sub-lexical chunk that could be
acoustically related during word-recognition
(Marslen-Wilson, 1990).

98
Degree Distribution (DD)

Exponential rather than power-law

5-7 segment words
Entire Lexicon
8-10 segment words
99
Other Works (see supplementary material for
reference)

Vitevitch (2005)
similar to the above work but builds n/ws of
nodes that are just one-segment different
(Choudhury et al. (2007)
Builds weighted n/ws in Hindi, Bengali and
English based on orthographic proximity (nodes
words edges orthographic edit-distance)
SpellNet
Does thresholding (?) to make the n/ws binary (at
? 1, 3, 5).
They also obtain exponential DDs
Observe that occurrence of real word errors in a
language is proportional to avg. wghtd. deg. of
the SpellNet of that language

100
Other Works

Sigman et al. (2002)
Analyzes the English WordNet
All semantic relationships are scale-invariant
Inclusion of polysemy make the n/w SW
Ferrer i Cancho et al. (2000,2001)
Word co-occurrence (in a sentence) based
definitions of the lexicon
Lexicon Kernel Lexicon Peripheral Lexicon
Finds a 2-regime DD one comprises words in the
kernel lexicon and the other words in the
peripheral lexicon
Finds that these n/ws are small-world

101
Some Unsolved Mysteries You can Give it a Try
?

What can be a model for the evolution of the ML?
How is the ML acquired by a child learner?
Is there a single optimal structure for the ML
or is it organized based on multiple criteria
(i.e., a combination of the different n/ws)
Towards a single framework for studying ML!!!

102
CASE STUDY III SyntaxUnsupervised POS Tagging
103
Labeling of Text

Lexical Category (POS tags)
Syntactic Category (Phrases, chunks)
Semantic Role (Agent, theme, )
Sense
Domain dependent labeling (genes, proteins, )
How to define the set of labels?
How to (learn to) predict them automatically?

104
Nothing makes sense, unless in context

Distribution-based definition of
Lexical category
Sense (meaning)
The X is
If you X then I shall
looking at the star PP

105
General Approach

Represent the context of a word (token)
Define some notion of similarity between the
contexts
Cluster the contexts of the tokens
Get the label of the tokens

w1 w2 w3 w4

w1
w3
w2
w4
106
Issues

How to define the context?
How to define similarity
How to Cluster?
How to evaluate?

107
Unsupervised Parts-of-Speech Tagging Employing
Efficient Graph Clustering

Chris Biemann
COLING-ACL 2006

108
Stages

Input raw text corpus
Identify feature words and define a graph for
high and medium frequency words (10000)
Cluster the graph to identify the classes
For low frequency words, use context similarity
Lexicon of word classes ? tag the same text ?
learn a Viterbi tagger

109
Features Words

Estimate the unigram frequencies
Feature words Most frequent 200 words

110
Feature Vector

From the familiar to the exotic, the collection
is a delight

the
to
is
from
fw1
fw2
fw199
fw200
p-2
0 0 0 1
1 0 0 0
0 1 0 0
1 0 0 0
p-1
p1
p2
111
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1 1 cos(red, blue)
red
112
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
113
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
114
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
115
Medium and Low Frequency Words

Neighboring (window 4) co-occurrences ranked by
log-likelihood thresholded by ?
Two words are connected iff they share at least 4
neighbors

Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
116
Construction of Lexicon

Each word assigned a unique tag based on the word
class it belongs to
Class 1 sky, color, blood, weight
Class 2 red, blue, light, heavy
Ambiguous words
High and medium frequency words that formed
singleton cluster
Possible tags of neighboring clusters

117
Training and Evaluation

Unsupervised training of trigram HMM using the
clusters and lexicon
Evaluation
Tag a text, for which gold standard is available
Estimate the conditional entropy H(TC) and the
related perplexity 2H(TC)
Final Results
English 2.05 (619/345), Finnish 3.22
(625/466), German 1.79 (781/440)

118
Example
From the familiar to the exotic, the collection
is a delight Prep At JJ Prep At JJ
At NN V At NN C200 C1 C331 C5
C1 C331 C1 C221 C3 C1 C220
119
Word Sense Disambiguation

Véronis, J. 2004. HyperLex lexical cartography
for information retrieval. Computer Speech
Language 18(3)223-252.
Let the word to be disambiguated be light
Select a subcorpus of paragraphs which have at
least one occurrence of light
Construct the word co-occurrence graph

120
HyperLex

A beam of white light is dispersed into its
component colors by its passage through a prism.
Energy efficient light fixtures including
solar lights, night lights, energy star lighting,
ceiling lighting, wall lighting, lamps
What enables us to see the light and
experience such wonderful shades of colors during
the course of our everyday lives?

prism
beam
dispersed
white
colors
shades
energy
fixtures
efficient
lamps
121
Hub Detection and MST
prism
beam
light
dispersed
white
colors
lamps
colors
shades
beam
prism
fixtures
energy
shades
energy
efficient
dispersed
white
fixtures
efficient
lamps
White fluorescent lights consume less energy than
incandescent lamps
122
Other Related Works

Solan, Z., Horn, D., Ruppin, E. and Edelman, S.
2005. Unsupervised learning of natural languages.
PNAS, 102 (33) 11629-11634
Ferrer i Cancho, R. 2007. Why do syntactic links
not cross? Europhysics Letters
Also applied to IR, Summarization, sentiment
detection and categorization, script evaluation,
author detection,

123
Discussions Conclusions

What we learnt
Advantages of SNIC in NLP
Comparison to standard techniques
Open problems
Concluding remarks and QA

124
What we learnt

What is SNIC and Complex Networks
Analytical tools for SNIC
Applications to human languages
Three Case-studies

Area Perspective Technique
I Sound systems Language evolution and change Synthesis models
II Lexicon Psycholinguistic modeling and linguistic typology Topology and search
III Syntax Semantics Applications to NLP Clustering
125
What we saw

Language features complex structure at every
level of organization
Linguistic networks have non-trivial properties
scale-free small-world
Therefore, Language and Engineering systems
involving language should be studied within the
framework of complex systems, esp. CNT

126
Advantages of SNIC

Fully Unsupervised techniques
No labeled data required A good solution to
resources scarcity
Problem of evaluation circumvented by
semi-supervised techniques
Ease of computation
Simple and scalable
Distributed and parallel computable
Holistic treatment
Language evolution psycho-linguistic theories

127
Comparison to Standard Techniques

Rule-based vs. Statistical NLP
Graphical Models
Generative models in machine learning
HMM, CRF, Bayesian belief networks

JJ
NN
RB
VF
128
Graphical Models vs. SNIC

GRAPHICAL MODEL

COMPLEX NETWORK

Principled based on Bayesian Theory
Structure is assumed and parameters are learnt
Focus Decoding parameter estimation
Data-driven or computationally intensive
The generative process is easy to visualize, but
no visualization of the data

Heuristic, but underlying principles of linear
algebra
Structure is discovered and studied
Focus Topology and evolutionary dynamics
Unsupervised and computationally easy
Easy visualization of the data

129
Language Modeling

A network of words as a model of language vs.
n-gram models
Hierarchical, hyper-graph based models
Smoothing through holistic analysis of the
network topology
Jedynak, B. and Karakos, D. 2007. Unigram
Language Models using Diffusion Smoothing over
Graphs. Proc. of TextGraphs - 2

130
Open Problems