Title: Social Network Inspired Models of NLP and Language Evolution
1Social Network Inspired Models of NLP and
Language Evolution
Monojit Choudhury (Microsoft Research
India)Animesh Mukherjee (IIT Kharagpur) Niloy
Ganguly (IIT Kharagpur)
2What is a Social Network?
- Nodes Social entities (people, organization
etc.) - Edges Interaction/relationship between entities
(Friendship, collaboration, sex)
Courtesy http//blogs.clickz.com
3Social Network Inspired Computing
- Society and nature of human interaction is a
Complex System - Complex Network A generic tool to model complex
systems - There is a growing body of work on CNT Theory
- Applied to a variety of fields Social,
Biological, Physical Cognitive sciences,
Engineering Technology - Language is a complex system
4Objective of this Tutorial
- To show that SNIC (Soc. Net. Inspired Comp.) is
an emerging and promising technique - Apply it to model Natural Languages
- NLP, Quantitative Linguistics, Language
Evolution, Historical Linguistics, Language
acquisition - Familiarize with tools and techniques in SNIC
- Compare it with other standard approaches to NLP
5Outline of the Tutorial
- Part I Background
- Introduction 25 min
- Network Analysis Techniques 25 min
- Network Synthesis Techniques 25 min
- Break 320pm 340pm
- Part II Case Studies
- Self-organization of Sound Systems 20 min
- Modeling the Lexicon 20 min
- Unsupervised Labeling (Syntax Semantics) 20
min - Conclusion and Discussions 20 min
6Complex System
- Non-trivial properties and patterns emerging from
the interaction of a large number of simple
entities - Self-organization The process through which
these patterns evolve without any external
intervention or central control - Emergent Property or Emergent Behavior The
pattern that emerges due to self-organization
7The best example from nature
A termite "cathedral" mound produced by a
termite colony
8Emergence of a networked life
Communities
Atom
Organisms
Molecule
Tissue
Cell
Organs
9Language a complex system
- Language medium for communication through an
arbitrary set of symbols - Constantly evolving
- An outcome of self-organization at many levels
- Neurons
- Speakers and listeners
- Phonemes, morphemes, words
- 80-20 Rule in every level of structure
10Three Views of a System
A useful trade-off between the two
MESOSCOPY
MACROSCOPY
May not give a complete picture or explanation of
what goes on
MICROSCOPY
May be too difficult to analyze or simulate the
macroscopic behavior
11Language as a physical system
- Microscopic a collection of utterances by
individual speakers - Mesoscopic an interaction between phonemes,
syllables, words, phrases - Macroscopic A set of grammar rules with a lexicon
12Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
red
13Complex Network Theory
- Handy toolbox for modeling mesoscopy
- Marriage of Graph theory and Statistics
- Complex because
- Non-trivial topology
- Difficult to specify completely
- Usually large (in terms of nodes and edges)
- Provides insight into the nature and evolution of
the system being modeled
14Internet
15Genetic interaction network
169-11 Terrorist Network Social Network Analysis
is a mathematical methodology for connecting the
dots -- using science to fight terrorism.
Connecting multiple pairs of dots soon reveals an
emergent network of organization.
17CNT Examples Road and Airlines Network
18What Questions can be asked
- Does these networks display some symmetry?
- Are these networks creation of intelligent
objects or they have emerged? - How have these networks emerged
- What are the underlying simple rules leading
to their complex formation?
19Bi-directional Approach
- Analysis of the real-world networks
- Global topological properties
- Community structure
- Node-level properties
- Synthesis of the network by means of some simple
rules - Preferential attachment models
- Small-world models ..
20Application of CNT in Linguistics - I
- Quantitative linguistics
- Invariance and typology (Zipfs law, syntactic
dependencies) - Natural Language Processing
- Unsupervised methods for text labeling (POS
tagging, NER, WSD, etc.) - Textual similarity (automatic evaluation,
document clustering) - Evolutionary Models (NER, multi-document
summarization)
21Application of CNT in Linguistics - II
- Language Evolution
- How did sound systems evolve?
- Development of syntax
- Language Change
- Innovation diffusion over social networks
- Language as an evolving network
- Language Acquisition
- Phonological acquisition
- Evolution of the mental lexicon of the child
22Linguistic Networks
Name Nodes Edges Why?
PhoNet Pho-nemes Co-occurrence likelihood in languages Evolution of sound systems
WordNet Words Ontological relation Host of NLP applications
Syntactic Network Words Similarity between syntactic contexts POS Tagging
Semantic Network Words, Names Semantic relation IR, Parsing, NER, WSD
Mental Lexicon Words Phonetic similarity and semantic relation Cognitive modeling, Spell Checking
Tree-banks Words Syntactic Dependency links Evolution of syntax
Word Co-occurrence Words Co-occurrence IR, WSD, LSA,
23Summarizing
- SNIC and CNT are emerging techniques for modeling
complex systems at mesoscopic level - Applied to Physics, Biology, Sociology,
Economics, Logistics - Language - an ideal application domain for SNIC
- SNIC models in NLP, Quantitative linguistics,
language change, evolution and acquisition
24Topological Characterization of Networks
25Types Of Networks and Representation
Unipartite Binary/ Weighted Undirected/ Directed
Bipartite Binary/ Weighted Undirected/ Directed
- Representation
- Adjacency Matrix
- Adjacency List
a b c
a 0 1 1
b 1 0 1
c 1 1 0
a b,c
b a,c
c a,b
26Properties of Adjacency Matrix
- Aaij, where i and j are nodes and aij1 if
there is an edge between i an j. - A2 AA Entries denote number paths of length 2
between any two node (Saikakj) - In general, An denotes number of paths of length
n - Trace(A) Saii
- How is the trace of A3 related to the number of
triangles in the n/w?
k
27Characterization of Complex N/ws??
- They have a non-trivial topological structure
- Properties
- Heavy tail in the degree distribution
(non-negligible probability mass towards the
tail more than in the case of an exp.
distribution) - High clustering coefficient
- Centrality Properties
- Social Roles Equivalence
- Assortativity
- Community Structure
- Random Graphs Small avg. path length
- Preferential attachment
- Small World Properties
28Degree Distribution (DD)
- Let pk be the fraction of vertices in the network
that has a degree k. - The k versus pk plot is defined as the degree
distribution of a network - For most of the real world networks these
distributions are right skewed with a long right
tail showing up values far above the mean pk
varies as k-a - Due to noisy and insufficient data sometimes the
definition is slightly modified - Cumulative degree distribution is plotted
- Probability that the degree of a node is greater
than or equal to k
29A Few Examples
Power law Pk k-a
30Friend of Friends
- Consider the following scenario
- Sourish and Ravi are friends
- Sourish and Shaunak are friends
- Are Shaunak and Ravi friends?
- If so then
- This property is known as transitivity
31Measuring Transitivity Clustering Coefficient
- The clustering coefficient for a vertex v in a
network is defined as the ratio between the total
number of connections among the neighbors of v
to the total number of possible connections
between the neighbors - High clustering coefficient means my friends know
each other with high probability a typical
property of social networks
32Mathematically
- The clustering coefficient of a vertex i is
- The clustering coefficient of the whole network
is the average - Alternatively,
33Centrality
- Centrality measures are commonly described as
indices of 4 Ps -- prestige, prominence,
importance, and power - Degree Count of immediate neighbors
- Betweenness Nodes that form a bridge between
two regions of the n/w - Where sst is total number of shortest paths
between s and t and sst (v) is the total number
of shortest paths from s to t via v
34Eigenvector centrality Bonacich (1972)
- It is not just how many people knows me counts to
my popularity (or power) but how many people
knows people who knows me this is recursive! - In context of HIV transmission A person x with
one sex partner is less prone to the disease than
a person y with multiple partners - But imagine what happens if the partner of x has
multiple partners - The basic idea of eigenvector centrality
35Definition
- Eigenvector centrality is defined as the
principal eigenvector of the adjacency matrix - Eigenvector of any symmetric matrix A aij is
any vector e such that - Where ? is a constant and ei is the centrality of
the node i - What does it imply centrality of a node is
proportional to the centrality of the nodes it is
connected to (recursively) - Practical Example Google PageRank
36Node Equivalence
- Social Roles Nodes (actors) in a social n/w who
have similar patterns of relations (ties) with
other nodes. - Three Different ways to find equivalence classes
- Structural Equivalence
- Automorphic Equivalence
- Regular Equivalence
37Structural Equivalence
- Two nodes are said to be exactly structurally
equivalent if they have the same relationships to
all other nodes.
Computation Let A be the adjacency
matrix. Compute the Euclidean Distance /Pearson
Correlation between a pair or rows/columns
representing the neighbor profile of two nodes
(say i and j). This value shows how much
structurally similar i and j are.
38Automorphic Equivalence
- The idea of automorphic equivalence is that sets
of actors can be equivalent by being embedded in
local structures that have the same patterns of
ties -- "parallel" structures.
Swap(B,D) with all their neighborsThe distances
among all the actors in the graph would be
exactly identical Path vectors of i how many
nodes are at distance 1, 2, from node i. Amount
of Equivalence Distance between path vectors
39Regular Equivalence
- Two nodes are said to be regularly equivalent if
they have the same profile of ties with members
of other sets of actors that are also regularly
equivalent.
1 tie with Class II No tie with Class III
Class I
1 tie with Class I 1/2 tie(s) with Class III
Class II
No tie with Class I 1 tie with Class II
Class III
40Assortativity (homophily)
- Rich goes with the rich (selective linking)
- A famous actor (e.g., Shah Rukh Khan) would
prefer to pair up with some other famous actor
(e.g., Rani Mukherjee) in a movie rather than a
new comer in the film industry.
Assortative Scale-free network
Disassortative Scale-free network
41Measures of Assortativity
- ANND (Average nearest neighbor degree)
- Find the average degree of the neighbors of each
node i with degree k - Find the Pearson correlation (r) between the
degree of i and the average degree of its
neighbors - For further reference see the supplementary
material
42Community structure
- Community structure a group of vertices that
have a high density of edges within them and a
low density of edges in between groups - Example
- Friendship n/w of children
- Citation n/ws research interest
- World Wide Web subject matter of pages
- Metabolic networks Functional units
- Linguistic n/ws similar linguistic categories
43Some Examples
Community Structure in Political Books
Community structure in a Social n/w of Students
(American High School)
44Community Identification Algorithms
- Hierarchical
- Girvan-Newman
- Radicchi et al.
- Chinese Whishpers
- Spectral Bisection
- See (Newman 2004) for a comprehensive
- survey (you will find the ref. in the
- supplementary material)
45Girvan-Newman Algorithm
- Bisection Method
- Calculate the betweenness for all edges in the
network. - Remove the edge with the highest betweenness.
- Recalculate betweennesses for all edges affected
by the removal. - Repeat from step 2 until no edges remain.
46Evolution of NetworksProcesses on Networks
47Random Graphs Small Average Path Length
Q What do we mean by a random graph? A
Erdos-Renyi random graph model For every pair of
nodes, draw an edge between them with equal
probability p.
Degrees of Separation in a Random Graph
- N nodes
- z neighbors per node, on average, z ltkgt
- D degrees of separation
P(k) e-ltkgt ltkgtk/k!
48Degree Distributions
49Degree distributions for various networks
- World-Wide Web
- Coauthorship networks computer science, high
energy physics, condensed matter physics,
astrophysics - Power grid of the western United States and
Canada - Social network of 43 Mormons in Utah
50How do Power law DDs arise?
Barabási-Albert Model of Preferential Attachment
(Rich gets Richer)
(1) GROWTH Starting with a small number of
nodes (m0) at every timestep we add a new node
with m (ltm0) edges (connected to the nodes
already present in the system). (2) PREFERENTIAL
ATTACHMENT The probability ? that a new node
will be connected to node i depends on the
connectivity ki of that node
A.-L.Barabási, R. Albert, Science 286, 509 (1999)
51Mean Field Theory
52The World is Small!
- Registration fee for IJCNLP 2008 are being
waived for all participants get it collected
from the registration counter - How long do you think the above information will
take to spread among yourselves - Experiments say it will spread very fast within
6 hops from the initiator it would reach all - This is the famous Milgrams six degrees of
separation
53The Small World Effect
- Even in very large social networks, the average
distance - between nodes is usually quite short.
- Milgrams small world experiment
- Target individual in Boston
- Initial senders in Omaha, Nebraska
- Each sender was asked to forward a packet to a
friend who was closer to the target - Friends asked to do the same
- Result Average of six degrees of separation.
- S. Milgram, The small world problem, Psych.
Today, 2 (1967), pp. 60-67.
54Measure of Small-Worldness
- Low average geodesic path length
- High clustering coefficient
- Geodesic path Shortest path through the network
from one vertex to another - Mean path length
- l 2?ijdij/n(n1) where dij is the geodesic
distance from vertex i to vertex j - Most of the networks observed in real world have
l 6 - Film actors 3.48
- Company Directors 4.60
- Emails 4.95
- Internet 3.33
- Electronic circuits 4.34
55Clustering
C Probability that two of a nodes neighbors
are themselves connected In a random graph
Crand 1/N (if the average degree is held
constant)
56Watts-Strogatz Small World Model
Watts and Strogatz introduced this simple model
to show how networks can have both short path
lengths and high clustering.
D. J. Watts and S. H. Strogatz, Collective
dynamics of small-world networks, Nature, 393
(1998), pp. 440442.
57Small-world model
- Used for modeling network transitivity
- Many networks assume some kind of geographical
proximity - Small-world model
- Start with a low-dimensional regular lattice
- Rewire
- Add/remove edges to create shortcuts to join
remote parts of the lattice - For each edge with prob p move the other end to a
random vertex - Rewiring allows to interpolate between regular
lattice and random graph
58Small-world model
- Regular lattice (p0)
- Clustering coefficient C(3k-3)/(4k-2)3/4
- Mean distance L/4k
- Almost random graph (p1)
- Clustering coefficient C2k/L
- Mean distance log L / log k
- No power-law degree distribution
Rewiring probability p
Degree distribution
59Resilience of Networks
- We consider the resilience of the network to the
removal of its vertices (site percolation) or
edges (bond percolation). - As vertices (or edges) are removed from the
network, the average path length will increase. - Ultimately, the giant component will
disintegrate. - Networks vary according to their level of
resilience to vertex (or edge) removal.
60Stability MetricPercolation Threshold
fc fraction of nodes removed
f fraction of nodes removed
Initial single connected component
The entire graph breaks into smaller fragments
Giant component still exists
Therefore fc 1-qc becomes the percolation
threshold
61Ordinary Percolation on Lattices
Fill in each link (bond percolation) or site
(site percolation) with probability p and ask
questions about the sizes of connected
components.
62Percolation in Poisson and Scale free networks
Exponential Network
Scale free Network
63CASE STUDY I Self-Organization of the Sound
Inventories
64Human Communication
- Human beings and many other living organisms
produce sound signals - Unlike other organisms, they can concatenate
these sounds to produce new messages Language - Language is one of the primary cause/effect of
human intelligence
65Human Speech Sounds
- Human speech sounds are called phonemes the
smallest unit of a language - Phonemes are characterized by certain distinctive
features like
66Types of Phonemes
Consonants
Vowels
Diphthongs
L
/t/
/i/
/ai/
/a/
/u/
/p/
/k/
67Choice of Phonemes
- How a language chooses a set of phonemes in order
to build its sound inventory? - Is the process arbitrary?
- Certainly Not!
- What are the forces affecting this choice?
68Forces of Choice
A Linguistic System How does it look?
Desires ease of articulation
Desires perceptual contrast / ease of
learnability
The forces shaping the choice are opposing
Hence there has to be a non-trivial solution
69Vowels A (Partially) Solved Mystery
- Languages choose vowels based on maximal
perceptual contrast. - For instance if a language has three vowels then
in more than 95 of the cases they are /a/,/i/,
and /u/.
70Consonants A puzzle
- Research From 1929 Date
- No single satisfactory explanation of the
organization of the consonant inventories - The set of features that characterize consonants
is much larger than that of vowels - No single force is sufficient to explain this
organization - Rather a complex interplay of forces goes on in
shaping these inventories
71Principle of Occurrence
- PlaNet The Phoneme-Language Network
- A bipartite network N(VL,VC,E)
- VL Nodes representing languages of the world
- VC Nodes representing consonants
- E Set of edges which run between VL and VC
- There is an edge e ? E between two nodes
- vl ? VL and vc ? VC if the consonant c occurs
- in the language l.
Choudhury et al. 2006 ACL Mukherjee et al. 2007
Int. Jnl of Modern Physics C
The Structure of PlaNet
72Construction of PlaNet
- Data Source UCLA Phonological Inventory
Database (UPSID) - Number of nodes in VL is 317
- Number of nodes in VC is 541
- Number of edges in E is 7022
73Degree Distribution of PlaNet
DD of the language nodes follows a ß-distribution
DD of the consonant nodes follows a
power-law with an exponential cut-off
Distribution of Consonants over Languages follow
a power-law
74Synthesis of PlaNet
- Non-linear preferential attachment
- Iteratively construct the language inventories
given their inventory sizes
dia e
Pr(Ci)
?x?V (dxa e)
75Simulation Result
The parameters a and e are 1.44 and 0.5
respectively. The results are averaged over 100
runs
76Principle of Co-occurrence
- Consonants tend to co-occur in groups or
communities - These groups tend to be organized around a few
distinctive features (based on manner of
articulation, place of articulation phonation)
Principle of feature economy
If a language has
in its inventory
then it will also tend to have
77How to Capture these Co-occurrences?
- PhoNet Phoneme Phoneme Network
- A weighted network N(VC,E)
- VC Nodes representing consonants
- E Set of edges which run between the nodes in
VC - There is an edge e ? E between two nodes vc1 ,vc2
? VC if the consonant c1 and c2 co-occur in a
language. The number of languages in which c1 and
c2 co-occurs defines the edge-weight of e. The
number of languages in which c1 occurs defines
the node-weight of vc1.
78Construction of PhoNet
- Data Source UPSID
- Number of nodes in VC is 541
- Number of edges is 34012
PhoNet
79Community Structures in PhoNet
- Radicchi et al. algorithm (for unweighted
networks) Counts number of triangles that an
edge is a part of. Inter-community edges will
have low count so remove them. - Modification for a weighted network like PhoNet
- Look for triangles, where the weights on the
edges are comparable. - If they are comparable, then the group of
consonants co-occur highly else it is not so. - Measure strength S for each edge (u,v) in PhoNet
where S is, - Remove edges with S less than a threshold ?
80Community Formation
S
For different values of ? we get different sets
of communities
81Consonant Societies!
?0.35
?0.60
?0.72
?1.25
The fact that the communities are good can
quantitatively shown by measuring the feature
entropy
82Problems to ponder on
- Physical significance of PA
- Functional forces
- Historical/Evolutionary process
- Labeled synthesis of PlaNet and PhoNet
- Language diversity vs. Preferential attachment
83CASE STUDY II Modeling the Mental Lexicon
84Metal Lexicon (ML) Basics
- It refers to the repository of the word forms
that resides in the human brain - Two Questions
- How words are stored in the long term memory,
i.e., the organization of the ML. - How are words retrieved from the ML (lexical
access) - The above questions are highly inter-related
to predict the organization one can investigate
how words are retrieved and vice versa.
85Different Possible Ways of Organization
- Un-organized (a bag full of words) or,
- Organized
- By sound (phonological similarity)
- E.g., start the same banana, bear, bean
- End the same look, took, book
- Number of phonological segments they share
- By Meaning (semantic similarity)
- Banana, apple, pear, orange
- By age at which the word is acquired
- By frequency of usage
- By POS
- Orthographically
86The Hierarchical Model of ML
- Proposed by Collins and Quillian in 1969
- Concepts are organized in a hierarchy
- Taxonomic and attributive relations are
represented - Cognitive Economy Put the attributes at the
highest of all appropriate levels e.g.,
reproduces applies to the whole animal kingdom
87Hierarchical Model
- According to the principle of cognitive economy
- Animals eat lt mammals eat lt humans eat
- However, shark is a fish salmon is a fish
- What do lt and mean?
- lt Less time to judge
- Equal time to judge
88Spreading Activation Model of ML
- Not a hierarchical structure but a web of
inter-connected nodes (first proposed by Collins
and Loftus in 1975) - Distance between nodes is determined by the
structural characteristics of the word-forms (by
sound, by meaning, by age, by ) - Combining the above two plethora of complex
networks
89Phonological Neighborhood Network
- (Vitevitch 2004)
- (Gruenenfelder Pisoni, 2005)
- (Kapatsinski 2006)
- Sound Similarity Relations in the Mental
Lexicon Modeling the Lexicon as a Complex
Network
90N/W Definition
- Nodes Words
- Edge An edge is drawn from node A to node B if
at least 2/3 of the segments that occur in word
represented by A also occurs in the word
represented by B - i.e., if the word represented by A is 6 segments
long then one can derive all its neighbors (B)
from it by two phoneme changes (insertions,
deletions or substitutions).
91N/W Construction
- Datbase
- Hoosier Mental Lexicon (Nusbaum et al., 1984)
- phonologically transcribed words ? n/w using the
metric defined earlier - Nodes with no links (correspond to hermit words
i.e., words that have no neighbors) - Random networks (E-R) for comparison
- Directed n/w ? a long word can have a short word
as a neighbor, not vice versa - Have a link only if the duration of the
difference in the word pair lt (duration of a
word)/3 (the factor 1/3 is experimentally derived
see the paper for further info.)
92Neighborhood Density
- The node whose neighbors are searched ? base
words - Neighborhood density of a base word is expressed
as the out-degree of the node representing the
base word - Is an estimate of the number of words activated
by the base word when the base word is presented
? spreading activation - Something like semantic priming (however, in the
phonological level)
93Results of the N/W Analysis
- Small-world Properties
- High clustering but also long average path length
-- like a SW network the lexicon has densely
connected neighborhoods but the links between two
nodes of different neighborhoods is harder to
find than in SW networks
94Visualization A Disconnected Graph with a
Giant Component (GC)
- GC is elongated there are some nodes that have
really long chain of intermediates and hence the
mean path length is long
95Low Degree Nodes are Important!!!
- Removal of low degree nodes renders the n/w
almost disconnected - A bottleneck is formed between longer (more than
7 segments long) and shorter words - This bottleneck consists the tion final words
coalition, passion, nation, fixation/fission
they form short-cuts between the high-degree
nodes (i.e., they are low-degree stars that
connect mega-neighborhoods)
96Removal of Nodes with Degree lt 40
2-4 segment words
Removal of low-degree nodes disconnect the n/w
as opposed to the removal of hubs like pastor
(deg. 112)
8-10 segment words
97Why low connectivity between neighborhoods?
- Spreading activation should not inhibit
- neighbors of the stimulus neighbors that are
non-neighbors of the stimulus itself (and are
therefore, not similar to the stimulus) - Low mean path ? complete traversal of n/ws, for
e.g., in general purpose search - Search in lexicon does not need to traverse links
between distant nodes rather it involves an
activation of the structured neighborhood that
share a single sub-lexical chunk that could be
acoustically related during word-recognition
(Marslen-Wilson, 1990).
98Degree Distribution (DD)
- Exponential rather than power-law
5-7 segment words
Entire Lexicon
8-10 segment words
99Other Works (see supplementary material for
reference)
- Vitevitch (2005)
- similar to the above work but builds n/ws of
nodes that are just one-segment different - (Choudhury et al. (2007)
- Builds weighted n/ws in Hindi, Bengali and
English based on orthographic proximity (nodes
words edges orthographic edit-distance)
SpellNet - Does thresholding (?) to make the n/ws binary (at
? 1, 3, 5). - They also obtain exponential DDs
- Observe that occurrence of real word errors in a
language is proportional to avg. wghtd. deg. of
the SpellNet of that language
100Other Works
- Sigman et al. (2002)
- Analyzes the English WordNet
- All semantic relationships are scale-invariant
- Inclusion of polysemy make the n/w SW
- Ferrer i Cancho et al. (2000,2001)
- Word co-occurrence (in a sentence) based
definitions of the lexicon - Lexicon Kernel Lexicon Peripheral Lexicon
- Finds a 2-regime DD one comprises words in the
kernel lexicon and the other words in the
peripheral lexicon - Finds that these n/ws are small-world
101Some Unsolved Mysteries You can Give it a Try
?
- What can be a model for the evolution of the ML?
- How is the ML acquired by a child learner?
- Is there a single optimal structure for the ML
or is it organized based on multiple criteria
(i.e., a combination of the different n/ws)
Towards a single framework for studying ML!!!
102CASE STUDY III SyntaxUnsupervised POS Tagging
103Labeling of Text
- Lexical Category (POS tags)
- Syntactic Category (Phrases, chunks)
- Semantic Role (Agent, theme, )
- Sense
- Domain dependent labeling (genes, proteins, )
- How to define the set of labels?
- How to (learn to) predict them automatically?
104Nothing makes sense, unless in context
- Distribution-based definition of
- Lexical category
- Sense (meaning)
- The X is
- If you X then I shall
- looking at the star PP
105General Approach
- Represent the context of a word (token)
- Define some notion of similarity between the
contexts - Cluster the contexts of the tokens
- Get the label of the tokens
w1
w3
w2
w4
106Issues
- How to define the context?
- How to define similarity
- How to Cluster?
- How to evaluate?
107Unsupervised Parts-of-Speech Tagging Employing
Efficient Graph Clustering
- Chris Biemann
- COLING-ACL 2006
108Stages
- Input raw text corpus
- Identify feature words and define a graph for
high and medium frequency words (10000) - Cluster the graph to identify the classes
- For low frequency words, use context similarity
- Lexicon of word classes ? tag the same text ?
learn a Viterbi tagger
109Features Words
- Estimate the unigram frequencies
- Feature words Most frequent 200 words
110Feature Vector
- From the familiar to the exotic, the collection
is a delight
the
to
is
from
fw1
fw2
fw199
fw200
p-2
0 0 0 1
1 0 0 0
0 1 0 0
1 0 0 0
p-1
p1
p2
111Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1 1 cos(red, blue)
red
112The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
113The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
114The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
0.9
blood
heavy
0.5
red
115Medium and Low Frequency Words
- Neighboring (window 4) co-occurrences ranked by
log-likelihood thresholded by ? - Two words are connected iff they share at least 4
neighbors
Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
116Construction of Lexicon
- Each word assigned a unique tag based on the word
class it belongs to - Class 1 sky, color, blood, weight
- Class 2 red, blue, light, heavy
- Ambiguous words
- High and medium frequency words that formed
singleton cluster - Possible tags of neighboring clusters
117Training and Evaluation
- Unsupervised training of trigram HMM using the
clusters and lexicon - Evaluation
- Tag a text, for which gold standard is available
- Estimate the conditional entropy H(TC) and the
related perplexity 2H(TC) - Final Results
- English 2.05 (619/345), Finnish 3.22
(625/466), German 1.79 (781/440)
118Example
From the familiar to the exotic, the collection
is a delight Prep At JJ Prep At JJ
At NN V At NN C200 C1 C331 C5
C1 C331 C1 C221 C3 C1 C220
119Word Sense Disambiguation
- Véronis, J. 2004. HyperLex lexical cartography
for information retrieval. Computer Speech
Language 18(3)223-252. - Let the word to be disambiguated be light
- Select a subcorpus of paragraphs which have at
least one occurrence of light - Construct the word co-occurrence graph
120HyperLex
- A beam of white light is dispersed into its
component colors by its passage through a prism. - Energy efficient light fixtures including
solar lights, night lights, energy star lighting,
ceiling lighting, wall lighting, lamps - What enables us to see the light and
experience such wonderful shades of colors during
the course of our everyday lives?
prism
beam
dispersed
white
colors
shades
energy
fixtures
efficient
lamps
121Hub Detection and MST
prism
beam
light
dispersed
white
colors
lamps
colors
shades
beam
prism
fixtures
energy
shades
energy
efficient
dispersed
white
fixtures
efficient
lamps
White fluorescent lights consume less energy than
incandescent lamps
122Other Related Works
- Solan, Z., Horn, D., Ruppin, E. and Edelman, S.
2005. Unsupervised learning of natural languages.
PNAS, 102 (33) 11629-11634 - Ferrer i Cancho, R. 2007. Why do syntactic links
not cross? Europhysics Letters - Also applied to IR, Summarization, sentiment
detection and categorization, script evaluation,
author detection,
123Discussions Conclusions
- What we learnt
- Advantages of SNIC in NLP
- Comparison to standard techniques
- Open problems
- Concluding remarks and QA
124What we learnt
- What is SNIC and Complex Networks
- Analytical tools for SNIC
- Applications to human languages
- Three Case-studies
Area Perspective Technique
I Sound systems Language evolution and change Synthesis models
II Lexicon Psycholinguistic modeling and linguistic typology Topology and search
III Syntax Semantics Applications to NLP Clustering
125What we saw
- Language features complex structure at every
level of organization - Linguistic networks have non-trivial properties
scale-free small-world - Therefore, Language and Engineering systems
involving language should be studied within the
framework of complex systems, esp. CNT
126Advantages of SNIC
- Fully Unsupervised techniques
- No labeled data required A good solution to
resources scarcity - Problem of evaluation circumvented by
semi-supervised techniques - Ease of computation
- Simple and scalable
- Distributed and parallel computable
- Holistic treatment
- Language evolution psycho-linguistic theories
127Comparison to Standard Techniques
- Rule-based vs. Statistical NLP
- Graphical Models
- Generative models in machine learning
- HMM, CRF, Bayesian belief networks
JJ
NN
RB
VF
128Graphical Models vs. SNIC
- Principled based on Bayesian Theory
- Structure is assumed and parameters are learnt
- Focus Decoding parameter estimation
- Data-driven or computationally intensive
- The generative process is easy to visualize, but
no visualization of the data
- Heuristic, but underlying principles of linear
algebra - Structure is discovered and studied
- Focus Topology and evolutionary dynamics
- Unsupervised and computationally easy
- Easy visualization of the data
129Language Modeling
- A network of words as a model of language vs.
n-gram models - Hierarchical, hyper-graph based models
- Smoothing through holistic analysis of the
network topology - Jedynak, B. and Karakos, D. 2007. Unigram
Language Models using Diffusion Smoothing over
Graphs. Proc. of TextGraphs - 2
130Open Problems
- Universals and variables of linguistic networks
- Superimposition of networks phonetic, syntactic,
semantic - Which clustering algorithm for which topology?
- Metrics for network comparison important for
language modeling - Unsupervised dependency parsing using networks
- Mining translation equivalents
131Resources
- Conferences
- TextGraphs, Sunbelt, EvoLang, ECCS
- Journals
- PRE, Physica A, IJMPC, EPL, PRL, PNAS, QL, ACS,
Complexity, Social Networks - Tools
- Pajek, CUNG, http//www.insna.org/INSNA/soft_inf.
html - Online Resources
- Bibliographies, courses on CNT
132Contact
- Monojit Choudhury
- monojitc_at_microsoft.com
- http//www.cel.iitkgp.ernet.in/monojit/
- Animesh Mukherjee
- animeshm_at_cse.iitkgp.ernet.in
- http//www.cel.iitkgp.ernet.in/animesh/
- Niloy Ganguly
- niloy_at_cse.iitkgp.erent.in
- http//www.facweb.iitkgp.ernet.in/niloy/
133Thank you!!