Title: Lexical networks
1Lecture 19 Lexical networks
Slides modified from Dragomir R. Radev
2Social data
- Blog postings
- News stories
- Speeches in Congress
- Query logs
- Movie and book reviews
- Scientific papers
- Financial reports
- Query logs
- Encyclopedia entries
- Email
- Chat room discussions
- Social networking sites
WHAT DO ALL OF THESE HAVE IN COMMON?
3Natural language processing
- Part of speech tagging
- Prepositional phrase attachment
- Parsing
- Word sense disambiguation
- Document indexing
- Text summarization
- Machine translation
- Question answering
- Information retrieval
- Social network extraction
- Topic modeling
4Talk outline
- Lexical networks
- Semantic networks
- Lexical centrality
- Latent networks
- Conclusion
5Lexical networks
6Lexical networks
- A special case of networks where nodes are words
or documents and edges link semantically related
nodes - Other examples
- Words used in dictionary definitions
- Names of people mentioned in the same story
- Words that translate to the same word
- A semantic network consists of a set of nodes
that are connected by labeled arcs. - The nodes represent concepts and
- The arcs represent relations between concepts.
7Semantic network
8Free word associations
The large-scale structure of semantic
networks statistical analyses and a model of
semantic growth M. Steyvers, J. B. Tenenbaum
(2005) Cognitive Science, 29(1)
9Dependency network
bought
Meredith
yesterday
apples
green
10Dependency network
11Semantic Networks
12So again A Semantic Network is
- A semantic (or associative) network is a simple
representation scheme which uses a graph of
labeled nodes and labeled, directed arcs to
encode knowledge. - Labeled nodes objects/classes/concepts.
- Labeled links relations/associations between
nodes - Labels define the semantics of nodes and links
- Usually used to represent static, taxonomic,
concept dictionaries
13Nodes and Arcs
- Nodes denote objects/classes
- arcs define binary relationships between objects.
mother
age
Sue
john
5
wife
age
father
mother(john,sue) age(john,5) wife(sue,max) age(sue
,34) ...
husband
34
Max
age
14Common Semantic Relations
- There is no standard set of relations for
semantic networks, but the following relations
are very common - INSTANCE X is an INSTANCE of Y if X is a
specific example of the general concept Y. - Example Elvis is an INSTANCE of Human
- ISA X ISA Y if X is a subset of the more general
concept Y. - Example sparrow ISA bird
- HASPART X HASPART Y if the concept Y is a part
of the concept X. - Or this can be any other property
- Example sparrow HASPART tail
15(No Transcript)
16ISA hierarchy
- The ISA (is a) or AKO (a kind of) relation is
often used to link a class and its superclass. - And sometimes an instance and its class.
- Some links (e.g. has-part) are inherited along
ISA paths. - The semantics of a semantic net can be relatively
informal or very formal - often defined at the implementation level
17Inference by association
- Red (a robin) is related to Air Force One by
association (as directed path originated from
these two nodes join at nodes Wings and Fly) - Bob and George are not related (no paths
originated from them join in this network
18Frames A Semantic Network with properties
- A frame represents an entity as a set of slots
(attributes) and associated values. - act, look, etc. like objects in C
- a more robust/compact version of a semantic
network - Each slot may have constraints that describe
legal values that the slot can take. - A frame can represent a specific entity, or a
general concept. - Frames are implicitly associated with one another
because the value of a slot can be another frame.
19(No Transcript)
20Semantic Networks
- Rules are appropriate for some types of
knowledge, - but do not easily map to others.
- Semantic nets can easily represent inheritance
and exceptions, - but are not well-suited for representing
negation, disjunction, preferences, conditionals,
and cause/effect relationships. - Frames allow arbitrary functions (demons) and
typed inheritance. - Implementation is a bit more cumbersome.
21Lexical Centrality
22LexRank Centrality in Text Graphs
Vertices Units of text (sentences or documents)
Edges Pairwise similarity between text
23LexRank Centrality in Text Graphs
Intuition LexRank score is propagated through
edges Central vertices are those that are
similar to other central vertices
24LexRank Centrality in Text Graphs
Recurrence Relation
0.3
0.1
0.9
0.3
s
0.5
0.8
Can guarantee solution by allowing jump
probability d/N.
0.2
0.4
0.2
25(No Transcript)
26http//tangra.si.umich.edu/clair/lexrank/
27NLP and network analysis
28Part of speech tagging
Word sense disambiguation
Document indexing
Mihalcea et al 2004
Mihalcea et al 2004
Biemann 2006
Subjectivity analysis
Semantic class induction
Passage retrieval
relevance
inter-similarity
Q
Widdows and Dorow 2002
Pang and Lee 2004
Otterbacher,Erkan,Radev05
29MavenRank Centrality in Speech Graphs
Vertices Speech transcripts from a given topic
Edges tf-idf cosine similarity (with threshold)
Hypothesis Key speakers will have speeches with
high centrality.
30MavenRank Example
Speech Scores 1 0.13 2 0.13 3 0.10 4 0.19 5 0.10
6 0.14 7 0.08 8 0.13 Speaker Scores (mean speech
score) 1 0.12 2 0.15 3 0.12
Speaker 1 Speeches
3
2
4
Speaker 2 Speeches
1
5
6
8
7
Speaker 3 Speeches
31(No Transcript)
32GIN Gene Interaction Network
- Motivation
- Biomedical literature is growing rapidly.
Manually curated databases cover small portion of
the available information - Most protein interaction information is uncovered
in biomedical articles - Approach
- text mining and network analysis for
- Automatic extraction of molecule interactions
- Automatic article summarization
- Interaction and citation networks
- Inferring gene-disease associations
33Feature Extraction from Dependency Trees
The results demonstrated that KaiC interacts
rhythmically with KaiA, KaiB, and SasA.
- Path1 KaiC nsubj interacts obj SasA
- Path2 KaiC nsubj interacts obj SasA
conj_and KaiA - Path3 KaiC nsubj interacts obj - SasA
conj_and KaiB - Path4 SasA conj_and KaiA
- Path5 SasA conj_and KaiB
- Path6 KaiA - prep_with - SasA conj_and KaiB
34Inferring Genes Related to Prostate Cancer
- Hypothesis
- Genes that are interacting with many genes that
are known to be related to prostate cancer are
likely to be related to prostate cancer - Approach
- Extract the interaction network of genes (seed
genes) that are known to be related to prostate
cancer automatically from the literature - Infer new genes related to prostate cancer from
the network topology - Use eigenvalue centrality to rank gene-prostate
cancer associations - Hypothesis restatement
- Genes central in the constructed network are most
probably related to prostate cancer.
35Approach
- Corpus
- PMCOA (PubMed Central Open Access) full text
articles - Articles in PMCOA split into sentences and
sentences tagged with GeniaTagger - Compile seed list of genes known to be related to
prostate cancer - 20 genes compiled from OMIM (Online Mendelian
Inheritance in Man) Database - Extend seed gene list with synonyms from HGNC
(HUGO Gene Nomenclature Committee) database. - Use the automatic interaction extraction pipeline
to extract the interaction network of the seed
genes and their neighbors (genes interacting with
the seed genes).
36Seed Genes
- 20 genes that are reported in OMIM to be related
to prostate cancer
37Interactions of the seed genes(gene names
normalized to their HGNC symbols)
38Sample Extracted Interaction Sentences
- A study by Jin et al. 20 indicated that the
association of Tax with hsMAD1, a mitotic spindle
checkpoint (MSC) protein, led to the
translocation of both MAD1 and MAD2 to the
cytoplasm. - PTEN is transcriptionally regulated by
transcription factors such as p53, Egr-1, NFκB
and SMADs, while protein levels and activity are
modulated by phosphorylation, oxidation,
subcellular localisation, phospholipid binding
and protein stability 29. - Interestingly, one of these, HPC1, is linked to
RNASEL 10,11. - In response to DNA damage, the cell-cycle
checkpoint kinase CHEK2 can be activated by ATM
kinase to phosphorylate p53 and BRCA1, which are
involved in cell-cycle control, apoptosis, and
DNA repair 1,2. - The interactions of RAD51 with TP53, RPA and the
BRC repeats of BRCA2 are relatively well
understood (see Discussion). - The interaction of BRCA2 with HsRad51 is
significantly more different to both RadA and
RecA (Figure 2c). - Max interactor protein, MXI1 (gene L07648)
competes for MAX thus negatively regulates MYC
function and may play a role in insulin
resistance. - Mad2 binds to Cdc20, an activator of the
anaphase-promoting complex (APC), to inhibit APC
activity and arrest cells in metaphase in
response to checkpoint activation.
39Inferred Genes (evaluation of top-20 scoring
genes)
- 6 are seed genes 14 genes are inferred to be
related to prostate cancer - (Check GeneGo Pathway database if no evidence
there, check PubMed literature) - 9 genes marked as being related to prostate
cancer by GeneGo Pathway Database - 1 gene Found evidence in PubMed that gene
related to prostate cancer - 4 genes no evidence found
40(No Transcript)
41Other networks
- Diabetes Type I
- Diabetes Type II
- Bipolar Disorder
42Properties of lexical networks
43Dependency network
44Random network
45Analyzing networks
- Properties of networks
- Clustering coefficient
- Watts/Strogatz cc triangles/triples
- Power law coefficient a
- Diameter (longest shortest path)
- Average shortest path (ASP)
- Properties of nodes
- Centrality degree, closeness, betweenness,
eigenvector
46Types of networks
- Regular networks
- Uniform degree distribution
- Random networks
- Memoryless
- Poisson degree distribution
- Characteristic value
- Low clustering coefficient
- Large asp
- Small world networks
- High transitivity
- Presence of hubs (memory)
- High clustering coefficient
- (e.g., 1000 times higher than random)
- Small ASP
- Power law degree distribution
- (typical value of a between 2 and 3)
47Comparing the dependency graph to a random
(Poisson) graph
48Properties of lexical networks
- Entries in a thesaurusMotter et al. 2002
- c/c0 260 (n30,000)
- Co-occurrence networks Dorogovtsev and Mendes
2001, Sole and Ferrer i Cancho 2001 - c/c0 1,000 (n400,000)
- Mental lexicon Vitevitch 2005
- c/c0 278 (n19,340)
49(No Transcript)