Title: The Analysis of Patterns
1The Analysis of Patterns
2The Value of Patterns
- Patterns are everywhere, and people have always
been fascinated by them. - Detecting patterns confers an advantage to an
organism
Temperature and Rainfall in Lake Shasta over 5
years
3Patterns Help Us in Many Ways
e.g., compress, predict, remove errors
4Benefits of Detecting Patterns
5Patterns and Intelligence
- We care so much about pattern finding skills,
that we even use them (partly) to quantify
intelligence -
6The Instinct for Patterns
- We see patterns everywhere
- Even where there are no patterns
7Patterns and Randomness
- We are poorly equipped to deal with randomness
- 3.141592653589793238462643383279502884197169399375
10582097494459230781640628620899862803482534211706
79... - In first million digits we see Erices ZIP code
11 times - Does it mean anything?
8Patterns and Randomness
- ABRAHAM LINCOLN WAS ELECTED TO CONGRESS IN 1846.
- JOHN F. KENNEDY WAS ELECTED TO CONGRESS IN
1946. -
- ABRAHAM LINCOLN WAS ELECTED PRESIDENT IN 1860.
- JOHN F. KENNEDY WAS ELECTED PRESIDENT IN 1960.
-
- THE NAME LINCOLN AND KENNEDY EACH CONTAIN SEVEN
LETTERS. - BOTH WIVES LOST CHILDREN WHILE LIVING IN THE
WHITE HOUSE. -
- BOTH PRESIDENTS WERE SHOT ON FRIDAY.
- BOTH WERE SHOT IN THE HEAD.
- BOTH SUCCESSORS WERE NAMED JOHNSON.
- ANDREW JOHNSON, WHO SUCCEEDED LINCOLN, WAS BORN
IN 1808. - LYNDON JOHNSON, WHO SUCCEEDED KENNEDY, WAS BORN
IN 1908. -
- JOHN WILKES BOOTH, REPORTEDLY ASSASSINATED
LINCOLN. - LEE HARVEY OSWALD, REPORTEDLY ASSASSINATED
KENNEDY. - BOTH ASSASSINS WERE KNOWN BY THREE NAMES.
- BOTH NAMES CONTAINED FIFTEEN LETTERS.
-
- BOOTH AND OSWALD WERE ASSASSINATED BEFORE THEIR
TRIALS.
Coincidences ?
9Visualizing Patterns
- We are naturally equipped to detect CERTAIN types
of patterns, not others
5.9400 8.6100 12.2800 11.6100 20.2800
23.8300 25.8300 27.3900 24.1700 19.0600
11.1700 8.8900 8.3300 7.8900 12.1100
15.3900 19.1100 24.6100 28.1100 25.7800
23.1100 16.8400 13.1700 8.4400 5.8900
10.8300 12.1100 15.7800 18.8300 26.5600
27.5600 25.0000 23.4400 15.5600 10.7200
7.1700 7.8300 11.1700 9.7800
14.9400 20.5000 23.3300 27.8300 29.2200
25.1100 20.6700 12.8900 11.8900 9.1700
9.8300 14.2800 18.5000 19.0000 26.3900
29.6100 26.7200 22.6700 20.3900 13.8900
8.8900
10Finding Patterns
- We are naturally interested in finding relations
in data. - We are naturally ill-equipped in dealing with
randomness. - We have developed sophisticated technology to do
this for us. - In last decade we have made one more step
- As a society we rely on pattern discovery
technology, in many ways
11Computational Pattern Finding
- We want to find relations
- They need to be reliable
- They need to be explored efficiently
- We want to do it automatically
- On MASSIVE amounts of data
12The Analysis of Patterns
- Data driven approach to
- Science
- Business
- Technology
- Modern society relies on our capability to
automatically detect reliable patterns in vast
sets of data
13The Analysis of Patterns
- Science
- The Genome Project
- Surveys of the Universe
- Business
- Amazon automatically exploiting trends and
relations in transactions database - Fraud Detection in Credit Card Companies
- Technology
- Voice recognition
- Handwriting recognition
14A Scientific Gold Rush
- 1 GATCACAGGT CTATCACCCT ATTAACCACT
CACGGGAGCT CTCCATGCAT TTGGTATTTT - 61 CGTCTGGGGG GTGTGCACGC GATAGCATTG
CGAGACGCTG GAGCCGGAGC ACCCTATGTC - 121 GCAGTATCTG TCTTTGATTC CTGCCTCATT
CTATTATTTA TCGCACCTAC GTTCAATATT - 181 ACAGGCGAAC ATACCTACTA AAGTGTGTTA
ATTAATTAAT GCTTGTAGGA CATAATAATA - 241 ACAATTGAAT GTCTGCACAG CCGCTTTCCA
CACAGACATC ATAACAAAAA ATTTCCACCA - 301 AACCCCCCCC TCCCCCCGCT TCTGGCCACA
GCACTTAAAC ACATCTCTGC CAAACCCCAA - 361 AAACAAAGAA CCCTAACACC AGCCTAACCA
GATTTCAAAT TTTATCTTTA GGCGGTATGC - 421 ACTTTTAACA GTCACCCCCC AACTAACACA
TTATTTTCCC CTCCCACTCC CATACTACTA - 481 ATCTCATCAA TACAACCCCC GCCCATCCTA
CCCAGCACAC ACACACCGCT GCTAACCCCA - 541 TACCCCGAAC CAACCAAACC CCAAAGACAC
CCCCCACAGT TTATGTAGCT TACCTCCTCA - 601 AAGCAATACA CTGAAAATGT TTAGACGGGC
TCACATCACC CCATAAACAA ATAGGTTTGG - 661 TCCTAGCCTT TCTATTAGCT CTTAGTAAGA
TTACACATGC AAGCATCCCC GTTCCAGTGA - 721 GTTCACCCTC TAAATCACCA CGATCAAAAG
GGACAAGCAT CAAGCACGCA GCAATGCAGC - 781 TCAAAACGCT TAGCCTAGCC ACACCCCCAC
GGGAAACAGC AGTGATTAAC CTTTAGCAAT - 841 AAACGAAAGT TTAACTAAGC TATACTAACC
CCAGGGTTGG TCAATTTCGT GCCAGCCACC - 901 GCGGTCACAC GATTAACCCA AGTCAATAGA
AGCCGGCGTA AAGAGTGTTT TAGATCACCC - 961 CCTCCCCAAT AAAGCTAAAA CTCACCTGAG
TTGTAAAAAA CTCCAGTTGA CACAAAATAG - 1021 ACTACGAAAG TGGCTTTAAC ATATCTGAAC
ACACAATAGC TAAGACCCAA ACTGGGATTA - 1081 GATACCCCAC TATGCTTAGC CCTAAACCTC
AACAGTTAAA TCAACAAAAC TGCTCGCCAG
15(No Transcript)
16Yeast protein interaction map (Barabasi)
17Another Gold Rush
- The world wide web contains billion of pages,
with text, images, data - Semantic web, XML-based, provides high quality
annotated information - Soon all books ever written will be in digital
form - Are we ready?
182001
2004
2005
19The Analysis of Patterns
- Traditionally, the role of analyzing data belongs
to Statistics. - Or does it ?
- Data analysis performed by physicists,
biologists, engineers each with their own set of
tools. - Even the task of making or validating these tools
is not just part of statistics.
20The Analysis of Patterns
- Signal processing
- Data mining
- Information retrieval
- Pattern recognition ()
- Pattern matching
- Machine Learning
21The Analysis of Patterns
- Pattern Recognition
- Syntactical / Structural
- Statistical
- Visual
- Pattern Discovery vs Pattern Matching
- In sequences
- In graphs
- In images
22The Analysis of Patterns
- Grammatical Inference
- Mining for Association Rules
- Patterns in Vector Data (classical multivariate
statistics neural networks machine learning
etc) - Etc, Etc
23The Analysis of Patterns
- Many communities working almost independently
- Occasionally re-discovering the same things
- A small and fairly stable set of ideas
- Efficient search for patterns in data
- Statistical validation issues
- Pattern visualization
- Often same tools and concepts
24Searching for Patterns
- The search problem can be framed within
Operational Research / Optimization. - (e.g., Integer Programming, Convex Programming,
etc) - Many key ideas from exact optimization have
revolutionized this field in recent years - Where exact solution are theoretically
impractical (and only then!) we can use
approximations, then heuristic approaches. - Again same heuristics appear in many fields
25Statistics
- How do we know that a relation found in a finite
set of data is reliable, or significant, or even
interesting? - Many issues of hypothesis testing
- Classical statistics vs statistical learning
theory
26What Are Patterns?
- This is a rather difficult question to answer. I
hope we will have an answer by the end of this
meeting. - I encourage all speakers and participants to
suggest some definitions.
27Gregory Chaitin "Patterns, Randomness and
Information"
- Information, Complexity, Patterns, Randomness and
Compression. - What are regularities in data? How can they be
defined? And quantified? - Predictability and Compressibility are connected.
- Randomness can be defined in algorithmic ways.
28Gregory Chaitin
- Chaitin will explain what it means that a
sequence has no pattern, and some far reaching
consequences - ideas can be traced back through Hermann Weyl to
Leibniz in 1686, - connect with Godel Turing
- the question of how math compares contrasts
with physics and with biology
29Patterns in Sets of Points (Vectors)
- Probably the most developed part of pattern
analysis - Includes much multivariate stats, much
statistical pattern recognition (e.g., Duda and
Hart) and machine learning
30Tijl De Bie Patterns in Sets of Points
- Patterns in sets of points an overview
- the role of optimization
- examples of patterns
- Dimensionality reduction, classification,
clustering - Emphasize linear patterns (connect to later
kernels talk) - Patterns in sets of points the myriad virtues of
eigenproblems ? - the eigenvalue problem.
- principal component analysis, canonical
correlation analysis, Fisher's discriminant,
partial least squares, and spectral clustering. - More from thiss area will be covered in Kernel
Methods talk
31Patterns in Sequences
- After vectors, probably the most important type
of data - DNA
- Text (web)
- How to find patterns within and among sequences?
- What data structures? What statistical models?
32Suffix Tree and Hidden Markov techniques for
pattern analysis
Esko Ukkonen
- Efficient Pattern Discovery in sequences requires
appropriate data structures - Suffix tree construction.
- linear time array constructions
- using suffix trees for finding motifs with gaps
- finding cis-regulatory motifs by comparative
genomics - Hidden Markov techniques for haplotyping
33Dan Gusfield Trees, Arrays, Networks and
Optimization for Finding Patterns in Biological
Sequences
- The use of suffix trees and integer programming
for finding optimal virus signatures. - A current treatment of suffix-arrays and their
uses. - Algorithms for finding signatures (patterns) of
historical recombination and gene-conversion in
SNP (binary) sequences.
34Raffaele GiancarloPatterns and Compression
- Patterns are not just necessary for prediction,
they are also needed for data compression. - Many relations between PA and Data Compression.
- Raffaele Giancarlo (University of Palermo) - On
Indexing and Compression Two Sides of the Same
Coin
35Conceptual Foundations
- Alberto Apostolico (University of Padova and
Georgia Tech) - "Algorithmic and Combinatorial
Foundations of Pattern Discovery" - Will discuss various aspects of the interplay
between algorithmics and statistics, as well as
the notion itself of pattern.
36Kernel Methods
- An idea if we are so good at finding (linear)
patterns in sets of points - Why not transforming all other problems into a
points problem? - Good idea
- Kernels Methods (from machine learning) can do
this automatically
37Bernhard Schoelkopf Kernel Methods
- Kernel methods combine ideas from statistics and
optimization - State of the art machine learning systems
- Operate on general types of data
- Work by embedding data into a euclidean space
- The structure of the space determined by choosing
a special kernel function - KMs connect various aspects of PA
38Patterns in Sets
- The most classic textbook example of data mining
- You shop at the supermarket, and the
market-basket contents are recorded by the
computer system at check-out - Discover when some items are associated, when
it is possible to predict your next purchase,
etc - (this is what Amazon does automatically)
39Heikki Mannila Finding frequent patterns
- Part I Finding frequent patterns from data
- Discovery of frequent patterns finding positive
conjunctions that are true for a given fraction
of the observations - this basic idea can be instantiated in many ways
- finding frequent sets from 0/1 data (association
mining) - finding frequent episodes in sequences
- finding frequent subgraphs in graphs etc.
- efficient algorithms exist -- the levelwise
approach - theoretical analysis of the algorithms is not
trivial (leads to connections to hypergraph
transversals etc.) - Part II how can the patterns be used?
- sometimes interesting in themselves
- can be used to approximate the joint distribution
- maximum entropy approaches
- combining information from several patterns -
ordering patterns
40When can we trust the patterns we found?
- Statistical issues
- Patterns can be the result of chance
- Multiple testing increases this risk
- Small samples, interest in weak patterns, etc
are other factors - Statistical learning theory and Classical
statistics have developed tools to deal with this - These criteria can also guide the search
algorithms towards more reliable patterns
41John Shawe-Taylor Statistical Aspects of Pattern
Analysis
- We want significant / reliable patterns
- Reliable give us predictive power
- Significant cannot be explained by chance
- Factors affecting pattern reliability
- Pattern magnitude (how strong is the relation)
- Sample size (how large is the support from data?)
- Multiple testing (how many other patterns have
been tested at the same time) - This translates into classical machine learning
and statistics themes
42Nicolo' Cesa-Bianchi On-line linear learning
algorithms
- Machine learning has various ways to model the
pattern discovery process. - An approach completely different from classical
statistics on-line learning. - Prediction with expert advice.
- Learning with linear experts.
- The Perceptron algorithm and its extensions.
- On-line learning with kernels.
- Mistake bounds.
- From mistake bounds to risk bounds
43Grammatical Inference
- Very classical theme in pattern recognition,
based on Chomskys theory of formal languages and
grammars - Given a finite sample from a language, infer the
grammar that generates it (with various
constraints). - A childs game
44Colin de la Higuera "Grammatical Inference, a
Tutorial"
- The lectures will introduce the key ideas of
grammatical inference and concentrate specially
on the algorithmic aspects. - Some algorithms that will be described are
- The "State merging" family Gold, Rpni, Edsm...
- The "Window" languages Local and k-testable
- Learning with queries.
-
- This class of approaches often goes under the
name Syntactical Pattern Recognition
45Patterns in Graphs
- Edwin Hancock (University of York, UK) -
Pattern Analysis with Graphs and Trees' - Spectral representations of graphs,
- Pattern spaces from graph spectra,
- Spectral approaches to matching,
- Heat kernel methods
- Probabilistic and spectral methods for graph
matching and clustering. - Applications in computer vision.
46(No Transcript)