The Analysis of Patterns

About This Presentation

Title:

The Analysis of Patterns

Description:

BOTH WIVES LOST CHILDREN WHILE LIVING IN THE WHITE HOUSE. BOTH PRESIDENTS WERE SHOT ... The world wide web contains billion of pages, with text, images, data... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 47

Provided by: stat290

Category:

more less

Transcript and Presenter's Notes

Title: The Analysis of Patterns

1
The Analysis of Patterns

Nello Cristianini

2
The Value of Patterns

Patterns are everywhere, and people have always
been fascinated by them.
Detecting patterns confers an advantage to an
organism

Temperature and Rainfall in Lake Shasta over 5
years
3
Patterns Help Us in Many Ways
e.g., compress, predict, remove errors
4
Benefits of Detecting Patterns
5
Patterns and Intelligence

We care so much about pattern finding skills,
that we even use them (partly) to quantify
intelligence

6
The Instinct for Patterns

We see patterns everywhere
Even where there are no patterns

7
Patterns and Randomness

We are poorly equipped to deal with randomness
3.141592653589793238462643383279502884197169399375
10582097494459230781640628620899862803482534211706
79...
In first million digits we see Erices ZIP code
11 times
Does it mean anything?

8
Patterns and Randomness

ABRAHAM LINCOLN WAS ELECTED TO CONGRESS IN 1846.
JOHN F. KENNEDY WAS ELECTED TO CONGRESS IN
1946.
ABRAHAM LINCOLN WAS ELECTED PRESIDENT IN 1860.
JOHN F. KENNEDY WAS ELECTED PRESIDENT IN 1960.
THE NAME LINCOLN AND KENNEDY EACH CONTAIN SEVEN
LETTERS.
BOTH WIVES LOST CHILDREN WHILE LIVING IN THE
WHITE HOUSE.
BOTH PRESIDENTS WERE SHOT ON FRIDAY.
BOTH WERE SHOT IN THE HEAD.
BOTH SUCCESSORS WERE NAMED JOHNSON.

ANDREW JOHNSON, WHO SUCCEEDED LINCOLN, WAS BORN
IN 1808.
LYNDON JOHNSON, WHO SUCCEEDED KENNEDY, WAS BORN
IN 1908.
JOHN WILKES BOOTH, REPORTEDLY ASSASSINATED
LINCOLN.
LEE HARVEY OSWALD, REPORTEDLY ASSASSINATED
KENNEDY.
BOTH ASSASSINS WERE KNOWN BY THREE NAMES.
BOTH NAMES CONTAINED FIFTEEN LETTERS.
BOOTH AND OSWALD WERE ASSASSINATED BEFORE THEIR
TRIALS.

Coincidences ?
9
Visualizing Patterns

We are naturally equipped to detect CERTAIN types
of patterns, not others

5.9400 8.6100 12.2800 11.6100 20.2800
23.8300 25.8300 27.3900 24.1700 19.0600
11.1700 8.8900 8.3300 7.8900 12.1100
15.3900 19.1100 24.6100 28.1100 25.7800
23.1100 16.8400 13.1700 8.4400 5.8900
10.8300 12.1100 15.7800 18.8300 26.5600
27.5600 25.0000 23.4400 15.5600 10.7200
7.1700 7.8300 11.1700 9.7800
14.9400 20.5000 23.3300 27.8300 29.2200
25.1100 20.6700 12.8900 11.8900 9.1700
9.8300 14.2800 18.5000 19.0000 26.3900
29.6100 26.7200 22.6700 20.3900 13.8900
8.8900
10
Finding Patterns

We are naturally interested in finding relations
in data.
We are naturally ill-equipped in dealing with
randomness.
We have developed sophisticated technology to do
this for us.
In last decade we have made one more step
As a society we rely on pattern discovery
technology, in many ways

11
Computational Pattern Finding

We want to find relations
They need to be reliable
They need to be explored efficiently
We want to do it automatically
On MASSIVE amounts of data

12
The Analysis of Patterns

Data driven approach to
Science
Business
Technology
Modern society relies on our capability to
automatically detect reliable patterns in vast
sets of data

13
The Analysis of Patterns

Science
The Genome Project
Surveys of the Universe
Business
Amazon automatically exploiting trends and
relations in transactions database
Fraud Detection in Credit Card Companies
Technology
Voice recognition
Handwriting recognition

14
A Scientific Gold Rush

1 GATCACAGGT CTATCACCCT ATTAACCACT
CACGGGAGCT CTCCATGCAT TTGGTATTTT
61 CGTCTGGGGG GTGTGCACGC GATAGCATTG
CGAGACGCTG GAGCCGGAGC ACCCTATGTC
121 GCAGTATCTG TCTTTGATTC CTGCCTCATT
CTATTATTTA TCGCACCTAC GTTCAATATT
181 ACAGGCGAAC ATACCTACTA AAGTGTGTTA
ATTAATTAAT GCTTGTAGGA CATAATAATA
241 ACAATTGAAT GTCTGCACAG CCGCTTTCCA
CACAGACATC ATAACAAAAA ATTTCCACCA
301 AACCCCCCCC TCCCCCCGCT TCTGGCCACA
GCACTTAAAC ACATCTCTGC CAAACCCCAA
361 AAACAAAGAA CCCTAACACC AGCCTAACCA
GATTTCAAAT TTTATCTTTA GGCGGTATGC
421 ACTTTTAACA GTCACCCCCC AACTAACACA
TTATTTTCCC CTCCCACTCC CATACTACTA
481 ATCTCATCAA TACAACCCCC GCCCATCCTA
CCCAGCACAC ACACACCGCT GCTAACCCCA
541 TACCCCGAAC CAACCAAACC CCAAAGACAC
CCCCCACAGT TTATGTAGCT TACCTCCTCA
601 AAGCAATACA CTGAAAATGT TTAGACGGGC
TCACATCACC CCATAAACAA ATAGGTTTGG
661 TCCTAGCCTT TCTATTAGCT CTTAGTAAGA
TTACACATGC AAGCATCCCC GTTCCAGTGA
721 GTTCACCCTC TAAATCACCA CGATCAAAAG
GGACAAGCAT CAAGCACGCA GCAATGCAGC
781 TCAAAACGCT TAGCCTAGCC ACACCCCCAC
GGGAAACAGC AGTGATTAAC CTTTAGCAAT
841 AAACGAAAGT TTAACTAAGC TATACTAACC
CCAGGGTTGG TCAATTTCGT GCCAGCCACC
901 GCGGTCACAC GATTAACCCA AGTCAATAGA
AGCCGGCGTA AAGAGTGTTT TAGATCACCC
961 CCTCCCCAAT AAAGCTAAAA CTCACCTGAG
TTGTAAAAAA CTCCAGTTGA CACAAAATAG
1021 ACTACGAAAG TGGCTTTAAC ATATCTGAAC
ACACAATAGC TAAGACCCAA ACTGGGATTA
1081 GATACCCCAC TATGCTTAGC CCTAAACCTC
AACAGTTAAA TCAACAAAAC TGCTCGCCAG

15
(No Transcript)
16
Yeast protein interaction map (Barabasi)
17
Another Gold Rush

The world wide web contains billion of pages,
with text, images, data
Semantic web, XML-based, provides high quality
annotated information
Soon all books ever written will be in digital
form
Are we ready?

18
2001
2004
2005
19
The Analysis of Patterns

Traditionally, the role of analyzing data belongs
to Statistics.
Or does it ?
Data analysis performed by physicists,
biologists, engineers each with their own set of
tools.
Even the task of making or validating these tools
is not just part of statistics.

20
The Analysis of Patterns

Signal processing
Data mining
Information retrieval
Pattern recognition ()
Pattern matching
Machine Learning

21
The Analysis of Patterns

Pattern Recognition
Syntactical / Structural
Statistical
Visual
Pattern Discovery vs Pattern Matching
In sequences
In graphs
In images

22
The Analysis of Patterns

Grammatical Inference
Mining for Association Rules
Patterns in Vector Data (classical multivariate
statistics neural networks machine learning
etc)
Etc, Etc

23
The Analysis of Patterns

Many communities working almost independently
Occasionally re-discovering the same things
A small and fairly stable set of ideas
Efficient search for patterns in data
Statistical validation issues
Pattern visualization
Often same tools and concepts

24
Searching for Patterns

The search problem can be framed within
Operational Research / Optimization.
(e.g., Integer Programming, Convex Programming,
etc)
Many key ideas from exact optimization have
revolutionized this field in recent years
Where exact solution are theoretically
impractical (and only then!) we can use
approximations, then heuristic approaches.
Again same heuristics appear in many fields

25
Statistics

How do we know that a relation found in a finite
set of data is reliable, or significant, or even
interesting?
Many issues of hypothesis testing
Classical statistics vs statistical learning
theory

26
What Are Patterns?

This is a rather difficult question to answer. I
hope we will have an answer by the end of this
meeting.
I encourage all speakers and participants to
suggest some definitions.

27
Gregory Chaitin "Patterns, Randomness and
Information"

Information, Complexity, Patterns, Randomness and
Compression.
What are regularities in data? How can they be
defined? And quantified?
Predictability and Compressibility are connected.
Randomness can be defined in algorithmic ways.

28
Gregory Chaitin

Chaitin will explain what it means that a
sequence has no pattern, and some far reaching
consequences
ideas can be traced back through Hermann Weyl to
Leibniz in 1686,
connect with Godel Turing
the question of how math compares contrasts
with physics and with biology

29
Patterns in Sets of Points (Vectors)

Probably the most developed part of pattern
analysis
Includes much multivariate stats, much
statistical pattern recognition (e.g., Duda and
Hart) and machine learning

30
Tijl De Bie Patterns in Sets of Points

Patterns in sets of points an overview
the role of optimization
examples of patterns
Dimensionality reduction, classification,
clustering
Emphasize linear patterns (connect to later
kernels talk)
Patterns in sets of points the myriad virtues of
eigenproblems ?
the eigenvalue problem.
principal component analysis, canonical
correlation analysis, Fisher's discriminant,
partial least squares, and spectral clustering.
More from thiss area will be covered in Kernel
Methods talk

31
Patterns in Sequences

After vectors, probably the most important type
of data
DNA
Text (web)
How to find patterns within and among sequences?
What data structures? What statistical models?

32
Suffix Tree and Hidden Markov techniques for
pattern analysis
Esko Ukkonen

Efficient Pattern Discovery in sequences requires
appropriate data structures
Suffix tree construction.
linear time array constructions
using suffix trees for finding motifs with gaps
finding cis-regulatory motifs by comparative
genomics
Hidden Markov techniques for haplotyping

33
Dan Gusfield Trees, Arrays, Networks and
Optimization for Finding Patterns in Biological
Sequences

The use of suffix trees and integer programming
for finding optimal virus signatures.
A current treatment of suffix-arrays and their
uses.
Algorithms for finding signatures (patterns) of
historical recombination and gene-conversion in
SNP (binary) sequences.

34
Raffaele GiancarloPatterns and Compression

Patterns are not just necessary for prediction,
they are also needed for data compression.
Many relations between PA and Data Compression.
Raffaele Giancarlo (University of Palermo) - On
Indexing and Compression Two Sides of the Same
Coin

35
Conceptual Foundations

Alberto Apostolico (University of Padova and
Georgia Tech) - "Algorithmic and Combinatorial
Foundations of Pattern Discovery"
Will discuss various aspects of the interplay
between algorithmics and statistics, as well as
the notion itself of pattern.

36
Kernel Methods

An idea if we are so good at finding (linear)
patterns in sets of points
Why not transforming all other problems into a
points problem?
Good idea
Kernels Methods (from machine learning) can do
this automatically

37
Bernhard Schoelkopf Kernel Methods

Kernel methods combine ideas from statistics and
optimization
State of the art machine learning systems
Operate on general types of data
Work by embedding data into a euclidean space
The structure of the space determined by choosing
a special kernel function
KMs connect various aspects of PA

38
Patterns in Sets

The most classic textbook example of data mining
You shop at the supermarket, and the
market-basket contents are recorded by the
computer system at check-out
Discover when some items are associated, when
it is possible to predict your next purchase,
etc
(this is what Amazon does automatically)

39
Heikki Mannila Finding frequent patterns

Part I Finding frequent patterns from data
Discovery of frequent patterns finding positive
conjunctions that are true for a given fraction
of the observations
this basic idea can be instantiated in many ways
finding frequent sets from 0/1 data (association
mining)
finding frequent episodes in sequences
finding frequent subgraphs in graphs etc.
efficient algorithms exist -- the levelwise
approach
theoretical analysis of the algorithms is not
trivial (leads to connections to hypergraph
transversals etc.)
Part II how can the patterns be used?
sometimes interesting in themselves
can be used to approximate the joint distribution
maximum entropy approaches
combining information from several patterns -
ordering patterns

40
When can we trust the patterns we found?

Statistical issues
Patterns can be the result of chance
Multiple testing increases this risk
Small samples, interest in weak patterns, etc
are other factors
Statistical learning theory and Classical
statistics have developed tools to deal with this
These criteria can also guide the search
algorithms towards more reliable patterns

41
John Shawe-Taylor Statistical Aspects of Pattern
Analysis

We want significant / reliable patterns
Reliable give us predictive power
Significant cannot be explained by chance
Factors affecting pattern reliability
Pattern magnitude (how strong is the relation)
Sample size (how large is the support from data?)
Multiple testing (how many other patterns have
been tested at the same time)
This translates into classical machine learning
and statistics themes

42
Nicolo' Cesa-Bianchi On-line linear learning
algorithms

Machine learning has various ways to model the
pattern discovery process.
An approach completely different from classical
statistics on-line learning.
Prediction with expert advice.
Learning with linear experts.
The Perceptron algorithm and its extensions.
On-line learning with kernels.
Mistake bounds.
From mistake bounds to risk bounds

43
Grammatical Inference

Very classical theme in pattern recognition,
based on Chomskys theory of formal languages and
grammars
Given a finite sample from a language, infer the
grammar that generates it (with various
constraints).
A childs game

44
Colin de la Higuera "Grammatical Inference, a
Tutorial"

The lectures will introduce the key ideas of
grammatical inference and concentrate specially
on the algorithmic aspects.
Some algorithms that will be described are
The "State merging" family Gold, Rpni, Edsm...
The "Window" languages Local and k-testable
Learning with queries.
This class of approaches often goes under the
name Syntactical Pattern Recognition

45
Patterns in Graphs

Edwin Hancock (University of York, UK) -
Pattern Analysis with Graphs and Trees'
Spectral representations of graphs,
Pattern spaces from graph spectra,
Spectral approaches to matching,
Heat kernel methods
Probabilistic and spectral methods for graph
matching and clustering.
Applications in computer vision.

46
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

The Analysis of Patterns - PowerPoint PPT Presentation

The Analysis of Patterns

BOTH WIVES LOST CHILDREN WHILE LIVING IN THE WHITE HOUSE. BOTH PRESIDENTS WERE SHOT ... The world wide web contains billion of pages, with text, images, data... – PowerPoint PPT presentation