When Computing Meets Statistics - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

When Computing Meets Statistics

Description:

When Computing Meets Statistics Tr n Th Truy n Department of Computing Curtin University of Technology t.tran2_at_curtin.edu.au http://truyen.vietlabs.com – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 61

Provided by: prad60

Category:

more less

Transcript and Presenter's Notes

Title: When Computing Meets Statistics

1
When Computing Meets Statistics

Tr?n Th? Truy?n
Department of Computing
Curtin University of Technology
t.tran2_at_curtin.edu.au
http//truyen.vietlabs.com

2
Content

Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration

3
Data as a starting point

The ultimate goal is to make sense of data
It is a capital mistake to theorize before one
has data. (Sir Arthur Conan Doyle)

4
How big is the data?

Google currently indexes 1012 Web pages
At NIPS09 they have shown how to estimate
logistic regression for 108 documents
MIT dataset has 108 images
106 sentence pairs for machine translation
The Netflix data has 108 entries.
Dimensions for language typically 107, for
bioinformatics up to 1012

5
Mathematics for data processing
Informationtheory Entropy Mutual
information Divergence Data compression Differenti
al entropy Channel capacity
Statistics Probabilistic graphs Exponential
family Kernels Bayesian Non-parametric Random
processes High dimensional
Abstract spaces Projection Linear algebra
Hilbert spaces Metric spaces Topology
Differential geometry
Optimization Duality Sparsity Sub-modularity
Linear programming Integer programming
Non-convexity Combinatorics
6
Why does computing needs statistics?

The world is uncertain
Making sense of data, e.g. sufficient statistics,
clustering
Convergence proof
Performance bound
Consistency
Bayes optimal
Confidence estimate
Most probable explanation
Symmetry breaking
Randomness as a solution to NP-hard problems

7
What computing has to offer

Massive data and computing power
Computational algorithms
Less memory
Fast processing
Dynamic programming
Parallel processing
Clusters
GPUs

8
Conferences and Journals

Most important and current results in computing
are published in conferences, some followed by
journal versions
Relevant conferences
AAAI/IJCAI
COLT/ICML/NIPS/KDD
UAI/AISTATS
CVPR/ICCV
ACL/COLLING
Relevant journals
Machine Learning
Journal of Machine Learning Research
Neural Computation
Pattern Analysis and Machine Intelligence
Pattern Recognition

9
Content

Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration

10
Probabilistic graphical models

Directed models
Markov chains
Hidden Markov models
Kalman filters
Bayesian networks
Dynamic Bayesian networks
Probabilistic neural networks
Undirected models
Ising models
Markov random fields
Boltzmann machines
Factor graphs
Relational Markov networks
Markov logic networks

Non-identically independently distributed
Variable dependencies

Graph theory Probability theory
11
Representing variable dependencies using graphs
Causes
Causes
Effects
Effects
Hidden factors
12
Directed graphs decomposition

Suitable to encode causality
Domain knowledge can be expressed in conditional
probability tables
Graph must be acyclic

13
DAG examples Markov chains
(b) Hidden Markov model
(a) Markov chain
(c) Hidden semi-Markov model
(d) Factorial hidden Markov model
14
DAG examples Abstract hidden Markov models (Bui
et al, 2002)
15
Some more DAG examples(some borrowed from
Bishops slides)

Hidden Markov models
Kalman filters
Factor analysis
Probabilistic principal component analysis
Independent component analysis
Probabilistic canonical correlation analysis
Mixtures of Gaussians
Probabilistic expert systems
Sigmoid belief networks
Hierarchical mixtures of experts
Probabilistic Latent Semantic Indexing
Latent Dirichlet Allocation
Chinese restaurant processes
Indian buffet processes

16
Undirected graphs factorisation

Suitable to encode correlation
More flexible than directed graphs
But lose the notion of causality

17
Undirected graph examples Markov random fields
Image from LabelMe
18
Undirected graph examples Restricted Boltzmann
machines

Useful to discover hidden aspects
Can theoretically represent all binary
distributions

19
Conditional independence

Separator
Markov blanket

20
Bad news

Inference in general graphs is intractable
Some reduced to combinatorial optimization
Model selection is really hard!
There are exponentially many graphs of given size
Each of them is likely to be intractable

21
Approximate inference

Belief propagation
Variational methods
MCMC

22
Belief propagation

Introduced by J. Pearl (1980s)
A major breakthrough
Guaranteed to converge for trees
Good approximation for non-trees
Related to statistical physics (Bethe Kikuchi
free-energies)
Related to Turbo decoding
Local operation, global effect

23
Variational methods
24
MCMC

Metropolis-Hasting
Gibbs/importance/slice sampling
Rao-Blackwellisation
Reversible jump MCMC
Contrastive divergence

25
?
26
Content

Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration

27
Statistical machine learning

(Mitchell, 2006)
How can we build computer systems that
automatically improve with experience, and
What are the fundamental laws that govern all
learning processes?
More concerned about prediction performance in
the unseen data
Need consistency guarantee
Need error bounds

28
Statistical machine learning

Inverse problems
Supervised learning regression/classification
Unsupervised learning density estimation/clusteri
ng
Semi-supervised learning
Manifold learning
Transfer learning domain adaptation
Multi-task learning
Gaussian processes
Non-parametric Bayesian

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Classifier example naïve Bayes
Sport, Social, Health
Words
33
Classifier example MaxEnt

Maximum entropy principle out of all
distributions which are consistent with the data,
select the one that has the maximum entropy
(Jaynes, 1957)

The solution

34
Gaussian and Laplace priors

Parameter estimation is an ill-posed problem
Needs regularisation theory
Gaussian prior
Laplace prior

35
Transfer learning

Moving from a domain to another domain
May be distribution shifts

The goal is to use as little data as possible to
estimate the second task

36
Multitask learning

Multiple predictions based on a single dataset
E.g., for each image, we want to do
Object recognition
Scene classification
Human and car detection

37
Open problems

Many learning algorithms are not consistent
Many performance bounds are not tight
The dimensions are high, just feature selection
is important
Most data is unlabelled
Structured data is pervasive, but most
statistical methods assume i.i.d

38
Dealing with unlabelled data
39
?
40
Content

Data as a starting point
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration

41
Applications

Computational linguistics
Accent restoration
Language modelling
Statistical machine translation
Speech recognition
Multimedia computer vision
Information filtering
Named Entity Recognition
Collaborative filtering
Web/Text classification
Ranking in search engines

42
Accent restoration
http//vietlabs.com/vietizer.html
Chi?n th?ng Real trong tr?n siêu kinh di?n cu?i
tu?n qua cung nhu phong d? ?n tu?ng mùa này khi?n
HLV tr? c?a Barca nh?n du?c nh?ng l?i tán t?ng t?
ngu?i nhà cung nhu dông d?o các c? d?ng viên.
Chien thang Real trong tran sieu kinh dien cuoi
tuan qua cung nhu phong do an tuong mua nay khien
HLV tre cua Barca nhan duoc nhung loi tan tung tu
nguoi nha cung nhu dong dao cac co dong vien.
accents
accentless terms
43
Decoding using N-order hidden Markov models
44
Accent restoration (cont.)