Title: When Computing Meets Statistics
1When Computing Meets Statistics
- Tr?n Th? Truy?n
- Department of Computing
- Curtin University of Technology
- t.tran2_at_curtin.edu.au
- http//truyen.vietlabs.com
2Content
- Introduction
- Probabilistic graphical models
- Statistical machine learning
- Applications
- Collaboration
3Data as a starting point
- The ultimate goal is to make sense of data
- It is a capital mistake to theorize before one
has data. (Sir Arthur Conan Doyle)
4How big is the data?
- Google currently indexes 1012 Web pages
- At NIPS09 they have shown how to estimate
logistic regression for 108 documents - MIT dataset has 108 images
- 106 sentence pairs for machine translation
- The Netflix data has 108 entries.
- Dimensions for language typically 107, for
bioinformatics up to 1012
5Mathematics for data processing
Informationtheory Entropy Mutual
information Divergence Data compression Differenti
al entropy Channel capacity
Statistics Probabilistic graphs Exponential
family Kernels Bayesian Non-parametric Random
processes High dimensional
Abstract spaces Projection Linear algebra
Hilbert spaces Metric spaces Topology
Differential geometry
Optimization Duality Sparsity Sub-modularity
Linear programming Integer programming
Non-convexity Combinatorics
6Why does computing needs statistics?
- The world is uncertain
- Making sense of data, e.g. sufficient statistics,
clustering - Convergence proof
- Performance bound
- Consistency
- Bayes optimal
- Confidence estimate
- Most probable explanation
- Symmetry breaking
- Randomness as a solution to NP-hard problems
7What computing has to offer
- Massive data and computing power
- Computational algorithms
- Less memory
- Fast processing
- Dynamic programming
- Parallel processing
- Clusters
- GPUs
8Conferences and Journals
- Most important and current results in computing
are published in conferences, some followed by
journal versions - Relevant conferences
- AAAI/IJCAI
- COLT/ICML/NIPS/KDD
- UAI/AISTATS
- CVPR/ICCV
- ACL/COLLING
- Relevant journals
- Machine Learning
- Journal of Machine Learning Research
- Neural Computation
- Pattern Analysis and Machine Intelligence
- Pattern Recognition
9Content
- Introduction
- Probabilistic graphical models
- Statistical machine learning
- Applications
- Collaboration
10Probabilistic graphical models
- Directed models
- Markov chains
- Hidden Markov models
- Kalman filters
- Bayesian networks
- Dynamic Bayesian networks
- Probabilistic neural networks
- Undirected models
- Ising models
- Markov random fields
- Boltzmann machines
- Factor graphs
- Relational Markov networks
- Markov logic networks
- Non-identically independently distributed
- Variable dependencies
Graph theory Probability theory
11Representing variable dependencies using graphs
Causes
Causes
Effects
Effects
Hidden factors
12Directed graphs decomposition
- Suitable to encode causality
- Domain knowledge can be expressed in conditional
probability tables - Graph must be acyclic
13DAG examples Markov chains
(b) Hidden Markov model
(a) Markov chain
(c) Hidden semi-Markov model
(d) Factorial hidden Markov model
14DAG examples Abstract hidden Markov models (Bui
et al, 2002)
15Some more DAG examples(some borrowed from
Bishops slides)
- Hidden Markov models
- Kalman filters
- Factor analysis
- Probabilistic principal component analysis
- Independent component analysis
- Probabilistic canonical correlation analysis
- Mixtures of Gaussians
- Probabilistic expert systems
- Sigmoid belief networks
- Hierarchical mixtures of experts
- Probabilistic Latent Semantic Indexing
- Latent Dirichlet Allocation
- Chinese restaurant processes
- Indian buffet processes
16Undirected graphs factorisation
- Suitable to encode correlation
- More flexible than directed graphs
- But lose the notion of causality
17Undirected graph examples Markov random fields
Image from LabelMe
18Undirected graph examples Restricted Boltzmann
machines
- Useful to discover hidden aspects
- Can theoretically represent all binary
distributions
19Conditional independence
20Bad news
- Inference in general graphs is intractable
- Some reduced to combinatorial optimization
- Model selection is really hard!
- There are exponentially many graphs of given size
- Each of them is likely to be intractable
21Approximate inference
- Belief propagation
- Variational methods
- MCMC
22Belief propagation
- Introduced by J. Pearl (1980s)
- A major breakthrough
- Guaranteed to converge for trees
- Good approximation for non-trees
- Related to statistical physics (Bethe Kikuchi
free-energies) - Related to Turbo decoding
- Local operation, global effect
23Variational methods
24MCMC
- Metropolis-Hasting
- Gibbs/importance/slice sampling
- Rao-Blackwellisation
- Reversible jump MCMC
- Contrastive divergence
25?
26Content
- Introduction
- Probabilistic graphical models
- Statistical machine learning
- Applications
- Collaboration
27Statistical machine learning
- (Mitchell, 2006)
- How can we build computer systems that
automatically improve with experience, and - What are the fundamental laws that govern all
learning processes? - More concerned about prediction performance in
the unseen data - Need consistency guarantee
- Need error bounds
28Statistical machine learning
- Inverse problems
- Supervised learning regression/classification
- Unsupervised learning density estimation/clusteri
ng - Semi-supervised learning
- Manifold learning
- Transfer learning domain adaptation
- Multi-task learning
- Gaussian processes
- Non-parametric Bayesian
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Classifier example naïve Bayes
Sport, Social, Health
Words
33Classifier example MaxEnt
- Maximum entropy principle out of all
distributions which are consistent with the data,
select the one that has the maximum entropy
(Jaynes, 1957)
34Gaussian and Laplace priors
- Parameter estimation is an ill-posed problem
- Needs regularisation theory
- Gaussian prior
- Laplace prior
35Transfer learning
- Moving from a domain to another domain
- May be distribution shifts
- The goal is to use as little data as possible to
estimate the second task
36Multitask learning
- Multiple predictions based on a single dataset
- E.g., for each image, we want to do
- Object recognition
- Scene classification
- Human and car detection
37Open problems
- Many learning algorithms are not consistent
- Many performance bounds are not tight
- The dimensions are high, just feature selection
is important - Most data is unlabelled
- Structured data is pervasive, but most
statistical methods assume i.i.d
38Dealing with unlabelled data
39?
40Content
- Data as a starting point
- Probabilistic graphical models
- Statistical machine learning
- Applications
- Collaboration
41Applications
- Computational linguistics
- Accent restoration
- Language modelling
- Statistical machine translation
- Speech recognition
- Multimedia computer vision
- Information filtering
- Named Entity Recognition
- Collaborative filtering
- Web/Text classification
- Ranking in search engines
42Accent restoration
http//vietlabs.com/vietizer.html
Chi?n th?ng Real trong tr?n siêu kinh di?n cu?i
tu?n qua cung nhu phong d? ?n tu?ng mùa này khi?n
HLV tr? c?a Barca nh?n du?c nh?ng l?i tán t?ng t?
ngu?i nhà cung nhu dông d?o các c? d?ng viên.
Chien thang Real trong tran sieu kinh dien cuoi
tuan qua cung nhu phong do an tuong mua nay khien
HLV tre cua Barca nhan duoc nhung loi tan tung tu
nguoi nha cung nhu dong dao cac co dong vien.
accents
accentless terms
43Decoding using N-order hidden Markov models
44Accent restoration (cont.)
- Online news corpus
- 426,000 sentences for training
- 28,000 sentences for testing
- 1,400 accentless terms
- compared to 10,000 accentful terms.
- 7,000 unique unigrams
- 842,000 unique bigrams
- 3,137,000 unique trigrams
45Language modelling
- This is the key of all linguistics problems
- Most useful models are N-grams
- Equivalent to (N-1)th order Markov chains
- Usually N3
- Google offers N5 with multiple billions entries
- Smoothing is the key to deal with data sparseness
46Statistical machine translation
- Estimate P(Vietnamese unit English unit)
- Usually, unit sentence
- Current training size 106 sentence pairs
- Statistical methods are state-of-the-arts
- Followed by major labs
- Google translation services
47SMT source-channel approach
- P(V) is language model of Vietnamese
- P(EV) is translation model from Vietnamese to
English - Subcomponents
- Translation table from Vietnamese phrases to
English phrases - Alignment position distortion, syntax, idioms
48SMT maximum conditional entropy approach
- f is called feature
- f may be the estimate from the source-channel
approach
49Speech recognition
- Estimate P(words sound signals)
- Usually the source-channel approach
- P(words) is the language model
- P(soundwords) is called acoustic model
- Hidden Markov models are the state-of-the-arts
Hidden states
Start state
End state
Acoustic features
50Multimedia
- Mix of audio, video, text, user interaction,
hyperlinks, context - Social media
- Diffusion, random walks, Brownian motion
- Cross-modality
- Probabilistic canonical correlation analysis
51Computer vision
- Scene labelling
- Face recognition
- Object recognition
- Video surveillance
52Information filtering
53Named entity recognition
54Boltzmann machines for collaborative filtering
- Boltzmann machines are one of the main methods in
the 1mil Netflix competition - This is essentially the matrix completion problem
55Ranking in search engines
56Ranking in search engines (cont.)
- This is an object ordering problem
- We want to estimate the probability of
permutation - There are exponentially many permutations
- Permutations are query-dependent
57Content
- Data as a starting point
- Probabilistic graphical models
- Statistical machine learning
- Applications
- Collaboration
58Collaboration
- IMPCA Institute for Multi-sensor Processing and
Content Analysis - http//impca.cs.curtin.edu.au
- Lead by Prof. Svetha Venkatesh
- Some Vietnamese guys
- Phùng Qu?c Ð?nh, http//computing.edu.au/phung/
- Probabilistic graphical models
- Topic modelling
- Non-parametric Bayesian
- Multimedia
- Ph?m Ð?c Son, http//computing.edu.au/dsp/
- Statistical learning theory
- Compressed sensing
- Robust signal processing
- Bayesian methods
- Tr?n Th? Truy?n, http//truyen.vietlabs.com
- Probabilistic graphical models
- Learning structured output spaces
- Deep learning
- Permutation modelling
59Scholarships
- Master by research
- 2 years full, may upgrade to PhD
- PhD
- 3 years full
- Strong background in maths and good programming
skills - Postdoc
- 1-2 year contract
- Research fellows
- 3-5 year contract
- Visiting scholars
- 3-12 months
60Discussion