When Computing Meets Statistics - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

When Computing Meets Statistics

Description:

When Computing Meets Statistics Tr n Th Truy n Department of Computing Curtin University of Technology t.tran2_at_curtin.edu.au http://truyen.vietlabs.com – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 61
Provided by: prad60
Category:

less

Transcript and Presenter's Notes

Title: When Computing Meets Statistics


1
When Computing Meets Statistics
  • Tr?n Th? Truy?n
  • Department of Computing
  • Curtin University of Technology
  • t.tran2_at_curtin.edu.au
  • http//truyen.vietlabs.com

2
Content
  • Introduction
  • Probabilistic graphical models
  • Statistical machine learning
  • Applications
  • Collaboration

3
Data as a starting point
  • The ultimate goal is to make sense of data
  • It is a capital mistake to theorize before one
    has data. (Sir Arthur Conan Doyle)

4
How big is the data?
  • Google currently indexes 1012 Web pages
  • At NIPS09 they have shown how to estimate
    logistic regression for 108 documents
  • MIT dataset has 108 images
  • 106 sentence pairs for machine translation
  • The Netflix data has 108 entries.
  • Dimensions for language typically 107, for
    bioinformatics up to 1012

5
Mathematics for data processing
Informationtheory Entropy Mutual
information Divergence Data compression Differenti
al entropy Channel capacity
Statistics Probabilistic graphs Exponential
family Kernels Bayesian Non-parametric Random
processes High dimensional
Abstract spaces Projection Linear algebra
Hilbert spaces Metric spaces Topology
Differential geometry
Optimization Duality Sparsity Sub-modularity
Linear programming Integer programming
Non-convexity Combinatorics
6
Why does computing needs statistics?
  • The world is uncertain
  • Making sense of data, e.g. sufficient statistics,
    clustering
  • Convergence proof
  • Performance bound
  • Consistency
  • Bayes optimal
  • Confidence estimate
  • Most probable explanation
  • Symmetry breaking
  • Randomness as a solution to NP-hard problems

7
What computing has to offer
  • Massive data and computing power
  • Computational algorithms
  • Less memory
  • Fast processing
  • Dynamic programming
  • Parallel processing
  • Clusters
  • GPUs

8
Conferences and Journals
  • Most important and current results in computing
    are published in conferences, some followed by
    journal versions
  • Relevant conferences
  • AAAI/IJCAI
  • COLT/ICML/NIPS/KDD
  • UAI/AISTATS
  • CVPR/ICCV
  • ACL/COLLING
  • Relevant journals
  • Machine Learning
  • Journal of Machine Learning Research
  • Neural Computation
  • Pattern Analysis and Machine Intelligence
  • Pattern Recognition

9
Content
  • Introduction
  • Probabilistic graphical models
  • Statistical machine learning
  • Applications
  • Collaboration

10
Probabilistic graphical models
  • Directed models
  • Markov chains
  • Hidden Markov models
  • Kalman filters
  • Bayesian networks
  • Dynamic Bayesian networks
  • Probabilistic neural networks
  • Undirected models
  • Ising models
  • Markov random fields
  • Boltzmann machines
  • Factor graphs
  • Relational Markov networks
  • Markov logic networks
  • Non-identically independently distributed
  • Variable dependencies

Graph theory Probability theory
11
Representing variable dependencies using graphs
Causes
Causes
Effects
Effects
Hidden factors
12
Directed graphs decomposition
  • Suitable to encode causality
  • Domain knowledge can be expressed in conditional
    probability tables
  • Graph must be acyclic

13
DAG examples Markov chains
(b) Hidden Markov model
(a) Markov chain
(c) Hidden semi-Markov model
(d) Factorial hidden Markov model
14
DAG examples Abstract hidden Markov models (Bui
et al, 2002)
15
Some more DAG examples(some borrowed from
Bishops slides)
  • Hidden Markov models
  • Kalman filters
  • Factor analysis
  • Probabilistic principal component analysis
  • Independent component analysis
  • Probabilistic canonical correlation analysis
  • Mixtures of Gaussians
  • Probabilistic expert systems
  • Sigmoid belief networks
  • Hierarchical mixtures of experts
  • Probabilistic Latent Semantic Indexing
  • Latent Dirichlet Allocation
  • Chinese restaurant processes
  • Indian buffet processes

16
Undirected graphs factorisation
  • Suitable to encode correlation
  • More flexible than directed graphs
  • But lose the notion of causality

17
Undirected graph examples Markov random fields
Image from LabelMe
18
Undirected graph examples Restricted Boltzmann
machines
  • Useful to discover hidden aspects
  • Can theoretically represent all binary
    distributions

19
Conditional independence
  • Separator
  • Markov blanket

20
Bad news
  • Inference in general graphs is intractable
  • Some reduced to combinatorial optimization
  • Model selection is really hard!
  • There are exponentially many graphs of given size
  • Each of them is likely to be intractable

21
Approximate inference
  • Belief propagation
  • Variational methods
  • MCMC

22
Belief propagation
  • Introduced by J. Pearl (1980s)
  • A major breakthrough
  • Guaranteed to converge for trees
  • Good approximation for non-trees
  • Related to statistical physics (Bethe Kikuchi
    free-energies)
  • Related to Turbo decoding
  • Local operation, global effect

23
Variational methods
24
MCMC
  • Metropolis-Hasting
  • Gibbs/importance/slice sampling
  • Rao-Blackwellisation
  • Reversible jump MCMC
  • Contrastive divergence

25
?
26
Content
  • Introduction
  • Probabilistic graphical models
  • Statistical machine learning
  • Applications
  • Collaboration

27
Statistical machine learning
  • (Mitchell, 2006)
  • How can we build computer systems that
    automatically improve with experience, and
  • What are the fundamental laws that govern all
    learning processes?
  • More concerned about prediction performance in
    the unseen data
  • Need consistency guarantee
  • Need error bounds

28
Statistical machine learning
  • Inverse problems
  • Supervised learning regression/classification
  • Unsupervised learning density estimation/clusteri
    ng
  • Semi-supervised learning
  • Manifold learning
  • Transfer learning domain adaptation
  • Multi-task learning
  • Gaussian processes
  • Non-parametric Bayesian

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Classifier example naïve Bayes
Sport, Social, Health
Words
33
Classifier example MaxEnt
  • Maximum entropy principle out of all
    distributions which are consistent with the data,
    select the one that has the maximum entropy
    (Jaynes, 1957)
  • The solution

34
Gaussian and Laplace priors
  • Parameter estimation is an ill-posed problem
  • Needs regularisation theory
  • Gaussian prior
  • Laplace prior

35
Transfer learning
  • Moving from a domain to another domain
  • May be distribution shifts
  • The goal is to use as little data as possible to
    estimate the second task

36
Multitask learning
  • Multiple predictions based on a single dataset
  • E.g., for each image, we want to do
  • Object recognition
  • Scene classification
  • Human and car detection

37
Open problems
  • Many learning algorithms are not consistent
  • Many performance bounds are not tight
  • The dimensions are high, just feature selection
    is important
  • Most data is unlabelled
  • Structured data is pervasive, but most
    statistical methods assume i.i.d

38
Dealing with unlabelled data
39
?
40
Content
  • Data as a starting point
  • Probabilistic graphical models
  • Statistical machine learning
  • Applications
  • Collaboration

41
Applications
  • Computational linguistics
  • Accent restoration
  • Language modelling
  • Statistical machine translation
  • Speech recognition
  • Multimedia computer vision
  • Information filtering
  • Named Entity Recognition
  • Collaborative filtering
  • Web/Text classification
  • Ranking in search engines

42
Accent restoration
http//vietlabs.com/vietizer.html
Chi?n th?ng Real trong tr?n siêu kinh di?n cu?i
tu?n qua cung nhu phong d? ?n tu?ng mùa này khi?n
HLV tr? c?a Barca nh?n du?c nh?ng l?i tán t?ng t?
ngu?i nhà cung nhu dông d?o các c? d?ng viên.
Chien thang Real trong tran sieu kinh dien cuoi
tuan qua cung nhu phong do an tuong mua nay khien
HLV tre cua Barca nhan duoc nhung loi tan tung tu
nguoi nha cung nhu dong dao cac co dong vien.
accents
accentless terms
43
Decoding using N-order hidden Markov models
44
Accent restoration (cont.)
  • Online news corpus
  • 426,000 sentences for training
  • 28,000 sentences for testing
  • 1,400 accentless terms
  • compared to 10,000 accentful terms.
  • 7,000 unique unigrams
  • 842,000 unique bigrams
  • 3,137,000 unique trigrams

45
Language modelling
  • This is the key of all linguistics problems
  • Most useful models are N-grams
  • Equivalent to (N-1)th order Markov chains
  • Usually N3
  • Google offers N5 with multiple billions entries
  • Smoothing is the key to deal with data sparseness

46
Statistical machine translation
  • Estimate P(Vietnamese unit English unit)
  • Usually, unit sentence
  • Current training size 106 sentence pairs
  • Statistical methods are state-of-the-arts
  • Followed by major labs
  • Google translation services

47
SMT source-channel approach
  • P(V) is language model of Vietnamese
  • P(EV) is translation model from Vietnamese to
    English
  • Subcomponents
  • Translation table from Vietnamese phrases to
    English phrases
  • Alignment position distortion, syntax, idioms

48
SMT maximum conditional entropy approach
  • f is called feature
  • f may be the estimate from the source-channel
    approach

49
Speech recognition
  • Estimate P(words sound signals)
  • Usually the source-channel approach
  • P(words) is the language model
  • P(soundwords) is called acoustic model
  • Hidden Markov models are the state-of-the-arts

Hidden states
Start state
End state
Acoustic features
50
Multimedia
  • Mix of audio, video, text, user interaction,
    hyperlinks, context
  • Social media
  • Diffusion, random walks, Brownian motion
  • Cross-modality
  • Probabilistic canonical correlation analysis

51
Computer vision
  • Scene labelling
  • Face recognition
  • Object recognition
  • Video surveillance

52
Information filtering
53
Named entity recognition
54
Boltzmann machines for collaborative filtering
  • Boltzmann machines are one of the main methods in
    the 1mil Netflix competition
  • This is essentially the matrix completion problem

55
Ranking in search engines
56
Ranking in search engines (cont.)
  • This is an object ordering problem
  • We want to estimate the probability of
    permutation
  • There are exponentially many permutations
  • Permutations are query-dependent

57
Content
  • Data as a starting point
  • Probabilistic graphical models
  • Statistical machine learning
  • Applications
  • Collaboration

58
Collaboration
  • IMPCA Institute for Multi-sensor Processing and
    Content Analysis
  • http//impca.cs.curtin.edu.au
  • Lead by Prof. Svetha Venkatesh
  • Some Vietnamese guys
  • Phùng Qu?c Ð?nh, http//computing.edu.au/phung/
  • Probabilistic graphical models
  • Topic modelling
  • Non-parametric Bayesian
  • Multimedia
  • Ph?m Ð?c Son, http//computing.edu.au/dsp/
  • Statistical learning theory
  • Compressed sensing
  • Robust signal processing
  • Bayesian methods
  • Tr?n Th? Truy?n, http//truyen.vietlabs.com
  • Probabilistic graphical models
  • Learning structured output spaces
  • Deep learning
  • Permutation modelling

59
Scholarships
  • Master by research
  • 2 years full, may upgrade to PhD
  • PhD
  • 3 years full
  • Strong background in maths and good programming
    skills
  • Postdoc
  • 1-2 year contract
  • Research fellows
  • 3-5 year contract
  • Visiting scholars
  • 3-12 months

60
Discussion
  • Collaboration mode
Write a Comment
User Comments (0)
About PowerShow.com