Title: Analyzing Quantitative EEG Using Machine Learning and Information Theoretic Techniques
1Analyzing Quantitative EEG Using Machine Learning
and Information Theoretic Techniques
Pedja Neskovic
Booz Allen Hamilton Arlington, VA Institute
for Brain and Neural Systems Brown University,
Providence, RI
2- Objective
- Classify EEG signals into different tasks
(learning, memory, etc) within and across
subjects decoding brain signals
- Methods
- To extract informative characteristics use
information theory and to learn/detect patterns
in EEG use machine learning
3n-back task
- The n-back task requires subjects to decide
whether a currently present stimulus matches one
presented n trials previously
4- Challenges
- Low signal to noise ratio the noise level is
typically about 20 , and the signal of
interest about 5 - Signal is contaminated by artifacts (eye
movements/blinks, muscle contr.) - Processing large amounts of data in real time
- Design a method for extracting most
discriminative features
5Representing EEG signals
- Representation what characteristics/features to
use? - Most widely used features are power spectrum
(PS), coherence analysis, linear correlation
(LC), auto regression (AR) analysis, PCA, etc. - - Main drawbacks capture only linear
dependences and 2nd order statistics
- Introduce new features Entropy (H) and Mutual
Information (MI) - In contrast to power spectrum which is based on
second order statistics, - H encompasses higher order statistics
- In contrast to linear approaches (e.g
correlation, coherence, partial coherence
analysis, linear Granger causality and Directed
Transfer Function (DTF)), MI captures both linear
and nonlinear dependences
6Definitions
- Entropy measures the uncertainty of a discrete
random variable X - If H0 it means that every measurement occurs
with probability 1 or 0 - Mutual information is the reduction in the
uncertainty of X due to the knowledge of Y - MI0 iff X and Y are statistically independent
- (in contrast with Pearson correlation which
quantifies only linear dependencies)
7Calculating p(x)
- Standard approach use histograms to estimate p
v
3
2
5
t
3
4
8Stochastic process
- Shortcoming of using H and MI ignored temporal
dependences - Model outputs as a Markov stochastic process
represent outputs of one electrode as a sequence
of random variables
v
t
9Capturing temporal dependences
- Conditional entropy
- Conditional MI - capture both spatial and
temporal dependences
10How to estimate entropy?
- Histogram approach is not good if we have to
estimate joint probabilities of many variables - Some of the bins will have zero counts
sometimes there are more bins than data points - To calculate expected Entropy use Bayesian
approach
- vector of true (unknown) probabilities
- vector of counts
11Entropy estimation
- Likelihood multinomial distribution
- Dirichlet prior
- - reflects the prior knowledge of the number
of points in each bin
- Digamma function
12Classification task subject
- Goal Associate a given EEG segment with
both the subject and the class - Classifiers Naïve Bayes (NB) and SVM
- Features 62 for H, and 1,891 for MI
- 24 classes 6 subject and 4 tasks
-
13Dependence on prior parameters
Single trial classification rates with SVM
Five trial classification rates with SVM
- EF N
- prior parameter becomes less important as more
observations become available
- effective number of points
14SVM classification rates 64 features
Five trial long segments
Single trial long segments
- PS Power Spectra
- H Entropy, H(X)
- CH Conditional Entropy, H(XX), captures
temporal dependences within a single electrode
- Band A 1-20Hz
- Band B 20-40Hz
- Band C 40-60Hz
15SVM classification rates 1,891 features
Five trial long segments
Single trial long segments
- LC Linear Correlation
- MI(2) MI between two electrodes, MII(X,Y) -
captures "spatial" dependences - MI(3) MI between two electrodes (using 3 random
variables), MI(3) (XYX) - captures spatial
and temporal dependences - MI(4) MI using 4 random variables, MI(4)
(XYX,Y)
16Classifying EEG across subjects
- Using all the features classification is around
chance!
- Problems too many features e.g., 6262 for MI
features and not all are important (noisy
features, irrelevant features, etc.)
- Solution reduce the number of features by
measuring the statistical dependence between
features and a class variable (e.g. MI)
17Classifying EEG across subjects
- Problem have to calculate joint probability
that involves many features and a class variable
- Solution select features sequentially, treat
one feature at a time and calculate MI between
every feature and a class variable. - Maximize the dependence between a feature and
the class variable and minimize the redundancy
among features
18Classification across subjects
Goal Associate a given EEG segment with the
task across subjects (test on a subject that
has not been used for training)
- Classifiers
- Naïve Bayes (Bayes)
- Support Vector Machine (SVM)
- Nearest Neighbor (NN)
- Features
- For MI, start with 1,891 and use only 40 highly
discriminative features - Use 6 subjects 5 for training and 1 for
testing
19Results
Power Spectra
Conditional Entropy
Conditional MI
20 Conclusions
- Information-theoretic features (H, CH and MI(i))
outperform conventional features (power spectrum,
lin. correlation) - Although H and MI features capture non-linear
dependences, they assume that the signal is
stationary - To capture temporal dependences, treat electrode
outputs as a stochastic process and introduce
conditional entropy and conditional mutual
information features - To detect more difficult patterns (classify
across subjects from single trials) necessary to
reduce dimensionality of the feature space
21 Collaborators
- Liang Wu and Leon Cooper
- Physics Department
- Institute for Brain and Neural Systems
- Brown University, Providence, RI
- Bill Heindel and Elena Festa
- Department of Psychology
- Brown University, Providence, RI