Title: Earthquake Prediction using Data Mining Tools
 1- Earthquake Prediction using Data Mining Tools 
- Mrinalini Kabbur 
- Ritu Chinya 
- Progress Report
2Introduction
- An earthquake is a sudden movement of the Earth, 
 caused by the abrupt release of strain that has
 accumulated over a long time.
- Earthquakes remain to be one of the unpredictable 
 natural hazards so far.
- The goal of earthquake prediction is to give 
 warning of potentially damaging earthquakes early
 enough to allow appropriate response to the
 disaster, enabling people to minimize loss of
 life and property.
3Project Design
- This project deals with Earthquake classification 
 and prediction using Data mining tools.
- Weka was used to develop the model 
- Naïve Bayesian was used to classify unknown class 
 label.
- Used C4.5 with 66 split to classify the data and 
 10-fold cross validation to evaluate accuracy.
-  
4Method
- Installalation of Weka 
- Weka is a set of software for machine learning 
 and mining
- Developed at the University of Waikato in New 
 Zealand
- Available for free 
- Easy to use Graphical User Interface 
5- Learning Weka 
- Both of us were new to weka 
- Used tutorial by Svetlana Aksanova 
- Looked up the internet for additional information 
 on Weka
- Gathering EarthQuake Data Set 
- Consists of the Earthquakes that happened in 
 the Northern California region during 2005.
- Data gathered from United States Geological 
 Survey (USGS) website.
6- Data preprocessing 
- Weka algorithms work on ARFF format 
- But the data was in HTML format as shown below. 
7The data was in HTML format as shown below.  
 8Data Preprocessing (Contd)
- So the data had to be transferred to an Excel 
 file.
- Tough to directly convert from HTML to Excel. 
- So the data was first saved in the word format. 
-  
9Excel Format 
 10- Conversion from Excel to ARFF format. 
- Save the Excel file as csv. 
- Used awk commands to format the data. 
- Keyed in some missing data. 
11- Data Cleansing 
-  The earthquake data contained many parameters. 
 They include
- Date and time 
- Longitude 
- Latitude 
- Depth 
- Magnitude 
- Event ID 
- Source 
- Magt 
- Nst 
- Gap 
- Clo 
- Attributes of interest include 
- Date and Time 
- Longitude 
- Latitude 
- Depth 
- Magnitude 
12Date and time fields are not considered while 
applying the classification algorithm. The filter 
weka.filters.unsupervised.attribute.Remove is 
applied to remove the date and time attribute. 
This is shown below. 
 13- Descretize 
- Attributes contain numeric data. 
- Some Weka algorithms like ID3 require nominal 
 attribute Values.
- Convertion of numeric attributes to nominal. 
- The attributes Longitude, Latitude, Depth and 
 Magnitude are all desctretized by using the
 filter weka.filters.unsupervised.attribute.Descre
 tize.
14- Apply Classification rules to come up with 
- Decision trees 
- Rules sets 
- Algorithms used for modelling 
- C4.5 
- Naïve Bayesian 
15C4.5
- We have considered two cases. 
- Cross-Validation Evaluates the classifier by 
 cross-validation, using the number of folds that
 are entered in the Folds text field.
- Percentage split Evaluates the classifier on how 
 well it predicts a certain percentage of the
 data, which is held out for testing. The amount
 of data held out depends on the value entered in
 the  field.
16First we will consider the classifier based on 
how well it predicts 66 of the test data as 
shown in the below. 
 17Run Analysis 
 18Run Information gives you the following 
information  the algorithm you used - J48  the 
relation name  Earthquake  number of 
instances in the relation  113  number of 
attributes in the relation  4 and the list of 
the attributes Longitude, Latitude, Depth, 
Magnitude.  the test mode you selected split66
Classifier model is a un-pruned decision tree in 
textual form that was produced on the full 
training data. As you can see, the first split 
is on the Longitude attribute, at the second 
level, the splits are on Latitude and 
Longitude 
Below the tree structure, there is a number of 
leaves (which is 10), and the number of nodes in 
the tree - size of the tree (which is 19). The 
program gives a time it took to build the model, 
which is 0.06 seconds.
In this case only 67 of 113 training instances 
have been classified correctly. This indicates 
that the results obtained from the training data 
are not optimistic compared with what might 
be obtained from the independent test set from 
the same source. 
 19WEKA also lets you to visualize decision tree 
 20- Accuracy Estimation 
-  Ten fold Cross validation 
- Snapshot of Naïve 
-  Bayesian classification 
-  using Weka 
21Run Information
-  Run information  
- Scheme weka.classifiers.bayes.NaiveBayes 
- Relation Earthquake-weka.filters.unsupervised
 .attribute.Discretize-B10-M-1.0-Rlast
- Instances 113 
- Attributes 4 
-  Latitude 
-  Longitude 
-  Depth 
-  Magnitude 
- Test mode 10-fold cross-validation 
-  Classifier model (full training set)  
- Naive Bayes Classifier 
- Time taken to build model 0.06 seconds 
-  Stratified cross-validation  
-  Summary  
- Correctly Classified Instances 69 
 61.0619
- Incorrectly Classified Instances 44 
 38.9381
- Kappa statistic -0.0061 
- Mean absolute error 0.1187 
22Run Information (Cont)
-  Detailed Accuracy By Class  
- TP Rate FP Rate Precision Recall F-Measure 
 Class
-  0.972 0.976 0.627 0.972 0.762 
 '(-inf-3.41'
-  0 0 0 0 0 
 '(3.41-3.82'
-  0 0.019 0 0 0 
 '(3.82-4.23'
-  0 0 0 0 0 
 '(4.23-4.64'
-  0 0 0 0 0 
 '(4.64-5.05'
-  0 0 0 0 0 
 '(5.05-5.46'
-  0 0 0 0 0 
 '(5.46-5.87'
-  0 0 0 0 0 
 '(5.87-6.28'
-  0 0 0 0 0 
 '(6.28-6.69'
-  0 0.009 0 0 0 
 '(6.69-inf)'
-  Confusion Matrix  
-  a b c d e f g h i j lt-- classified 
 as
-  69 0 1 0 0 0 0 0 0 1  a  
 '(-inf-3.41'
-  24 0 1 0 0 0 0 0 0 0  b  
 '(3.41-3.82'
-  8 0 0 0 0 0 0 0 0 0  c  
 '(3.82-4.23'
-  6 0 0 0 0 0 0 0 0 0  d  
 '(4.23-4.64'
-  2 0 0 0 0 0 0 0 0 0  e  
 '(4.64-5.05'
23Learnings from the project
- We both were new to Weka and learnt to use Weka 
 software.
- It was challenging to analyze large amount of 
 data as compared to what we did in our home
 works.
- We realized that data pre-processing indeed takes 
 a long time.
- We got a clear understanding of C4.5 and Naïve 
 Bayesian classification algorithms.
24Division of work
- We worked together on all the tasks.
 Conclusion
We realized that data mining tools are very 
powerful and save a lot of time for classifying 
huge amount data. We found that using C4.5 
algorithm and 66 of data as training data gave 
an accuracy of 67 whereas 10-fold 
cross-validation gave an accuracy of 62 in the 
case of earthquake data. The Naïve Bayesian 
algorithm also correctly classified 61 of the 
test data. So, the results were pretty close. All 
in all, the project was very interesting and 
challenging and we enjoyed working on it. 
 25Reference
- http//www.studentprogress.com/appln/colleges/cogr
 ec/Papers/D_05.pdf
- www.meteoquake.org/our.html 
- http//www.cs.waikato.ac.nz/ml/weka/index.html 
- http//gaia.ecs.csus.edu/mei/215/tutorial.html 
- http//www.ngdc.noaa.gov/seg/hazard/sig_srch_idb.s
 html
- Weka Explorer tutorial by Svetlana Aksanova