Mining Baseball Statistics - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Mining Baseball Statistics

Description:

Many statistics characterizing player performance are published yearly ... No apparent academic work on predicting MLB MVPs. PECOTA. Baseball Prospectus ... – PowerPoint PPT presentation

Number of Views:345

Avg rating:3.0/5.0

Slides: 14

Provided by: cse6

Category:

more less

Transcript and Presenter's Notes

Title: Mining Baseball Statistics

1
Mining Baseball Statistics
Data Mining CSE881
Paul Cornwell Kajal Miyan Mojtaba Solgi Project
URL http//kmp-cse881.appspot.com/

2
Overview of Baseball

Baseball is a team sport
There are two major leagues AL (American), NL
(National)
Many statistics characterizing player performance
are published yearly
Each league names one player MVP (Most Valuable
Player) each year according to a vote
People place bets on who will be MVP

2
3
Overview

Application (motivation)
Can we predict who will be named MVP?
Learn how to do data mining
Learn about baseball
Impress sabermetricians
Baseball its not diseases, crime, or pollution
Baseball statistics
Main task predict MVPs for a given year
Use SVM to rank players

3
4
Overview of Data and Mining

Data 5 CSV files (Batting, Fielding, Master,
Awards, Salaries)?
Data Mining
Ranking (similar to classification)?
Anomaly detection (maybe)?

4
5
Methodology - Preprocessing

Initial Data 90,000 rows in Batting table,
1871-2007
One row one player/year/stint/team
Cut to 1985-2007, 28,000 rows, b/c Salary begin,
rule changes
Perl script to merge tables by playerID/yearID/sti
nt
Batting?Fielding?Awards(MVP)?Salaries?Master
48 columns
14 hours, but I got to relearn Perl!
Discovered infeasible to use WEKA, need to use
SVM-Light
Reformatted from CSV to space-delimited SVM-Light
format
replace every value with attributevalue
replace commas, spaces
deleted 131 w/out fielding record (3-max 26, 21,
16 at-bats)?
create (binary) rank value based on MVP status
replace all MM/DD/YYYY with YYYY
insert qid column according to year/league (46
qids)?
...

5
6
Methodology Data Mining

Classification not apt to get good results, hence
ranking with?
SVM-Light (Cornell University)?
Training generates a model which can rank input
Training phase Leave one (year) out
Testing Rank the players for that year
Postprocessing
SVM-Light returns only ranks of the players as
integers
match ranks with corresponding players
Reformat data for visualization
Ranked the data for each attribute
Anomaly detection (in progress)
KNN on 4 attributes (Gbat, R, HR, RBI)? for
players in gt 10 games
Compute z-scores for each attribute/year
Rank players by distance from nearest neighbor
Compare ranks in various attributes for detecting
anomalies

6
7
Methodology - Visualization

Bar charts of top 20 ranked players for various
attributes
Python
Google App Engine
Google Charts tool
U.S. map of player birthState density

7
8
Team Roles

Roles of team members
Planning - Everyone
Preprocessing Paul Cornwell
Data Mining Kajal Miyan
Visualization Mojtaba Solgi

8
9
Related Work

No apparent academic work on predicting MLB MVPs
PECOTA
Baseball Prospectus
www.baseballprospectus.com/pecota/
Baseball forecasting
Makes statistical predictions about players
No MVP prediction evident
subscription service
Books are available with baseball forecasts
apparently for one year only

9
10
Experimental Setup

Raw data downloaded from http//baseball1.com/cont
ent/view/58/82/
Preprocessing done using Perl, Nano, Excel, OOo,
TextPad
Preprocessing yields a table with 28K rows and
45 columns
Experiments were conducted on a 2 GHz P4 machine
running Kubuntu 8.04 with 1GB RAM
Data Mining and postprocessing with SVM-Light,
Visual C, Matlab
Visualization done using Python, Google App

10
11
Experimental Evaluation