Mining Baseball Statistics - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Mining Baseball Statistics

Description:

Many statistics characterizing player performance are published yearly ... No apparent academic work on predicting MLB MVPs. PECOTA. Baseball Prospectus ... – PowerPoint PPT presentation

Number of Views:345
Avg rating:3.0/5.0
Slides: 14
Provided by: cse6
Category:

less

Transcript and Presenter's Notes

Title: Mining Baseball Statistics


1
Mining Baseball Statistics
Data Mining CSE881
Paul Cornwell Kajal Miyan Mojtaba Solgi Project
URL http//kmp-cse881.appspot.com/

2
Overview of Baseball
  • Baseball is a team sport
  • There are two major leagues AL (American), NL
    (National)
  • Many statistics characterizing player performance
    are published yearly
  • Each league names one player MVP (Most Valuable
    Player) each year according to a vote
  • People place bets on who will be MVP

2
3
Overview
  • Application (motivation)
  • Can we predict who will be named MVP?
  • Learn how to do data mining
  • Learn about baseball
  • Impress sabermetricians
  • Baseball its not diseases, crime, or pollution
  • Baseball statistics
  • Main task predict MVPs for a given year
  • Use SVM to rank players

3
4
Overview of Data and Mining
  • Data 5 CSV files (Batting, Fielding, Master,
    Awards, Salaries)?
  • Data Mining
  • Ranking (similar to classification)?
  • Anomaly detection (maybe)?

4
5
Methodology - Preprocessing
  • Initial Data 90,000 rows in Batting table,
    1871-2007
  • One row one player/year/stint/team
  • Cut to 1985-2007, 28,000 rows, b/c Salary begin,
    rule changes
  • Perl script to merge tables by playerID/yearID/sti
    nt
  • Batting?Fielding?Awards(MVP)?Salaries?Master
    48 columns
  • 14 hours, but I got to relearn Perl!
  • Discovered infeasible to use WEKA, need to use
    SVM-Light
  • Reformatted from CSV to space-delimited SVM-Light
    format
  • replace every value with attributevalue
  • replace commas, spaces
  • deleted 131 w/out fielding record (3-max 26, 21,
    16 at-bats)?
  • create (binary) rank value based on MVP status
  • replace all MM/DD/YYYY with YYYY
  • insert qid column according to year/league (46
    qids)?
  • ...

5
6
Methodology Data Mining
  • Classification not apt to get good results, hence
    ranking with?
  • SVM-Light (Cornell University)?
  • Training generates a model which can rank input
  • Training phase Leave one (year) out
  • Testing Rank the players for that year
  • Postprocessing
  • SVM-Light returns only ranks of the players as
    integers
  • match ranks with corresponding players
  • Reformat data for visualization
  • Ranked the data for each attribute
  • Anomaly detection (in progress)
  • KNN on 4 attributes (Gbat, R, HR, RBI)? for
    players in gt 10 games
  • Compute z-scores for each attribute/year
  • Rank players by distance from nearest neighbor
  • Compare ranks in various attributes for detecting
    anomalies

6
7
Methodology - Visualization
  • Bar charts of top 20 ranked players for various
    attributes
  • Python
  • Google App Engine
  • Google Charts tool
  • U.S. map of player birthState density

7
8
Team Roles
  • Roles of team members
  • Planning - Everyone
  • Preprocessing Paul Cornwell
  • Data Mining Kajal Miyan
  • Visualization Mojtaba Solgi

8
9
Related Work
  • No apparent academic work on predicting MLB MVPs
  • PECOTA
  • Baseball Prospectus
  • www.baseballprospectus.com/pecota/
  • Baseball forecasting
  • Makes statistical predictions about players
  • No MVP prediction evident
  • subscription service
  • Books are available with baseball forecasts
  • apparently for one year only

9
10
Experimental Setup
  • Raw data downloaded from http//baseball1.com/cont
    ent/view/58/82/
  • Preprocessing done using Perl, Nano, Excel, OOo,
    TextPad
  • Preprocessing yields a table with 28K rows and
    45 columns
  • Experiments were conducted on a 2 GHz P4 machine
    running Kubuntu 8.04 with 1GB RAM
  • Data Mining and postprocessing with SVM-Light,
    Visual C, Matlab
  • Visualization done using Python, Google App

10
11
Experimental Evaluation
  • Preliminary results
  • SVM-Light trained on 1985-2006 data
  • tested on 2007
  • ranked actual MVPs 1 and 11 (out of 1242
    players) (2nd NL, 2)?
  • (there is one MVP for each league each year AL,
    NL)?
  • 2006 ranks 7, 16 (1371 players)
  • 2005 ranks 1, 4 (1322 players)
  • 2004 ranks 1, 3 (1342 players)
  • 2003 ranks 3, 32 (1341 players)
  • 2002 ranks 1, 11 (1316 players)
  • Final evaluation (pending)?
  • Leave-one-out

11
12
Visualization Demo
  • http//kmp-cse881.appspot.com/

12
13
Conclusions
  • MVP ranking was surprisingly successful
  • Early results suggest that it is feasible to
    predict MVPs with some accuracy
  • Lessons learned
  • Data mining is hard work
  • Baseball statistics are actually sort of
    interesting
  • Future work
  • Leave-one-out validation
  • Incorporate team statistics in player evaluations
    (expert advice)?

13
Write a Comment
User Comments (0)
About PowerShow.com