Working With Large Datasets in Corporate Settings PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Working With Large Datasets in Corporate Settings


1
Working With Large Datasets in Corporate Settings
  • Ed Bassin

www.profsoft-health.com
2
BackgroundAbout ProfSoft
  • Medical/pharma claims analysis software
  • Main uses are provider profiling, quality
    analysis
  • 14 clients range from 15K to 2.6M members
  • Databases from 900K to 110M claim lines
  • Compete with Fortune 100 companies by stressing
    content, task-appropriate technology
  • Stata is the core of our product
  • 25,000 lines of ado files
  • Stata do file generators

3
Challenges We Face
  • Managing that quantity of data
  • End-users are not statisticians
  • Want point-and-click tools
  • Do not understand complicated techniques
  • Stata is largely unknown at our clients. SAS is
    the standard heavy duty data package.
  • Integrating Stata with the technology of
    corporate America.

4
Why We Chose Stata
  • Performance
  • Relative ease of programming
  • Chose for analytic capabilities, not UI
  • I knew it reasonably well

5
Interfacing with databases
  • Create_table ado reads Stata structure and writes
    appropriate SQL to build, load, and index tables
  • Write delimited text files with DBMS/Copy
  • Call native DBMS tools to load gigabytes of data
  • Support Oracle, Microsoft SQL Server, MySQL
  • Execsql ado calls native DBMS tools to run SQL
    scripts
  • Process is fast, easy, invisible

6
Web-Based, Point-and-click Stata
  • Use PHP to write do files
  • PHP executes Stata, calls do file
  • Stata writes HTML and closes
  • PHP page displays output
  • End-user doesnt know Stata in background
  • Process can be both synch and asynch

7
Integrating Stata with Excel
  • Excel is everyday app for our users
  • Use Excel web queries to get to Stata
  • Build URL through forms or user actions
  • Two ways of getting Stata output to Excel
  • Store Stata output in DBMS
  • Run Stata jobs through PHP
  • Create HTML table return results to Excel
  • Excel manipulates formats Stata output

8
What Works Well
  • Analytic flexibility
  • Performance
  • Calling Stata from a web server is easy
  • Getting Stata datasets to HTML
  • Integration with DBMS systems
  • Hiding Stata from end-users

9
Lessons Learned
  • Segment data as much as possible
  • Be prepared to write special programs to run
    routine statistical procedures
  • Stata statistical programs work with raw data,
    not aggregated data
  • If missing data is not an issue, write your own
    egen or collapse routines
  • Automate memory setting by examining structure of
    dataset you want to use
  • DBMS/Copy to handle reading/writing of large
    datasets
  • Version control with CVS

10
Problems
  • Integration with other data formats
  • Infile, outfile are very slow for large datasets
  • DBMS/Analyst was not maintained for Stata 8
  • Limitations of merge command
  • Abbreviations drive us nuts
  • No IDE (integrated development environment)
  • Stata datasets arent indexed
  • Stata has no name in corporate America
  • Recruiting Stata programmers
Write a Comment
User Comments (0)
About PowerShow.com