STA 7923 BioinformaticsII Data Mining - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

STA 7923 BioinformaticsII Data Mining

Description:

Data: clickstream and purchase data from Gazelle.com, legwear and legcare e-tailer ... Obituary: Gazelle.com out of business, Aug 2000. 38. Genomic Microarrays ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 44
Provided by: jeff463
Category:

less

Transcript and Presenter's Notes

Title: STA 7923 BioinformaticsII Data Mining


1
STA 7923 BioinformaticsII Data Mining
  • Daijin Ko
  • MMS, UTSA
  • Spring, 2008

2
Course
  • Instructor
  • Daijin Ko
  • BB 4.03.18
  • daijin.ko_at_utsa.edu
  • MW 4-515 PM
  • OH MW 100-145 PM or appointment

3
Grading
  • Homework 20
  • Project 20
  • Midterm I and II 40
  • Final Exam 20

4
Project
  • Data Mining Project related to course subject
    matter.
  • We will provide some databases to mine welcome
    to use your own data.

5
Team Projects
  • Working in pairs OK, but
  • We will expect more from a pair than from an
    individual.
  • The effort should be roughly evenly distributed.

6
Course Outline
  • Introduction
  • What is data mining?
  • Examples
  • Supervised Learning
  • Linear Regression and Classification models
  • R program
  • Model Assessment
  • Smoothing and Generalized Additive models
  • Classification and Regression Tree
  • Neural Network
  • Bagging and Boosting
  • Random Forest and Support Vector Machine

7
Course Outline (continued)
  • Unsupervised Learning
  • Clustering
  • Association Rules
  • Applications to Microarray Data

8
What is Data Mining?
  • Discovery of useful, possibly unexpected,
    patterns in data.
  • Subsidiary issues
  • Data cleansing detection of bogus data.
  • E.g., age 150.
  • Visualization something better than megabyte
    files of output.
  • Warehousing of data (for retrieval).

9
Typical Kinds of Patterns
  • Decision trees succinct ways to classify by
    testing properties.
  • Clusters another succinct classification by
    similarity of properties.
  • Bayes, hidden-Markov, and other statistical
    models, frequent-itemsets expose important
    associations within data.

10
Supervised vs. Unsupervised Learning
  • Supervised Problem solving
  • Driven by a real business problems and historical
    data
  • Quality of results dependent on quality of data
  • Unsupervised Exploration (aka clustering)
  • Relevance often an issue
  • Beer and baby diapers (who cares?)
  • Useful when trying to get an initial
    understanding of the data
  • Non-obvious patterns can sometimes pop out of a
    completed data analysis project

11
Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
12
Clustering of Gene Markers
  • Patients clustered based on survival

12
13
Goal of Data Mining
  • Simplification and automation of the overall
    statistical process, from data source(s) to model
    application
  • Changed over the years
  • Replace statistical models ? Better models, less
    grunge work
  • Many different data mining algorithms / tools
    available
  • Statistical expertise required to compare
    different techniques
  • Build intelligence into the software

14
Related Fields
Machine Learning
Visualization

Data Mining/ Knowledge Discovery
Statistics
Databases
15
Cultures
  • Databases concentrate on large-scale
    (non-main-memory) data.
  • AI (machine-learning) concentrate on complex
    methods.
  • Statistics concentrate on inferring models.

16
Applications
  • Science
  • astronomy, bioinformatics, drug discovery,
  • Business
  • advertising, CRM (Customer Relationship
    management), investments, manufacturing,
    sports/entertainment, telecom, e-Commerce,
    targeted marketing, health care,
  • Web
  • search engines, bots,
  • Government
  • law enforcement, profiling tax cheaters,
    anti-terror(?)

17
Why Data Mining? Why not Statistics?
  • Data Flood
  • Bank, telecom, other business transactions ...
  • Scientific data astronomy, biology, etc
  • Web, text, and e-commerce
  • Computer Power
  • Moores law doubles computing power every 18
    months
  • Powerful workstations became common
  • Cost effective servers (SMPs) provide parallel
    processing to the mass market
  • Interesting tradeoff Small number of large
    analyses vs. large number of small analyses
  • Improved Algorithms
  • Techniques have often been waiting for computing
    technology to catch up
  • Statisticians already doing manual data mining
  • Good machine learning is just the intelligent
    application of statistical processes

18
Big Data Examples
  • Europe's Very Long Baseline Interferometry (VLBI)
    has 16 telescopes, each of which produces 1
    Gigabit/second of astronomical data over a 25-day
    observation session
  • storage and analysis a big problem
  • ATT handles billions of calls per day
  • so much data, it cannot be all stored -- analysis
    has to be done on the fly, on streaming data

19
Largest databases in 2003
  • Commercial databases
  • Winter Corp. 2003 Survey France Telecom has
    largest decision-support DB, 30TB ATT 26 TB
  • Web
  • Alexa internet archive 7 years of data, 500 TB
  • Google searches 4 Billion pages, many hundreds
    TB
  • IBM WebFountain, 160 TB (2003)
  • Internet Archive (www.archive.org), 300 TB

20
Data Growth Rate
  • Twice as much information was created in 2002 as
    in 1999 (30 growth rate)
  • Other growth rate estimates even higher
  • Very little data will ever be looked at by a
    human
  • Data Mining is NEEDED to make sense and use of
    data.

21
Meaningfulness of Answers
  • A big risk when data mining is that you will
    discover patterns that are meaningless.
  • Statisticians call it Bonferronis principle
    (roughly) if you look in more places for
    interesting patterns than your amount of data
    will support, you are bound to find crap.

22
Examples
  • A big objection to TIA was that it was looking
    for so many vague connections that it was sure to
    find things that were bogus and thus violate
    innocents privacy.
  • The Rhine Paradox a great example of how not to
    conduct scientific research.

23
Rhine Paradox --- (1)
  • David Rhine was a parapsychologist in the 1950s
    who hypothesized that some people had
    Extra-Sensory Perception.
  • He devised an experiment where subjects were
    asked to guess 10 hidden cards --- red or blue.
  • He discovered that almost 1 in 1000 had ESP ---
    they were able to get all 10 right!

24
Rhine Paradox --- (2)
  • He told these people they had ESP and called them
    in for another test of the same type.
  • Alas, he discovered that almost all of them had
    lost their ESP.
  • What did he conclude?
  • Answer on next slide.

25
Rhine Paradox --- (3)
  • He concluded that you shouldnt tell people they
    have ESP it causes them to lose it.

26
A Concrete Example
  • This example illustrates a problem with
    intelligence-gathering.
  • Suppose we believe that certain groups of
    evil-doers are meeting occasionally in hotels to
    plot doing evil.
  • We want to find people who at least twice have
    stayed at the same hotel on the same day.

27
The Details
  • 109 people being tracked.
  • 1000 days.
  • Each person stays in a hotel 1 of the time (10
    days out of 1000).
  • Hotels hold 100 people (so 105 hotels).
  • If everyone behaves randomly (I.e., no
    evil-doers) will the data mining detect anything
    suspicious?

28
Calculations --- (1)
  • Probability that persons p and q will be at the
    same hotel on day d
  • 1/100 1/100 10-5 10-9.
  • Probability that p and q will be at the same
    hotel on two given days
  • 10-9 10-9 10-18.
  • Pairs of days
  • 5105.

29
Calculations --- (2)
  • Probability that p and q will be at the same
    hotel on some two days
  • 5105 10-18 510-13.
  • Pairs of people
  • 51017.
  • Expected number of suspicious pairs of people
  • 51017 510-13 250,000.

30
Conclusion
  • Suppose there are (say) 10 pairs of evil-doers
    who definitely stayed at the same hotel twice.
  • Analysts have to sift through 250,010 candidates
    to find the 10 real cases.
  • Not gonna happen.
  • But how can we improve the scheme?

31
Data Mining for Customer Modeling
  • Customer Tasks
  • attrition prediction
  • targeted marketing
  • cross-sell, customer acquisition
  • credit-risk
  • fraud detection
  • Industries
  • banking, telecom, retail sales,

32
Case Study I
  • Situation Attrition rate at for mobile phone
    customers is around 25-30 a year!
  • Task
  • Given customer information for the past N months,
    predict who is likely to attrite next month.
  • Also, estimate customer value and what is the
    cost-effective offer to be made to this customer.

33
Results
  • Verizon Wireless built a customer data warehouse
  • Identified potential attriters
  • Developed multiple, regional models
  • Targeted customers with high propensity to accept
    the offer
  • Reduced attrition rate from over 2/month to
    under 1.5/month (huge impact, with gt30 M
    subscribers)
  • (Reported in 2003)

34
Case Study II Assessing Credit Risk
  • Situation Person applies for a loan
  • Task Should a bank approve the loan?
  • Note People who have the best credit dont need
    the loans, and people with worst credit are not
    likely to repay. Banks best customers are in
    the middle

35
Results
  • Banks develop credit models using variety of
    machine learning methods.
  • Mortgage and credit card proliferation are the
    results of being able to successfully predict if
    a person is likely to default on a loan
  • Widely deployed in many countries

36
Successful e-commerce Case Study
  • A person buys a book (product) at Amazon.com.
  • Task Recommend other books (products) this
    person is likely to buy
  • Amazon does clustering based on books bought
  • customers who bought The Elements of Statistical
    Learning, also bought Data Mining Practical
    Machine Learning Tools and Techniques with Java
    Implementations
  • Recommendation program is quite successful

37
Unsuccessful e-commerce case study (KDD-Cup 2000)
  • Data clickstream and purchase data from
    Gazelle.com, legwear and legcare e-tailer
  • Q Characterize visitors who spend more than 12
    on an average order at the site
  • Dataset of 3,465 purchases, 1,831 customers
  • Very interesting analysis by Cup participants
  • thousands of hours - X,000,000 (Millions) of
    consulting
  • Total sales -- Y,000
  • Obituary Gazelle.com out of business, Aug 2000

38
Genomic Microarrays Case Study
  • Given microarray data for a number of samples
    (patients), can we
  • Accurately diagnose the disease?
  • Predict outcome for given treatment?
  • Recommend best treatment?

39
Example ALL/AML data
  • 38 training cases, 34 test, 7,000 genes
  • 2 Classes Acute Lymphoblastic Leukemia (ALL) vs
    Acute Myeloid Leukemia (AML)
  • Use train data to build diagnostic model

Results on test data 33/34 correct, 1 error may
be mislabeled
40
Security and Fraud Detection - Case Study
  • Credit Card Fraud Detection
  • Detection of Money laundering
  • FAIS (US Treasury)
  • Securities Fraud
  • NASDAQ KDD system
  • Phone fraud
  • ATT, Bell Atlantic, British Telecom/MCI
  • Bio-terrorism detection at Salt Lake Olympics 2002

41
Problems Suitable for Data-Mining
  • require knowledge-based decisions
  • have a changing environment
  • have sub-optimal current methods
  • have accessible, sufficient, and relevant data
  • provides high payoff for the right decisions!
  • Privacy considerations important if personal data
    is involved

42
Many Names of DM
  • Data Fishing, Data Dredging 1960-
  • used by Statistician (as bad name)
  • Data Mining 1990 --
  • used DB, business
  • in 2003 bad image because of TIA
  • (Total "Terrorism" Information Awareness)
  • Knowledge Discovery in Databases (1989-)
  • used by AI, Machine Learning Community
  • also Data Archaeology, Information Harvesting,
    Information Discovery, Knowledge Extraction, ...

Currently Data Mining and Knowledge Discovery
are used interchangeably
43
Good Names
  • Machine Learning
  • Statistical Learning
  • Knowledge Discovery
  • Data Mining

44
Summary
  • Technology trends lead to data flood
  • data mining is needed to make sense of data
  • Data Mining has many applications, successful and
    not
  • Data Mining Tasks
  • classification, clustering,
Write a Comment
User Comments (0)
About PowerShow.com