STA 7923 BioinformaticsII Data Mining - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

STA 7923 BioinformaticsII Data Mining

Description:

Data: clickstream and purchase data from Gazelle.com, legwear and legcare e-tailer ... Obituary: Gazelle.com out of business, Aug 2000. 38. Genomic Microarrays ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 44

Provided by: jeff463

Category:

more less

Transcript and Presenter's Notes

Title: STA 7923 BioinformaticsII Data Mining

1
STA 7923 BioinformaticsII Data Mining

Daijin Ko
MMS, UTSA
Spring, 2008

2
Course

Instructor
Daijin Ko
BB 4.03.18
daijin.ko_at_utsa.edu
MW 4-515 PM
OH MW 100-145 PM or appointment

3
Grading

Homework 20
Project 20
Midterm I and II 40
Final Exam 20

4
Project

Data Mining Project related to course subject
matter.
We will provide some databases to mine welcome
to use your own data.

5
Team Projects

Working in pairs OK, but
We will expect more from a pair than from an
individual.
The effort should be roughly evenly distributed.

6
Course Outline

Introduction
What is data mining?
Examples
Supervised Learning
Linear Regression and Classification models
R program
Model Assessment
Smoothing and Generalized Additive models
Classification and Regression Tree
Neural Network
Bagging and Boosting
Random Forest and Support Vector Machine

7
Course Outline (continued)

Unsupervised Learning
Clustering
Association Rules
Applications to Microarray Data

8
What is Data Mining?

Discovery of useful, possibly unexpected,
patterns in data.
Subsidiary issues
Data cleansing detection of bogus data.
E.g., age 150.
Visualization something better than megabyte
files of output.
Warehousing of data (for retrieval).

9
Typical Kinds of Patterns

Decision trees succinct ways to classify by
testing properties.
Clusters another succinct classification by
similarity of properties.
Bayes, hidden-Markov, and other statistical
models, frequent-itemsets expose important
associations within data.

10
Supervised vs. Unsupervised Learning

Supervised Problem solving
Driven by a real business problems and historical
data
Quality of results dependent on quality of data
Unsupervised Exploration (aka clustering)
Relevance often an issue
Beer and baby diapers (who cares?)
Useful when trying to get an initial
understanding of the data
Non-obvious patterns can sometimes pop out of a
completed data analysis project

11
Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
12
Clustering of Gene Markers

Patients clustered based on survival

12
13
Goal of Data Mining

Simplification and automation of the overall
statistical process, from data source(s) to model
application
Changed over the years
Replace statistical models ? Better models, less
grunge work
Many different data mining algorithms / tools
available
Statistical expertise required to compare
different techniques
Build intelligence into the software

14
Related Fields
Machine Learning
Visualization

Data Mining/ Knowledge Discovery
Statistics
Databases
15
Cultures

Databases concentrate on large-scale
(non-main-memory) data.
AI (machine-learning) concentrate on complex
methods.
Statistics concentrate on inferring models.

16
Applications

Science
astronomy, bioinformatics, drug discovery,
Business
advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care,
Web
search engines, bots,
Government
law enforcement, profiling tax cheaters,
anti-terror(?)

17
Why Data Mining? Why not Statistics?

Data Flood
Bank, telecom, other business transactions ...
Scientific data astronomy, biology, etc
Web, text, and e-commerce
Computer Power
Moores law doubles computing power every 18
months
Powerful workstations became common
Cost effective servers (SMPs) provide parallel
processing to the mass market
Interesting tradeoff Small number of large
analyses vs. large number of small analyses
Improved Algorithms
Techniques have often been waiting for computing
technology to catch up
Statisticians already doing manual data mining
Good machine learning is just the intelligent
application of statistical processes

18
Big Data Examples

Europe's Very Long Baseline Interferometry (VLBI)
has 16 telescopes, each of which produces 1
Gigabit/second of astronomical data over a 25-day
observation session
storage and analysis a big problem
ATT handles billions of calls per day
so much data, it cannot be all stored -- analysis
has to be done on the fly, on streaming data

19
Largest databases in 2003

Commercial databases
Winter Corp. 2003 Survey France Telecom has
largest decision-support DB, 30TB ATT 26 TB
Web
Alexa internet archive 7 years of data, 500 TB
Google searches 4 Billion pages, many hundreds
TB
IBM WebFountain, 160 TB (2003)
Internet Archive (www.archive.org), 300 TB

20
Data Growth Rate

Twice as much information was created in 2002 as
in 1999 (30 growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a
human
Data Mining is NEEDED to make sense and use of
data.

21
Meaningfulness of Answers

A big risk when data mining is that you will
discover patterns that are meaningless.
Statisticians call it Bonferronis principle
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.

22
Examples

A big objection to TIA was that it was looking
for so many vague connections that it was sure to
find things that were bogus and thus violate
innocents privacy.
The Rhine Paradox a great example of how not to
conduct scientific research.

23
Rhine Paradox --- (1)

David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception.
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!

24
Rhine Paradox --- (2)

He told these people they had ESP and called them
in for another test of the same type.
Alas, he discovered that almost all of them had
lost their ESP.
What did he conclude?
Answer on next slide.

25
Rhine Paradox --- (3)

He concluded that you shouldnt tell people they
have ESP it causes them to lose it.

26
A Concrete Example

This example illustrates a problem with
intelligence-gathering.
Suppose we believe that certain groups of
evil-doers are meeting occasionally in hotels to
plot doing evil.
We want to find people who at least twice have
stayed at the same hotel on the same day.

27
The Details

109 people being tracked.
1000 days.
Each person stays in a hotel 1 of the time (10
days out of 1000).
Hotels hold 100 people (so 105 hotels).
If everyone behaves randomly (I.e., no
evil-doers) will the data mining detect anything
suspicious?

28
Calculations --- (1)

Probability that persons p and q will be at the
same hotel on day d
1/100 1/100 10-5 10-9.
Probability that p and q will be at the same
hotel on two given days
10-9 10-9 10-18.
Pairs of days
5105.

29
Calculations --- (2)

Probability that p and q will be at the same
hotel on some two days
5105 10-18 510-13.
Pairs of people
51017.
Expected number of suspicious pairs of people
51017 510-13 250,000.

30
Conclusion

Suppose there are (say) 10 pairs of evil-doers
who definitely stayed at the same hotel twice.
Analysts have to sift through 250,010 candidates
to find the 10 real cases.
Not gonna happen.
But how can we improve the scheme?

31
Data Mining for Customer Modeling

Customer Tasks
attrition prediction
targeted marketing
cross-sell, customer acquisition
credit-risk
fraud detection
Industries
banking, telecom, retail sales,

32
Case Study I

Situation Attrition rate at for mobile phone
customers is around 25-30 a year!
Task
Given customer information for the past N months,
predict who is likely to attrite next month.
Also, estimate customer value and what is the
cost-effective offer to be made to this customer.

33
Results

Verizon Wireless built a customer data warehouse
Identified potential attriters
Developed multiple, regional models
Targeted customers with high propensity to accept
the offer
Reduced attrition rate from over 2/month to
under 1.5/month (huge impact, with gt30 M
subscribers)
(Reported in 2003)

34
Case Study II Assessing Credit Risk

Situation Person applies for a loan
Task Should a bank approve the loan?
Note People who have the best credit dont need
the loans, and people with worst credit are not
likely to repay. Banks best customers are in
the middle

35
Results

Banks develop credit models using variety of
machine learning methods.
Mortgage and credit card proliferation are the
results of being able to successfully predict if
a person is likely to default on a loan
Widely deployed in many countries

36
Successful e-commerce Case Study

A person buys a book (product) at Amazon.com.
Task Recommend other books (products) this
person is likely to buy
Amazon does clustering based on books bought
customers who bought The Elements of Statistical
Learning, also bought Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations
Recommendation program is quite successful

37
Unsuccessful e-commerce case study (KDD-Cup 2000)

Data clickstream and purchase data from
Gazelle.com, legwear and legcare e-tailer
Q Characterize visitors who spend more than 12
on an average order at the site
Dataset of 3,465 purchases, 1,831 customers
Very interesting analysis by Cup participants
thousands of hours - X,000,000 (Millions) of
consulting
Total sales -- Y,000
Obituary Gazelle.com out of business, Aug 2000

38
Genomic Microarrays Case Study

Given microarray data for a number of samples
(patients), can we
Accurately diagnose the disease?
Predict outcome for given treatment?
Recommend best treatment?

39
Example ALL/AML data

38 training cases, 34 test, 7,000 genes
2 Classes Acute Lymphoblastic Leukemia (ALL) vs
Acute Myeloid Leukemia (AML)
Use train data to build diagnostic model

Results on test data 33/34 correct, 1 error may
be mislabeled
40
Security and Fraud Detection - Case Study

Credit Card Fraud Detection
Detection of Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ KDD system
Phone fraud
ATT, Bell Atlantic, British Telecom/MCI
Bio-terrorism detection at Salt Lake Olympics 2002

41
Problems Suitable for Data-Mining

require knowledge-based decisions
have a changing environment
have sub-optimal current methods
have accessible, sufficient, and relevant data
provides high payoff for the right decisions!
Privacy considerations important if personal data
is involved

42
Many Names of DM

Data Fishing, Data Dredging 1960-
used by Statistician (as bad name)
Data Mining 1990 --
used DB, business
in 2003 bad image because of TIA
(Total "Terrorism" Information Awareness)
Knowledge Discovery in Databases (1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...

Currently Data Mining and Knowledge Discovery
are used interchangeably
43
Good Names