Title: STA 7923 BioinformaticsII Data Mining
1STA 7923 BioinformaticsII Data Mining
- Daijin Ko
- MMS, UTSA
- Spring, 2008
2Course
- Instructor
- Daijin Ko
- BB 4.03.18
- daijin.ko_at_utsa.edu
- MW 4-515 PM
- OH MW 100-145 PM or appointment
3Grading
- Homework 20
- Project 20
- Midterm I and II 40
- Final Exam 20
4Project
- Data Mining Project related to course subject
matter. - We will provide some databases to mine welcome
to use your own data.
5Team Projects
- Working in pairs OK, but
- We will expect more from a pair than from an
individual. - The effort should be roughly evenly distributed.
6Course Outline
- Introduction
- What is data mining?
- Examples
- Supervised Learning
- Linear Regression and Classification models
- R program
- Model Assessment
- Smoothing and Generalized Additive models
- Classification and Regression Tree
- Neural Network
- Bagging and Boosting
- Random Forest and Support Vector Machine
7Course Outline (continued)
- Unsupervised Learning
- Clustering
- Association Rules
- Applications to Microarray Data
8What is Data Mining?
- Discovery of useful, possibly unexpected,
patterns in data. - Subsidiary issues
- Data cleansing detection of bogus data.
- E.g., age 150.
- Visualization something better than megabyte
files of output. - Warehousing of data (for retrieval).
9Typical Kinds of Patterns
- Decision trees succinct ways to classify by
testing properties. - Clusters another succinct classification by
similarity of properties. - Bayes, hidden-Markov, and other statistical
models, frequent-itemsets expose important
associations within data.
10Supervised vs. Unsupervised Learning
- Supervised Problem solving
- Driven by a real business problems and historical
data - Quality of results dependent on quality of data
- Unsupervised Exploration (aka clustering)
- Relevance often an issue
- Beer and baby diapers (who cares?)
- Useful when trying to get an initial
understanding of the data - Non-obvious patterns can sometimes pop out of a
completed data analysis project
11Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
12Clustering of Gene Markers
- Patients clustered based on survival
12
13Goal of Data Mining
- Simplification and automation of the overall
statistical process, from data source(s) to model
application - Changed over the years
- Replace statistical models ? Better models, less
grunge work - Many different data mining algorithms / tools
available - Statistical expertise required to compare
different techniques - Build intelligence into the software
14Related Fields
Machine Learning
Visualization
Data Mining/ Knowledge Discovery
Statistics
Databases
15Cultures
- Databases concentrate on large-scale
(non-main-memory) data. - AI (machine-learning) concentrate on complex
methods. - Statistics concentrate on inferring models.
16Applications
- Science
- astronomy, bioinformatics, drug discovery,
- Business
- advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care, - Web
- search engines, bots,
- Government
- law enforcement, profiling tax cheaters,
anti-terror(?)
17Why Data Mining? Why not Statistics?
- Data Flood
- Bank, telecom, other business transactions ...
- Scientific data astronomy, biology, etc
- Web, text, and e-commerce
- Computer Power
- Moores law doubles computing power every 18
months - Powerful workstations became common
- Cost effective servers (SMPs) provide parallel
processing to the mass market - Interesting tradeoff Small number of large
analyses vs. large number of small analyses - Improved Algorithms
- Techniques have often been waiting for computing
technology to catch up - Statisticians already doing manual data mining
- Good machine learning is just the intelligent
application of statistical processes
18Big Data Examples
- Europe's Very Long Baseline Interferometry (VLBI)
has 16 telescopes, each of which produces 1
Gigabit/second of astronomical data over a 25-day
observation session - storage and analysis a big problem
- ATT handles billions of calls per day
- so much data, it cannot be all stored -- analysis
has to be done on the fly, on streaming data
19Largest databases in 2003
- Commercial databases
- Winter Corp. 2003 Survey France Telecom has
largest decision-support DB, 30TB ATT 26 TB - Web
- Alexa internet archive 7 years of data, 500 TB
- Google searches 4 Billion pages, many hundreds
TB - IBM WebFountain, 160 TB (2003)
- Internet Archive (www.archive.org), 300 TB
20Data Growth Rate
- Twice as much information was created in 2002 as
in 1999 (30 growth rate) - Other growth rate estimates even higher
- Very little data will ever be looked at by a
human - Data Mining is NEEDED to make sense and use of
data.
21Meaningfulness of Answers
- A big risk when data mining is that you will
discover patterns that are meaningless. - Statisticians call it Bonferronis principle
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.
22Examples
- A big objection to TIA was that it was looking
for so many vague connections that it was sure to
find things that were bogus and thus violate
innocents privacy. - The Rhine Paradox a great example of how not to
conduct scientific research.
23Rhine Paradox --- (1)
- David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception. - He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue. - He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!
24Rhine Paradox --- (2)
- He told these people they had ESP and called them
in for another test of the same type. - Alas, he discovered that almost all of them had
lost their ESP. - What did he conclude?
- Answer on next slide.
25Rhine Paradox --- (3)
- He concluded that you shouldnt tell people they
have ESP it causes them to lose it.
26A Concrete Example
- This example illustrates a problem with
intelligence-gathering. - Suppose we believe that certain groups of
evil-doers are meeting occasionally in hotels to
plot doing evil. - We want to find people who at least twice have
stayed at the same hotel on the same day.
27The Details
- 109 people being tracked.
- 1000 days.
- Each person stays in a hotel 1 of the time (10
days out of 1000). - Hotels hold 100 people (so 105 hotels).
- If everyone behaves randomly (I.e., no
evil-doers) will the data mining detect anything
suspicious?
28Calculations --- (1)
- Probability that persons p and q will be at the
same hotel on day d - 1/100 1/100 10-5 10-9.
- Probability that p and q will be at the same
hotel on two given days - 10-9 10-9 10-18.
- Pairs of days
- 5105.
29Calculations --- (2)
- Probability that p and q will be at the same
hotel on some two days - 5105 10-18 510-13.
- Pairs of people
- 51017.
- Expected number of suspicious pairs of people
- 51017 510-13 250,000.
30Conclusion
- Suppose there are (say) 10 pairs of evil-doers
who definitely stayed at the same hotel twice. - Analysts have to sift through 250,010 candidates
to find the 10 real cases. - Not gonna happen.
- But how can we improve the scheme?
31Data Mining for Customer Modeling
- Customer Tasks
- attrition prediction
- targeted marketing
- cross-sell, customer acquisition
- credit-risk
- fraud detection
- Industries
- banking, telecom, retail sales,
32Case Study I
- Situation Attrition rate at for mobile phone
customers is around 25-30 a year! - Task
- Given customer information for the past N months,
predict who is likely to attrite next month. - Also, estimate customer value and what is the
cost-effective offer to be made to this customer.
33Results
- Verizon Wireless built a customer data warehouse
- Identified potential attriters
- Developed multiple, regional models
- Targeted customers with high propensity to accept
the offer - Reduced attrition rate from over 2/month to
under 1.5/month (huge impact, with gt30 M
subscribers) - (Reported in 2003)
34Case Study II Assessing Credit Risk
- Situation Person applies for a loan
- Task Should a bank approve the loan?
- Note People who have the best credit dont need
the loans, and people with worst credit are not
likely to repay. Banks best customers are in
the middle
35Results
- Banks develop credit models using variety of
machine learning methods. - Mortgage and credit card proliferation are the
results of being able to successfully predict if
a person is likely to default on a loan - Widely deployed in many countries
36Successful e-commerce Case Study
- A person buys a book (product) at Amazon.com.
- Task Recommend other books (products) this
person is likely to buy - Amazon does clustering based on books bought
- customers who bought The Elements of Statistical
Learning, also bought Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations - Recommendation program is quite successful
37Unsuccessful e-commerce case study (KDD-Cup 2000)
- Data clickstream and purchase data from
Gazelle.com, legwear and legcare e-tailer - Q Characterize visitors who spend more than 12
on an average order at the site - Dataset of 3,465 purchases, 1,831 customers
- Very interesting analysis by Cup participants
- thousands of hours - X,000,000 (Millions) of
consulting - Total sales -- Y,000
- Obituary Gazelle.com out of business, Aug 2000
38Genomic Microarrays Case Study
- Given microarray data for a number of samples
(patients), can we - Accurately diagnose the disease?
- Predict outcome for given treatment?
- Recommend best treatment?
39Example ALL/AML data
- 38 training cases, 34 test, 7,000 genes
- 2 Classes Acute Lymphoblastic Leukemia (ALL) vs
Acute Myeloid Leukemia (AML) - Use train data to build diagnostic model
Results on test data 33/34 correct, 1 error may
be mislabeled
40Security and Fraud Detection - Case Study
- Credit Card Fraud Detection
- Detection of Money laundering
- FAIS (US Treasury)
- Securities Fraud
- NASDAQ KDD system
- Phone fraud
- ATT, Bell Atlantic, British Telecom/MCI
- Bio-terrorism detection at Salt Lake Olympics 2002
41Problems Suitable for Data-Mining
- require knowledge-based decisions
- have a changing environment
- have sub-optimal current methods
- have accessible, sufficient, and relevant data
- provides high payoff for the right decisions!
- Privacy considerations important if personal data
is involved
42Many Names of DM
- Data Fishing, Data Dredging 1960-
- used by Statistician (as bad name)
- Data Mining 1990 --
- used DB, business
- in 2003 bad image because of TIA
- (Total "Terrorism" Information Awareness)
- Knowledge Discovery in Databases (1989-)
- used by AI, Machine Learning Community
- also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently Data Mining and Knowledge Discovery
are used interchangeably
43Good Names
- Machine Learning
- Statistical Learning
- Knowledge Discovery
- Data Mining
44Summary
- Technology trends lead to data flood
- data mining is needed to make sense of data
- Data Mining has many applications, successful and
not - Data Mining Tasks
- classification, clustering,