Title: Course on Data Mining (581550-4)
1Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2Accepted to Autumn 2001 Course
- Arkko Jouko
- Asikainen Tomi
- Aunimo Lili
- Hyvönen Leena
- Johansson Carl
- Jokinen Sakari
- Kerminen Antti
- Kuokkanen Ville
- Lehmussaari Kari
- Lehtonen Miro
- Löfström Jaakko
- Malinen Johanna
- Mäkelä Eetu
- Ojala Petri
- Palin Kimmo
- Pasanen Janne
- Pietilä Mikko
- Pitkänen Esa
- Rapiokallio Maarit
- Roos Teemu
- Sahlberg Mauri
- Saikku Arja
- Sundman Jonas
- Tarvainen Tero
- Tiihonen Sami
- Tolvanen Juha
- Uusitalo Petri
- Vasankari Minna
- Virtanen Otso
3Course Organization
Lecturers
Lectures
Course Material
Exercises
Contents
4Course Organization
Dr. Mika Klemettinen
- PhD Mika Klemettinen
- Email Mika.Klemettinen_at_nokia.com
- WWW http//www.cs.helsinki.fi/u/mklemett/
- Room B356
- Tel 050-483 6661
- PhD in January 1999
- Thesis A Knowledge Discovery Methodology for
Telecommunication Network Alarm Databases - Data mining and SGML/XML related research at
UH/CS (1994-2000) and at Nokia (2000-)
5Course Organization
Dr. Pirjo Moen
- PhD Pirjo Moen
- Email Pirjo.Moen_at_cs.helsinki.fi
- WWW http//www.cs.helsinki.fi/pirjo.moen/
- Room B350
- Tel191 44238
- PhD in February 2000
- Thesis Attribute, Event Sequence, and Event Type
Similarity Notions for Data Mining - Data mining related research at UH/CS (1994-)
6Course Organization
DM/SGML/XML at UH/CS
- RATI (A structured text database system/
Rakenteiset tekstitietokannat), 1988-91 - Data mining from telecommunication alarm data,
1994-97 - Structured and Intelligent Documents (SID),
1995-98 - From Data to Knowledge (FDK), 1995-
- Knowledge workers workstation (TYTTI), 2000-02
- DM Group (?99), DOREMI Group (00?)
Linux was invented here!
7Course Organization
NRC in Short
- Nokia is the global leader in digital
communication technologies with around 60 000
employees all over the world - Nokia Research Center (NRC) has around 1 200
employees in Finland, USA, Japan, China, Germany,
Hungary, UK, etc. - NRC's role is to enhance the Nokia's
technological competitiveness by exploring and
developing new technologies - Strongly involved in many European Union and
national research projects
8Course Organization
DM Group at NRC
- Background
- At the University of Computer Science data mining
methods and theory of data mining since late 80s - Association and episode rule mining, time series
similarity, analysis of telecommunication alarm
data and web logs, etc. - Other members include
- Dr. Heikki Mannila (group leader)
- Dr. Hannu Toivonen
9Course Organization
Lectures (1)
- 24.10.-30.11.2001 (12 lectures)
- 7 normal lectures
- 5 seminar like lectures
- Wed 14-16, Fri 12-14 (A217)
- Wed normal lecture
- Fri seminar like lecture (except for 26.10.)
- Lectures are obligatory
- Normal lectures 5/7
- Seminar like lectures 4/5
- Lists are circulated
10Course Organization
Lectures (2)
- Lecturing language is Finnish, slides are in
English - Students can also use English
- A foreign student group can be established
- Normal lectures
- Basics, terminology, standard methods
- Lecturer driven teaching
- Seminar like lectures
- Extensions to the basic methods
- Lecturer gives an introduction
- Student groups give short presentations
11Course Organization
Lectures (3)
- Group for seminar (and exercise) work
- 10 groups, à 3 persons, 2 groups/lecture
- Dates are agreed at the beginning of course
- Articles are given on previous week's Wed
- Seminar presentations
- Presentation in an HTML page (around 3-5 printed
pages) due to seminar starting - Can be either a HTML page or a printable document
in PostScript/PDF format - 30 minutes of presentation
- 5-15 minutes of discussion
- Active participation
12Course Organization
Course Material
- Lecture slides
- Original articles
- Seminar presentations
- Book "Data Mining Concepts and Techniques" by
Jiawei Han and Micheline Kamber, Morgan Kaufmann
Publishers, August 2000. 550 pages. ISBN
1-55860-489-8 - Remember to check course website and folder for
the material!
13Course Organization
Exercises
- Given by Pirjo Moen
- Email Pirjo.Moen_at_cs.helsinki.fi
- Room B350
- Tel 191 44238
- 1.11.-29.11.2001 (5 exercises)
- Thu 12-14 (A318)
- Exercises are obligatory
- Exercises 4/5
- Lists are circulated
- Discussion is an essential part!
14Course Organization
Exercises
- Usually around 3-4 exercises
- 2-3 "normal" exercises (with subtasks)
- Available due Thu mornings at 9
- 1 group work
- A practical exercise
- Available due Thu mornings at 9
- A written report (not hand-written!) must be
returned at the exercise session - Group the seminar presentation group
- Foreign students
- Return all exercises in written format to Pirjo
Moen
15Course Organization
Home Exam
- The home exam is given on 28.11.2001
- Must be returned by 21.12.2001 (printed version,
not hand-written, not by email) - Tentatively
- Course lectures, seminar presentations and
exercises are the material for the exam - Questions contain both theoretical and practical
issues - Around 4-6 smaller questions
- Around 1-2 bigger questions
16Course Organization
Course Evaluation
- Scale 1-/3 3/3 or rejected
- Grade home exam exercises experiments
group presentations - home exam max 30 points
- (4 X 5p) (1 X 10p)
- normal exercises (10) max 5 points
- 2 1p, 4 2p, 6 3p, 8 4p, 10 5p
- experiments (5) max 15 points
- max 3 points/experiment
- group presentation max 10 points
17Course Organization
Course Evaluation
- Passing the course min 30 points
- home exam min 13 points (max 30 points)
- exercises/experiments min 8 points (max 20
points) - at least 3 returned and reported experiments
- group presentation min 4 points (max 10 points)
- Remember also the other requirements
- Attending the lectures (5/7)
- Attending the seminars (4/5)
- Attending the exercises (4/5)
18Course Organization
Course Contents (1)
- Module/Week 1
- What is Data Mining?
- Association rules
- 24.10. normal lecture by Mika
- 26.10. normal lecture by Mika
- Module/Week 2
- Recurrent patterns
- Episode rules, minimal occurrences
- 31.10. normal lecture by Mika
- 2.11. seminar like lecture by Pirjo
19Course Organization
Course Contents (2)
- Module/Week 3
- Text mining
- 7.11. normal lecture by Mika
- 9.11. seminar like lecture by Mika
- Module/Week 4
- Clustering
- Classification
- Similarity
- 14.11. normal lecture by Pirjo
- 16.11. seminar like lecture by Mika
20Course Organization
Course Contents (3)
- Module/Week 5
- Knowledge discovery process
- Pre- and postprocessing
- 21.11. normal lecture by Pirjo
- 23.11. seminar like lecture by Pirjo
- Module/Week 6
- Data mining tools
- Summary, future
- 28.11. normal lecture by Pirjo
- 30.11. seminar like lecture by Pirjo
21Course Organization / Groups
Group Establishment
- Group is for both seminar and weekly group
exercise work - 10 groups à 3 persons
Get grouped!
22Course Organization / Groups
- Group presentation time allocation
- Fri 2.11. Group 1, Group 2 (associations)
- Fri 9.11. Group 3, Group 4 (episodes)
- Fri 16.11. Group 5, Group 6 (text mining)
- Fri 23.11. Group 7, Group 8 (clustering)
- Fri 30.11. Group 9, Group 10 (KDD process)
23Course Organization / Groups
- Group 1
- Asikainen Tomi, Hyvönen Leena
- Group 2
- Löfström Jaakko, Pitkänen Esa, Tarvainen Tero
- Group 3
- Jokinen Sakari, Kuokkanen Ville, Tolvanen Juha
- Group 4
- Lehmussaari Kari, Pietilä Mikko, Uusitalo Petri
- Group 5
- Johansson Carl, Kerminen Antti, Sundman Jonas
24Course Organization / Groups
- Group 6
- Malinen Johanna, Sahlberg Mauri, Vasankari Minna
- Group 7
- Arkko Jouko, Ojala Petri, Rapiokallio Maarit
- Group 8
- Palin Kimmo, Pasanen Janne (, X)
- Group 9
- Aunimo Lili, Lehtonen Miro, Saikku Arja
- Group 10
- X, X, X
25Introduction to Data Mining (DM)
What? Why?
Applications
KDD Process
DM Views
Major Issues
26Computers in 1940s (ENIAC)
27Personal Home Network in 2000s
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Internet
Storage
28Evolution of Database Technology
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.) - 1990s
- Data mining and data warehousing, multimedia
databases, and Web technology
29Why Data Mining?
- Enormous amounts of data available
- Automated data collection tools and mature
database technology lead to huge amounts of data
stored in databases, data warehouses and other
information repositories - Manual inspection is either tedious or just
impossible
30What is Data Mining?
- Ultimately
- "Extraction of interesting (non-trivial,
implicit, previously unknown, potentially useful)
information or patterns from data in large
databases" - Often just
- "Tell something interesting about this data",
"Describe this data" - Exploratory, semi-automatic data analysis on
large data sets
31What is Data Mining?
- Rather established terminology
- Data mining
- Usually DM is one part of KDD process
- Knowledge discovery in databases (KDD)
- The general term that covers, e.g., data
preprocessing, DM, and post-processing - Not so often used terms
- Knowledge extraction, data archeology
- Newest hype
- Business intelligence, knowledge management
32What is DM Useful for?
Increase knowledge to base decision upon E.g.,
impact on marketing The role and importance of
KDD and DM has growed rapidly - and is still
growing! But DM is not just marketing...
33Potential Applications?
- Database analysis and decision support
- Market analysis and management
- Risk analysis and management
- Fraud detection and management
- Other applications
- Web mining
- Text mining
- etc.
34Example (1)
- You are a marketing manager for a cellular
telephone company - Customers receive a free phone (worth 150) with
one-year contract you pay a sales commission of
250 per contract - Problem Turnover (after contract expires) is 25
- Giving a new phone to everyone whose contract is
expiring is very expensive - Bringing back a customer after quitting is both
difficult and expensive
35Example (1)
- Three months before a contract expires, predict
which customers will leave - If you want to keep a customer that is predicted
to leave, offer them a new phone
Yippee! I won't leave!
36Example (2)
- You are an insurance officer and you should
define a suitable monthly payment for an
18-year-old boy who has bough a Ferrari what to
do?
Oh, yes! I love my Ferrari!
37Example (2)
- Analyze all previous customer data and paid
compensations data - What is the predicted accident probability based
on - Driver's gender (male/female) and age
- Car model and age, place of living
- etc.
- If the accident probability is higher than on
average, set the monthly payment accordingly!
38Example (3)
- You are in a foreign country and somebody steals
or duplicates your credit card or mobile phone - Credit card companies
- use historical data to build models of fraudulent
behaviour and use data mining to help identify
similar instances - Phone companies
- analyze patterns that deviate from an expected
norm (destination, duration, etc.)
39Example (4)
- Web access logs can be analyzed for
- discovering customer preferences
- improving Web site organization
- Similarly
- all kinds of log information analysis
- user interface/service adaptation
Excellent surfing experience!
40Knowledge Discovery Process (1)
Learning the domain
Creating a target data set
Data cleaning/preprocessing
Data reduction/projection
Choosing the DM task
41Knowledge Discovery Process (2)
Choosing the DM algorithm(s)
Data mining Search
Pattern evaluation
Knowledge presentation
Use of discovered knowledge
42Typical KDD Process
Operational Database
Data mining
Input data
Results
2
Utilization
43Utilization
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
44The Value Chain
- Decision
- Promote product A in region Z.
- Mail ads to families of profile P
- Cross-sell service B to clients C
- Knowledge
- A quantity Y of product A is used in region Z
- Customers of class Y use x of C during period
D
- Information
- X lives in Z
- S is Y years old
- X and S moved
- W has money in Z
- Data
- Customer data
- Store data
- Demographical Data
- Geographical data
45Data Mining Views
- General approaches
- Descriptive data mining
- Describe what interesting can be found in this
data! - Explain this data to me!
- Predictive data mining
- Based on this and previous data, tell me what
will happen in the future! - Show me the future trends!
46Data Mining Views
- Views based on
- Databases to be mined
- Knowledge to be discovered
- Techniques utilized
- Applications adapted
- Let's take a closer look at these views...
47Data Mining Views
Databases to be mined
- Relational
- Transactional
- Object-oriented
- Object-relational
- Active
- Spatial
- Time-series
- Text, XML
- Multi-media
- Heterogeneous
- Legacy
- Inductive
- WWW
- etc.
Databases
48Data Mining Views
Knowledge to be mined tasks
- Characterization
- Discrimination
- Association
- Classification
- Clustering
- Trend
- Deviation analysis
- Outlier analysis
- etc.
Knowledge task
49Data Mining Views
Techniques utilized
- Database-oriented
- Data warehouse (OLAP)
- Machine learning
- Statistics
- Visualization
- Neural networks
- Etc.
Techniques
50Data Mining Views
Applications adapted
- Retail (supermarkets etc.)
- Telecom
- Banking
- Fraud analysis
- DNA mining
- Stock market analysis
- Web mining
- Log data analysis
- etc.
Applic.
51Major Issues in Data Mining
- Mining methodologies and interaction
- Mining different kinds of knowledge
- Interactive mining of knowledge
- Incorporation of background knowledge
- DM query languages and ad-hoc DM
- Visualization of DM results
- Handling noise and incomplete data
- The interestingness problem
- Performance and scalability
- Efficiency and scalability of DM algorithms
- Parallel, distributed and incremental mining
methods
52Major Issues in Data Mining
- Diversity of data types
- Handling complex types of data
- Mining information from heterogeneous databases
(Web etc.) - Application and integration of discovered
knowledge - Domain-specific DM tools
- Intelligent query answering and decision making
- Integration of discovered knowledge with existing
knowledge - Protection of data
- Security
- Integrity
- Privacy
53Historical Data Mining Activities
- 1989 IJCAI Workshop
- 1991-1994 KDD Workshops
- 1995-1998 KDD Conferences
- 1998 ACM SIGKDD
- 1999- SIGKDD Conferences
- And many smaller/new DM conferences
- PAKDD, PKDD
- SIAM-Data Mining, (IEEE) ICDM
- etc.
54Useful References on Data Mining
Standards
- DM Conferences KDD, PKDD, PAKDD, ...
- Journals Data Mining and Knowledge
Discovery, CACM - DM/DB Conferences ACM-SIGMOD/PODS, VLDB, ...
- Journals ACM-TODS, J. ACM, IEEE-TKDE,
JIIS, ... - AI/ML Conferences Machine Learning, AAAI,
IJCAI, ... - Journals Machine Learning, Artific. Intell.,
...
55Conclusions
- Data mining semi-automatic discovery of
interesting patterns from large data sets - Knowledge discovery is a process
- Preprocessing
- Data mining
- Postprocessing
- To be mined, used or utilized different
- Databases (relational, object-oriented, spatial,
WWW, ) - Knowledge (characterization, clustering,
association, ) - Techniques (machine learning, statistics,
visualization, ) - Applications (retail, telecom, Web mining, log
analysis, )
56Conclusions
- Module/Week 1
- What is Data Mining?
- Association rules
- 24.10. normal lecture by Mika
- 26.10. normal lecture by Mika
- Module/Week 2
- Episode rules, minimal occurrences
- 31.10. normal lecture by Mika
- 2.11. seminar like lecture by Pirjo
- Module/Week 3
- Text mining
- 7.11. normal lecture by Mika
- 9.11. seminar like lecture by Mika
57Conclusions
- Module/Week 4
- Clustering, Classification, Similarity
- 14.11. normal lecture by Pirjo
- 16.11. seminal like lecture by Mika
- Module/Week 5
- Knowledge discovery process
- Pre- and postprocessing
- 21.11. normal lecture by Pirjo
- 23.11. Seminar like lecture by Pirjo
- Module/Week 6
- Data mining tools, Summary, Future
- 28.11. normal lecture by Pirjo
- 30.11. seminal like lecture by Pirjo
58Seminar Presentations
- Seminar presentations
- Articles are given on previous week's Wed
- Presentation in an HTML page (around 3-5 printed
pages) due to seminar starting - Can be either a HTML page or a printable document
in PostScript/PDF format - 30 minutes of presentation
- 5-15 minutes of discussion
- Active participation
59Seminar Presentations/Groups 1-2
Quantitative Rules
MINERULE
60Seminar 1/2 Quantitative Rules
- R. Srikant, R. Agrawal "Mining Quantitative
Association Rules in Large Relational Tables",
Proc. of the ACM-SIGMOD 1996 Conference on
Management of Data, Montreal, Canada, June 1996.
61Seminar 2/2 MINERULE
- Rosa Meo, Giuseppe Psaila, Stefano Ceri "A New
SQL-like Operator for Mining Association Rules".
VLDB 1996 122-133
62Introduction to Data Mining (DM)
Thank you for your attention and have a nice
course! Thanks to Jiawei Han from Simon Fraser
University for his slides which greatly helped
in preparing this lecture! Also thanks to Fosca
Giannotti and Dino Pedreschi from Pisa for their
slides.