Course on Data Mining (581550-4) - PowerPoint PPT Presentation

About This Presentation
Title:

Course on Data Mining (581550-4)

Description:

Title: No Slide Title Author: Dept. of Computer Science Last modified by: Mika Klemettinen Created Date: 6/17/1995 11:31:02 PM Document presentation format – PowerPoint PPT presentation

Number of Views:587
Avg rating:3.0/5.0
Slides: 63
Provided by: Dept120
Category:

less

Transcript and Presenter's Notes

Title: Course on Data Mining (581550-4)


1
Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2
Accepted to Autumn 2001 Course
  • Arkko Jouko
  • Asikainen Tomi
  • Aunimo Lili
  • Hyvönen Leena
  • Johansson Carl
  • Jokinen Sakari
  • Kerminen Antti
  • Kuokkanen Ville
  • Lehmussaari Kari
  • Lehtonen Miro
  • Löfström Jaakko
  • Malinen Johanna
  • Mäkelä Eetu
  • Ojala Petri
  • Palin Kimmo
  • Pasanen Janne
  • Pietilä Mikko
  • Pitkänen Esa
  • Rapiokallio Maarit
  • Roos Teemu
  • Sahlberg Mauri
  • Saikku Arja
  • Sundman Jonas
  • Tarvainen Tero
  • Tiihonen Sami
  • Tolvanen Juha
  • Uusitalo Petri
  • Vasankari Minna
  • Virtanen Otso

3
Course Organization
Lecturers
Lectures
Course Material
Exercises
Contents
4
Course Organization
Dr. Mika Klemettinen
  • PhD Mika Klemettinen
  • Email Mika.Klemettinen_at_nokia.com
  • WWW http//www.cs.helsinki.fi/u/mklemett/
  • Room B356
  • Tel 050-483 6661
  • PhD in January 1999
  • Thesis A Knowledge Discovery Methodology for
    Telecommunication Network Alarm Databases
  • Data mining and SGML/XML related research at
    UH/CS (1994-2000) and at Nokia (2000-)

5
Course Organization
Dr. Pirjo Moen
  • PhD Pirjo Moen
  • Email Pirjo.Moen_at_cs.helsinki.fi
  • WWW http//www.cs.helsinki.fi/pirjo.moen/
  • Room B350
  • Tel191 44238
  • PhD in February 2000
  • Thesis Attribute, Event Sequence, and Event Type
    Similarity Notions for Data Mining
  • Data mining related research at UH/CS (1994-)

6
Course Organization
DM/SGML/XML at UH/CS
  • RATI (A structured text database system/
    Rakenteiset tekstitietokannat), 1988-91
  • Data mining from telecommunication alarm data,
    1994-97
  • Structured and Intelligent Documents (SID),
    1995-98
  • From Data to Knowledge (FDK), 1995-
  • Knowledge workers workstation (TYTTI), 2000-02
  • DM Group (?99), DOREMI Group (00?)

Linux was invented here!
7
Course Organization
NRC in Short
  • Nokia is the global leader in digital
    communication technologies with around 60 000
    employees all over the world
  • Nokia Research Center (NRC) has around 1 200
    employees in Finland, USA, Japan, China, Germany,
    Hungary, UK, etc.
  • NRC's role is to enhance the Nokia's
    technological competitiveness by exploring and
    developing new technologies
  • Strongly involved in many European Union and
    national research projects

8
Course Organization
DM Group at NRC
  • Background
  • At the University of Computer Science data mining
    methods and theory of data mining since late 80s
  • Association and episode rule mining, time series
    similarity, analysis of telecommunication alarm
    data and web logs, etc.
  • Other members include
  • Dr. Heikki Mannila (group leader)
  • Dr. Hannu Toivonen

9
Course Organization
Lectures (1)
  • 24.10.-30.11.2001 (12 lectures)
  • 7 normal lectures
  • 5 seminar like lectures
  • Wed 14-16, Fri 12-14 (A217)
  • Wed normal lecture
  • Fri seminar like lecture (except for 26.10.)
  • Lectures are obligatory
  • Normal lectures 5/7
  • Seminar like lectures 4/5
  • Lists are circulated

10
Course Organization
Lectures (2)
  • Lecturing language is Finnish, slides are in
    English
  • Students can also use English
  • A foreign student group can be established
  • Normal lectures
  • Basics, terminology, standard methods
  • Lecturer driven teaching
  • Seminar like lectures
  • Extensions to the basic methods
  • Lecturer gives an introduction
  • Student groups give short presentations

11
Course Organization
Lectures (3)
  • Group for seminar (and exercise) work
  • 10 groups, à 3 persons, 2 groups/lecture
  • Dates are agreed at the beginning of course
  • Articles are given on previous week's Wed
  • Seminar presentations
  • Presentation in an HTML page (around 3-5 printed
    pages) due to seminar starting
  • Can be either a HTML page or a printable document
    in PostScript/PDF format
  • 30 minutes of presentation
  • 5-15 minutes of discussion
  • Active participation

12
Course Organization
Course Material
  • Lecture slides
  • Original articles
  • Seminar presentations
  • Book "Data Mining Concepts and Techniques" by
    Jiawei Han and Micheline Kamber, Morgan Kaufmann
    Publishers, August 2000. 550 pages. ISBN
    1-55860-489-8
  • Remember to check course website and folder for
    the material!

13
Course Organization
Exercises
  • Given by Pirjo Moen
  • Email Pirjo.Moen_at_cs.helsinki.fi
  • Room B350
  • Tel 191 44238
  • 1.11.-29.11.2001 (5 exercises)
  • Thu 12-14 (A318)
  • Exercises are obligatory
  • Exercises 4/5
  • Lists are circulated
  • Discussion is an essential part!

14
Course Organization
Exercises
  • Usually around 3-4 exercises
  • 2-3 "normal" exercises (with subtasks)
  • Available due Thu mornings at 9
  • 1 group work
  • A practical exercise
  • Available due Thu mornings at 9
  • A written report (not hand-written!) must be
    returned at the exercise session
  • Group the seminar presentation group
  • Foreign students
  • Return all exercises in written format to Pirjo
    Moen

15
Course Organization
Home Exam
  • The home exam is given on 28.11.2001
  • Must be returned by 21.12.2001 (printed version,
    not hand-written, not by email)
  • Tentatively
  • Course lectures, seminar presentations and
    exercises are the material for the exam
  • Questions contain both theoretical and practical
    issues
  • Around 4-6 smaller questions
  • Around 1-2 bigger questions

16
Course Organization
Course Evaluation
  • Scale 1-/3 3/3 or rejected
  • Grade home exam exercises experiments
    group presentations
  • home exam max 30 points
  • (4 X 5p) (1 X 10p)
  • normal exercises (10) max 5 points
  • 2 1p, 4 2p, 6 3p, 8 4p, 10 5p
  • experiments (5) max 15 points
  • max 3 points/experiment
  • group presentation max 10 points

17
Course Organization
Course Evaluation
  • Passing the course min 30 points
  • home exam min 13 points (max 30 points)
  • exercises/experiments min 8 points (max 20
    points)
  • at least 3 returned and reported experiments
  • group presentation min 4 points (max 10 points)
  • Remember also the other requirements
  • Attending the lectures (5/7)
  • Attending the seminars (4/5)
  • Attending the exercises (4/5)

18
Course Organization
Course Contents (1)
  • Module/Week 1
  • What is Data Mining?
  • Association rules
  • 24.10. normal lecture by Mika
  • 26.10. normal lecture by Mika
  • Module/Week 2
  • Recurrent patterns
  • Episode rules, minimal occurrences
  • 31.10. normal lecture by Mika
  • 2.11. seminar like lecture by Pirjo

19
Course Organization
Course Contents (2)
  • Module/Week 3
  • Text mining
  • 7.11. normal lecture by Mika
  • 9.11. seminar like lecture by Mika
  • Module/Week 4
  • Clustering
  • Classification
  • Similarity
  • 14.11. normal lecture by Pirjo
  • 16.11. seminar like lecture by Mika

20
Course Organization
Course Contents (3)
  • Module/Week 5
  • Knowledge discovery process
  • Pre- and postprocessing
  • 21.11. normal lecture by Pirjo
  • 23.11. seminar like lecture by Pirjo
  • Module/Week 6
  • Data mining tools
  • Summary, future
  • 28.11. normal lecture by Pirjo
  • 30.11. seminar like lecture by Pirjo

21
Course Organization / Groups
Group Establishment
  • Group is for both seminar and weekly group
    exercise work
  • 10 groups à 3 persons

Get grouped!
22
Course Organization / Groups
  • Group presentation time allocation
  • Fri 2.11. Group 1, Group 2 (associations)
  • Fri 9.11. Group 3, Group 4 (episodes)
  • Fri 16.11. Group 5, Group 6 (text mining)
  • Fri 23.11. Group 7, Group 8 (clustering)
  • Fri 30.11. Group 9, Group 10 (KDD process)

23
Course Organization / Groups
  • Group 1
  • Asikainen Tomi, Hyvönen Leena
  • Group 2
  • Löfström Jaakko, Pitkänen Esa, Tarvainen Tero
  • Group 3
  • Jokinen Sakari, Kuokkanen Ville, Tolvanen Juha
  • Group 4
  • Lehmussaari Kari, Pietilä Mikko, Uusitalo Petri
  • Group 5
  • Johansson Carl, Kerminen Antti, Sundman Jonas

24
Course Organization / Groups
  • Group 6
  • Malinen Johanna, Sahlberg Mauri, Vasankari Minna
  • Group 7
  • Arkko Jouko, Ojala Petri, Rapiokallio Maarit
  • Group 8
  • Palin Kimmo, Pasanen Janne (, X)
  • Group 9
  • Aunimo Lili, Lehtonen Miro, Saikku Arja
  • Group 10
  • X, X, X

25
Introduction to Data Mining (DM)
What? Why?
Applications
KDD Process
DM Views
Major Issues
26
Computers in 1940s (ENIAC)
27
Personal Home Network in 2000s
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Internet
Storage
28
Evolution of Database Technology
  • 1960s
  • Data collection, database creation, IMS and
    network DBMS
  • 1970s
  • Relational data model, relational DBMS
    implementation
  • 1980s
  • RDBMS, advanced data models (extended-relational,
    OO, deductive, etc.) and application-oriented
    DBMS (spatial, scientific, engineering, etc.)
  • 1990s
  • Data mining and data warehousing, multimedia
    databases, and Web technology

29
Why Data Mining?
  • Enormous amounts of data available
  • Automated data collection tools and mature
    database technology lead to huge amounts of data
    stored in databases, data warehouses and other
    information repositories
  • Manual inspection is either tedious or just
    impossible

30
What is Data Mining?
  • Ultimately
  • "Extraction of interesting (non-trivial,
    implicit, previously unknown, potentially useful)
    information or patterns from data in large
    databases"
  • Often just
  • "Tell something interesting about this data",
    "Describe this data"
  • Exploratory, semi-automatic data analysis on
    large data sets

31
What is Data Mining?
  • Rather established terminology
  • Data mining
  • Usually DM is one part of KDD process
  • Knowledge discovery in databases (KDD)
  • The general term that covers, e.g., data
    preprocessing, DM, and post-processing
  • Not so often used terms
  • Knowledge extraction, data archeology
  • Newest hype
  • Business intelligence, knowledge management

32
What is DM Useful for?
Increase knowledge to base decision upon E.g.,
impact on marketing The role and importance of
KDD and DM has growed rapidly - and is still
growing! But DM is not just marketing...
33
Potential Applications?
  • Database analysis and decision support
  • Market analysis and management
  • Risk analysis and management
  • Fraud detection and management
  • Other applications
  • Web mining
  • Text mining
  • etc.

34
Example (1)
  • You are a marketing manager for a cellular
    telephone company
  • Customers receive a free phone (worth 150) with
    one-year contract you pay a sales commission of
    250 per contract
  • Problem Turnover (after contract expires) is 25
  • Giving a new phone to everyone whose contract is
    expiring is very expensive
  • Bringing back a customer after quitting is both
    difficult and expensive

35
Example (1)
  • Three months before a contract expires, predict
    which customers will leave
  • If you want to keep a customer that is predicted
    to leave, offer them a new phone

Yippee! I won't leave!
36
Example (2)
  • You are an insurance officer and you should
    define a suitable monthly payment for an
    18-year-old boy who has bough a Ferrari what to
    do?

Oh, yes! I love my Ferrari!
37
Example (2)
  • Analyze all previous customer data and paid
    compensations data
  • What is the predicted accident probability based
    on
  • Driver's gender (male/female) and age
  • Car model and age, place of living
  • etc.
  • If the accident probability is higher than on
    average, set the monthly payment accordingly!

38
Example (3)
  • You are in a foreign country and somebody steals
    or duplicates your credit card or mobile phone
  • Credit card companies
  • use historical data to build models of fraudulent
    behaviour and use data mining to help identify
    similar instances
  • Phone companies
  • analyze patterns that deviate from an expected
    norm (destination, duration, etc.)

39
Example (4)
  • Web access logs can be analyzed for
  • discovering customer preferences
  • improving Web site organization
  • Similarly
  • all kinds of log information analysis
  • user interface/service adaptation

Excellent surfing experience!
40
Knowledge Discovery Process (1)
Learning the domain
Creating a target data set
Data cleaning/preprocessing
Data reduction/projection
Choosing the DM task
41
Knowledge Discovery Process (2)
Choosing the DM algorithm(s)
Data mining Search
Pattern evaluation
Knowledge presentation
Use of discovered knowledge
42
Typical KDD Process
Operational Database
Data mining
Input data
Results
2
Utilization
43
Utilization
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
44
The Value Chain
  • Decision
  • Promote product A in region Z.
  • Mail ads to families of profile P
  • Cross-sell service B to clients C
  • Knowledge
  • A quantity Y of product A is used in region Z
  • Customers of class Y use x of C during period
    D
  • Information
  • X lives in Z
  • S is Y years old
  • X and S moved
  • W has money in Z
  • Data
  • Customer data
  • Store data
  • Demographical Data
  • Geographical data

45
Data Mining Views
  • General approaches
  • Descriptive data mining
  • Describe what interesting can be found in this
    data!
  • Explain this data to me!
  • Predictive data mining
  • Based on this and previous data, tell me what
    will happen in the future!
  • Show me the future trends!

46
Data Mining Views
  • Views based on
  • Databases to be mined
  • Knowledge to be discovered
  • Techniques utilized
  • Applications adapted
  • Let's take a closer look at these views...

47
Data Mining Views
Databases to be mined
  • Relational
  • Transactional
  • Object-oriented
  • Object-relational
  • Active
  • Spatial
  • Time-series
  • Text, XML
  • Multi-media
  • Heterogeneous
  • Legacy
  • Inductive
  • WWW
  • etc.

Databases
48
Data Mining Views
Knowledge to be mined tasks
  • Characterization
  • Discrimination
  • Association
  • Classification
  • Clustering
  • Trend
  • Deviation analysis
  • Outlier analysis
  • etc.

Knowledge task
49
Data Mining Views
Techniques utilized
  • Database-oriented
  • Data warehouse (OLAP)
  • Machine learning
  • Statistics
  • Visualization
  • Neural networks
  • Etc.

Techniques
50
Data Mining Views
Applications adapted
  • Retail (supermarkets etc.)
  • Telecom
  • Banking
  • Fraud analysis
  • DNA mining
  • Stock market analysis
  • Web mining
  • Log data analysis
  • etc.

Applic.
51
Major Issues in Data Mining
  • Mining methodologies and interaction
  • Mining different kinds of knowledge
  • Interactive mining of knowledge
  • Incorporation of background knowledge
  • DM query languages and ad-hoc DM
  • Visualization of DM results
  • Handling noise and incomplete data
  • The interestingness problem
  • Performance and scalability
  • Efficiency and scalability of DM algorithms
  • Parallel, distributed and incremental mining
    methods

52
Major Issues in Data Mining
  • Diversity of data types
  • Handling complex types of data
  • Mining information from heterogeneous databases
    (Web etc.)
  • Application and integration of discovered
    knowledge
  • Domain-specific DM tools
  • Intelligent query answering and decision making
  • Integration of discovered knowledge with existing
    knowledge
  • Protection of data
  • Security
  • Integrity
  • Privacy

53
Historical Data Mining Activities
  • 1989 IJCAI Workshop
  • 1991-1994 KDD Workshops
  • 1995-1998 KDD Conferences
  • 1998 ACM SIGKDD
  • 1999- SIGKDD Conferences
  • And many smaller/new DM conferences
  • PAKDD, PKDD
  • SIAM-Data Mining, (IEEE) ICDM
  • etc.

54
Useful References on Data Mining
Standards
  • DM Conferences KDD, PKDD, PAKDD, ...
  • Journals Data Mining and Knowledge
    Discovery, CACM
  • DM/DB Conferences ACM-SIGMOD/PODS, VLDB, ...
  • Journals ACM-TODS, J. ACM, IEEE-TKDE,
    JIIS, ...
  • AI/ML Conferences Machine Learning, AAAI,
    IJCAI, ...
  • Journals Machine Learning, Artific. Intell.,
    ...

55
Conclusions
  • Data mining semi-automatic discovery of
    interesting patterns from large data sets
  • Knowledge discovery is a process
  • Preprocessing
  • Data mining
  • Postprocessing
  • To be mined, used or utilized different
  • Databases (relational, object-oriented, spatial,
    WWW, )
  • Knowledge (characterization, clustering,
    association, )
  • Techniques (machine learning, statistics,
    visualization, )
  • Applications (retail, telecom, Web mining, log
    analysis, )

56
Conclusions
  • Module/Week 1
  • What is Data Mining?
  • Association rules
  • 24.10. normal lecture by Mika
  • 26.10. normal lecture by Mika
  • Module/Week 2
  • Episode rules, minimal occurrences
  • 31.10. normal lecture by Mika
  • 2.11. seminar like lecture by Pirjo
  • Module/Week 3
  • Text mining
  • 7.11. normal lecture by Mika
  • 9.11. seminar like lecture by Mika

57
Conclusions
  • Module/Week 4
  • Clustering, Classification, Similarity
  • 14.11. normal lecture by Pirjo
  • 16.11. seminal like lecture by Mika
  • Module/Week 5
  • Knowledge discovery process
  • Pre- and postprocessing
  • 21.11. normal lecture by Pirjo
  • 23.11. Seminar like lecture by Pirjo
  • Module/Week 6
  • Data mining tools, Summary, Future
  • 28.11. normal lecture by Pirjo
  • 30.11. seminal like lecture by Pirjo

58
Seminar Presentations
  • Seminar presentations
  • Articles are given on previous week's Wed
  • Presentation in an HTML page (around 3-5 printed
    pages) due to seminar starting
  • Can be either a HTML page or a printable document
    in PostScript/PDF format
  • 30 minutes of presentation
  • 5-15 minutes of discussion
  • Active participation

59
Seminar Presentations/Groups 1-2
Quantitative Rules
MINERULE
60
Seminar 1/2 Quantitative Rules
  • R. Srikant, R. Agrawal "Mining Quantitative
    Association Rules in Large Relational Tables",
    Proc. of the ACM-SIGMOD 1996 Conference on
    Management of Data, Montreal, Canada, June 1996.

61
Seminar 2/2 MINERULE
  • Rosa Meo, Giuseppe Psaila, Stefano Ceri "A New
    SQL-like Operator for Mining Association Rules".
    VLDB 1996 122-133

62
Introduction to Data Mining (DM)
Thank you for your attention and have a nice
course! Thanks to Jiawei Han from Simon Fraser
University for his slides which greatly helped
in preparing this lecture! Also thanks to Fosca
Giannotti and Dino Pedreschi from Pisa for their
slides.
Write a Comment
User Comments (0)
About PowerShow.com