CSE/CBS 572 Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

CSE/CBS 572 Data Mining

Description:

Contents of basic and advanced topics. Classification, Clustering, ... 'No pain, no gain', or 'As you sow, so you shall reap', we will also learn the ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 27
Provided by: Huan77
Category:
Tags: cbs | cse | data | mining | sow

less

Transcript and Presenter's Notes

Title: CSE/CBS 572 Data Mining


1
CSE/CBS 572 Data Mining
  • Huan Liu, CSE, CEAS, ASU
  • http//www.public.asu.edu/huanliu/DM06S/cse572.ht
    ml

2
CSE 591
  • Contents of basic and advanced topics
  • Classification, Clustering, Association, and
    Applications
  • Format An interactive course with ample
    opportunities to work, create and share
  • Paper reading, discussion, project, presentation
  • Assessment
  • Class participation, assignments, project
    proposal, presentations, exam(s)

3
  • You our future data miner
  • TA Trevor Lei Tang, l.tang_at_asu.edu
  • Me Huan Liu, huanliu_at_asu.edu
  • Where Brickyard 566
  • When see on the class website, or by appointment
  • No pain, no gain, or As you sow, so you shall
    reap, we will also learn the principle of No
    Free Lunch.
  • MyASU will be used, so make sure your email
    address is correct wont miss important
    announcement

4
Course Format
  • What is the effective teaching of graduate data
    mining ? Your help is wanted.
  • Research papers - the main categories to be found
    on the course web site.
  • You can choose one of the textbooks listed. It is
    an entering point for you to access related
    subjects. It is a fast changing field.
  • Everyone is expected to read research papers and
    participate in class discussion.
  • Selected research paper presentation.
  • Project presentations will be evaluated during
    presentation.

5
Point distribution (tentative)
  • Projects (35)
  • Reading/presentation assignment (10)
  • Exam(s) (40)
  • Assignments (15), and class participation,
    quizzes (up to 10 extra credit)
  • Late penalty, YES, increased exponentially.
  • Academic integrity (http//www.public.asu.edu/hua
    nliu/conduct.html)

6
Research paper reading
  • We provide a reading list and you can also choose
    your favorite
  • All are expected to search for and read the
    selected papers.
  • What is it about (e.g., key idea, basic
    algorithm)?
  • What are points to discuss and improve?
  • What can we do with it?
  • What to submit? (see more on the class website)
  • A brief report that describes the above and 2
    questions suitable for quizzes/tests with
    solutions
  • A set of presentation slides for 20 minutes
  • Due date TBA, use digital drop box
  • Grading criteria include (1) quality of
    additional papers you select, (2) slides for
    presentation, (3) the report, and (4) oral
    presentation will be selected among the best
    submissions and presenters will be given extra
    credit based on presentation
  • Presentation can start as early as in February,
    we hope -

7
Project
  • Proposal
  • Proposal presentation, discussion, revision
  • A project that can be completed in a semester
  • Project
  • Class presentation and/or demo
  • Report
  • One key goal of this course is to take advantage
    of your intelligence and experience to create
    something useful and with impact

8
Topic Distribution (tentative)
9
Categories of interests (including design and
implementation)
  • Data and application security
  • Data mining and privacy
  • Data reduction and selection
  • Streaming data reduction
  • Dealing with large data (column- row-wise)
  • Search bias, overfitting
  • Learning algorithms
  • Ensemble methods
  • Semi-supervised learning
  • Active learning and co-training
  • Bioinformatics for CBS 572 or others
  • A discussion board will be created

10
Your first assignment
  • Think about what you want to accomplish.
  • List 2 your areas of interests (dont be
    restricted by the previous list).
  • Pick an area of interest and choose a general
    topic for paper presentation.
  • Submission via MyASU or hardcopy
  • Complete the above and submit it in the 3rd class
    (next Wednesday 1/25).

11
2nd Assignment due on Feb 1
  • First, choose your category of interest
  • Second, form presentation groups (3-4 a group)
  • Third, each group picks a paper from the given
    list of papers and find additional 2 high-quality
    relevant papers
  • Submit it through myASU
  • TA will help you and compile a list of all papers
    at the end
  • Write a summary for each paper including
  • What is it about
  • Why is it significant and relevant
  • Where is it published and when

12
Introduction
  • The need for data mining
  • Data mining
  • Text mining
  • Image mining
  • Web mining (log, link, content)
  • Bioinformatics
  • Many products and abundant applications
  • Where do we stand

13
What is data mining
  • Data mining is
  • extraction of useful patterns from data sources,
    e.g., databases, texts, web, image.
  • the analysis of (often large) observational data
    sets to find unsuspected relationships and to
    summarize the data in novel ways that are both
    understandable and useful to the data owner.

14
Patterns (1)
  • Patterns are the relationships and summaries
    derived through a data mining exercise.
  • Patterns must be
  • valid
  • novel
  • potentially useful
  • understandable

15
Patterns (2)
  • Patterns are used for
  • prediction or classification
  • describing the existing data
  • segmenting the data (e.g., the market)
  • profiling the data (e.g., your customers)
  • Detection (e.g., intrusion, fault, anomaly)

16
Data (1)
  • Data mining typically deals with data that have
    already been collected for some purpose other
    than data mining.
  • Data miners usually have no influence on data
    collection strategies.
  • Large bodies of data cause new problems
    representation, storage, retrieval, analysis, ...

17
Data (2)
  • Even with a very large data set, we are usually
    faced with just a sample from the population.
  • Data exist in many types (continuous, nominal)
    and forms (credit card usage records, supermarket
    transactions, government statistics, text,
    images, medical records, human genome databases,
    molecular databases).

18
Typical DM tasks
  • Classification
  • mining patterns that can classify future data
    into known classes.
  • Association rule mining
  • mining any rule of the form X ?? Y, where X and Y
    are sets of data items.
  • Clustering
  • identifying a set of similar groups in the data

19
  • Sequential pattern mining
  • A sequential rule A? B, says that event A will
    be immediately followed by event B with a certain
    confidence
  • Deviation/anomaly/exception detection
  • discovering the most significant changes in data
  • Data visualization using graphical methods to
    show patterns in data.
  • High performance computing
  • Bioinformatics

20
Why data mining
  • Rapid computerization of businesses produces huge
    amounts of data
  • How to make best use of data?
  • A growing realization knowledge discovered from
    data can be used for competitive advantage and to
    increase business intelligence.

21
  • Make use/sense of your data assets
  • Many interesting things you want to find cannot
    be found using database queries
  • find me people likely to buy my products
  • Who are likely to respond to my promotion
  • Fast identify underlying relationships and
    respond to emerging opportunities

22
Why now and for the near future
  • The data is abundant.
  • The data is being collected or warehoused.
  • The computing power is affordable.
  • The competitive pressure is increasingly.
  • Data mining tools have become available.
  • New challenges
  • New data types evolve
  • New applications emerge

23
DM fields
  • Data mining is an emerging multi-disciplinary
    field
  • Statistics
  • Machine learning
  • Databases
  • Visualization
  • OLAP and data warehousing
  • High-performance computing
  • ...

24
Summary
  • What is data mining?
  • KDD - knowledge discovery in databases
    non-trivial extraction of implicit, previously
    unknown and potentially useful information
  • Why do we need data mining?
  • Wide use of computer systems - data explosion -
    knowledge is power but were data rich,
    knowledge poor useful, understandable and
    actionable knowledge ...
  • Data mining is not a plug-and-play, so we are not
    done yet and need to continue this class

25
An Overview of KDD Process (Guess which is which)
26
Web mining an application
  • The Web is a massive database
  • Semi-structured data
  • XML and RDF
  • Web mining
  • Content
  • Structure
  • Usage
  • Link analysis
Write a Comment
User Comments (0)
About PowerShow.com