Research Problems in Data Mining - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Research Problems in Data Mining

Description:

Weblog mining (usage, access, and evolution) Identification of authoritative Web pages ... customization: home page Weblog user profiles. 9/4/09. Research ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 29

Provided by: jiaw201

Category:

more less

Transcript and Presenter's Notes

Title: Research Problems in Data Mining

1
Research Problems in Data Mining

Jiawei Han
Database Systems Research Lab
Department of Computer Science
University of Illinois at Urbana-Champaign,
U.S.A.
http//www.cs.uiuc.edu/hanj

2
Several Research Issues in Data Mining

Web mining and text mining
Biomedical/DNA data mining
On-line, real-time, stream data mining
Cube exploration iceberg, cube-gradient, trends,
etc.
Mining max/closed long and error-tolerant
frequent and sequential patterns
Intrusion detection and anomaly mining
Invisible data mining

3
Challenges in Web Mining

Web A huge, widely-distributed, highly
heterogeneous, semi-structured, interconnected,
evolving, hypertext/hypermedia information
repository.
Problems
the abundance problem
limited coverage of the Web (hidden Web sources)
limited query interface keyword-oriented search
limited customization to individual users
DBMS, DBers, and data miners will play an
increasingly important role in the new generation
of Internet

4
Web Mining Lots Can Be Done!

A taxonomy of Web mining
Web content mining
Web usage mining
Some interesting problems on Web mining
Mining what Web search engine finds
Weblog mining (usage, access, and evolution)
Identification of authoritative Web pages
Web document classification
Warehousing a Meta-Web Web yellow page service
Intelligent query answering in Web search

5
Mine What Web Search Engine Finds

Current Web search engines convenient source for
mining
keyword-based, return too many answers, low
quality answers, still missing a lot, not
customized, etc.
Data mining will help
coverage enlarge and then shrink, using
synonyms and conceptual hierarchies
better search primitives user preferences/hints
linkage analysis authoritative pages and
clusters
Web-based languages XML WebSQL WebML
customization home page Weblog user profiles

6
Web Log Mining

Weblog provides rich information about Web
dynamics
Multidimensional Weblog analysis
disclose potential customers, users, markets,
etc.
Web accessing association/sequential pattern
analysis
Web cashing, prefetching, swapping
Web linkage adjustment
Trend analysis
Dynamics of the Web what has been changing?
Customized to individual users
Need additional information in order to discover
truly useful patterns

7
Discovery of Authoritative Pages in WWW

Page-rank method ( Brin and Page, 1998)
Rank the "importance" of Web pages, based on a
model of a "random browser."
Hub/authority method (Kleinberg, 1998)
Prominent authorities often do not endorse one
another directly on the Web.
Hub pages have a large number of links to many
relevant authorities.
Thus hubs and authorities exhibit a mutually
reinforcing relationship
Both the page-rank and hub/authority
methodologies have been shown to provide
qualitatively good search results for broad query
topics on the WWW, e.g., Google.

8
Web Document Classification

Automatic classification of Web pages vs. human
classification (e.g., Yahoo)
Training set
Existing, typical good classification sites,
e.g., Yahoo!, CS term hierarchies
Classification methods
Typical method Naïve Bayesian, decision trees,
etc.
Key-word based classification is different from
multi-dimensional classification
Association- or clustering- based classification
is often more effective
Multi-level classification is important

9
Warehousing a High-Level Web An MLDB Approach

ML-Web A structure which summarizes the
contents, structure, linkage, and access of the
Web and which evolves with the Web
Layer0 the Web itself
Layer1 the lowest layer of the ML-Web
An entry a Web page summary, including class,
time, URL, contents, keywords, popularity, rank,
links, etc.
Layer2 and up summary/classification/clustering
in various ways and distributed for various
applications
ML-Web can be warehoused and incrementally
updated
Querying and mining can be performed on or
assisted by ML-Web (a ML digital library
catalogue, yellow page).

10
A Multiple Layered Meta-Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
11
Construction of Multi-Layer Meta-Web

XML facilitates structured and meta-information
extraction
Hidden Web DB schema extraction other meta
info
Automatic classification of Web documents
based on Yahoo!, etc. as training set
keyword-based correlation/classification analysis
(IR/AI assistance)
Automatic ranking of important Web pages
authoritative site recognition and clustering Web
pages
Generalization-based multi-layer ML-Web
construction
With the assistance of clustering and
classification analysis

12
Use of Multi-Layer Meta Web

Benefits of Multi-Layer Meta-Web
Multi-dimensional Web info summary analysis
Approximate and intelligent query answering
Web high-level query answering (WebSQL, WebML)
Web content and structure mining
Observing the dynamics/evolution of the Web
Is it realistic to construct such a meta-Web?
Benefits even if it is partially constructed
Benefits may justify the cost of tool
development, standardization and partial
restructuring

13
Intelligent Web Query Answering

What is intelligent query answering?
Smart alternative answers, summary information,
etc.
Based on users profiles or history
Web query needs more intelligent query answering
mechanism
How to develop it?
Data warehouse and Web Yellow Page service will
help
Data mining will help too!

14
Biomedical Data Mining and DNA Analysis

DNA sequences 4 basic building blocks
(nucleotides) adenine (A), cytosine (C), guanine
(G), and thymine (T).
Gene a sequence of hundreds of individual
nucleotides arranged in a particular order
Humans have around 50,000 genes
Tremendous number of ways that the nucleotides
can be ordered and sequenced to form distinct
genes
Semantic integration of heterogeneous,
distributed genome databases
Current highly distributed, uncontrolled
generation and use of a wide variety of DNA data
Data cleaning and data integration methods
developed in data mining will help

15
Discovery and Comparison of DNA Sequences

Finding tandem repeats
Fault-tolerant sequential patterns (Is Blast
enough?)
Similarity search and comparison among DNA
sequences
Compare the frequently occurring patterns of each
class (e.g., diseased and healthy)
Identify gene sequence patterns that play roles
in various diseases

16
Association and Path Analysis in Bio-Medical and
DNA Data Mining

Association analysis identification of
co-occurring gene sequences
Most diseases are not triggered by a single gene
but by a combination of genes acting together
Association analysis may help determine the kinds
of genes that are likely to co-occur together in
target samples
Path analysis linking genes to different disease
development stages
Different genes may become active at different
stages of the disease
Develop pharmaceutical interventions that target
the different stages separately
Visualization tools and genetic data analysis

17
Stream Data and Applications

Prevalence of Data Streams
Network flow analysis and management
Telephone call details and fraud detection
On-line sensor monitoring
Characteristics of Streams
Massive and continuous amounts of data
O(100 GB) per day
Storage constraints
small space (e.g., main memory or cache)
Online continuous queries
i.e., agg stream -gt stream

18
On-Line Mining of Stream Data

Single-pass aggregates
Basic, simple (exact) aggregates
trivial inherently single-pass
Approx fancy aggregates (e.g., median)
Discovering correlations, associations, models,
cause-and-effect relationships between patterns
Discovering trends, clusters, changes, and
outliers in data flowvisual mining will help as
well
Data stream warehousingsaving regularities?
constraint- or goal- directed stream mining?

19
Multidimensional Data and Data Cubes

Sales volume as a function of product, month, and
region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
20
Mining and Explorative Analysis of Data Cubes

Efficient computation of data or iceberg cubes
Discovery-driven data cube analysis
Cube-gradient analysis
What are the changes of the average house value
in Sillicon Valley in 2001 comparing with 2000?
Under what conditions the average house value
increases 10 per year in Chicago area in 1990s?

21
What More can Be Done at Mining Data Cubes?

Trend analysis in data cubes?
What kind of companies have similar asset
increase trends like Microsoft?
Cluster customers based on their similar shopping
behavior with regard to the change of time
Model-based class comparison
What are the critical features that distinguish
winners and losers?
Association and correlation analysis in data
cubes
If companys average profit is high, what other
features will go with it?

22
Further Development of Frequent and Sequential
Pattern Analysis

Efficient frequent pattern mining methods
Association Apriori (94), FP-growth (00)
Sequential pattern GSP(96), PrefixSpan (01)
Mining max patterns, closed patterns,
approximately closed, top-n frequent patterns
Error-tolerant frequent and sequential patterns
(e.g., DNA sequences)
Constraint-based mining of frequent and
sequential patterns

23
Intrusion Detection and Anomaly Mining

Fighting against crimes and terrorists
Linking and mining dynamic and huge amounts of
data
Sifting irregularities from regular ones Mining
regularities as base for comparison and find
outliers
Classification normal vs. alarming classes and
models
Clustering and outlier analysis
Human-instructed conditions and condition-guided
classification, clustering, and outlier analysis
Information visualization and stream data analysis

24
Invisible Data Mining

Embed mining functions into information services
Web search engine (link analysis, authoritative
pages, user profiles)adaptive web sites, etc.
Improvement of query processing history data
Making service smart and efficient
Benefits from/to data mining research
Data mining research has produced many scalable,
efficient, novel mining solutions
Applications feed new challenge problems to
research

25
Conclusions

Data mining A young and promising discipline
A confluences of multiple disciplines database,
data warehouse, machine learning, statistics,
high performance computing, Web technology, etc.
Great progress in the last decade
Lots of research issues, and a few identified
here
Web mining and text mining
Biomedical/DNA data mining
On-line, real-time, stream data mining
Cube exploration iceberg, cube-gradient, trends,
etc.
Mining max/closed long and error-tolerant
frequent and sequential patterns
Intrusion detection and anomaly mining
Invisible data mining

26
http//www.cs.uiuc.edu/hanj

Thank you !!!

27
Selected Publications (2001)

A. K. H. Tung, J. Hou, and J. Han, "Spatial
Clustering in the Presence of Obstacles", Proc.
2001 Int. Conf. on Data Engineering (ICDE'01),
Heidelberg, Germany, April 2001.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, "PrefixSpan Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern
Growth", Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), Heidelberg, Germany, April
2001.
J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining
Frequent Itemsets with Convertible Constraints",
Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and
R. T. Ng, "Constraint-Based Clustering in Large Da
tabases", Proc. 2001 Int. Conf. on Database
Theory (ICDT'01), London, U.K., Jan. 2001.
H. Miller and J. Han (eds.), Geographic Data
Mining and Knowledge Discovery, Taylor and
Francis, 2001.
Y. Bedard, T. Merrett, and J. Han, "Fundamentals
of Geospatial Data Warehousing for Geographic
Knowledge Discovery", H. Miller and J. Han
(eds.), Geographic Data Mining and Knowledge
Discovery, Taylor and Francis, 2001.
J. Han, M. Kamber, and A. K. H. Tung, "Spatial
Clustering Methods in Data Mining A Survey", H.
Miller and J. Han (eds.), Geographic Data Mining
and Knowledge Discovery, Taylor and Francis,
2001.
H. Lu, L. Feng, and J. Han, "Beyond
Intra-Transaction Association AnalysisMining
Multi-Dimensional Inter-Transaction Association
Rules", ACM Transactions on Information Systems,
2001.

28
Selected Publications (2000)

J. Han and M. Kamber, Data Mining Concepts and
Techniques, Morgan Kaufmann, August 2000.
K. Wang, Y. He and J. Han, "Mining Frequent
Itemsets Using Support Constraints", Proc. 2000
Int. Conf. on Very Large Data Bases (VLDB'00),
Cairo, Egypt, Sept. 2000, pp. 43-52.
E. D. Kim, J. M.W. Lam, and J. Han, "AIM
Approximate Intelligent Matching for Time Series
Data", Proc. 2000 Int. Conf. on Data Wareshouse
and Knowledge Discovery (DaWaK'00), Greenwich,
U.K., Sept. 2000.
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, M.-C. Hsu, "FreeSpan Frequent
Pattern-Projected Sequential Pattern Mining",
submitted to 2000 Int. Conf. on Knowledge
Discovery and Data Mining (KDD'00), Boston, MA,
August 2000.
J. Pei and J. Han "Can We Push More Constraints
into Frequent Pattern Mining?", submitted to 2000
Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000.
J. Han, J. Pei, and Y. Yin, "Mining Frequent
Patterns without Candidate Generation", Proc.
2000 ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD'00), Dallas, TX, May 2000.
J. Pei, J. Han, and R. Mao, "CLOSET An Efficient
Algorithm of Mining Frequent Closed Itemsets for
Association Rules", submitted to 2000 ACM-SIGMOD
Int. Workshop on Data Mining and Knowledge
Discovery (DMKD'00), Dallas, TX, May 2000.
D. Cheung, C. Hwang, A. Fu, and J. Han,
"Efficient Rule-Based Attributed-Oriented
Induction for Data Mining", Journal of
Intelligent Information Systems, 15(2) 175-200,
2000.
N. Stefanovic, J. Han, and K. Koperski,
"Object-Based Selective Materialization for
Efficient Implementation of Spatial Data Cubes,"
IEEE Transactions on Knowledge and Data
Engineering, 12(6), 2000.