Title: Research Problems in Data Mining
1Research Problems in Data Mining
- Jiawei Han
- Database Systems Research Lab
- Department of Computer Science
- University of Illinois at Urbana-Champaign,
U.S.A. - http//www.cs.uiuc.edu/hanj
2Several Research Issues in Data Mining
- Web mining and text mining
- Biomedical/DNA data mining
- On-line, real-time, stream data mining
- Cube exploration iceberg, cube-gradient, trends,
etc. - Mining max/closed long and error-tolerant
frequent and sequential patterns - Intrusion detection and anomaly mining
- Invisible data mining
3Challenges in Web Mining
- Web A huge, widely-distributed, highly
heterogeneous, semi-structured, interconnected,
evolving, hypertext/hypermedia information
repository. - Problems
- the abundance problem
- limited coverage of the Web (hidden Web sources)
- limited query interface keyword-oriented search
- limited customization to individual users
- DBMS, DBers, and data miners will play an
increasingly important role in the new generation
of Internet
4Web Mining Lots Can Be Done!
- A taxonomy of Web mining
- Web content mining
- Web usage mining
- Some interesting problems on Web mining
- Mining what Web search engine finds
- Weblog mining (usage, access, and evolution)
- Identification of authoritative Web pages
- Web document classification
- Warehousing a Meta-Web Web yellow page service
- Intelligent query answering in Web search
5Mine What Web Search Engine Finds
- Current Web search engines convenient source for
mining - keyword-based, return too many answers, low
quality answers, still missing a lot, not
customized, etc. - Data mining will help
- coverage enlarge and then shrink, using
synonyms and conceptual hierarchies - better search primitives user preferences/hints
- linkage analysis authoritative pages and
clusters - Web-based languages XML WebSQL WebML
- customization home page Weblog user profiles
6Web Log Mining
- Weblog provides rich information about Web
dynamics - Multidimensional Weblog analysis
- disclose potential customers, users, markets,
etc. - Web accessing association/sequential pattern
analysis - Web cashing, prefetching, swapping
- Web linkage adjustment
- Trend analysis
- Dynamics of the Web what has been changing?
- Customized to individual users
- Need additional information in order to discover
truly useful patterns
7Discovery of Authoritative Pages in WWW
- Page-rank method ( Brin and Page, 1998)
- Rank the "importance" of Web pages, based on a
model of a "random browser." - Hub/authority method (Kleinberg, 1998)
- Prominent authorities often do not endorse one
another directly on the Web. - Hub pages have a large number of links to many
relevant authorities. - Thus hubs and authorities exhibit a mutually
reinforcing relationship - Both the page-rank and hub/authority
methodologies have been shown to provide
qualitatively good search results for broad query
topics on the WWW, e.g., Google.
8Web Document Classification
- Automatic classification of Web pages vs. human
classification (e.g., Yahoo) - Training set
- Existing, typical good classification sites,
e.g., Yahoo!, CS term hierarchies - Classification methods
- Typical method Naïve Bayesian, decision trees,
etc. - Key-word based classification is different from
multi-dimensional classification - Association- or clustering- based classification
is often more effective - Multi-level classification is important
9Warehousing a High-Level Web An MLDB Approach
- ML-Web A structure which summarizes the
contents, structure, linkage, and access of the
Web and which evolves with the Web - Layer0 the Web itself
- Layer1 the lowest layer of the ML-Web
- An entry a Web page summary, including class,
time, URL, contents, keywords, popularity, rank,
links, etc. - Layer2 and up summary/classification/clustering
in various ways and distributed for various
applications - ML-Web can be warehoused and incrementally
updated - Querying and mining can be performed on or
assisted by ML-Web (a ML digital library
catalogue, yellow page).
10A Multiple Layered Meta-Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
11Construction of Multi-Layer Meta-Web
- XML facilitates structured and meta-information
extraction - Hidden Web DB schema extraction other meta
info - Automatic classification of Web documents
- based on Yahoo!, etc. as training set
keyword-based correlation/classification analysis
(IR/AI assistance) - Automatic ranking of important Web pages
- authoritative site recognition and clustering Web
pages - Generalization-based multi-layer ML-Web
construction - With the assistance of clustering and
classification analysis
12Use of Multi-Layer Meta Web
- Benefits of Multi-Layer Meta-Web
- Multi-dimensional Web info summary analysis
- Approximate and intelligent query answering
- Web high-level query answering (WebSQL, WebML)
- Web content and structure mining
- Observing the dynamics/evolution of the Web
- Is it realistic to construct such a meta-Web?
- Benefits even if it is partially constructed
- Benefits may justify the cost of tool
development, standardization and partial
restructuring
13Intelligent Web Query Answering
- What is intelligent query answering?
- Smart alternative answers, summary information,
etc. - Based on users profiles or history
- Web query needs more intelligent query answering
mechanism - How to develop it?
- Data warehouse and Web Yellow Page service will
help - Data mining will help too!
14Biomedical Data Mining and DNA Analysis
- DNA sequences 4 basic building blocks
(nucleotides) adenine (A), cytosine (C), guanine
(G), and thymine (T). - Gene a sequence of hundreds of individual
nucleotides arranged in a particular order - Humans have around 50,000 genes
- Tremendous number of ways that the nucleotides
can be ordered and sequenced to form distinct
genes - Semantic integration of heterogeneous,
distributed genome databases - Current highly distributed, uncontrolled
generation and use of a wide variety of DNA data - Data cleaning and data integration methods
developed in data mining will help
15Discovery and Comparison of DNA Sequences
- Finding tandem repeats
- Fault-tolerant sequential patterns (Is Blast
enough?) - Similarity search and comparison among DNA
sequences - Compare the frequently occurring patterns of each
class (e.g., diseased and healthy) - Identify gene sequence patterns that play roles
in various diseases
16Association and Path Analysis in Bio-Medical and
DNA Data Mining
- Association analysis identification of
co-occurring gene sequences - Most diseases are not triggered by a single gene
but by a combination of genes acting together - Association analysis may help determine the kinds
of genes that are likely to co-occur together in
target samples - Path analysis linking genes to different disease
development stages - Different genes may become active at different
stages of the disease - Develop pharmaceutical interventions that target
the different stages separately - Visualization tools and genetic data analysis
17Stream Data and Applications
- Prevalence of Data Streams
- Network flow analysis and management
- Telephone call details and fraud detection
- On-line sensor monitoring
- Characteristics of Streams
- Massive and continuous amounts of data
- O(100 GB) per day
- Storage constraints
- small space (e.g., main memory or cache)
- Online continuous queries
- i.e., agg stream -gt stream
18On-Line Mining of Stream Data
- Single-pass aggregates
- Basic, simple (exact) aggregates
- trivial inherently single-pass
- Approx fancy aggregates (e.g., median)
- Discovering correlations, associations, models,
cause-and-effect relationships between patterns - Discovering trends, clusters, changes, and
outliers in data flowvisual mining will help as
well - Data stream warehousingsaving regularities?
constraint- or goal- directed stream mining?
19Multidimensional Data and Data Cubes
- Sales volume as a function of product, month, and
region
Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
20Mining and Explorative Analysis of Data Cubes
- Efficient computation of data or iceberg cubes
- Discovery-driven data cube analysis
- Cube-gradient analysis
- What are the changes of the average house value
in Sillicon Valley in 2001 comparing with 2000? - Under what conditions the average house value
increases 10 per year in Chicago area in 1990s?
21What More can Be Done at Mining Data Cubes?
- Trend analysis in data cubes?
- What kind of companies have similar asset
increase trends like Microsoft? - Cluster customers based on their similar shopping
behavior with regard to the change of time - Model-based class comparison
- What are the critical features that distinguish
winners and losers? - Association and correlation analysis in data
cubes - If companys average profit is high, what other
features will go with it?
22Further Development of Frequent and Sequential
Pattern Analysis
- Efficient frequent pattern mining methods
- Association Apriori (94), FP-growth (00)
- Sequential pattern GSP(96), PrefixSpan (01)
- Mining max patterns, closed patterns,
approximately closed, top-n frequent patterns - Error-tolerant frequent and sequential patterns
(e.g., DNA sequences) - Constraint-based mining of frequent and
sequential patterns
23Intrusion Detection and Anomaly Mining
- Fighting against crimes and terrorists
- Linking and mining dynamic and huge amounts of
data - Sifting irregularities from regular ones Mining
regularities as base for comparison and find
outliers - Classification normal vs. alarming classes and
models - Clustering and outlier analysis
- Human-instructed conditions and condition-guided
classification, clustering, and outlier analysis - Information visualization and stream data analysis
24Invisible Data Mining
- Embed mining functions into information services
- Web search engine (link analysis, authoritative
pages, user profiles)adaptive web sites, etc. - Improvement of query processing history data
- Making service smart and efficient
- Benefits from/to data mining research
- Data mining research has produced many scalable,
efficient, novel mining solutions - Applications feed new challenge problems to
research
25Conclusions
- Data mining A young and promising discipline
- A confluences of multiple disciplines database,
data warehouse, machine learning, statistics,
high performance computing, Web technology, etc. - Great progress in the last decade
- Lots of research issues, and a few identified
here - Web mining and text mining
- Biomedical/DNA data mining
- On-line, real-time, stream data mining
- Cube exploration iceberg, cube-gradient, trends,
etc. - Mining max/closed long and error-tolerant
frequent and sequential patterns - Intrusion detection and anomaly mining
- Invisible data mining
26http//www.cs.uiuc.edu/hanj
27Selected Publications (2001)
- A. K. H. Tung, J. Hou, and J. Han, "Spatial
Clustering in the Presence of Obstacles", Proc.
2001 Int. Conf. on Data Engineering (ICDE'01),
Heidelberg, Germany, April 2001. - J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, "PrefixSpan Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern
Growth", Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), Heidelberg, Germany, April
2001. - J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining
Frequent Itemsets with Convertible Constraints",
Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001. - A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and
R. T. Ng, "Constraint-Based Clustering in Large Da
tabases", Proc. 2001 Int. Conf. on Database
Theory (ICDT'01), London, U.K., Jan. 2001. - H. Miller and J. Han (eds.), Geographic Data
Mining and Knowledge Discovery, Taylor and
Francis, 2001. - Y. Bedard, T. Merrett, and J. Han, "Fundamentals
of Geospatial Data Warehousing for Geographic
Knowledge Discovery", H. Miller and J. Han
(eds.), Geographic Data Mining and Knowledge
Discovery, Taylor and Francis, 2001. - J. Han, M. Kamber, and A. K. H. Tung, "Spatial
Clustering Methods in Data Mining A Survey", H.
Miller and J. Han (eds.), Geographic Data Mining
and Knowledge Discovery, Taylor and Francis,
2001. - H. Lu, L. Feng, and J. Han, "Beyond
Intra-Transaction Association AnalysisMining
Multi-Dimensional Inter-Transaction Association
Rules", ACM Transactions on Information Systems,
2001.
28Selected Publications (2000)
- J. Han and M. Kamber, Data Mining Concepts and
Techniques, Morgan Kaufmann, August 2000. - K. Wang, Y. He and J. Han, "Mining Frequent
Itemsets Using Support Constraints", Proc. 2000
Int. Conf. on Very Large Data Bases (VLDB'00),
Cairo, Egypt, Sept. 2000, pp. 43-52. - E. D. Kim, J. M.W. Lam, and J. Han, "AIM
Approximate Intelligent Matching for Time Series
Data", Proc. 2000 Int. Conf. on Data Wareshouse
and Knowledge Discovery (DaWaK'00), Greenwich,
U.K., Sept. 2000. - J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, M.-C. Hsu, "FreeSpan Frequent
Pattern-Projected Sequential Pattern Mining",
submitted to 2000 Int. Conf. on Knowledge
Discovery and Data Mining (KDD'00), Boston, MA,
August 2000. - J. Pei and J. Han "Can We Push More Constraints
into Frequent Pattern Mining?", submitted to 2000
Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000. - J. Han, J. Pei, and Y. Yin, "Mining Frequent
Patterns without Candidate Generation", Proc.
2000 ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD'00), Dallas, TX, May 2000. - J. Pei, J. Han, and R. Mao, "CLOSET An Efficient
Algorithm of Mining Frequent Closed Itemsets for
Association Rules", submitted to 2000 ACM-SIGMOD
Int. Workshop on Data Mining and Knowledge
Discovery (DMKD'00), Dallas, TX, May 2000. - D. Cheung, C. Hwang, A. Fu, and J. Han,
"Efficient Rule-Based Attributed-Oriented
Induction for Data Mining", Journal of
Intelligent Information Systems, 15(2) 175-200,
2000. - N. Stefanovic, J. Han, and K. Koperski,
"Object-Based Selective Materialization for
Efficient Implementation of Spatial Data Cubes,"
IEEE Transactions on Knowledge and Data
Engineering, 12(6), 2000.