Title: Data
1???????
2??
- Introduction
- Data
- ??????
- ????? (Association Rules)
- ??? (Classification)
- ??? (Clustering)
- Applications
- ??
3What Is Data Mining?
- Data mining knowledge discovery from data)
- Extraction of interesting (non-trivial,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data - Alternative names
- Knowledge discovery in databases (KDD),
-
4Data Mining Confluence of Multiple Disciplines
5Why Not Traditional Data Analysis?
- Tremendous amount of data
-
- High complexity of data
6Why Data Mining?
-
- Necessity is the mother of inventionData
miningAutomated analysis of massive data sets.
7 Business Objectives
PPS 03
BL 97
MMPH 03BM 98
LPSHG 01
SM 00FCJ 01MPT 99
8What is Data?
Attributes
- Collection of data objects and their attributes
- An attribute is a property or characteristic of
an object - Examples eye color of a person, temperature,
etc. - Attribute is also known as variable, field,
characteristic, or feature - A collection of attributes describe an object
- Object is also known as record, point, case,
sample, entity, or instance
Objects
9Data Mining On What Kinds of Data?
- Database-oriented data sets and applications
- Relational database, data warehouse,
transactional database - Advanced data sets and advanced applications
- Data streams and sensor data
- Time-series data, temporal data, sequence data
(incl. bio-sequences) - Structure data, graphs, social networks and
multi-linked data - Object-relational databases
- Heterogeneous databases and legacy databases
- Spatial data and spatiotemporal data
- Multimedia database
- Text databases
- The World-Wide Web
10Steps of Data Mining
11Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
12Data Mining Functionalities
- Outlier analysis
- Outlier Data object that does not comply with
the general behavior of the data - Noise or exception? Useful in fraud detection,
rare events analysis - Trend and evolution analysis
- Trend and deviation e.g., regression analysis
- Sequential pattern mining e.g., digital camera ?
large SD memory - Periodicity analysis
- Similarity-based analysis
-
13Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
14Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
15 Classification Task
Decision Tree
16Weather Data Play or not Play?
Note Outlook is the Forecast, no relation to
Microsoft email program
17Example Tree for Play?
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
18 Clustering
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Grouping a set of data objects into clusters
- Clustering is unsupervised classification no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
19Examples of Clustering Applications
- Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs - Land use Identification of areas of similar land
use in an earth observation database - Insurance Identifying groups of motor insurance
policy holders with a high average claim cost - City-planning Identifying groups of houses
according to their house type, value, and
geographical location - Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults
20Data Mining for Retail Industry
- Retail industry huge amounts of data on sales,
customer shopping history, etc. - Applications of retail data mining
- Identify customer buying behaviors
- Discover customer shopping patterns and trends
- Improve the quality of customer service
- Achieve better customer retention and
satisfaction - Enhance goods consumption ratios
- Design more effective goods transportation and
distribution policies
21Data Mining in Retail Industry Examples
- Design and construction of data warehouses based
on the benefits of data mining - Multidimensional analysis of sales, customers,
products, time, and region - Analysis of the effectiveness of sales campaigns
- Customer retention Analysis of customer loyalty
- Use customer loyalty card information to register
sequences of purchases of particular customers - Use sequential pattern mining to investigate
changes in customer consumption or loyalty - Suggest adjustments on the pricing and variety of
goods - Purchase recommendation and cross-reference of
items
22Financial Data Mining
- Classification and clustering of customers for
targeted marketing - multidimensional segmentation by
nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a
new customer to an appropriate customer group - Detection of money laundering and other financial
crimes - integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs) - Tools data visualization, linkage analysis,
classification, clustering tools, outlier
analysis, and sequential pattern analysis tools
(find unusual access sequences)
23Financial Data Mining
- Classification and clustering of customers for
targeted marketing - multidimensional segmentation by
nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a
new customer to an appropriate customer group - Detection of money laundering and other financial
crimes - integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs) - Tools data visualization, linkage analysis,
classification, clustering tools, outlier
analysis, and sequential pattern analysis tools
(find unusual access sequences)
24Data Mining for Telecomm. Industry (1)
- A rapidly expanding and highly competitive
industry and a great demand for data mining - Understand the business involved
- Identify telecommunication patterns
- Catch fraudulent activities
- Make better use of resources
- Improve the quality of service
- Multidimensional analysis of telecommunication
data - Intrinsically multidimensional calling-time,
duration, location of caller, location of callee,
type of call, etc.
25Data Mining for Telecomm. Industry (2)
- Fraudulent pattern analysis and the
identification of unusual patterns - Identify potentially fraudulent users and their
atypical usage patterns - Detect attempts to gain fraudulent entry to
customer accounts - Discover unusual patterns which may need special
attention - Multidimensional association and sequential
pattern analysis - Find usage patterns for a set of communication
services by customer group, by month, etc. - Promote the sales of specific services
- Improve the availability of particular services
in a region - Use of visualization tools in telecommunication
data analysis
26Biomedical and DNA Data Analysis
- DNA sequences 4 basic building blocks
(nucleotides) adenine (A), cytosine (C), guanine
(G), and thymine (T). - Gene a sequence of hundreds of individual
nucleotides arranged in a particular order - Humans have around 30,000 genes
- Tremendous number of ways that the nucleotides
can be ordered and sequenced to form distinct
genes - Semantic integration of heterogeneous,
distributed genome databases - Current highly distributed, uncontrolled
generation and use of a wide variety of DNA data - Data cleaning and data integration methods
developed in data mining will help
27DNA Analysis Examples
- Similarity search and comparison among DNA
sequences - Compare the frequently occurring patterns of each
class (e.g., diseased and healthy) - Identify gene sequence patterns that play roles
in various diseases - Association analysis identification of
co-occurring gene sequences - Most diseases are not triggered by a single gene
but by a combination of genes acting together - Association analysis may help determine the kinds
of genes that are likely to co-occur together in
target samples - Path analysis linking genes to different disease
development stages - Different genes may become active at different
stages of the disease - Develop pharmaceutical interventions that target
the different stages separately - Visualization tools and genetic data analysis
28Other Applications
- Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat - Astronomy
- JPL and the Palomar Observatory discovered 22
quasars with the help of data mining - Internet Web Surf-Aid
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
29Web Mining Taxonomy
- Web structure mining
- Web content mining
- Web usage mining
- Mining the usage data
- Users behaviour
- Click the link
- Browsing time
- Transaction
30What is Web log mining ?
- Web servers register a log entry for every single
access they get. - A huge number of accesses (hits) are registered
and collected in an ever growing we log. - Web log mining
- Enhance server performance
- Improve web site navigation
- Improve system design of web applications
- Target customers for EC
- Identify potential prime advertisement locations
31Log data
- clients IP address, the date and time the
request is received, the time zone where the
server is located, the request command, the URL
(Uniform Resource Locator) of the requested page,
the protocol of the request, the return code of
server, and the size of the page. - mac04cville.wam.umd.edu - - 01/Apr/1997000000
-0600 - "GET /ka/graphics/mel/melting_glass.html
HTTP/1.0" 200 11880 - mac04cville.wam.umd.edu - - 01/Apr/1997000020
-0600 - "GET /ka/graphics/mel/glass.html HTTP/1.0" 200
11880
32??
- Data mining is a young discipline with wide and
diverse applications - There is still a nontrivial gap between general
principles of data mining and domain-specific,
effective data mining tools for particular
applications -
33 34- 1. ( )???????????? (A)KDD (B)DPP (C)KBB (D)DBP.
- 2. ( )?????????????? (A) A.I. (B) Machine
Learning (C) Data Base (D) ????. - 3. ( )?????????,??????????. (A) ???? (B)
???????domain (C)????? (D) ??????. - 4. ( )???????????,?????????????. (A)????????
(B)???????????????? (C)???????????
(D)?????????????. - 5. ( )?sensor?????????(A) spatial data (B) Text
data (C) stream data (D) Structure data. - 6. ( )????????????????, ???data????? (A)
spatial data (B) Time-series data (C) stream data
(D) Structure data. - 7. ( )???????????,???data????? (A) spatial data
(B) Text data (C) stream data (D) Structure data. - 8. ( )????????text data ???(attribute) (A)?????
(B) ???? (C)???????? (D) ??????.
35- 9. ( ) ??????????????????????,
????????????(A)?????(B)??? (C)??? (D)??????. - 10.( ) ????????, ????, ???, ???????????????????
??????. ???????????(A)?????(B)??? (C)???
(D)??????. - 11. ( ) ?????????????, ???????,????????????(A)??
???(B)??? (C)??? (D)??????. - 12. ( )?????????, ???????, ????????A)?????(B)???
(C)??? (D) ???? - 13. ( ) ?????, ?????????????????, ???????????,
???????(A)??? (B) ??? (C) ??? (D) ???. - 14.( )???????????????? (A) ?????? (B)
????????? (C) ???????? (D) ????. - 15. ( )???????????????data (A) spatial data (B)
Time-series data (C) stream data (D) ????.