Title: Knowledge Engineering
1 Knowledge Engineering
Data mining
2We are deluged by data !
scientific data, medical data, demographic
data, financial data, and marketing data
People have no time to look at this data !
? we must find a tool to automatically
analyze the data, classify it, summarize it,
discover and characterize trends in it, and
flag anomalies .
This magic tool is "Dataminig ".
3The data explosion
Increase in use of electronic data gathering
devices e.g. point-of-sale, remote sensing
devices etc. Data storage became easier and
cheaper with increasing computing power
4? What is Data Mining
lt Definition gt
non trivial extraction of implicit, previously
unknown, and potentially useful information from
data
OR
the variety of techniques to identify nuggets of
information or decision-making knowledge in
bodies of data, and extracting these in such a
way that they can be put to use in the areas such
as decision support, prediction, forecasting and
estimation. The data is often voluminous, but as
it stands of low value as no direct use can be
made of it it is the hidden information in the
data that is useful
OR
extraction of hidden predictive information from
large databases
5Data Mining and DBMS
DBMS
Queries based on the data held e.g.
last months sales for each product sales
grouped by customer age etc. list of customers
who lapsed their policy
Data Mining
Infer knowledge from the data held to answer
queries e.g.
what characteristics do customers share who
lapsed their policies and how do they differ from
those who renewed their policies? why is the
Cleveland division so profitable?
6Characteristics of a Data Mining System
- Large quantities of data
- volume of data so great it has to be analyzed by
automated techniques e.g. satellite information,
credit card transactions etc. - Noisy, incomplete data
- imprecise data is characteristic of all data
collection - databases - usually contaminated by errors,
cannot assume that the data they contain is
entirely correct e.g. some attributes rely on
subjective or measurement judgments - Complex data structure - conventional statistical
analysis not possible - Heterogeneous data stored in legacy systems
7Data Mining Goals
- Classification
- Association
- Sequence / Temporal analysis
- Cluster outlier analysis
8Data Mining and Machine Learning
Data Mining or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge Machine Learning is concerned with improving performance of an agent e.g. , training a neural network to balance a pole is part of ML, but not of KDD
DM is concerned with very large, real-world databases ML typically looks at smaller data sets
DM deals with real-world data , which tends to have problems such as missing values , dynamic data , noise , and pre-existing data ML has laboratory type examples for the training set
Efficiency of the algorithm and scalability is
more important in DM or KDD .
9Issues in Data Mining
- Noisy data
- Missing values
- Static data
- Sparse data
- Dynamic data
- Relevance
- Interestingness
- Heterogeneity
- Algorithm efficiency
- Size and complexity of data
10Data Mining Process
- Data pre-processing
- heterogeneity resolution
- data cleansing
- data warehousing
- Data Mining Tools applied
- extraction of patterns from the pre-processed
data - Interpretation and evaluation
- user bias i.e. can direct DM tools to areas of
interest - attributes of interest in databases
- goal of discovery
- domain knowledge
- prior knowledge or belief about the domain
11Techniques
- Object-oriented database methods
- Statistics
- Clustering
- Visualization
- Neural networks
- Rule Induction
12Techniques
- Object-oriented approaches/Databases
- Making use of DBMSs to discover knowledge, SQL is
limiting . - Advantages
- Easier maintenance. Objects may be understood as
stand-alone entities - Objects are appropriate reusable components
- For some systems, there may be an obvious mapping
from real world entities to system objects -
-
13Techniques
- Statistics
- Can be used in several data mining stages
- data cleansing i.e. the removal of erroneous or
irrelevant data known as outliers - EDA, exploratory data analysis e.g. frequency
counts, histograms etc. - data selection - sampling facilities and so
reduce the scale of computation - attribute re-definition e.g. Body Mass Index,
BMI, which is Weight/Height2 - data analysis - measures of association and
relationships between attributes, interestingness
of rules, classification etc.
14Techniques
Visualization enhances EDA and makes patterns
more visible 1-d, 2-d, 3-d visualizations
Example NETMAP , a commercial data mining tool
, uses this technique
15Techniques
- Cluster outlier analysis
- Clustering according to similarity .
- Partitioning the database so that each partition
or group is similar according to some criteria or
metric . - Appears in many disciplines e.g. in chemistry the
clustering of molecules - Data mining applications make use of it e.g. to
segment a client/customer base . - Provides sub-groups of a population for further
analysis or action - very important when dealing
with very large databases - Can be used for profile generation for target
marketing
16Techniques
- Artificial Neural Networks (ANN)
- An trained ANN can be thought of as an "expert"
in the category of information it has been given
to analyze . - It provides projections given new situations of
interest and answers "what if" questions . - Problems include
- the resulting network is viewed as a black box
- no explanation of the results is given i.e.
difficult for the user to interpret the results - difficult to incorporate user intervention
- slow to train due to their iterative nature
17Techniques
Artificial Neural Networks (ANN) Data mining
example using neural networks .
18Techniques
- Decision trees
- Built using a training set of data and can then
be used to classify new objects
- Description
- internal node is a test on an attribute.
- branch represents an outcome of the test, e.g.,
Colorred. - leaf node represents a class label or class
label distribution. - At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible - new case is classified by following a matching
path to a leaf node.
19Techniques
- Building a decision tree
- Top-down tree construction
- At start, all training examples are at the root.
- Partition the examples recursively by choosing
one attribute each time. - Bottom-up tree pruning
- Remove sub-trees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases
20Techniques
Example
Outlook
Decision Tree for Play?
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21Techniques
The extraction of useful if-then rules from data
based on statistical significance.
Example format If X Then Y
22Techniques
- Frames
- Frames are templates for holding clusters of
related knowledge about a very particular subject
. - It is a natural way to represent knowledge .
- It has a taxonomy approach .
- Problem they are more complex than rule
representation .
23Techniques
24Data Warehousing
Definition Any centralized data repository
which can be queried for business benefit .
- Warehousing makes it possible to
- extract archived operational data .
- overcome inconsistencies between different legacy
data formats . - integrate data throughout an enterprise,
regardless of location, format, or communication
requirements . - incorporate additional or expert information .
25Characteristics of a Data Warehouse
- subject-oriented - data organized by subject
instead of application e.g. - an insurance company would organize their data by
customer, premium, and claim, instead of by
different products (auto, life, etc.) - contains only the information necessary for
decision support processing - integrated - encoding of data is often
inconsistent e.g. - gender might be coded as "m" and "f" or 0 and 1
but when data are moved from the operational
environment into the data warehouse they assume a
consistent coding convention - time-variant - the data warehouse contains a
place for storing data that are five to 10 years
old, or older e.g. - this data is used for comparisons, trends, and
forecasting - these data are not updated
- non-volatile - data are not updated or changed in
any way once they enter the data warehouse - data are only loaded and accessed
26Data Warehousing Processes
- insulate data - i.e. the current operational
information - preserves the security and integrity of
mission-critical OLTP applications - gives access to the broadest possible base of
data - retrieve data - from a variety of heterogeneous
operational databases - data is transformed and delivered to the data
warehouse/store based on a selected model (or
mapping definition) - metadata - information describing the model and
definition of the source data elements - data cleansing - removal of certain aspects of
operational data, such as low-level transaction
information, which slow down the query times. - transfer - processed data transferred to the data
warehouse, a large database on a high performance
box
27Data warehouse Architecture
28Criteria for Data Warehouses
1. Load Performance require incremental loading
of new data on a periodic basis must not
artificially constrain the volume of data
2. Load Processing data conversions,
filtering, reformatting, integrity checks,
physical storage, indexing, and metadata update
3. Data Quality Management ensure local
consistency, global consistency, and referential
integrity despite "dirty" sources and massive
database size 4. Query Performance must not
be slowed or inhibited by the performance of the
data warehouse RDBMS 5. Terabyte Scalability
Data warehouse sizes are growing at astonishing
rates so RDBMS must have no architectural
limitations. It must support modular and parallel
management.
29Criteria for Data Warehouses
6. Mass User Scalability Access to warehouse
data must not be limited to the elite few has to
support hundreds, even thousands, of concurrent
users while maintaining acceptable query
performance. 7. Networked Data Warehouse Data
warehouses rarely exist in isolation, users must
be able to look at and work with multiple
warehouses from a single client workstation
8. Warehouse Administration large scale and
time-cyclic nature of the data warehouse demands
administrative ease and flexibility 9. The
RDBMS must Integrate Dimensional Analysis
dimensional support must be inherent in the
warehouse RDBMS to provide highest performance
for relational OLAP tools 10. Advanced Query
Functionality End users require advanced
analytic calculations, sequential comparative
analysis, consistent access to detailed and
summarized data
30Problems with Data Warehousing
The rush of companies to jump on the band wagon
as these companies have slapped "data warehouse"
labels on traditional transaction-processing
products and co- opted the lexicon of the
industry in order to be considered players in
this fast-growing category . Chris Erickson, Red
Brick
31Data Warehousing OLTP
32OLTP Systems
Designed to maximize transaction capacity But
they
- cannot be repositories of facts and historical
data for business analysis - cannot quickly answer ad hoc queries
- rapid retrieval is almost impossible
- data is inconsistent and changing, duplicate
entries exist, - entries can be missing
- OLTP offers large amounts of raw data which is
not easily understood - Typical OLTP query is a simple aggregation e.g.
- what is the current account balance for this
customer? - Data Warehouse Systems
- Data warehouses are interested in query
processing - as opposed to transaction processing
33OLAP On-Line Analytical Processing
- Problem is how to process larger and larger
databases OLAP involves many data items (many
thousands or even millions) which are involved in
complex relationships . - Fast response is crucial in OLAP .
- Difference between OLAP and OLTP
- OLTP servers handle mission-critical production
data accessed through simple queries - OLAP servers handle management-critical data
accessed through an iterative analytical
investigation
34The end
We hope you enjoy it .
Thanks!