Knowledge Engineering - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Knowledge Engineering

Description:

Knowledge Engineering & Data mining – PowerPoint PPT presentation

Number of Views:228

Avg rating:3.0/5.0

Slides: 35

Provided by: drd144

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge Engineering

1

Knowledge Engineering

Data mining
2
We are deluged by data !
scientific data, medical data, demographic
data, financial data, and marketing data
People have no time to look at this data !
? we must find a tool to automatically
analyze the data, classify it, summarize it,
discover and characterize trends in it, and
flag anomalies .
This magic tool is "Dataminig ".
3
The data explosion
Increase in use of electronic data gathering
devices e.g. point-of-sale, remote sensing
devices etc. Data storage became easier and
cheaper with increasing computing power
4
? What is Data Mining
lt Definition gt
non trivial extraction of implicit, previously
unknown, and potentially useful information from
data
OR
the variety of techniques to identify nuggets of
information or decision-making knowledge in
bodies of data, and extracting these in such a
way that they can be put to use in the areas such
as decision support, prediction, forecasting and
estimation. The data is often voluminous, but as
it stands of low value as no direct use can be
made of it it is the hidden information in the
data that is useful
OR
extraction of hidden predictive information from
large databases
5
Data Mining and DBMS
DBMS
Queries based on the data held e.g.
last months sales for each product sales
grouped by customer age etc. list of customers
who lapsed their policy
Data Mining
Infer knowledge from the data held to answer
queries e.g.
what characteristics do customers share who
lapsed their policies and how do they differ from
those who renewed their policies? why is the
Cleveland division so profitable?
6
Characteristics of a Data Mining System

Large quantities of data
volume of data so great it has to be analyzed by
automated techniques e.g. satellite information,
credit card transactions etc.
Noisy, incomplete data
imprecise data is characteristic of all data
collection
databases - usually contaminated by errors,
cannot assume that the data they contain is
entirely correct e.g. some attributes rely on
subjective or measurement judgments
Complex data structure - conventional statistical
analysis not possible
Heterogeneous data stored in legacy systems

7
Data Mining Goals

Classification
Association
Sequence / Temporal analysis
Cluster outlier analysis

8
Data Mining and Machine Learning
Data Mining or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge Machine Learning is concerned with improving performance of an agent e.g. , training a neural network to balance a pole is part of ML, but not of KDD
DM is concerned with very large, real-world databases ML typically looks at smaller data sets
DM deals with real-world data , which tends to have problems such as missing values , dynamic data , noise , and pre-existing data ML has laboratory type examples for the training set
Efficiency of the algorithm and scalability is
more important in DM or KDD .
9
Issues in Data Mining

Noisy data
Missing values
Static data
Sparse data
Dynamic data
Relevance
Interestingness
Heterogeneity
Algorithm efficiency
Size and complexity of data

10
Data Mining Process

Data pre-processing
heterogeneity resolution
data cleansing
data warehousing
Data Mining Tools applied
extraction of patterns from the pre-processed
data
Interpretation and evaluation
user bias i.e. can direct DM tools to areas of
interest
attributes of interest in databases
goal of discovery
domain knowledge
prior knowledge or belief about the domain

11
Techniques

Object-oriented database methods
Statistics
Clustering
Visualization
Neural networks
Rule Induction

12
Techniques

Object-oriented approaches/Databases
Making use of DBMSs to discover knowledge, SQL is
limiting .
Advantages
Easier maintenance. Objects may be understood as
stand-alone entities
Objects are appropriate reusable components
For some systems, there may be an obvious mapping
from real world entities to system objects

13
Techniques

Statistics
Can be used in several data mining stages
data cleansing i.e. the removal of erroneous or
irrelevant data known as outliers
EDA, exploratory data analysis e.g. frequency
counts, histograms etc.
data selection - sampling facilities and so
reduce the scale of computation
attribute re-definition e.g. Body Mass Index,
BMI, which is Weight/Height2
data analysis - measures of association and
relationships between attributes, interestingness
of rules, classification etc.

14
Techniques

Visualization

Visualization enhances EDA and makes patterns
more visible 1-d, 2-d, 3-d visualizations
Example NETMAP , a commercial data mining tool
, uses this technique
15
Techniques

Cluster outlier analysis
Clustering according to similarity .
Partitioning the database so that each partition
or group is similar according to some criteria or
metric .
Appears in many disciplines e.g. in chemistry the
clustering of molecules
Data mining applications make use of it e.g. to
segment a client/customer base .
Provides sub-groups of a population for further
analysis or action - very important when dealing
with very large databases
Can be used for profile generation for target
marketing

16
Techniques

Artificial Neural Networks (ANN)
An trained ANN can be thought of as an "expert"
in the category of information it has been given
to analyze .
It provides projections given new situations of
interest and answers "what if" questions .
Problems include
the resulting network is viewed as a black box
no explanation of the results is given i.e.
difficult for the user to interpret the results
difficult to incorporate user intervention
slow to train due to their iterative nature

17
Techniques
Artificial Neural Networks (ANN) Data mining
example using neural networks .
18
Techniques

Decision trees
Built using a training set of data and can then
be used to classify new objects

Description
internal node is a test on an attribute.
branch represents an outcome of the test, e.g.,
Colorred.
leaf node represents a class label or class
label distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
new case is classified by following a matching
path to a leaf node.

19
Techniques

Decision trees

Building a decision tree
Top-down tree construction
At start, all training examples are at the root.
Partition the examples recursively by choosing
one attribute each time.
Bottom-up tree pruning
Remove sub-trees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases

20
Techniques

Decision trees

Example
Outlook
Decision Tree for Play?
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21
Techniques

Rules

The extraction of useful if-then rules from data
based on statistical significance.
Example format If X Then Y
22
Techniques

Frames
Frames are templates for holding clusters of
related knowledge about a very particular subject
.
It is a natural way to represent knowledge .
It has a taxonomy approach .
Problem they are more complex than rule
representation .

23
Techniques

Frames
Example

24
Data Warehousing
Definition Any centralized data repository
which can be queried for business benefit .

Warehousing makes it possible to
extract archived operational data .
overcome inconsistencies between different legacy
data formats .
integrate data throughout an enterprise,
regardless of location, format, or communication
requirements .
incorporate additional or expert information .

25
Characteristics of a Data Warehouse

subject-oriented - data organized by subject
instead of application e.g.
an insurance company would organize their data by
customer, premium, and claim, instead of by
different products (auto, life, etc.)
contains only the information necessary for
decision support processing
integrated - encoding of data is often
inconsistent e.g.
gender might be coded as "m" and "f" or 0 and 1
but when data are moved from the operational
environment into the data warehouse they assume a
consistent coding convention
time-variant - the data warehouse contains a
place for storing data that are five to 10 years
old, or older e.g.
this data is used for comparisons, trends, and
forecasting
these data are not updated
non-volatile - data are not updated or changed in
any way once they enter the data warehouse
data are only loaded and accessed

26
Data Warehousing Processes

insulate data - i.e. the current operational
information
preserves the security and integrity of
mission-critical OLTP applications
gives access to the broadest possible base of
data
retrieve data - from a variety of heterogeneous
operational databases
data is transformed and delivered to the data
warehouse/store based on a selected model (or
mapping definition)
metadata - information describing the model and
definition of the source data elements
data cleansing - removal of certain aspects of
operational data, such as low-level transaction
information, which slow down the query times.
transfer - processed data transferred to the data
warehouse, a large database on a high performance
box

27
Data warehouse Architecture
28
Criteria for Data Warehouses
1. Load Performance require incremental loading
of new data on a periodic basis must not
artificially constrain the volume of data
2. Load Processing data conversions,
filtering, reformatting, integrity checks,
physical storage, indexing, and metadata update
3. Data Quality Management ensure local
consistency, global consistency, and referential
integrity despite "dirty" sources and massive
database size 4. Query Performance must not
be slowed or inhibited by the performance of the
data warehouse RDBMS 5. Terabyte Scalability
Data warehouse sizes are growing at astonishing
rates so RDBMS must have no architectural
limitations. It must support modular and parallel
management.
29
Criteria for Data Warehouses
6. Mass User Scalability Access to warehouse
data must not be limited to the elite few has to
support hundreds, even thousands, of concurrent
users while maintaining acceptable query
performance. 7. Networked Data Warehouse Data
warehouses rarely exist in isolation, users must
be able to look at and work with multiple
warehouses from a single client workstation
8. Warehouse Administration large scale and
time-cyclic nature of the data warehouse demands
administrative ease and flexibility 9. The
RDBMS must Integrate Dimensional Analysis
dimensional support must be inherent in the
warehouse RDBMS to provide highest performance
for relational OLAP tools 10. Advanced Query
Functionality End users require advanced
analytic calculations, sequential comparative
analysis, consistent access to detailed and
summarized data
30
Problems with Data Warehousing
The rush of companies to jump on the band wagon
as these companies have slapped "data warehouse"
labels on traditional transaction-processing
products and co- opted the lexicon of the
industry in order to be considered players in
this fast-growing category . Chris Erickson, Red
Brick
31
Data Warehousing OLTP
32
OLTP Systems
Designed to maximize transaction capacity But
they

cannot be repositories of facts and historical
data for business analysis
cannot quickly answer ad hoc queries
rapid retrieval is almost impossible
data is inconsistent and changing, duplicate
entries exist,
entries can be missing
OLTP offers large amounts of raw data which is
not easily understood
Typical OLTP query is a simple aggregation e.g.
what is the current account balance for this
customer?
Data Warehouse Systems
Data warehouses are interested in query
processing
as opposed to transaction processing

33
OLAP On-Line Analytical Processing

Problem is how to process larger and larger
databases OLAP involves many data items (many
thousands or even millions) which are involved in
complex relationships .
Fast response is crucial in OLAP .
Difference between OLAP and OLTP
OLTP servers handle mission-critical production
data accessed through simple queries
OLAP servers handle management-critical data
accessed through an iterative analytical
investigation

34
The end
We hope you enjoy it .
Thanks!

Write a Comment

User Comments (0)