CS245A

About This Presentation

Title:

CS245A

Description:

Use human judgement, machine computation power ... Wesley W. Chu. Rei-Chi Lee. 51. Database Semantics. Database semantics can be classified into: ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 71

Provided by: wesle4

Learn more at: http://web.cs.ucla.edu

Category:

Tags: cs245a

more less

Transcript and Presenter's Notes

Title: CS245A

1
CS245A Syllabus (2005)

Knowledge Discovery in Databases
Query Processing With Domain Semantics
Capture Database Semantics by Rule Induction
Intentional Query Answering
Fault Tolerant DDBMS Via Data Inference
Intelligent Dictionary Directory
Uncertainty Management Using Rough Sets
Data Mining Techniques (Ch 4-7, H K)
Active Databases
Mediators in Information Systems
KQML A Language and Protocol for Knowledge and
Information Exchange

2
CS 245A - Syllabus (contd)

CoBase
CoSent
Relaxation for XML Documents
Query Formation From High-level Concepts
Knowledge Acquisition for Query Relaxation
Principles of Case-based Reasoning
A Case-based Reasoning Approach to AQA
CoXML
Data Mining for Sequence Data
Extracting key features from Free Text
Knowledge based Approach for Free Text Retrieval
Content-based Information Retrieval
Digital Library

3
References

Course notes Intelligent Information Systems,
CS245A, Course Reader Material, 1141 Westwood
Blvd, 310-443-3303
Jiawei Han and Micheline Kamber, Data Mining
Concept and Techniques, Morgan Kaufmann, August
2000.
Wesley Chu T.Y. Lin (ed.) Foundations and
Advances in Data Mining. Springer, 2005

4
CS 245AIntelligent Information Systems

Wesley W. Chu
Computer Science Department
U. of California
Los Angeles, CA

5
Knowledge Discovery In Databases

Information Explosion
Information doubles every 20 months
Increase in the number and size of DBs
NASA - Earth observation satellites, 1
picture/sec
Human genome - several billion genetic bases
US census data - lifestyle and subculture of the
US
How to analyze these databases (raw data)
There is a gap between
Data generation and data understanding
Intelligent data analysis will be useful and
valuable
AA uses frequent flyer DB to find its better
customers for specific market promotions

6
Knowledge Discovery In Databases (Contd)

Bank uses customers loan and credit information
to derive better loan approval and bankrupt
protection
Package-goods manufacturers use the scanned
supermarket data to measure the effect of their
promotions and to look for shopping patterns
Techniques
Machine Learning
Statistics
Information Theory
Fuzzy Set

7
Knowledge Discovery

Extraction of implicit, previously unknown and
potentially useful information from Data
Given a set of facts (Data) F, a language L,
measure of certainty C,
pattern a statement S in L that describes the
relationship among a subset Fs of F with
certainty C, such that Fs is a simpler
representation than the enumeration of all facts
in Fs
Discovered Knowledge
The output of a program that monitors the set
of facts in a DB and produce patterns.

8
Patterns

Expressed by high level language
Understand and used directly by people
Able to input to another program (e.g. expert
system)
e.g.
If age lt 25 and Driver-Education-Course No
Then At-Fault-Accident Yes
with likelihood 0.3

9
Patterns (Contd)

Patterns that are completely unrelated to current
goals are not considered as knowledge.
e.g.
Patterns that are relating at-fault-accident to
a drivers age is not useful to auto sales
figures.
Pattern interesting results knowledge
Age gt 16 is not an interesting pattern for
driver since all drivers require age gt 16.

10
Knowledge Discovery in DB Exhibits Four Main
Characteristics

High-Level Language
Understood by human users
Accuracy
Expressed by measure of uncertainty
Interesting Results
Patterns are novel and potentially useful
Efficiency
Running times for large-sized DB are predictable
and acceptable

11
Efficiency

The discovery process should be efficiently
implemented on a computer.
An algorithm is considered efficient if the run
time and space used are a polynomial function of
low degree of input length.
e.g.
efficient algorithms for restricted concept
classes
Conjunctive concepts, (A B C)
Conjunction of classes of disjunctions of no more
than k literals
(A B) (C D) (E F) , k 2.

12
Machine Learning

A learning algorithm takes the data set and its
accompanying information as input and returns a
statement (e.g., a concept) representing the
results of the learning as output
Data sets can be a file of records in DB
Problems in learning DB
DB are
Dynamic
Incomplete
Noisy
Much larger than typical machine learning data
sets
Much of work in learning DB focuses on overcoming
these complications!

13
Related Approaches

DB Management
Integrity
Querying in DB
Deduction in DB
OODBM
Expert Systems
Expert generated knowledge usually are higher
quality than the data in DB
Only cover the important cases
Experts are available to confirm the validity and
usefulness of discovered patterns
Autonomy of discovery is lacking in expert systems

14
Related Approaches (Contd)

Statistics
Ill suited for the nominal and structured data
types
Precluding the use of domain knowledge
Difficult to interpret
Require the guidance of the user to specify when
and how to analyze the data

15
Scientific Discovery

DBKD is less purposeful and controlling than SD
Scientists can reformulate and rerun their
experiment should they find the initial design
was inadequate
Database manager rarely have the luxury of
redesigning their data fields and recollecting
the data

16
A Framework for Knowledge Discovery

Input
Raw data from DB
Information from data dictionary
Additional domain knowledge
User defined biases that provide high level focus
Output
New Domain Knowledge
Feedback of the discovered knowledge to generate
new knowledge
DB issues
Dynamic data (time sensitive e.g. weight
height pulse rate)
Irrelevant fields (zip codes, pulse rate, sex)
Missing data
Noise and uncertainty
Missing field

17
Translation Between Database Management and
Machine Learning Terms
18
Conflicting Viewpoints Between Database
Management and Machine Learning
19
A Framework for Knowledge Discovery in Databases
20
Database and Knowledge

Domain Knowledge assist in discovery by the
searching scope
Data Dictionary
Inter-field Knowledge
e.g., weight and height
Inter-instance knowledge
e.g., age height seniority
age weight seniority
Contradictory - rule out valuable discovery
Trucks dont drive over water
eliminates potentially interesting solution,
Trucks drive over frozen lakes in winter.

21
Discovered Knowledge

Form
Inter-field patterns - related values of field in
the same record
e.g. (procedure surgery implies days in
hospital gt 5)
Inter-record patterns - aggregated over group of
records or identify useful clusters (e.g., profit
making companies)
Rules X gt Y1, A gt B
forms casual chains or network

22
Discovered Knowledge (contd)

Representation
Discovery must be represented in a form
appropriate for the intended user.
Human natural language, formal logic, visual
depictions of information
Computer program (expert system shells)
Programming language, declarative formalisms
Discovery System Feedback as domain knowledge
Need common representation
Uncertainty
Patterns are often probabilistic rather than
deterministic
missing and erroneous data
inherent indeterminism of the underlying real
world causes (50 chance of rain tomorrow)
sampling

23
Discovered Knowledge (contd)

Measures
Proof of success
Standard deviation
Belief measures
Linguistic uncertainty - fuzzy sets
Visual presentations by density, size, and
shading
Sampling technique for large DB accuracy of
results depends on sample size

24
Discovery Algorithms

Machine Learning
Unsupervised Learning
Supervised Learning
Unsupervised Learning
Pattern identification identifying interesting
patterns and describing them in a concise and
meaningful manner
Examples
customer with income gt 25,000/yr
questionable insurance claims

25
Discovery Algorithms (Contd)

Methods
Traditional Clustering
Minimized similarity between classes
Maximize similarity within classes
Drawbacks
Based on Euclidean Distance, work well only on
numerical data
Inability to use background information such as
likely cluster shape
Conceptual clustering
Based on attributes similarity, conceptual
cohesiveness (defined by background information)
Interactive clustering
Combines human users knowledge with computation
power of the computer

26
Discovery Algorithms (Contd)

Supervised Learning
Description process
Summaries relevant qualities of the identified
class
In discovery systems, user supervision can occur
in either the identification or description
process.

27
Concept Description(Supervised Concept Learning)

Discovery in large, complex database requires
both empirical methods to detect the statistical
regularity of patterns and knowledge-based
approaches to incorporate available domain
knowledge.
Discovery tasks
Summarization - Summarize class records by
describing their common or characteristic
features
Discrimination - Describe qualities sufficient to
discriminate records of one class from another
Comparison - Describe the class in a way that
facilitates comparison and analysis with other
records

28
Future Directions

Domain Knowledge - how to effectively use domain
knowledge to discover knowledge
Efficient Algorithms
Restrict rule type
Heuristic and approximate algorithms
Sampling
Parallel computing
OODBM
Deductive DB
Incremental methods
Efficiently keep pace with changes in Data
Incremental discovery system, reuse their
discoveries and make more complex discoveries

29
Future Directions (contd)

Interactive systems
Knowledge analyst included in the discovery loop
Use human judgement, machine computation power
Need information to be presented on a human
oriented form (text, sound, visuals)
Integration

30
Applications of Discovery in DB

Medicine
Finance
Agriculture
Social
Marketing Sales
Insurance
Engineering
Physics Chemistry
Military
Law Enforcement
Space Science
Publishing

31
Applications of Discovery in DB (Contd)

Discovery of Quantitative Laws
Data Driven Discovery of Quantitative Laws
Using Knowledge in Discovery
Data Summarization
Domain Specific Discovery Methods
Integrated Multi-Paradigm Systems
Methodology and Application Issues

32
Query Processing WithDomain Semantics

Wesley W. Chu

33
Query Optimization Problem

To find a sequence of operations, which has the
minimal processing cost.

34
Conventional Query Optimization (CQO)

For a given query
Generate a set of query that are equivalent to
the given query
Determine the processing cost of each such query
Select the lowest cost query processing strategy
among these equivalent queries

35
Limitations of CQO

There are certain queries that cannot be
optimized by Conventional Query Optimization.
For example, given the query
Which ships have deadweight greater than 200
thousand tons?
A search of entire the database may be required
to answer this query.

36
The Use of Knowledge

ASSUMING EXPERT KNOWS THAT
1. SHIP relation is indexed on ShipType. There
are about 10 different ship types, and
2. the ship must be a SuperTanker (one of the
ShipTypes) if the deadweight is greater than 150K
tons.
AUGMENTED QUERY
Which SuperTanker have deadweight greater than
200K tons?
RESULT
About 90 time saved in searching the answers.
The technique of improving queries with semantic
knowledge is called Semantic Query Optimization.

37
Semantic Query Optimization (SQO)

Uses domain knowledge to transform the original
query into a more efficient query yet still
yields the same answer.
Assuming a set of integrity constraints is
available as the domain knowledge,
Represent each integrity constraint as Pi
Ci, where 1 lt i lt n.
Translate (Augment) original query Q into Q
subject to C1, C2, ..., Cn, such that Q yields
lower processing cost than Q.
Query Optimization Problem Find C1, C2, ..., Cm
that yields minimal query processing cost that
is,
C(Q) min C(QLC1L ... LCm)

Ci
38
Semantic Equivalence

Domain knowledge of the database application
maybe used to transform the original query into
semantically equivalent queries.
Semantic Equivalence
Two queries are considered to be semantically
equivalent if they result in the same answer in
any state of the database that conforms to the
Integrity Constraints.
Integrity Constraints
A set of if and then rules that enforce the
database to be accurate instance of the real
world database application. Examples of
constraints include
state snapshot constraints
e.g., if deadweight gt 150K then ShipType
SuperTanker.
state transition constraints
e.g., salary can only be increased,
i.e., salary (new) gt salary (old)

39
Limitations of Current Approach

Current approach of SQO using
Integrity constraints as knowledge
Conventional data models

40
Limitations of Integrity Constraints

Integrity constraints are often too general to be
useful in SQO, because
Integrity constraints describe every possible
database state
User is only concerned with the current database
content.
Most database do not provide integrity checking
due to
Unavailability of integrity constraints
Overhead of checking the integrity
Thus, the usefulness of integrity constraints in
SQO is quite limited.

41
Limitations Of Conventional Data Models

Conventional data models lack expressive
capability for modeling conveniences. Many
useful semantics are ignored. Therefore, limited
knowledge are collected.
FOR EXAMPLE
Which employee earns more than 70K a year?
The integrity constraint
The salary range of employee is between 20K to
90K.
is useless in improving this query.

42
Augmentation Of SQO With Semantic Data Models

If the employees are divided into three
categories MANAGERS, ENGINEERS, STAFFS
and each category is associated with some
constraints
The salary range of MANAGERS is from 35K to 90K.
The salary range of ENGINEERS is from 25K to 60K.
The salary range of STAFF is from 20K to 35K.
A better query can be obtained
Which managers earn more than 70K a year?

43
(No Transcript)
44
CLASS (Type, Class, Name, Displacement, Draft,
Enlist)
45
Rule Statistics
46
SQP Performance for Selected Database Structure
47
Performance Improvement for Selected Attributes
CQP
SQP
attribute
cpu (ms) 505 432
dio 11 11
dio 3 4
cpu (ms) 129 130
Class Enlist
48
(No Transcript)
49
Summary

Contributions
Providing a model-based methodology for acquiring
knowledge from the database by rule induction.
Applications
1. Semantic Query Processing use semantic
knowledge to improve query processing
performance.
2. Deductive Database Systems - use induced rules
to provide intentional answers.
3. Data Inference Applications - use rules to
improve data availability by inferring
inaccessible data from accessible data.

50
Capture Database SemanticsBy Rule Induction

Wesley W. Chu
Rei-Chi Lee

51
Database Semantics

Database semantics can be classified into
Database Structure - the description of the
interrelationships between database objects.
Database Characteristics - defines the
characteristics and properties of each object
type.
However, only tools for modeling database
structure are available. Very few tools exist in
gathering and maintaining the database
characteristics.

52
An Example of Database Characteristics

The following table illustrates the US Navy
battleship characteristics that classify ships
into ship types with different displacement
ranges.

53
Knowledge Acquisition

A major problem in the development of a
knowledge-based data processing system.
Knowledge Engineers - persons in the use of
expert system tools
Domain Experts - persons with the expertise of
the application domain
The Process
Studying literature to obtain fundamental
background.
Interacting with domain experts to get their
expertise.
Translating the expertise into knowledge
representation.
Refining knowledge base through testing and
further interacting with domain experts.
A VERY TIME-CONSUMING TASK!

54
Knowledge Acquisition from Database

Database schema is defined according to database
semantics, and
Database instances are constrained by the
database characteristics.
Thus,
Database characteristics can be induced as the
semantic knowledge from the database.
Database schema can be a useful tool to guide the
knowledge acquisition.

55
Knowledge Acquisition By Rule Induction

Given an object hierarchy and a set of database
instances contained in the object hierarchy, a
set of classification rules can be induced by
inductive learning techniques.
Given
H - an object type hierarchy H1, ..., Hn
S - object schema
I - database instances representing H
Find
D - a set of descriptions, D1, ..., Dn such
that
for all x, x in I,
if Di (x) is true, then x ISA Hi
Example
SUBMARINES contains SSN, SSBN
DSSN 2145 lt Displacement lt 6955
DSSBN 7250 lt Displacement lt 30000

56
Model-Based Knowledge Acquisition Methodology

The methodology consists of
a Knowledge-based ER (KER) Model,
a knowledge acquisition methodology, and
a rule induction algorithm.
KER is used as a knowledge acquisition tool when
no knowledge specification is provided, or
the database already exists.

57
Knowledge-Based ER (KER) Model

To capture the database characteristics, a
Knowledge-based Entity Relationship (KER) is
proposed to extend the basic ER model to provide
knowledge specification capability.
A KER schema is defined by the following
constructs
has-attributed/with (aggregation)
This construct links an object with other
objects and specify certain properties of the
object.
2. isa/with (generalization)
This construct specifies a type/subtype
relationship between object types.
has-instance (classification)
This construct links a type to an object that is
an instance of that type.
The knowledge specification is represented by the
with-constraint specification.

58
Components of the KER Diagram
59
A KER Diagram Example
60
Classification of Semantic Knowledge

Domain Knowledge
Specifying the static properties of entities and
relationships.
e.g., displacement in the range of (0 - 30,000).
Intra-Structure Knowledge
Specifying the relationships between attributes
within an object (an entity or a relationship).
e.g., if the displacement is less than 7000, then
it is a nuclear submarine.
Inter-Structure Knowledge
Specifying the relationship that is related to
attributes of several entities of the aggregation
relationship.
e.g., the instructors department must be the
same as the department of the class offered.

61
Knowledge Acquisition Methodology

To provide a systematical way of collecting
domain knowledge guided by the database schema.
It consists of three steps
Schema Generating - using KER
a. Identify entities and associated attributes.
b. Identify type hierarchies by determining the
class attributes of each type hierarchy.
c. Identify aggregation relationships. Define
each referential key as a class attribute.
Rule Induction
Knowledge Base Refinement

62
Rule Induction Algorithm

Semantic rules for pair-wise attributes (X --gt
Y) are induced using the relational operations.
Sketch of the Algorithm
1. Retrieving (X,Y) value pairs.
Retrieve the instance of the (X,Y) pair from the
database.
Let S be the result.
2. Removing inconsistent (X,Y) value pairs.
Retrieve all the (X,Y) pairs that for the same
value of X has multiple values of Y. Let T be
the result.
Let S S -T.
3. Constructing Rules.
For each distinct value of Y in S, say y,
determine the value range x of X and create a
rule in the form of
if x1 lt X lt x2 then Y y.

63
Examples Of Induced Rules

A prototype system was implemented at UCLA using
a naval ship database as a test bed. Examples of
rules induced are
Entity SUBMARINE
x isa SUBMARINE
R1 if 0101 lt x.Class lt 0103 then x isa SSBN
R2 if 0201 lt x.Class lt 0215 then x isa SSN
R3 if Skate lt x.ClassName lt Thresher then x
isa SSN
R4 if 2145 lt x.Displacement lt 6955 then x isa
SSN
R5 if 7250 lt x.Displacement lt 30000 then x
isa SSBN

64
Examples of Induced Rules (Contd)

Relationship INSTALL
x isa SUBMARINE and y isa SONAR
R1 if SSN582 lt x.Id SSN601 then y isa BQS
R2 if SSN604 lt x.Id SSN671 then y isa BQQ
R3 if x.Class 0203 then y isa BQQ
R4 if 0205 lt x.Class lt 0207 then y isa BQQ
R5 if 0208 lt x.Class lt 0215 then y isa BQS
R6 if y.Sonar BQS-04 then x isa SSN

65
Pruning the Rule Set

When the number of rules generated becomes too
large, the system must reduce the size of the
knowledge base.
Two Criteria for Rule Pruning
Coverage
Keep the rules that are satisfied by more than
Nc instances and drop those rules that are
satisfied by less than Nc instances.
2. Completeness
Keep the rule schema (X ? Y) that the total
number of instances satisfied by the rules of the
same scheme is greater than a coverage threshold
Cc.

66
Induced Rules from Relation PORT
67
Summary

Contributions
Providing a model-based methodology for
acquiring knowledge from the database by rule
induction.
Applications
Semantic query processing use semantic
knowledge to improve query processing
performance.
Deductive Database Systems use induced rules to
provide intensional answers.
Data Inference Applications use rules to
improve data availability by inferring
inaccessible data from accessible data.