Title: CS245A
1CS245A Syllabus (2005)
- Knowledge Discovery in Databases
- Query Processing With Domain Semantics
- Capture Database Semantics by Rule Induction
- Intentional Query Answering
- Fault Tolerant DDBMS Via Data Inference
- Intelligent Dictionary Directory
- Uncertainty Management Using Rough Sets
- Data Mining Techniques (Ch 4-7, H K)
- Active Databases
- Mediators in Information Systems
- KQML A Language and Protocol for Knowledge and
Information Exchange
2CS 245A - Syllabus (contd)
- CoBase
- CoSent
- Relaxation for XML Documents
- Query Formation From High-level Concepts
- Knowledge Acquisition for Query Relaxation
- Principles of Case-based Reasoning
- A Case-based Reasoning Approach to AQA
- CoXML
- Data Mining for Sequence Data
- Extracting key features from Free Text
- Knowledge based Approach for Free Text Retrieval
- Content-based Information Retrieval
- Digital Library
3 References
- Course notes Intelligent Information Systems,
CS245A, Course Reader Material, 1141 Westwood
Blvd, 310-443-3303 - Jiawei Han and Micheline Kamber, Data Mining
Concept and Techniques, Morgan Kaufmann, August
2000. - Wesley Chu T.Y. Lin (ed.) Foundations and
Advances in Data Mining. Springer, 2005
4CS 245AIntelligent Information Systems
- Wesley W. Chu
- Computer Science Department
- U. of California
- Los Angeles, CA
5Knowledge Discovery In Databases
- Information Explosion
- Information doubles every 20 months
- Increase in the number and size of DBs
- NASA - Earth observation satellites, 1
picture/sec - Human genome - several billion genetic bases
- US census data - lifestyle and subculture of the
US - How to analyze these databases (raw data)
- There is a gap between
- Data generation and data understanding
- Intelligent data analysis will be useful and
valuable - AA uses frequent flyer DB to find its better
customers for specific market promotions
6Knowledge Discovery In Databases (Contd)
- Bank uses customers loan and credit information
to derive better loan approval and bankrupt
protection - Package-goods manufacturers use the scanned
supermarket data to measure the effect of their
promotions and to look for shopping patterns - Techniques
- Machine Learning
- Statistics
- Information Theory
- Fuzzy Set
7Knowledge Discovery
- Extraction of implicit, previously unknown and
potentially useful information from Data - Given a set of facts (Data) F, a language L,
measure of certainty C, - pattern a statement S in L that describes the
relationship among a subset Fs of F with
certainty C, such that Fs is a simpler
representation than the enumeration of all facts
in Fs - Discovered Knowledge
- The output of a program that monitors the set
of facts in a DB and produce patterns.
8Patterns
- Expressed by high level language
- Understand and used directly by people
- Able to input to another program (e.g. expert
system) - e.g.
- If age lt 25 and Driver-Education-Course No
- Then At-Fault-Accident Yes
- with likelihood 0.3
9Patterns (Contd)
- Patterns that are completely unrelated to current
goals are not considered as knowledge. - e.g.
- Patterns that are relating at-fault-accident to
a drivers age is not useful to auto sales
figures. - Pattern interesting results knowledge
- Age gt 16 is not an interesting pattern for
driver since all drivers require age gt 16.
10Knowledge Discovery in DB Exhibits Four Main
Characteristics
- High-Level Language
- Understood by human users
- Accuracy
- Expressed by measure of uncertainty
- Interesting Results
- Patterns are novel and potentially useful
- Efficiency
- Running times for large-sized DB are predictable
and acceptable
11Efficiency
- The discovery process should be efficiently
implemented on a computer. - An algorithm is considered efficient if the run
time and space used are a polynomial function of
low degree of input length. - e.g.
- efficient algorithms for restricted concept
classes - Conjunctive concepts, (A B C)
- Conjunction of classes of disjunctions of no more
than k literals - (A B) (C D) (E F) , k 2.
12Machine Learning
- A learning algorithm takes the data set and its
accompanying information as input and returns a
statement (e.g., a concept) representing the
results of the learning as output - Data sets can be a file of records in DB
- Problems in learning DB
- DB are
- Dynamic
- Incomplete
- Noisy
- Much larger than typical machine learning data
sets - Much of work in learning DB focuses on overcoming
these complications!
13Related Approaches
- DB Management
- Integrity
- Querying in DB
- Deduction in DB
- OODBM
- Expert Systems
- Expert generated knowledge usually are higher
quality than the data in DB - Only cover the important cases
- Experts are available to confirm the validity and
usefulness of discovered patterns - Autonomy of discovery is lacking in expert systems
14Related Approaches (Contd)
- Statistics
- Ill suited for the nominal and structured data
types - Precluding the use of domain knowledge
- Difficult to interpret
- Require the guidance of the user to specify when
and how to analyze the data
15Scientific Discovery
- DBKD is less purposeful and controlling than SD
- Scientists can reformulate and rerun their
experiment should they find the initial design
was inadequate - Database manager rarely have the luxury of
redesigning their data fields and recollecting
the data
16A Framework for Knowledge Discovery
- Input
- Raw data from DB
- Information from data dictionary
- Additional domain knowledge
- User defined biases that provide high level focus
- Output
- New Domain Knowledge
- Feedback of the discovered knowledge to generate
new knowledge - DB issues
- Dynamic data (time sensitive e.g. weight
height pulse rate) - Irrelevant fields (zip codes, pulse rate, sex)
- Missing data
- Noise and uncertainty
- Missing field
17Translation Between Database Management and
Machine Learning Terms
18Conflicting Viewpoints Between Database
Management and Machine Learning
19A Framework for Knowledge Discovery in Databases
20Database and Knowledge
- Domain Knowledge assist in discovery by the
searching scope - Data Dictionary
- Inter-field Knowledge
- e.g., weight and height
- Inter-instance knowledge
- e.g., age height seniority
- age weight seniority
- Contradictory - rule out valuable discovery
- Trucks dont drive over water
- eliminates potentially interesting solution,
- Trucks drive over frozen lakes in winter.
21Discovered Knowledge
- Form
- Inter-field patterns - related values of field in
the same record - e.g. (procedure surgery implies days in
hospital gt 5) - Inter-record patterns - aggregated over group of
records or identify useful clusters (e.g., profit
making companies) - Rules X gt Y1, A gt B
- forms casual chains or network
22Discovered Knowledge (contd)
- Representation
- Discovery must be represented in a form
appropriate for the intended user. - Human natural language, formal logic, visual
depictions of information - Computer program (expert system shells)
Programming language, declarative formalisms - Discovery System Feedback as domain knowledge
- Need common representation
- Uncertainty
- Patterns are often probabilistic rather than
deterministic - missing and erroneous data
- inherent indeterminism of the underlying real
world causes (50 chance of rain tomorrow) - sampling
23Discovered Knowledge (contd)
- Measures
- Proof of success
- Standard deviation
- Belief measures
- Linguistic uncertainty - fuzzy sets
- Visual presentations by density, size, and
shading - Sampling technique for large DB accuracy of
results depends on sample size
24Discovery Algorithms
- Machine Learning
- Unsupervised Learning
- Supervised Learning
- Unsupervised Learning
- Pattern identification identifying interesting
patterns and describing them in a concise and
meaningful manner - Examples
- customer with income gt 25,000/yr
- questionable insurance claims
25Discovery Algorithms (Contd)
- Methods
- Traditional Clustering
- Minimized similarity between classes
- Maximize similarity within classes
- Drawbacks
- Based on Euclidean Distance, work well only on
numerical data - Inability to use background information such as
likely cluster shape - Conceptual clustering
- Based on attributes similarity, conceptual
cohesiveness (defined by background information) - Interactive clustering
- Combines human users knowledge with computation
power of the computer
26Discovery Algorithms (Contd)
- Supervised Learning
- Description process
- Summaries relevant qualities of the identified
class - In discovery systems, user supervision can occur
in either the identification or description
process.
27Concept Description(Supervised Concept Learning)
- Discovery in large, complex database requires
both empirical methods to detect the statistical
regularity of patterns and knowledge-based
approaches to incorporate available domain
knowledge. - Discovery tasks
- Summarization - Summarize class records by
describing their common or characteristic
features - Discrimination - Describe qualities sufficient to
discriminate records of one class from another - Comparison - Describe the class in a way that
facilitates comparison and analysis with other
records
28Future Directions
- Domain Knowledge - how to effectively use domain
knowledge to discover knowledge - Efficient Algorithms
- Restrict rule type
- Heuristic and approximate algorithms
- Sampling
- Parallel computing
- OODBM
- Deductive DB
- Incremental methods
- Efficiently keep pace with changes in Data
- Incremental discovery system, reuse their
discoveries and make more complex discoveries
29Future Directions (contd)
- Interactive systems
- Knowledge analyst included in the discovery loop
- Use human judgement, machine computation power
- Need information to be presented on a human
oriented form (text, sound, visuals) - Integration
30Applications of Discovery in DB
- Medicine
- Finance
- Agriculture
- Social
- Marketing Sales
- Insurance
- Engineering
- Physics Chemistry
- Military
- Law Enforcement
- Space Science
- Publishing
31Applications of Discovery in DB (Contd)
- Discovery of Quantitative Laws
- Data Driven Discovery of Quantitative Laws
- Using Knowledge in Discovery
- Data Summarization
- Domain Specific Discovery Methods
- Integrated Multi-Paradigm Systems
- Methodology and Application Issues
32Query Processing WithDomain Semantics
33Query Optimization Problem
- To find a sequence of operations, which has the
minimal processing cost.
34Conventional Query Optimization (CQO)
- For a given query
- Generate a set of query that are equivalent to
the given query - Determine the processing cost of each such query
- Select the lowest cost query processing strategy
among these equivalent queries
35Limitations of CQO
- There are certain queries that cannot be
optimized by Conventional Query Optimization. - For example, given the query
- Which ships have deadweight greater than 200
thousand tons? - A search of entire the database may be required
to answer this query.
36The Use of Knowledge
- ASSUMING EXPERT KNOWS THAT
- 1. SHIP relation is indexed on ShipType. There
are about 10 different ship types, and - 2. the ship must be a SuperTanker (one of the
ShipTypes) if the deadweight is greater than 150K
tons. - AUGMENTED QUERY
- Which SuperTanker have deadweight greater than
200K tons? - RESULT
- About 90 time saved in searching the answers.
- The technique of improving queries with semantic
knowledge is called Semantic Query Optimization.
37Semantic Query Optimization (SQO)
- Uses domain knowledge to transform the original
query into a more efficient query yet still
yields the same answer. - Assuming a set of integrity constraints is
available as the domain knowledge, - Represent each integrity constraint as Pi
Ci, where 1 lt i lt n. - Translate (Augment) original query Q into Q
subject to C1, C2, ..., Cn, such that Q yields
lower processing cost than Q. - Query Optimization Problem Find C1, C2, ..., Cm
that yields minimal query processing cost that
is, - C(Q) min C(QLC1L ... LCm)
Ci
38Semantic Equivalence
- Domain knowledge of the database application
maybe used to transform the original query into
semantically equivalent queries. - Semantic Equivalence
-
- Two queries are considered to be semantically
equivalent if they result in the same answer in
any state of the database that conforms to the
Integrity Constraints. - Integrity Constraints
- A set of if and then rules that enforce the
database to be accurate instance of the real
world database application. Examples of
constraints include - state snapshot constraints
- e.g., if deadweight gt 150K then ShipType
SuperTanker. - state transition constraints
- e.g., salary can only be increased,
- i.e., salary (new) gt salary (old)
39Limitations of Current Approach
- Current approach of SQO using
- Integrity constraints as knowledge
- Conventional data models
40Limitations of Integrity Constraints
- Integrity constraints are often too general to be
useful in SQO, because - Integrity constraints describe every possible
database state - User is only concerned with the current database
content. - Most database do not provide integrity checking
due to - Unavailability of integrity constraints
- Overhead of checking the integrity
- Thus, the usefulness of integrity constraints in
SQO is quite limited.
41Limitations Of Conventional Data Models
- Conventional data models lack expressive
capability for modeling conveniences. Many
useful semantics are ignored. Therefore, limited
knowledge are collected. - FOR EXAMPLE
- Which employee earns more than 70K a year?
- The integrity constraint
- The salary range of employee is between 20K to
90K. - is useless in improving this query.
42Augmentation Of SQO With Semantic Data Models
- If the employees are divided into three
categories MANAGERS, ENGINEERS, STAFFS - and each category is associated with some
constraints - The salary range of MANAGERS is from 35K to 90K.
- The salary range of ENGINEERS is from 25K to 60K.
- The salary range of STAFF is from 20K to 35K.
- A better query can be obtained
- Which managers earn more than 70K a year?
43(No Transcript)
44CLASS (Type, Class, Name, Displacement, Draft,
Enlist)
45Rule Statistics
46SQP Performance for Selected Database Structure
47Performance Improvement for Selected Attributes
CQP
SQP
attribute
cpu (ms) 505 432
dio 11 11
dio 3 4
cpu (ms) 129 130
Class Enlist
48(No Transcript)
49Summary
- Contributions
- Providing a model-based methodology for acquiring
knowledge from the database by rule induction. - Applications
- 1. Semantic Query Processing use semantic
knowledge to improve query processing
performance. - 2. Deductive Database Systems - use induced rules
to provide intentional answers. - 3. Data Inference Applications - use rules to
improve data availability by inferring
inaccessible data from accessible data.
50Capture Database SemanticsBy Rule Induction
- Wesley W. Chu
-
- Rei-Chi Lee
51Database Semantics
- Database semantics can be classified into
- Database Structure - the description of the
interrelationships between database objects. - Database Characteristics - defines the
characteristics and properties of each object
type. - However, only tools for modeling database
structure are available. Very few tools exist in
gathering and maintaining the database
characteristics.
52An Example of Database Characteristics
- The following table illustrates the US Navy
battleship characteristics that classify ships
into ship types with different displacement
ranges.
53Knowledge Acquisition
- A major problem in the development of a
knowledge-based data processing system. - Knowledge Engineers - persons in the use of
expert system tools - Domain Experts - persons with the expertise of
the application domain - The Process
- Studying literature to obtain fundamental
background. - Interacting with domain experts to get their
expertise. - Translating the expertise into knowledge
representation. - Refining knowledge base through testing and
further interacting with domain experts. - A VERY TIME-CONSUMING TASK!
54Knowledge Acquisition from Database
- Database schema is defined according to database
semantics, and - Database instances are constrained by the
database characteristics. - Thus,
- Database characteristics can be induced as the
semantic knowledge from the database. - Database schema can be a useful tool to guide the
knowledge acquisition.
55Knowledge Acquisition By Rule Induction
- Given an object hierarchy and a set of database
instances contained in the object hierarchy, a
set of classification rules can be induced by
inductive learning techniques. - Given
- H - an object type hierarchy H1, ..., Hn
- S - object schema
- I - database instances representing H
- Find
- D - a set of descriptions, D1, ..., Dn such
that - for all x, x in I,
- if Di (x) is true, then x ISA Hi
- Example
- SUBMARINES contains SSN, SSBN
- DSSN 2145 lt Displacement lt 6955
- DSSBN 7250 lt Displacement lt 30000
56Model-Based Knowledge Acquisition Methodology
- The methodology consists of
- a Knowledge-based ER (KER) Model,
- a knowledge acquisition methodology, and
- a rule induction algorithm.
- KER is used as a knowledge acquisition tool when
- no knowledge specification is provided, or
- the database already exists.
57Knowledge-Based ER (KER) Model
- To capture the database characteristics, a
Knowledge-based Entity Relationship (KER) is
proposed to extend the basic ER model to provide
knowledge specification capability. - A KER schema is defined by the following
constructs - has-attributed/with (aggregation)
- This construct links an object with other
objects and specify certain properties of the
object. - 2. isa/with (generalization)
- This construct specifies a type/subtype
relationship between object types. - has-instance (classification)
- This construct links a type to an object that is
an instance of that type. - The knowledge specification is represented by the
with-constraint specification.
58Components of the KER Diagram
59A KER Diagram Example
60Classification of Semantic Knowledge
- Domain Knowledge
- Specifying the static properties of entities and
relationships. - e.g., displacement in the range of (0 - 30,000).
- Intra-Structure Knowledge
- Specifying the relationships between attributes
within an object (an entity or a relationship). - e.g., if the displacement is less than 7000, then
it is a nuclear submarine. - Inter-Structure Knowledge
- Specifying the relationship that is related to
attributes of several entities of the aggregation
relationship. - e.g., the instructors department must be the
same as the department of the class offered.
61Knowledge Acquisition Methodology
- To provide a systematical way of collecting
domain knowledge guided by the database schema.
It consists of three steps - Schema Generating - using KER
- a. Identify entities and associated attributes.
- b. Identify type hierarchies by determining the
class attributes of each type hierarchy. - c. Identify aggregation relationships. Define
each referential key as a class attribute. - Rule Induction
- Knowledge Base Refinement
62Rule Induction Algorithm
- Semantic rules for pair-wise attributes (X --gt
Y) are induced using the relational operations. - Sketch of the Algorithm
- 1. Retrieving (X,Y) value pairs.
- Retrieve the instance of the (X,Y) pair from the
database. - Let S be the result.
- 2. Removing inconsistent (X,Y) value pairs.
- Retrieve all the (X,Y) pairs that for the same
value of X has multiple values of Y. Let T be
the result. - Let S S -T.
- 3. Constructing Rules.
- For each distinct value of Y in S, say y,
determine the value range x of X and create a
rule in the form of - if x1 lt X lt x2 then Y y.
63Examples Of Induced Rules
- A prototype system was implemented at UCLA using
a naval ship database as a test bed. Examples of
rules induced are - Entity SUBMARINE
- x isa SUBMARINE
- R1 if 0101 lt x.Class lt 0103 then x isa SSBN
- R2 if 0201 lt x.Class lt 0215 then x isa SSN
- R3 if Skate lt x.ClassName lt Thresher then x
isa SSN - R4 if 2145 lt x.Displacement lt 6955 then x isa
SSN - R5 if 7250 lt x.Displacement lt 30000 then x
isa SSBN
64Examples of Induced Rules (Contd)
- Relationship INSTALL
- x isa SUBMARINE and y isa SONAR
- R1 if SSN582 lt x.Id SSN601 then y isa BQS
- R2 if SSN604 lt x.Id SSN671 then y isa BQQ
- R3 if x.Class 0203 then y isa BQQ
- R4 if 0205 lt x.Class lt 0207 then y isa BQQ
- R5 if 0208 lt x.Class lt 0215 then y isa BQS
- R6 if y.Sonar BQS-04 then x isa SSN
65Pruning the Rule Set
- When the number of rules generated becomes too
large, the system must reduce the size of the
knowledge base. - Two Criteria for Rule Pruning
- Coverage
- Keep the rules that are satisfied by more than
Nc instances and drop those rules that are
satisfied by less than Nc instances. - 2. Completeness
- Keep the rule schema (X ? Y) that the total
number of instances satisfied by the rules of the
same scheme is greater than a coverage threshold
Cc.
66Induced Rules from Relation PORT
67Summary
- Contributions
- Providing a model-based methodology for
acquiring knowledge from the database by rule
induction. - Applications
- Semantic query processing use semantic
knowledge to improve query processing
performance. - Deductive Database Systems use induced rules to
provide intensional answers. - Data Inference Applications use rules to
improve data availability by inferring
inaccessible data from accessible data.
68Rule Induction
69(No Transcript)
70Generate the Rules
- Select targets
- Targets are the RHS attributes of rules.
- Method of selection
- Use indices as targets
- Use selectivity
- selectivity of tuples with distinct
value/total of tuples - Targets are chosen based on database schema
(e.g., type hierarchy). - Generate rules for each target