Title: Database Integration
1??????? ???? ???? ??? ???????? ???? ??????
- ??????? ?????
- ??????? ???- ??????? ??? ? ????????
???? ???? ????? ????? ?????? ???? ????? ?????
2????? ?????
- ????? ???? ????????
- ????? ??????
- ????? ?????
- ????? ??????
- ????? ????
- ?????? ??????? ???? ????
- ??????? ???? ??? ??
- ?????? ? ??????? ??? ????? ?? ???? ?? (Data
interoperability) - ??????? ???? ??? ???????
- ?????? ????? ??? ???? ???????? ?????????
- ?????? ???? ???? ??
- ??? ??? ??????? ????? (Data alignment)
- ???????? ???? ??
- ????? ????????? ?? ???? ??
- ??????
- ???
3????? ????? (?????)
- ???? ??? ?????? ???? ??? ???????? ????????????
- ?????? ??? ??????? ???? ???? ??? ???????? ????
??????
4????? ???? ????????
?????? ?? ????? ???? ???????? ?? ???? ?? ????
????? ????? ?? ???? ????? ?? ???? ???????? ??????
?? ???. ?? ?? ???? ???????? ????? ??? ???? ?????
????? ?????? ???? ??? ????? ?? ?? ????? ? ??????
??? ??? ???? ?? ? ????? ?? ?? ???? ????.
??????? ???? ???????? ????? ?? ??? ???? ????
???? ???????? ????? ?? ???. ????? ?? ????? ????
???????? ?? ???? ?? ???? ???????? ????? ?? ????
??? ??? ???? ????? ?? ???? ??????? ?????? ????
???????? ????? ?? ???.
5????? ???? ???????? (?????)
- ???? ????? ????? ???? ????????
- ????? ??????? ?? ???? ?? ?????? ? ????? ?????
???? -
- ????? ?????? ?? ???? ?? ??????? ?? ? ???? ???? ?
????? ????? ???? - ????? ??????? ?? ????? ???? ??? ?? ????? ?????
- ????? ????? ????? ????? ????? ?? Oracle ?? SQL
Server
6????? ??????
?? ????? ?????? ?? ?? ?? ??? ???? ?? ?????? ???
???? ?? ??? ?? ???? ???? ?????? ? ????? ???
??????. ??? ???? ????? ?????? ??? ????? ????? ??
?????? ?? ????? ?? ??? ? ??? ???? ?? ?? ????
????? ?? ???? ?? ???. ?? ???? ??? ??? ???? ????
??? ??????? ??? ????? ????? ? ????? ?? ????? ??
???. ?? ???? ?? ??? ??? ?? ???? ????? ??? ??????
?? ?????? ?? ???? ?? ??? ? ?? ????? ?? ???????
??? ???????? ??? ???? ?? ???? ??????? ???? ??
???. ??? ???? ?? ????? ?????? ? ?????? ??? ????
?? ????? ?? ???. ??? ??? ???? ?? ?? ???? ????????
?? ??? ?? ????? ???? ????? ?? ??? ? ??????? ???
????? ????. ??? ?? ????? ?????? ????? ?? ??????
???? ???? ???????????? ?? ????. ??? ???? ??????
????? ?? ???????? ?? ???. ?? ???? ????? ?????? ??
?? ???? ????? ????? ????? ????? ?? ?? ?? ?? ????
? ?? ?????? ???? ????? ???? ??? ?????.
7????? ?????
?? ??????? ????? ?? ??? ???? ?? ???? ?? ????? ?
???? ??? ?? ?? ???? ?? ?????? ?? ? ??????? ??? ??
???????? ?????? ???? ?? ???? ???? ???. ??? ???
???? ???? ????? ?? ?? ????? ??? ?????? ????
???????? ?? ????? ????? ???? ?? ???????? ????
??????? ???? ??? ?? ??? ???? ????? ????? ????? ??
? ????? ???? ??? ????? ?? ??? ??? ?? ???? ? ?? ??
??? XML ??????? ???. ??? ?????? ?? ???? ??? ????
?? ?????? ?? ?? ?? ??? ????? ?? ?? ???????? ????
??????? ??? ???? ? ?? ?? ???? ??? ???? ?? ??????
?? ?? ?? ?????? ????? ? ??????? ??????? ?? ????
?????? ???? ??? ???? ?? ????? ?? ????.
8????? ??????
??? ?????? ?? ???? ????? ?? ??? ?? ???? ?? ?????
?? ??? ????? ?????? ????? ???? ????? ?? ????. ???
????? ???? ?? ????? ?? ????? ?????? ? ?????? ? ??
????? ????? ??? ???? ????. ???? ??? ?????? ??
???? ????? ????? ?????? ?? ???? ? ????? ???? ?
???? ?????? ?? ?? ????
9????? ????
?? ?? ???? ?? ?????? ????? ??????? ??????? ???.
?? ?? ???? ????? ?? ?? ?? ?? ?? ???? ??? ??????
??? ???? ?? ??????? ???? ????? ?? ???? ?????.
???? ?? Oracle ??????? ?? ?? ???? ???? ??? ??
?? ?? ??? ???? ????? ?? ????? Oracle ???????
????. ?? Oracle ????? ?? ????? ????? ????? ??
???? ?? ?? ?? ?????? ???? ????? ?????. ??????
???? ??? ?????? ???????? ???? ??? ??????????
???????? ?????? ? ... ????? ???? ?? ??????? ??
?????
10???? ??????
- ?????? ??????? ???? ????
- ??????? ???? ??? ??
- ?????? ? ??????? ??? ????? ?? ???? ?? (Data
interoperability) - ??????? ???? ??? ???????
- ?????? ???? ????? ????? ??? ???? ????????
????????? - ?????? ???? ???? ??
- ??? ??? ??????? ????? (Data alignment)
11??????? ???? ????? (1965-75)
- ???? ?? ???? ??? ????? ?????? ??? ??????? ?? ???
????? ???? ????????
???? ??? ?????? ??????? 1
???? ??? ?????? ??????? 2
???? ??? ?????? ??????? 3
???? ??? ?????? n ???????
????? ?????? ??????? ???? ???
12?????? ? ??????? ??? ?????? ?? ??????? (1975-80)
I N T E G R A T I O N
- ???????? ???? ???? ?? ???? (Location
Transparency) - ? ????? ??????
- ?????? ?????? ???? ??? ???????? ????? ???
- ?????? ?????? ?????? ????
???? ??? ???????? ????????? - ??? ???????? ???? ???? ?? ???? ( LOCATION
VISIBILITY) - ? ??? ???? ????? ??????
- multiDB views, multi DB access language
-
MULTIDATABASE SYSTEMS - ???? ??? ?????? ?????? ? ?? ???? ?????? ?????
- (files, repositories, knowledge bases,
spreadsheets, ) - information exchange protocols / languages
-
13??????? ???? ??? ??????? (80-95)
????? Interoperability ??????? ?????? ? ?????
??????? ????? ?? ??? ??? ????? ? ??????? ?? ???
???????
14??????? ???? ??? ??????? (?????)
Courtesy Oracle
15?????? ????? ??? ???? ???????? ?????????
Filtering
Integration
Translation Wrappers / Mediators
local export schemas
16?????? ???? ???? ?? (1995-2000)
- ?????? ?? ?????? ???? ????
17????? ???????? ??????????? ?????? ?? ?????? ????
???? ??
- ????? ??????? ???? ???? ??? ???? ????
- ??????? ????
18??????? ???? ?? ??? ??
- ????? ???????
- ???? ???? ?? ???? ???? ????? ?? ????? ????? ????
(????? eBay) ?????? ??? ?? ??? ????? ??????? ???
?? ???? ? ??????? ??? ????? ?? ????? ????? ??????
??? ???? ???? ?? ???? ???. - ?? ?? ????? ??? ?????? ??? ?? ????? ????? ???
???? ????? ????? ???? ??? ?? ?? ???? ????????
???? ?? ?? ?? ??????? ??? ????? ????? ? ??
??????? ???? ?? ??????? ?? ?? ??????? ?? ????
?????? ?? ????. - ?? ?? ?????? ??? ??????? ?? ??????? ?? ???? ????
? ??????? ?? ?????? ?? ???? ???? ???? ??? ???
?????? ?????? ????.
19????? ???????? ???????
20?? ???????? 4 ???? ???? ??????? ???? ???? ???
???????? ??????
21???? ???????? (????)
- ??????? ?????? ?????? ??? ??????
Schema 1
Schema 2
22???? ????? ???? ???????? (??????)
Schema S1 (OO)
The integrated schema (OO)
Person
Person
Pin
Name
Pin
Name
Student
Faculty
Student
Faculty
GPA
Rank
Rank
GPA
Phd-advisor
PhD Student
Schema S2 (relational)
Thesis
Thesis (Phd-advisor, Phd-student, title)
Title
Adv.
Student
- ??? ??? ???? ?? ??????? ??????? ????? ????????
23??? ?????? ???? ??
- ???? ???? ???? ??
- ????? ??? ?????? ??????? ??? ????? ??? ? ??
????? ???? ??? ???? ?? ? ??? ????????? ?? ? ?????
?? - ??????? ???? ???? ??
- ??????? ???? ???? ??? ???????? ?? ???? ???
???????? ? ?? ???? ??????? - ????? ???? ?? (Data transformation)
- ?????? ? ?? ????? ???? ???? ?? ( Normalization
and aggregation) - ???? ??? ???? ?? (Data reduction)
- ???????? ???? ?? ?? ???? ???? ?? ???? ?? ?? ?????
?????? ????? ???? ????? ???? ???? ?????? ????
(????? ???? ????? ????? ?????)
24????? ????????? ?? ?? ??????? ????
- ????????? ??? ??????
- ????????? ??? ???
25????????? ??? ??????
- Naming Conflicts
- In any data model, the schemata incorporate
names for various entities/objects represented by
them. Since these schemata are designed
independently, the designer of each schema uses
his or her own vocabulary to name these objects.
Objects in different schemata representing the
same real world concept may contain dissimilar
names
26Semantic Incompatibilities (cont.)
- Homonyms This inconsistency arises when the
- same name is used for two different concepts. For
- example, 'SALARY' may mean weekly salary in
- one database, and monthly salary in another.
- Synonyms This type of naming conflict arises
- when the same concept is identified by two or
- more names.
- For example, the term 'DOMESTIC CUSTOMER'
- in one database may refer to the same concept as
the term 'BUYERS' in another database
27Semantic Incompatibilities (cont.)
Type Conflicts These conflicts arise when the
same concept is represented by different coding
constructs in different schemata. For example, an
object may be represented as an entity in one
schema and as an attribute in another schema.
28Semantic Incompatibilities (cont.)
Key Conflicts Different keys may be assigned
to the same concept in different schemata 15,
46. For example, ss and EMP-ID may be keys for
employees in two component schemata.
29Semantic Incompatibilities (cont.)
Behavioral Conflicts These conflicts arise
when different insertion/deletion policies are
associated with the same class of objects in
different schemata. For example, in one database,
the relation DEPT may exist without having any
employee records being associated with it, where
as in another database, the deletion of the last
employee record may also delete the relation DEPT
from the database.
30Semantic Incompatibilities (cont.)
Missing Data Different attributes may be
defined for the same concept in different
schemata. For example, EMPI(SSN, NAME, AGE) and
EMP2(SSN, NAME,ADDRESS) may represent the same
concept in two database schemata. Attribute 'AGE'
is missing in EMP2, and attribute 'ADDRESS' is
missing in EMPI.
31Semantic Incompatibilities (cont.)
Levels of Abstraction This incompatibility is
encountered when information about an entity is
stored at dissimilar levels of detail in two
databases. For example, 'LABOR-COST' and
'MATERIAL-COST' may be stored separately in one
database and combined together as 'TOTAL-COST' in
a second database.
32Semantic Incompatibilities (cont.)
Identification of Related Concepts For concepts
in the component schemata that are not the same
but are related, one needs to discover all the
inter-schema properties that relate to them. For
example, two entities belonging to two different
databases may not be equivalent but one entity
may be a generalization of the other entity.
33Semantic Incompatibilities (cont.)
Scaling Conflicts This incompatibility arises
when the same attribute of an entity is stored in
dissimilar units in different databases. For
example, the attribute 'LENGTH' of an entity may
be stored in terms of centimeters in one database
and as inches in another database.
34Quantitative Data Incompatibilities
Different Levels of Precision Different
databases may be storing an attribute at
dissimilar levels of precision. For example, one
database may contain the weight of a particular
part up to an precision of a milligram,
whereas another database may specify precision
only up to a gram
35Quantitative Data Incompatibilities (Cont.)
Asyncronous Updates Since each database is
managed independently, all databases may not
update the value simultaneously
36Quantitative Data Incompatibilities (Cont.)
Lack of Security Due to lack of information
security at component databases, unauthorized
users may have changed the data
37Challenges of Bioinformatics Databases Management
- Bioinformatics Databases format
- Flat files GenBank, EMBL, DDBJ, PDB.
- Relational databases HGMD, MGMD
- Object-oriented database AceDB.
- XML databases PIR, SwissProt, InterPro.
- Characteristics
- The Diversity/variety of data.
- The representational heterogeneity.
- Autonomous and web-based sources.
- Varied interface and query capabilities
38Motivation
- Very large heterogeneous databases.
- Need to
- link.
- Integration.
- Complex relation.
39Volume and Variety
- Two interacting issues in the generating
information - 1. The volume is large --
- we need automation
- 2. The data is varied heterogeneous
- many autonomous sources
- many distinct objectives
- many incompatibilities, errors
40Diversity Heterogeneity
- A wide variety of knowledge is needed to
interpret the data - A large variety of experts is developing this
knowledge - The scope of interests differs among those
experts - The knowledge is expressed in diverse ways
- The terms differs in precise meaning semantics
- A large variety of data types is needed
- A wide variety of representations is used
- The database and file schemas differ
- A wide variety of representations is used
- The openness and accessibility of the
information differs
41Heterogeneity inhibits Integration
- An essential feature of science
- autonomy of fields
- differing granularity and scope of focus
- growth of fields requires new terms
- A feature of technological process
- standards require stability
- yesterdays innovations are todays
infrastructure - Must be dealt with explicitly
- sharing, integration, and aggregation are
essential - Large quantities of data require precision
42Heterogeneity among domains is natural
- Interoperation creates mismatch
- Autonomy conflicts with consistency,
- - Local Needs have Priority,
- - Outside uses are a Byproduct
- Heterogeneity must be addressed
- Platform and Operating Systems 4 4
- Data Representation and Access Conventions 4
- Metadata Annotations, Naming, and Ontology
- needed to share data from distinct sources
43Obstacles to Integration
- Data spread over multiple, heterogeneous dbs
- Not all are easily queried
- flat file sequence dbs, web sites, BLAST
alignments - Some are not even easily parsed!
- Not all represent biology optimally
- Genbank is sequence-centric, not gene-centric
- SwissProt is sequence-centric, not
domain-centric - Hard to keep results up-to-date
- Non-traditional query approaches are needed to
exclude extraneous results
44What are the Data Sources?
- Flat Files
- URLs
- Proprietary Databases
- Public Databases
- Data Marts
- Spreadsheets
- Emails
-