Title: Data Integration in Digital Libraries: Approaches and Challenges
1Data Integration in Digital Libraries Approaches
and Challenges
- Dr. Ismail Khalil Ibrahim
- ismail.khalil-ibrahim_at_scch.at
- 43 7236 3343 852www.scch.at
- Bringing Digital Libraries together
2Biography
Dr. Ismail Khalil Ibrahim is a senior software
develepoer and AgenCom project manager at the
Software Competence Center Hagenberg - Austria.
He worked in the University of Technology -
Baghdad Iraq from 1985-1990 as a lecturer, in
the Human Resources Training and Development
Institute - Iraq from 1990-1996 as the head of
the academic studies department, in Gadjah Mada
University from 1996-2000 as a teaching and
research assistant. His main research interests
lay in the fields of E-commerce I-Commerce,
Database Applications and Techniques for the Web,
Practical Experience and Applications in
Information Integration systems , Logic
Programming for Information Integration , Agents
for Information Retrieval and Knowledge Discovery
, XML and Semistructured Data Management ,
Information Systems Management and Development ,
Information Technology Impact, Economic
Analysis. Ismail is a member of ACM, SIGMOD,
SIGKDD, and SIGecom, general Secretary of the
Indonesian Information Society Initiative (IISI),
member of the Iraqi Engineers Association (IEA),
overseas Collaborator in the E-commerce Lab at
the National University of Singapore, editorial
Board of the Columbian Journal of Computing
Revista Colombiana de Computación, chairman of
the organizing committee of the 1st and 2nd
International Workshop on Information Integration
and Web-based Applications Services (IIWAS'99,
IIWAS'00) , Yogyakarta, Indonesia, chairman of
the organizing committee of the 3rd International
Conference on Information Integration and
Web-based Applications Services (IIWAS'2001),
Linz, Austria. Ismail holds a B.Sc. in
Electrical Engineering, from the University of
Technology, Iraq (1985), M.Sc. and Ph.D., in
Computer Eng. and Information Systems from Gadjah
Mada University (1998, 2001).
3Outline
- Data Integration
- What is it ?
- What does a data integration system look like ?
- What are some data integration challenges?
4What Is Data Integration?
- uniform sources transparent to user
- access query, and eventually updates
- multiple even two is a problem
- autonomous not effect behavior of sources
- heterogeneous different data models, schemas
- unstructured at least semi-structured
- information sources not only databases
5Example Scenario
6Example Scenariocont.
Retrieve the titles and subjects of all the
technical reports written by (Stephane Bressan)
and published by MIT PRESS q1? amazon ?
(Title,Stephane Bressan,subject) q2?
book-a-million ? (ISBN,Title,MIT Press) Join
the results
7So What is the Problem?
- Virtual vs. Materialized Architectures
- Access query or query update?
- Problem similar to updating through views
- need distributed transactional services
- Mediated schema yes or no?
- without mediated schema we lose advantages
- mediated schema requires schema integration
- schema integration need query transformation
- query transformation need query optimization
8Additional Dimensions
- How many sources are we accessing?
- how autonomous are the sources?
- how much knowledge do we have about sources?
- how structured are the data in the sources?
- Requirements from responses
- accuracy
- completeness
- machine readable vs. human readable
- handling inconsistencies
- speed
- closed World Assumption vs. Open World Assumption
9Related Technologies / Issues
- Distributed databases
- sources are homogeneous
- data is distributed a priori
- sources are not autonomous
- Similarities at the optimization and execution
level - Information retrieval
- keyword search
- no semantics
- Data mining discovering properties and patterns
in data
10Current Applications
- Intranets
- enterprise data integration
- web-site construction
- World Wide Web
- digital libraries
- comparison shopping (Netbot, Junglee)
- portals integration data from multiple resources
- XML integration
- Science Culture
- medical genetics integrating genomic data
- Astrophysics monitoring events in the sky
- Environment puget sound regional synthesis model
- Culture uniform access to all the cultural
databases
11Paradigms of Data Integration
Integration
global defined from local
global independent of local
CWA
OWA
global-schema-as-view
global-as-view- of-local
local-as-view- of-global
Database Schema Integration
Data Warehousing
Mediation
12Paradigms of Data Integration II
- Data Warehousing (materialization architecture)
- data of interest is collected in a central place
and a web site is built on top of it - queries are applied to the data warehouse
easy to support queries, transactions
hard to modify, the warehouse is not connected to
the providers of information, ... etc.
13Data Warehousing Architecture
Application
Data Warehouse
Data Extraction
14Paradigms of Data Integration III
- Information Mediation (virtual architecture)
- data remains in web sources
- rules that relate external data to internal
application
data is not replicated, data are guaranteed to be
up-to-date
query optimization and execution is more complex
15Mediation Architecture
Application
Global Data Model
Local Data Model
16Running Example
- World Relations
- Book(title,year,author,subject)
BookYear(title,year) - BookRev(title,author,review)
- Source Relations
- DB1(title,author,year)
- DB2(title,author,year)
- DB3(title,review)
17Global As View (GAV)
- Define a global schema of objects ande write down
rules to collect these objects - for each relation R in the mediated schema, we
write a query over the sources' relations
specifying how to obtain R's tuples from the
sources (Query unfolding)
traditional query processing applies
requires the right sources to be avaliable and
compliant
18Local As View (GAV)
- For every information source (S), we write a
query over the relations in the mediated schema
that describes which tuples are found in S (Query
folding or Answering Queries using Views)
may be able to answer a query based on the
avaliable partial information
generally, may not be able to answer the query
needs non standard query processing techniques
potentially high complexity
19Challanges
- Complexity over traditional DBs heterogeneous,
autonomous, network-bounded surces - Query reformulation now understood
- map queries over mediated schemas to wrapped
sources (heterogeneity) - Issues remain in query processing
- few statistics (autonomous sources)
- unanticipated delays and failures
(network-bounded sources)
20Conclusions
- Data integration handles many problems needed for
embedded systems applications - Many data sources
- Easy addition and deletion of sources
- Different source capabilities
- Dealing with network delays
- Easy for user
21Publications
- Semantic Query Transformation for the Integration
of Autonomous Information Sources (INAP99
Tokyo) - IKA Unity in Heterogenity (IIWAS99
Yogyakarta) - Information Reterival Agents for the Intelligent
Integration of Information Sources (MulNet 2000 -
Bandung) - A Multilingual Natural Language Interface for
Mediating E-Commerce Product Catalogs (INAP2000
Tokyo) - Semantic Query Transformation for the Intelligent
Integration of Information Sources over the Web
(WIIW2001 Rio de Janeiro) - Rewriting Rules for Semantic Query Transformation
in E-Commerce Applications (DS9 Hong Kong) - Data Integration in Digital Libraries Challenges
and Approaches (IndonesiaDL Bandung)
22Thank you for your attention!