Title: Content Integration for EBusiness
1Content Integration for E-Business
2New Generation of e-Business on the Internet
- Companies moving beyond marketing, storefronts
- Attempting to do operations on the Internet
- procurement
- supply chain
- customer relationships
- etc.
- In a cross-enterprise environment
- Requires cross-enterprise content integration
- catalog integration is the procurement instance
of this problem
3Content Integration
- Content integration across enterprises
- Not the in-house data warehousing problem
- Not the Enterprise App Integration (EAI) problem
- Operational data must be integrated
- As opposed to historical (trend) data
- E.g. pricing, availability, supply chain
- Structured and unstructured data
- Not just relational or XML queries
- Not just text search
- A combination of the two logic meets statistics
4The Butterfly
- Everybodys favorite picture c. 1/2000
- At question (6/2001) is how many butterflies, who
owns them - Not a startup opportunity (Transora vs. Chemdex)
- Perhaps one of the wings is smaller than the
other (HomeDepot)
Marketplace
Suppliers
Buyers
5Road Map
- Setting
- Scenarios Terminology
- Characteristics and Challenges of Content
Integration - Research Evangelism
6Some Scenarios for Content Integration
- Catalog Management Integration and Syndication
- MRO (Maintenance, Repair and Operations) a la
Grainger - Thousands of suppliers, run by a content
manager - Availability and Pricing
- Travel industry
- Necessitates live, cross-enterprise querying
- Supply Chain Management
- E.g. auto industry
- Increase in production requires the entire supply
chain (the cows) - Contractual information along with catalog and
availability
7Marketing The EcoSystem and its Terminology
- Enterprise Application Integration (EAI) App
Glue - Imperative, message-oriented programming
(scripting languages) - Transactional networking (persistent queues)
- Gateways to popular packaged apps
- Vendors WebMethods, BEA, CrossWorlds, Netfish,
MQseries, etc. - Data Integration Warehousing and associated
processes - Intra-enterprise, for business intelligence
(historical trends) - Vendors Informatica, Ascential, DBMS vendors
- Content Management Tools for content creation
- Web page and graphic design
- Versioning and configuration management
- Vendors Vignette, Interwoven, etc.
8Road Map
- Setting
- Scenarios Terminology
- Characteristics and Challenges of Content
Integration - Content Access, Mapping and Transformation
- Query Processing
- Research Evangelism
9Content Integration Characteristic and Challenges
- New integration challenges for e-business
- cross-enterprise
- operational
- data-centric (not app-centric)
- structured/unstructured
- Two main thrusts
- Content Access, Mapping and Transformation
- Query Processing
10Content Access Relationships with Providers
- Varying relationships with content providers
- Direct DBMS access (typically in-house)
- Direct access to federated apps (SAP, etc.)
- Gateway vendors a la Merant, NEON, Attunity, etc.
- Arms-length relationships
- HTML screen scraping
- XML messaging
- Relationships evolve over time!
- MySimon example
11Content Mapping
- Syntactic and semantic integration
- Formatting/normalization is one piece of the
puzzle - XML, HTML, Relational, etc.
- Semantics is much harder
- E.g. price. E.g. delivery.
- Semantics gate the process
- A content manager must own the transformation
task - Ease of use critical
- Home Depot has 60,000 suppliers!
- Standards can help a bit (e.g. UDDI)
- But graphical tools are the name of the game
12Cohera Workbench
13Schemas and Taxonomies
- Cross-enterprise multiple schemas
- Even if standards prevail (very optimistic)
- Early e-catalog systems were locked into one
schema - Great for service companies, e.g. Requisite
- Tools are sounding the death knell
- Taxonomies are critical
- Natural for browsing, especially with dirty data
- Black Ink, India ink, fountain pen ink,
black - Taxonomy per vertical markets, plus standards
like UNSPSC - Office Supplies-gtInk and lead refills-gtIndia ink
- Taxonomy as data query it, browse it, etc.
- Integration task includes taxonomy integration!
14(No Transcript)
15(No Transcript)
16Themes in Content Access and Mapping
- Scalability in human terms
- Content managers, not geeks
- The name of the game semi-automatic tools
- Statistical (fuzzy) techniques to provide hints
(not silver bullets) - Integrated into graphical programming-by-example
interfaces - Problem domains
- Wrapper generation
- Data cleaning
- Schema mapping
- Taxonomy mapping
- Syndication
- One of the key systems challenges today
17Road Map
- Setting
- Scenarios Terminology
- Characteristics and Challenges of Content
Integration - Research Evangelism
18Query Processing Issues
- Content to be integrated is increasingly
uncacheable - Arms-length accessibility
- Business rules, not data
- E.g. custom content throughout the dataflow
- Volatile information
- E.g. Availability
- Yet a great deal of content is cacheable and
slowly changing - Upshot need a combined technology
- Prefetch/Cache/Replicate when possible
- Query live when impossible
19Federated Query Processing
- DBMS community must shed our materialization
myopia! - ETL/Warehousing was inelegant and limited
- What do we do on a cache miss??
- Should be no distinction between materialized
views and queries! - Federated Query Processing
- Query across multiple sources
- Choose among multiple replicas, materialized
views - Consider staleness
- This is the natural extension of the modern
database vision - Cohera uses Mariposas economic model to do this
- Decouples optimization, cost estimation, storage
and processing
20Standard Queries Required
- Hand-coded queries are brittle you want ad-hoc
- Dont buy a handful of beans
- Need support for standard query languages
- SQL and XPath today
- SQL/XQuery tomorrow
- Everybody knows this!
- Part of industrial religion
- Oracle on one side
- Dotcoms on the other side
- You might get by claiming to be XML compliant
- But most people have cottoned on by now
21IR capabilities need to be in the engine
- The best-integrated data will still be noisy
(product names, etc) - Text search on taxonomies, names, descriptions
- Still no good integration of DBMS and IR engines
- Storage (compression huge in IR)
- Index concurrency (many updates per doc in IR)
- Query optimization challenges
- Note this is not semi-structured querying!
- Integration of logic statistics is the real
model/query challenge - Plus HCI issues
- Unify query, browse, mine, rank
- Cohera integrates AltaVista into the engine
optimizer
22Core Systems Issues Remain Important
- Availability, Scalability, Load Balancing
- All critically important in the B2B space
- Availability you dont even control the
components! Outagenews. - Scalability MRO wants to grow up to very big
installations - Load Balancing need to respect SLAs, etc.
- Need adaptive, load balancing, federated QP
- 100s to 1000s of sites
- Replication is key to availability, but optimizer
must understand it - Coheras economic model adapts for each query
- Other models being studied (see DE Bulletin
6/2000) - Compile-time, centralized optimizers (R, et al)
will break
23Query Processing Themes
- Standards
- Logic Statistics
- Adaptivity to changing performance, load,
failures - Optimizer Scalability
24So What Really Matters Today?
- Cohera sells because
- Customers need the content integration workbench
today - They are in integration pain!
- Comes in multiple guises (e-catalog, supplier
enablement, etc.) - Smart tools start cutting the pain immediately
- Customers want an open, standard solution
- Plain old SQL and relational schemas (vs.
Requisite, e.g.) - XML in the bottom, out the top for
messaging/integration - Customers want federated queryingtomorrow
- For today, theyll settle for a centralized
solution - Want the flexibility to grow in that direction
- Federated query engine works fine centralized
- The converse clearly not true
25Road Map
- Setting
- Scenarios Terminology
- Characteristics and Challenges of Content
Integration - Research Evangelism
26Research Evangelism
- Semi-Automatic Tools
- Statistical logical techniques, with a user in
the loop - E.g. Potters Wheel Raman/Hellerstein, VLDB
01http//control.cs.berkeley.edu - schema integration algebra
- interactive visualization
- programming-by-example
- statistical inferencing for discrepancies and
domain detection - A new class of systems work!
- Tools/Apps must be part of our agenda
- Many systems challenges here, especially on the
stat/HCI side - Architectural elegance, API design,
extensibility, scalability, etc.
27Research Evangelism, Cont.
- Adaptive Query Processing
- Critical to the federated B2B space
- Unpredictable world, you dont control the
components - Also critical to the ubiquitous computing space
- Sensors are the next challenge
- Whos the DBA of your housepaint? The freeway
lines? - Economic optimization (Mariposa) is one model
- Finer-Grained adaptivity possible (Eddies, SIGMOD
2K) - See http//telegraph.cs.berkeley.edu for
examples, ideas, SW
28Research Evangelism, Cont.
- Tired of research on relational? Choose wisely!
- One big direction here is to integrate IR
- Another is to abandon languages in favor of
interfaces - querybrowsemine semi-automatic GUIs again!
- XML is critical to business, but under control
- Were doing fine in this space, thank you
- XQuery will push (merge with?) SQL
- The end-result will resemble things youve seen
before - But text search is eating our lunch!
- Intellectual impact in the last decade?
- Industrial impact in the last decade?
- Text search is mostly just an access method a
sort metric - Integrate into our composable algebras and
architectures! - Teach it in our undergrad classes
29Summary
- Content Integration is a new, challenging
industrial space - Cohera provides the first complete solution
- Access with varying relationships, formats
- Support for multiple schemas and taxonomies
- Support for custom syndication
- Support for distributed data, both cacheable and
uncacheable - Ad hoc querying
- Fuzzy structured search
- Availability, Scalability, Load Balancing
- Smart graphical tools for content managers
- A fertile area for research as well
- Join the fun!
30Custom Syndication
- Multiple audiences for content syndication
- Personalization, e.g. buyer-dependent pricing
- Passing the buck?
- Receiver-makes-right (marketplace integration)
- Sender-makes-right (supplier enablement)
- Both cause pain to the elephants!
- Another instance of the mapping problem
- A new space, so XML is a given
31Background on Cohera
- Engine --gt Focused Application --gt Broad
Solution - Began as Mariposa, Inc.
- Federated SQL engine shipped 1/99
- Hard to fight warehouses on their turf
- Entered the E-Catalog Space, 12/99
- Federated SQL screen scraper catalog apps
- Strong presence in this space
- Broaden to Content Integration, 2001
32- Cohera Content Integration System
Coheras Content Integration System?
Cohera Workbench
Cohera Integrate
Cohera Adapters
33Cohera Workbench
- Browser-based application used by content
managers to integrate content - Scalable architecture enables distributed
deployment - Enables suppliers to manage and control their
content remotely - Standards based implementation - J2EE, EJB, XML
infrastructure - Provides a complete workflow for managing
business content
Manage Taxonomy
Auto-categorize
Define
Cleanse
Import
Review
Publish
34Cohera Integrate
IntegrationServer Integrates and aggregates
static and dynamic business content from multiple
disparate sources SiteManager Retrieves and
transforms content from local or remote data
sources PowerSearch Expands search capabilities
by providing keyword and synonym based searching
via SQL
Mariposa
Embedded AltaVista
35Cohera Adapters
StandardAdapters Provide connectivity to all
major databases and common file
structures Connect Accesses and retrieves
content from XML documents and HTML pages
residing locally or remotely ExtendedAdapters Ac
cesses content residing in traditional mainframe
or application data sources
DBC
Screen Scraping XSLT by example, HTTP(s), etc.
App Gateways