Content Integration for EBusiness - PowerPoint PPT Presentation

About This Presentation
Title:

Content Integration for EBusiness

Description:

Companies moving beyond marketing, storefronts. Attempting to do operations on the Internet ... Great for service companies, e.g. Requisite. Tools are sounding ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 30
Provided by: joehell
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Content Integration for EBusiness


1
Content Integration for E-Business
  • Joe Hellerstein

2
New Generation of e-Business on the Internet
  • Companies moving beyond marketing, storefronts
  • Attempting to do operations on the Internet
  • procurement
  • supply chain
  • customer relationships
  • etc.
  • In a cross-enterprise environment
  • Requires cross-enterprise content integration
  • catalog integration is the procurement instance
    of this problem

3
Content Integration
  • Content integration across enterprises
  • Not the in-house data warehousing problem
  • Not the Enterprise App Integration (EAI) problem
  • Operational data must be integrated
  • As opposed to historical (trend) data
  • E.g. pricing, availability, supply chain
  • Structured and unstructured data
  • Not just relational or XML queries
  • Not just text search
  • A combination of the two logic meets statistics

4
The Butterfly
  • Everybodys favorite picture c. 1/2000
  • At question (6/2001) is how many butterflies, who
    owns them
  • Not a startup opportunity (Transora vs. Chemdex)
  • Perhaps one of the wings is smaller than the
    other (HomeDepot)

Marketplace
Suppliers
Buyers
5
Road Map
  • Setting
  • Scenarios Terminology
  • Characteristics and Challenges of Content
    Integration
  • Research Evangelism

6
Some Scenarios for Content Integration
  • Catalog Management Integration and Syndication
  • MRO (Maintenance, Repair and Operations) a la
    Grainger
  • Thousands of suppliers, run by a content
    manager
  • Availability and Pricing
  • Travel industry
  • Necessitates live, cross-enterprise querying
  • Supply Chain Management
  • E.g. auto industry
  • Increase in production requires the entire supply
    chain (the cows)
  • Contractual information along with catalog and
    availability

7
Marketing The EcoSystem and its Terminology
  • Enterprise Application Integration (EAI) App
    Glue
  • Imperative, message-oriented programming
    (scripting languages)
  • Transactional networking (persistent queues)
  • Gateways to popular packaged apps
  • Vendors WebMethods, BEA, CrossWorlds, Netfish,
    MQseries, etc.
  • Data Integration Warehousing and associated
    processes
  • Intra-enterprise, for business intelligence
    (historical trends)
  • Vendors Informatica, Ascential, DBMS vendors
  • Content Management Tools for content creation
  • Web page and graphic design
  • Versioning and configuration management
  • Vendors Vignette, Interwoven, etc.

8
Road Map
  • Setting
  • Scenarios Terminology
  • Characteristics and Challenges of Content
    Integration
  • Content Access, Mapping and Transformation
  • Query Processing
  • Research Evangelism

9
Content Integration Characteristic and Challenges
  • New integration challenges for e-business
  • cross-enterprise
  • operational
  • data-centric (not app-centric)
  • structured/unstructured
  • Two main thrusts
  • Content Access, Mapping and Transformation
  • Query Processing

10
Content Access Relationships with Providers
  • Varying relationships with content providers
  • Direct DBMS access (typically in-house)
  • Direct access to federated apps (SAP, etc.)
  • Gateway vendors a la Merant, NEON, Attunity, etc.
  • Arms-length relationships
  • HTML screen scraping
  • XML messaging
  • Relationships evolve over time!
  • MySimon example

11
Content Mapping
  • Syntactic and semantic integration
  • Formatting/normalization is one piece of the
    puzzle
  • XML, HTML, Relational, etc.
  • Semantics is much harder
  • E.g. price. E.g. delivery.
  • Semantics gate the process
  • A content manager must own the transformation
    task
  • Ease of use critical
  • Home Depot has 60,000 suppliers!
  • Standards can help a bit (e.g. UDDI)
  • But graphical tools are the name of the game

12
Cohera Workbench
13
Schemas and Taxonomies
  • Cross-enterprise multiple schemas
  • Even if standards prevail (very optimistic)
  • Early e-catalog systems were locked into one
    schema
  • Great for service companies, e.g. Requisite
  • Tools are sounding the death knell
  • Taxonomies are critical
  • Natural for browsing, especially with dirty data
  • Black Ink, India ink, fountain pen ink,
    black
  • Taxonomy per vertical markets, plus standards
    like UNSPSC
  • Office Supplies-gtInk and lead refills-gtIndia ink
  • Taxonomy as data query it, browse it, etc.
  • Integration task includes taxonomy integration!

14
(No Transcript)
15
(No Transcript)
16
Themes in Content Access and Mapping
  • Scalability in human terms
  • Content managers, not geeks
  • The name of the game semi-automatic tools
  • Statistical (fuzzy) techniques to provide hints
    (not silver bullets)
  • Integrated into graphical programming-by-example
    interfaces
  • Problem domains
  • Wrapper generation
  • Data cleaning
  • Schema mapping
  • Taxonomy mapping
  • Syndication
  • One of the key systems challenges today

17
Road Map
  • Setting
  • Scenarios Terminology
  • Characteristics and Challenges of Content
    Integration
  • Research Evangelism

18
Query Processing Issues
  • Content to be integrated is increasingly
    uncacheable
  • Arms-length accessibility
  • Business rules, not data
  • E.g. custom content throughout the dataflow
  • Volatile information
  • E.g. Availability
  • Yet a great deal of content is cacheable and
    slowly changing
  • Upshot need a combined technology
  • Prefetch/Cache/Replicate when possible
  • Query live when impossible

19
Federated Query Processing
  • DBMS community must shed our materialization
    myopia!
  • ETL/Warehousing was inelegant and limited
  • What do we do on a cache miss??
  • Should be no distinction between materialized
    views and queries!
  • Federated Query Processing
  • Query across multiple sources
  • Choose among multiple replicas, materialized
    views
  • Consider staleness
  • This is the natural extension of the modern
    database vision
  • Cohera uses Mariposas economic model to do this
  • Decouples optimization, cost estimation, storage
    and processing

20
Standard Queries Required
  • Hand-coded queries are brittle you want ad-hoc
  • Dont buy a handful of beans
  • Need support for standard query languages
  • SQL and XPath today
  • SQL/XQuery tomorrow
  • Everybody knows this!
  • Part of industrial religion
  • Oracle on one side
  • Dotcoms on the other side
  • You might get by claiming to be XML compliant
  • But most people have cottoned on by now

21
IR capabilities need to be in the engine
  • The best-integrated data will still be noisy
    (product names, etc)
  • Text search on taxonomies, names, descriptions
  • Still no good integration of DBMS and IR engines
  • Storage (compression huge in IR)
  • Index concurrency (many updates per doc in IR)
  • Query optimization challenges
  • Note this is not semi-structured querying!
  • Integration of logic statistics is the real
    model/query challenge
  • Plus HCI issues
  • Unify query, browse, mine, rank
  • Cohera integrates AltaVista into the engine
    optimizer

22
Core Systems Issues Remain Important
  • Availability, Scalability, Load Balancing
  • All critically important in the B2B space
  • Availability you dont even control the
    components! Outagenews.
  • Scalability MRO wants to grow up to very big
    installations
  • Load Balancing need to respect SLAs, etc.
  • Need adaptive, load balancing, federated QP
  • 100s to 1000s of sites
  • Replication is key to availability, but optimizer
    must understand it
  • Coheras economic model adapts for each query
  • Other models being studied (see DE Bulletin
    6/2000)
  • Compile-time, centralized optimizers (R, et al)
    will break

23
Query Processing Themes
  • Standards
  • Logic Statistics
  • Adaptivity to changing performance, load,
    failures
  • Optimizer Scalability

24
So What Really Matters Today?
  • Cohera sells because
  • Customers need the content integration workbench
    today
  • They are in integration pain!
  • Comes in multiple guises (e-catalog, supplier
    enablement, etc.)
  • Smart tools start cutting the pain immediately
  • Customers want an open, standard solution
  • Plain old SQL and relational schemas (vs.
    Requisite, e.g.)
  • XML in the bottom, out the top for
    messaging/integration
  • Customers want federated queryingtomorrow
  • For today, theyll settle for a centralized
    solution
  • Want the flexibility to grow in that direction
  • Federated query engine works fine centralized
  • The converse clearly not true

25
Road Map
  • Setting
  • Scenarios Terminology
  • Characteristics and Challenges of Content
    Integration
  • Research Evangelism

26
Research Evangelism
  • Semi-Automatic Tools
  • Statistical logical techniques, with a user in
    the loop
  • E.g. Potters Wheel Raman/Hellerstein, VLDB
    01http//control.cs.berkeley.edu
  • schema integration algebra
  • interactive visualization
  • programming-by-example
  • statistical inferencing for discrepancies and
    domain detection
  • A new class of systems work!
  • Tools/Apps must be part of our agenda
  • Many systems challenges here, especially on the
    stat/HCI side
  • Architectural elegance, API design,
    extensibility, scalability, etc.

27
Research Evangelism, Cont.
  • Adaptive Query Processing
  • Critical to the federated B2B space
  • Unpredictable world, you dont control the
    components
  • Also critical to the ubiquitous computing space
  • Sensors are the next challenge
  • Whos the DBA of your housepaint? The freeway
    lines?
  • Economic optimization (Mariposa) is one model
  • Finer-Grained adaptivity possible (Eddies, SIGMOD
    2K)
  • See http//telegraph.cs.berkeley.edu for
    examples, ideas, SW

28
Research Evangelism, Cont.
  • Tired of research on relational? Choose wisely!
  • One big direction here is to integrate IR
  • Another is to abandon languages in favor of
    interfaces
  • querybrowsemine semi-automatic GUIs again!
  • XML is critical to business, but under control
  • Were doing fine in this space, thank you
  • XQuery will push (merge with?) SQL
  • The end-result will resemble things youve seen
    before
  • But text search is eating our lunch!
  • Intellectual impact in the last decade?
  • Industrial impact in the last decade?
  • Text search is mostly just an access method a
    sort metric
  • Integrate into our composable algebras and
    architectures!
  • Teach it in our undergrad classes

29
Summary
  • Content Integration is a new, challenging
    industrial space
  • Cohera provides the first complete solution
  • Access with varying relationships, formats
  • Support for multiple schemas and taxonomies
  • Support for custom syndication
  • Support for distributed data, both cacheable and
    uncacheable
  • Ad hoc querying
  • Fuzzy structured search
  • Availability, Scalability, Load Balancing
  • Smart graphical tools for content managers
  • A fertile area for research as well
  • Join the fun!

30
Custom Syndication
  • Multiple audiences for content syndication
  • Personalization, e.g. buyer-dependent pricing
  • Passing the buck?
  • Receiver-makes-right (marketplace integration)
  • Sender-makes-right (supplier enablement)
  • Both cause pain to the elephants!
  • Another instance of the mapping problem
  • A new space, so XML is a given

31
Background on Cohera
  • Engine --gt Focused Application --gt Broad
    Solution
  • Began as Mariposa, Inc.
  • Federated SQL engine shipped 1/99
  • Hard to fight warehouses on their turf
  • Entered the E-Catalog Space, 12/99
  • Federated SQL screen scraper catalog apps
  • Strong presence in this space
  • Broaden to Content Integration, 2001

32
  • Cohera Content Integration System

Coheras Content Integration System?
Cohera Workbench
Cohera Integrate
Cohera Adapters
33
Cohera Workbench
  • Cohera Workbench
  • Browser-based application used by content
    managers to integrate content
  • Scalable architecture enables distributed
    deployment
  • Enables suppliers to manage and control their
    content remotely
  • Standards based implementation - J2EE, EJB, XML
    infrastructure
  • Provides a complete workflow for managing
    business content

Manage Taxonomy
Auto-categorize
Define
Cleanse
Import
Review
Publish
34
  • Cohera Integrate

Cohera Integrate

IntegrationServer Integrates and aggregates
static and dynamic business content from multiple
disparate sources SiteManager Retrieves and
transforms content from local or remote data
sources PowerSearch Expands search capabilities
by providing keyword and synonym based searching
via SQL
Mariposa

Embedded AltaVista
35
  • Cohera Adapters

Cohera Adapters
StandardAdapters Provide connectivity to all
major databases and common file
structures Connect Accesses and retrieves
content from XML documents and HTML pages
residing locally or remotely ExtendedAdapters Ac
cesses content residing in traditional mainframe
or application data sources
DBC
Screen Scraping XSLT by example, HTTP(s), etc.
App Gateways
Write a Comment
User Comments (0)
About PowerShow.com