XTF in Depth

1 / 73
About This Presentation
Title:

XTF in Depth

Description:

All about rapid prototyping, fast deployment, and incremental improvement ... Accent/diacritic marks. Many users can't or don't know how to type them ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 74
Provided by: marti88

less

Transcript and Presenter's Notes

Title: XTF in Depth


1
XTF in Depth
  • Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library
January 2009 presentation at University of Sydney
2
XTF in Depth
  • Part 1
  • What is XTF and how does it compare?
  • Who is using it?
  • What needs does it address?
  • New features in 2.1
  • Design and data flow
  • Adapting Lucene and Saxon
  • Planned improvements
  • Part 2
  • Interactive demos

3
XTF in 5 minutes
  • eXtensible Text Framework
  • Search and display technology from CDL
  • Open-source Java framework
  • Powerful and highly configurable
  • All about rapid prototyping, fast deployment,
    and incremental improvement
  • XML Full text search
  • Also indexes PDF, HTML, Word
  • Excel and Powerpoint coming soon

4
XTF in 5 minutes
  • Search Query power/speed of Lucene, plus
  • search results shown in context
  • keyword search, facets, spelling, lots more
  • View Processing power of Saxon, plus
  • large file optimizations, hit markup
  • Configure and customize exclusively in XSLT
  • Flexible, overlapping collections
  • Mature, tightly integrated, well documented
  • In use at CDL and many other places

5
What XTF is not
  • It is not a content management system
  • Creation (conversion, scanning, manual)?
  • Ingest / administration
  • Editing
  • Preservation
  • Not built for remote administration
  • Not a true XML database
  • but close
  • Not Google
  • Google one interface to vast grab-bag of data
  • XTF crafted interfaces to high-quality data sets

6
How does XTF compare?
Green- stone


Solr
Turn-key / easy---------------gt
XTF 2.1
XTF 2.0
Customizable / Powerful --------------------------
--------------gt
caveat based on my limited experience with
Greenstone and Solr
7
Online Archive of California
8
eScholarship Editions
9
calisphere
10
Mark Twain Project Online
11
UC Berkeley
12
University of Sydney
13
Encyclopedia of Chicago
14
Indiana University Newton
15
Indiana University Swinburne
16
Sweden
17
Brazil
18
Italy
19
Needs
  • Lets look at four needs that XTF was created to
    address
  • Diverse data
  • Open software
  • Rapid deployment
  • Community involvement

20
Needs 1. Diverse data
  • Our collections many and diverse
  • eScholarship (TEI, PDF)
  • UC Press monographs (a text may be gt 10 megs)
  • 25,000 scholarly articles in PDF
  • Mark Twain
  • Hand-crafted critical edition (TEI MODS)?
  • OAC finding aids, images, books, manuscripts
  • Japanese American Relocation Digital Archives
  • TEI, EAD, MODS
  • Book scanning projects (Google, Internet Archive)
  • Thousands of scanned books (PDF DC)?
  • Millions of Melvyl catalog records (MARC)

21
Needs 2. Open software
  • Digital Publishing Products
  • Black box (no control over fixes features)?
  • Often not standards-based
  • Tech companies have short lifespans
  • Support often spotty
  • Data can be held hostage, or even lost

22
Needs3. Rapid deployment
  • New collections arriving
  • Users don't want to wait a year for access
  • Many what if and wouldn't it be cool requests
    from our staff
  • Java programmers are expensive
  • Look feel goes stale quickly
  • Barrage of feature requests

23
Needs4. Community involvement
  • We want to share the load
  • For XTF 2.1, we asked the XTF community to vote
    for features they wanted
  • At CDL we try to align our development to needs
    of the community
  • Result Everybody benefits

24
New and improved in 2.1
  • Faceted browse
  • Search flexibility
  • Bookbag
  • Spelling correction
  • Similar items
  • OAI-PMH

25
Faceted browse
  • Previously implementing faceted browse required
    lots of XSLT programming.
  • Hierarchical facets even harder
  • Required us to deeply refactor the stylesheets,
    but now its simple to add new facets.

26
Faceted browse
27
Faceted browse
28
Hierarchical facets
29
Hierarchical facets
30
Search flexibility
  • Keyword search single box (now default).
    Internally, searches multiple fields.
  • Advanced search explicitly fill in constraints
    for various fields
  • Freeform search (new) text-based field
    specifiers, AND, OR, parentheses, etc.

31
Keyword search
32
Advanced search
33
Freeform search
34
OAI-PMH
  • This fit nicely into XTFs architecture
  • Simple but conforming implementation

35
Bookbag
  • Refactored the AJAX to use YUI (Yahoo User
    Interface widgets)
  • Still session based
  • Now supports emailing the bookbag

36
Bookbag
37
Bookbag
38
Bookbag
39
Spelling correction
  • Unicode bug fixes
  • On by default and fully integrated

40
Spelling correction
41
Spelling correction
42
Similar items
  • Allows user to see more like this
  • Improved AJAX integration
  • On by default - no configuration needed

43
Similar items
44
Similar items
45
Other changes in XTF 2.1
  • Built-in NLM Blue, TEI P5, MS Word support
    (still support TEI P4, EAD, PDF, HTML, text)
  • Valid XHTML output
  • RawQuery servlet to provide a query back-end to a
    (e.g. Ruby) front-end or mash-up.
  • Bug fixes and minor changes (many
    reported/requested by users)

46
Wiki documentation
47
Wiki documentation
48
Design philosophy
  • Adaptation through programming
  • XTF is still about building what you want using a
    set of powerful tools
  • But now
  • Stylesheets are more modular
  • Build interfaces faster using honed widgets
  • Prettier UI to start with

49
XTF is open, standards based
  • Based on free, open-source tools
  • Java SDK 1.5
  • Lucene 2.1 full-text search toolkit
  • Saxon 8.9 XSLT processor
  • UNICODE support throughout
  • XTF itself is open-source (BSD license)?
  • No native code pure Java and XSLT 2.0
  • Runs on Windows, Solaris, Linux, MacOS
  • Drops right in to Tomcat or Resin
  • Lots of user-fixable documentation

50
Modular
  • Use crossQuery servlet to search, dynaXML to
    display and navigate. Deploy one or both.
  • Stylesheets govern flow of data no Java
    programming required
  • Easy to add features incrementally
  • 100 configurable look and feel
  • Skin slice one system can have several
    interfaces and multiple brands
  • Collection subsetting driven by meta-data

51
Why XSLT?
  • XSLT is a natural fit for XML
  • Powerful, dynamic language
  • Incredibly high-quality, free processor (Saxon)?
  • Why not Java/Struts?
  • Poor for rapid prototyping, steep learning curve
  • Why not Ruby?
  • Not necessarily a good match for XML data
  • Can be too clever by half
  • But a smart mash-up might be cool...

52
Indexing Process
53
Indexing
  • Input filters adapt to many doc types
  • Any XML doc type
  • PDF, MS Word, plain text, untidy HTML
  • XTF is agnostic regarding
  • Document identifiers
  • Filesystem organization
  • Uses document selector stylesheet to identify and
    classify documents in filesystem
  • Meta-data storage
  • Incremental indexing
  • Simply update filesystem then run indexer.

54
crossQuery servlet
55
Flexible Search/Display
  • One query, many collections
  • XTF enables Virtual collections
  • Output filters for various result views
  • e.g. simple vs. advanced search form, results in
    brief vs. long format, etc.
  • Query parsers for different search interfaces
  • Interface to other query protocols
  • SRU and OAI-PMH already implemented
  • Should be easy to adapt other queries
  • Very extensive set of query operators
  • Flexible query composition
  • Faceted browse

56
Query Power
  • Many operators
  • AND, OR, NEAR, NOT, phrase, range, wildcard
  • Or-Near, multi-field AND, more like this
  • Arbitrarily complex queries
  • Combine full-text search with meta-data
  • Unusual queries like"dynamic duo" near "red
    phone"
  • Structure-aware searching
  • e.g. search only headings, or only bibliographies
  • But must pre-define which structures to search

57
More Power
  • Fixed-length snippets
  • Highlight the hit and just the hit
  • Sort by relevance, or any meta-data fields
  • Spelling correction
  • No penalty for huge documents
  • XTF lazily pulls in only those parts used by a
    particular request (e.g. show just Chapter 1)?
  • Scalable
  • Proven with 10 million records / 14 gigs data
  • but beyond that, Solr looks better
  • Authentication IP lists, LDAP, or external

58
dynaXML servlet
59
Adapting Lucene and Saxon
  • Adapting Lucene
  • Chunking, flattening, hit marking, stop-words,
    setting limits, insensitivity, special queries,
    faceted browsing, spelling correction
  • Adapting Saxon
  • Lazy trees, misc. extensions

60
Adapting LuceneChunking
  • Why
  • Lucene's proximity searches perform best on small
    documents
  • Small chunks enable efficient generation of
    80-character snippet surrounding each hit
  • How
  • XTF breaks text blocks into 200-word chunks
  • Chunks overlap to detect a hit starting in one
    and ending in the next.
  • Each chunk carries structural info, plus pointer
    to location in XML doc.
  • Only first chunk carries meta-data for doc

61
Adapting LuceneFlattening XML
  • XSLT prefilter flattens XML structure
  • Series of text blocks
  • Block tagged with structural info for search
  • Prefilter can boost or suppress sections
  • Fine control over proximity matching
  • Prefilter gathers/marks meta-data
  • Can come from within the document, from an XML
    doc in filesystem, or fetched from a URL.
  • Synthesize meta-data (e.g. sort fields, facets)?

62
Adapting LuceneHit Marking
  • Marking search hits in context
  • Lucene doesn't pinpoint location of hits, only
    gives a score per-document
  • Custom enhancements to Lucene's span logic
    score and locate each hit.
  • dynaXML dynamically adds ranked hits to original
    XML doc, then sends to XSLT formatter.
  • crossQuery forms a snippet around and highlights
    each hit.

63
Adapting LuceneStop-words
  • Robust, efficient stop-word handling
  • the, a, an, it, on...
  • People do use them, and expect corresponding
    results.
  • Lucene normally ignores stop-words, for speed.
  • XTF quietly joins stop-words to adjacent words,
    forming n-grams
  • Example man on the moon -gt man-on on-the
    the-moon
  • Queries are internally rewritten to search for
    n-grams automatically.

64
Adapting LuceneSetting Limits
  • Limits on aberrant queries
  • Adjustable limits on number of terms matched by
    range or wildcard queries
  • N-grams naturally make most queries efficient
  • Configurable limits on amount of work performed
    by a single query.
  • Numeric range query
  • Avoids term expansion
  • Efficiently filters very granular data, e.g.
    timestamps 2006-11-14124603.77

65
Adapting LuceneInsensitivity
  • Accent/diacritic marks
  • Many users can't or don't know how to type them
  • XTF indexer uses configurable map to remove
    accents
  • crossQuery maps query terms
  • Plural
  • Convenient for cat to match cats also
  • Configurable map of plural to singular used at
    index and query time

66
Adapting LuceneSpecial Queries
  • OR-NEAR
  • Standard OR query doesn't use proximity
  • OR-NEAR if words nearby, score is boosted
  • Multi-field AND
  • All terms must be present, in any field.
  • Essential for certain keyword searches against
    all enemies clarke(matches against title and
    author)?
  • More like this
  • Auto-calculates interesting terms in meta-data
  • Creates OR-NEAR query to find similar docs

67
Adapting LuceneFaceted Browsing
  • Draws facet term list from Lucene index
  • Each facet cached in-memory
  • Counts per group created dynamically
  • Special mini-language to sort/select (esp. useful
    for hierarchical facets)?

68
Adapting LuceneSpelling Correction
  • Any standard dictionary won't match place and
    proper names
  • Idea use the index as source of suggestions
  • XTF searches words within edit distance 2
  • Candidates ranked by weighted score
  • Edit distance (transpositions discounted)?
  • Frequency of use in the index
  • Double-metaphone match
  • Multi-word correction uses pair frequencies
  • On test data, 80 right suggestion

69
Adapting SaxonLazy Trees
  • The need display small parts of large (gt 10MB)
    XML documents
  • Solution create a binary, random-access version
    of each document
  • XSL keys calc'd once and stored
  • Only elements accessed by a given request are
    loaded from disk
  • Care must be taken in stylesheets
  • Profile mode is useful for optimization

70
Adapting SaxonExtensions
  • More complete SQL database connection
  • Ability to call external tools
  • Automatic XML conversion in/out
  • Timeout enforcement
  • File utilities
  • Check file existence
  • Get file length and timestamp
  • Session data
  • Key/value pairs
  • Value can be XML or plain string

71
The future
  • XTF 2.2
  • Better out-of-box for large EADs
  • Fixes for incremental indexing other bug fixes
  • Specify any number of sub-dirs to index
  • Possible TEI P5 refactoring
  • Background auto-warming of new index
  • Support for indexing Powerpoint and Excel files
  • Further out
  • A page-turner for scanned texts and converted
    PDFs
  • Pop-up image/PDF page snippets
  • And of course, features suggested by users

72
Demos
  • Ill demonstrate the features we talked about on
    several different XTF sites out in the wild.

73
Fin
  • Project xtf.sourceforge.net
  • Docs xtf.wiki.sourceforge.net
  • Discuss groups.google.com/group/xtf-user
  • This talk xtf.sourceforge.net/talks/2009-01-23.pp
    t
  • Me martin.haye_at_ucop.edu
Write a Comment
User Comments (0)