OTMI Open Text Mining Interface - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

OTMI Open Text Mining Interface

Description:

Whenever a publisher opens its content to users for search and discovery there ... User jumps from site to site following imprecise ... OTMI Autodiscovery ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 17
Provided by: howardratn
Category:

less

Transcript and Presenter's Notes

Title: OTMI Open Text Mining Interface


1
OTMIOpen Text Mining Interface
  • opentextmining.org

2
OTMI Opportunity
  • Whenever a publisher opens its content to users
    for search and discovery there is always the
    threat of stolen content and lost business

3
OTMI Text Crawling in a Googled
World
Spider Results
Spider Collection
User jumps from site to site following imprecise
search results
4
OTMI What is OTMI?
  • A common XML format in which publishers can issue
    their content
  • A data format allowing search partners and text
    mining companies to
  • Dig deep into a publishers content in a way that
    protects a publishers full text
  • Enable richer user discovery than HTML-harvesting
    alone can provide

5
OTMI XML File Description
  • Machine readable text for indexing and text
    mining
  • Nonlinear text presentation
  • XML file contents
  • Bibliographic metadata
  • List of word vectors
  • List of text snippets
  • Figure captions
  • References

6
OTMI Standards-Based
  • Atom Entry document selected as the XML format
  • Prototype uses NLM (National Library of Medicine)
    standards
  • XLM entity/char ref lookup table
  • Stopword list

7
OTMI Generator (High Level)
Publisher HTML or XML Full Text Document
Publisher OTMI Document
OTMI Generator Process Conversion Tailored to
Publisher-Specific Source File Format
8
OTMI Text Crawling in an OTMIed
World
Spider Results
Spider Collection
User goes to sites armed with highly precise
search results
9
OTMI Sample Article Text
10
OTMI Sample OTMI Text
OTMI Header
OTMI Snippets
11
OTMI Autodiscovery
  • Embedded in the HTML of the abstract and
    full-text file for each article is a tag like
    this ltlink rel"otmi" type"application/xml"
  • href"../otmi/nature04614.otmi" /gt
  • which redirects a spider/crawler to the
    associated OTMI file

12
OTMI Possible Improvements
  • References to associated data files and/or
    database entries
  • References from corresponding RSS feed items (and
    from the login page where content is
    access-controlled)
  • Allow text in normal human-readable form (for
    open-access journals)

13
OTMI Exposure
  • NPG presented a brief summary of the idea at Bio
    IT World Conference (Boston, 3-5 Apr. 06)
  • Very positive reactions
  • NPG shared the idea with NPGs Library Committee
    (NY, 24 Apr. 06)
  • When can we have this? This will solve so many
    problems!
  • Nascent (NPG Web Publishing Blog) on 24 Apr. 06
    and13 Jun. 06
  • Nature, 27 Apr. 06, Editorial on Machine
    Readability
  • CrossRef Member Meeting, 1 Nov. 06
  • Discussions continue on many web sites and blogs
  • arstechnica.com
  • radar.oreilly.com
  • Open Access News

14
OTMI Wiki
  • Dedicated wiki for OTMI at
  • opentextmining.org
  • Organizes
  • Information about OTMI
  • General Resources for OTMI
  • Other Web Resources
  • Facilitates
  • Community participation

15
OTMI Industry Standard
  • OTMI is a potentially industry-wide revolution
    leading to a new channel of business
    opportunities
  • A common standard can be powerful when it comes
    to aggregating content from multiple sources
    cf. DOI and RSS
  • We need to work with industry organizations
    toward developing a standard
  • CrossRef (2300 Publishers and Societies)
  • Search Partners (Microsoft, Google)
  • Text Mining Companies (Pharmas and Academic
    Librarians)

16
  • To learn more about OTMI see
  • opentextmining.org
  • Or to participate contact
  • Greg Suprock (g.suprock_at_natureny.com)
  • Howard Ratner (h.ratner_at_natureny.com
  • Timo Hannay (t.hannay_at_nature.com)
  • Tony Hammond (t.hammond_at_nature.com)
Write a Comment
User Comments (0)
About PowerShow.com