Alan Tonge - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Alan Tonge

Description:

Molecular Formula : Chemical Abstracts (9000 journals - 12,000 structures/day) ... List of Starting Materials & Reagents. Recipe: Reactions Conditions & Work-up ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 21
Provided by: libC7
Category:
Tags: alan | chemical | formulas | list | of | tonge

less

Transcript and Presenter's Notes

Title: Alan Tonge


1
SPECTRa-T Project
  • Alan Tonge

Semantic Web Data Repositories from Chemistry
e-Thesis Data Mining
Open Repositories 2008 Southampton University 2
April 2008
2
Project Overview
Submission, Preservation and Exposure of
Chemistry Teaching and Research Data
  • 12-month project between University of
    Cambridge and Imperial College London to
    develop text- and data-mining tools to extract
    chemical data from e-theses
  • Part of the JISC Digital Repositories programme

in Theses
3
Background
Chemistry is an experimental science Synthetic
Organic Chemistry
is the basis of
Pharmaceutical and Agrochemical industries
Where does the information to make this molecule
come from?
Ethyl 4,5-epoxy-hex-2-enolate C8H12O3
Systematic Name Molecular Formula
4
Chemical Abstracts (9000 journals - 12,000
structures/day)Beilstein (180 core
journals)Patents (CAS, Derwent, MDL) (400,000
/annum)
Search Chemical patent journal abstracting
services e.g.
Academic chemistry publications largely derived
from PhD Theses Perhaps 10K published per year
worldwide Synthetic contains 50-60 preparations
only 20 published in detail
5
  • List of Starting Materials Reagents
  • Recipe Reactions Conditions Work-up
  • Product Characterization spectroscopic
    physical properties

6
Sample preparation from synthetic chemistry
thesis
7
The Problem
  • 80 of (academic) synthetic preparations remain
    locked in theses
  • Manual abstraction (cf journals/patents) not an
    option

The Solution
  • OSCAR3 Automatic high-throughput chemical name
    and chemical term recognition
  • Open Source Chemistry Analysis Routines is
    an extensible Open Source framework which can
    identify much of the chemical terminology in
    electronic articles
  • Semantic Web Deposit extracted terms in
    searchable RDF triplestore

8
OSCAR Name recognition
1. Dictionary of chemical names/terms (ChEBI
Ontology)
2. Rules chemical suffix filters
3. Regular expressions to recognise data,
formulae
9
(No Transcript)
10
Input PDF Legacy FormatPDF is the de facto
format for electronic document deposition in
digital repositories
  • Problem

PDF text is a Page Description Format
optimized for human, not machine, readability
  • irregular word order
  • line-breaks loss of continuous text paragraphs
    difficult to identify
  • loss of subscripts and superscripts
  • non-printing characters
  • erroneous character assignment with OCR.

11
(No Transcript)
12
Programmatic modifications to
  • Remove linebreaks from extended chemical names
  • Remove text fragments derived from Figures and
    Tables
  • Correct whitespace in chemical names

OSCAR3
XSLT
UTF-8 text
SAF XML
RDF statements
PDF
Used as is
OSCAR used as is on PDF e-theses
Gives 5000 terms / thesis (80 duplicates) Cannot
identify chemical objects (spectra assignments
properties)
Gives 5000 terms / thess
13
Input MS Office Open XML docx
  • No information loss from students deposited
    thesis (written with MS software)
  • Identification of experimental sections no
    longer a problem -gt Chemical Objects
  • Conversion of COs into Chemical Markup Language

Extract chemical terms
RDF statements
OSCAR3
Link together
DocX
URI
Extract chemical objects
CML data files
Data Repository (GET, PUT, SEARCH)
14
Sample preparation from synthetic chemistry
thesisSample preparation from chemistry
thesis
15
CML Infra-Red ASSIGNMENTS ltcmlspectrum
type"cmlir"gt - ltcmlconditionListgt  
ltcmlcondition title"the form of the IR
spectrum dictRef"cmlirform"gtfilmlt/cmlcondition
gt   lt/cmlconditionListgt - ltcmlpeakListgt  
ltcmlpeak id"p1" xValue"3446" title"OH" /gt  
ltcmlpeak id"p2" xValue"3062"
title"unassigned" /gt   ltcmlpeak id"p3"
xValue"3029" title"unassigned" /gt   ltcmlpeak
id"p4" xValue"2922" title"unassigned" /gt  
ltcmlpeak id"p5" xValue"1672" title"CO" /gt  
ltcmlpeak id"p6" xValue"1604" title"CC" /gt  
ltcmlpeak id"p7" xValue"1496"
title"unassigned" /gt   ltcmlpeak id"p8"
xValue"1454" title"unassigned" /gt   ltcmlpeak
id"p9" xValue"1366" title"unassigned" /gt  
ltcmlpeak id"p10" xValue"1299"
title"unassigned" /gt   ltcmlpeak id"p11"
xValue"1135" title"unassigned" /gt   ltcmlpeak
id"p12" xValue"1078" title"unassigned" /gt  
ltcmlpeak id"p13" xValue"974"
title"unassigned" /gt     lt/cmlpeakListgt  
lt/cmlspectrumgt
CML C-13 NMR ASSIGNMENTS ltcmlspectrum
type"cmlcnmr"gt - ltcmlparameterListgt  
ltcmlparameter dictRef"cmlfrequency"
units"unitsMHz"gt50lt/cmlparametergt  
lt/cmlparameterListgt - ltcmlsubstanceListgt  
ltcmlsubstance ref"" /gt   lt/cmlsubstanceListgt -
ltcmlpeakListgt   ltcmlpeak xValue"198.6"
integral"" peakMultiplicity"" title"CO" /gt  
ltcmlpeak xValue"198.5" integral""
peakMultiplicity"" title"" /gt   ltcmlpeak
xValue"145.0" integral"" peakMultiplicity""
title"C" /gt   ltcmlpeak xValue"142.7"
integral"" peakMultiplicity"" title"C" /gt  
ltcmlpeak xValue"137.3" integral""
peakMultiplicity"" title"CH2" /gt   ltcmlpeak
xValue"136.7" integral"" peakMultiplicity""
title"CH2" /gt   ltcmlpeak xValue"129.1"
integral"" peakMultiplicity"" title"" /gt  
ltcmlpeak xValue"128.6" integral""
peakMultiplicity"" title"" /gt   ltcmlpeak
xValue"126.7" integral"" peakMultiplicity""
title"" /gt   ltcmlpeak xValue"124.0"
integral"" peakMultiplicity"" title"aryl-C" /gt
  ltcmlpeak xValue"62.5" integral""
peakMultiplicity"" title"CH" /gt   ltcmlpeak
xValue"59.0" integral"" peakMultiplicity""
title"CH" /gt   ltcmlpeak xValue"55.2"
integral"" peakMultiplicity"" title"CH" /gt  
ltcmlpeak xValue"54.9" integral""
peakMultiplicity"" title"CH" /gt   ltcmlpeak
xValue"38.5" integral"" peakMultiplicity""
title"CH2" /gt   ltcmlpeak xValue"32.8"
integral"" peakMultiplicity"" title"CH2" /gt  
ltcmlpeak xValue"26.1" integral""
peakMultiplicity"" title"CH3" /gt   ltcmlpeak
xValue"26.0" integral"" peakMultiplicity""
title"CH3" /gt   lt/cmlpeakListgt  
lt/cmlspectrumgt
16
RDF - Resource Description Framework. A
component of the Semantic Web, it is based upon
the concept of linking statements about
resources/data in the form of a Subject -
predicate - object (or Resource - property -
value ) expression (called a triple) e.g.
My_thesis has_chemical_entity
2,4-dinitrobenzene When created in URI form,
the value of one property can in turn be used as
the resource for another this allows the data
to be re-used and new inferences about data to be
created.
17
SPARQL QUERY PREFIX st lthttp//wwmm.ch.cam.ac.uk/
spectra-tgt PREFIX dcrdf lthttp//purl.org/metadat
a/dublin_coregt CONSTRUCT ?thesis
sthasBicycloMoleculeAndHNMR ?chemical . ?thesis
dcrdfauthor ?author WHERE ?thesis
dcrdfcreator ?author . ?thesis
sthasChemicalName ?annot . ?annot
stchemicalName ?chemical . ?annot
sthasHNMRSpectrum ?hnmr . FILTER
regex(?chemical, ".bicyclo.") .
RDF TRIPLESTORE ENTRY lt?xml version"1.0"
encoding"UTF-8"?gt ltrdfRDF xmlnsdc"http//purl
.org/dc/elements/1.1/" xmlnsdcrdf"http//purl.
org/metadata/dublin_core" xmlnsrdf"http//www
.w3.org/1999/02/22-rdf-syntax-ns"
xmlnsspectra-t"http//wwmm.ch.cam.ac.uk/spectr
a-t"gt ltrdfDescription rdfabout"file/C/spec
tra-t-theses/Juergen_Harter.docx"gt
ltspectra-thasChemicalNamegt ltrdfDescriptiongt
ltspectra-tchemicalNamegtCDCl3lt/spectra-tchemical
Namegt ltspectra-thasSMILESgtClC(2H)(Cl)Cllt/sp
ectra-thasSMILESgt ltspectra-thasInChIgtInChI1
/CHCl3/c2-1(3)4/h1H/i1Dlt/spectra-thasInChIgt
lt/rdfDescriptiongt lt/spectra-thasChemicalNamegt
ltspectra-thasChemicalNamegt ltrdfDescriptiongt
ltspectra-tchemicalNamegt1-Benzyloxy-but-3-ynelt/s
pectra-tchemicalNamegt ltspectra-thasSMILESgtC
CCCOCC1CCCCC1lt/spectra-thasSMILESgt
ltspectra-thasInChIgtInChI1/C11H12O/c1-2-3-9-12-10
-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2lt/spectra-thasIn
ChIgt ltspectra-thasHNMRSpectrumgthttp//ch.cam.
ac.uk8182/1ea7f8cd07/data-0.cmllt/spectra-thasHNM
RSpectrumgt ltspectra-thasCMLMoleculegthttp//ch
.cam.ac.uk8182/1ea7f8cd07/data-0.cmllt/spectra-th
asCMLMoleculegt ltspectra-thasPreparationgthttp
//ch.cam.ac.uk8182/1ea7f8cd07/preparation-0.sci.x
mllt/spectra-thasPreparationgt
lt/rdfDescriptiongt lt/spectra-thasChemicalNamegt
ltspectra-thasChemicalNamegt ltrdfDescriptiongt
ltspectra-tchemicalNamegt(3E,5S,6S)-8-(p-Methoxy-
benzyloxy)-5,6-epoxy-6-methyl-oct-3-en-2-onelt/spec
tra-tchemicalNamegt ltspectra-thasHNMRSpectrum
gthttp//fiwlt.ch.cam.ac.uk8182/8f2d98b04/data-20.
cmllt/spectra-thasHNMRSpectrumgt
ltspectra-thasIRSpectrumgthttp//fiwlt.ch.cam.ac.uk
8182/8f2d98b04/data-20.cmllt/spectra-thasIRSpectr
umgt ltspectra-thasMassSpectrumgthttp//fiwlt.ch
.cam.ac.uk8182/8f2d98b04/data-20.cmllt/spectra-th
asMassSpectrumgt ltspectra-thasHRMSSpectrumgthtt
p//fiwlt.ch.cam.ac.uk8182/8f2d98b04/data-20.cmllt
/spectra-thasHRMSSpectrumgt
ltspectra-thasPreparationgthttp//fiwlt.ch.cam.ac.u
k8182/8f2d98b04/preparation-20.sci.xmllt/spectra-t
hasPreparationgt lt/rdfDescriptiongt lt/spectra-t
hasChemicalNamegt lt/rdfDescriptiongt lt/rdfRDFgt
RESULT ltrdfDescription rdfabout"file/C/spectr
a-t-articles/B207708F.docx"gt ltsthasBicycloMolecu
leAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bicyclo
4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMR
gt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4.2.1
nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdc
rdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBi
cycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1nona-3
,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcrdfau
thorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBicycloM
oleculeAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bic
yclo4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAn
dHNMRgt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt
ltsthasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4
.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt
ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1n
ona-3,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcr
dfauthorgtN.R.Champnesslt/dcrdfauthorgt lt/rdfDescr
iptiongt
18
Message to repository managers PDF is a limited
format for data extraction from e-theses Docx
allows chemical data object extraction (80
precision / recall)
Solutions Domain ontology development Make
your e-theses public!
Caveats (Proof-of-concept) Single subject area
(synthetic organic chemistry) Single institution
docx (limited variation in document
structure) Limited thesis availability
19
Acknowledgements
  • Project Director Peter Morgan UL Cambridge
  • Chemistry leads Henry Rzepa, Peter Murray-Rust
  • Developers Jim Downing, Diana Stewart,
  • Joe Townsend, Matt Harvey
  • Project Manager Alan Tonge

http//www.lib.cam.ac.uk/spectra-t/
20
SPECTRa Tools Workshop
Autumn 2008 Unilever Centre, Cambridge, UK
Contact Peter Murray-Rust (pm286_at_cam.ac.uk) P
eter Morgan (pbm2_at_cam.ac.uk)
Write a Comment
User Comments (0)
About PowerShow.com