BioText Conference Birkbeck College, London - PowerPoint PPT Presentation

About This Presentation
Title:

BioText Conference Birkbeck College, London

Description:

Co-occurrence based IE (looking at parse methods) Created on the fly ... Wren JD: Extending the mutual information measure to rank inferred literature relationships. ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 44
Provided by: edwa121
Category:

less

Transcript and Presenter's Notes

Title: BioText Conference Birkbeck College, London


1
BioText ConferenceBirkbeck College, London
  • Stephen Edwards BSc.
  • University of Edinburgh
  • BioNLP meeting 14th November 2005

2
Speakers
  • EDIMed
  • SBSS
  • Astra-Zeneca
  • Rob Gaizauskas (Sheffield)
  • BioRAT
  • Andrew Clegg (UoL)

3
EBIMed
  • Co-occurrence based IE (looking at parse methods)
  • Created on the fly
  • 40 sentences retrieved useful for PPI
  • Includes navigation to databases (important)
  • Assessible data
  • max 10,000 (moving to full papers)
  • Whatizit modules
  • High speed tagging modules
  • Can be hooked up to any dictionary
  • RegExp and ML combination should be combined
  • Recall The whole truth
  • Precision Nothing but the truth

4
(No Transcript)
5
(No Transcript)
6
SBSS
  • Business uses
  • Patent recognisers
  • IP protection
  • Drug design
  • Author networks, competition and funding
  • Marketing
  • NLP system, statistical linking between concepts
  • External/Internal databases
  • Includes web-sites, forums (negate false
    rumours!)
  • Microarray -gt Text-Mining
  • User interface to add new synonyms
  • IBM Unstructured Information Management
    Architecture

7
Astra-Zeneca
  • Drug discovery process gt masses of data
    (chem/bio assays, clinical trials, reports etc)
  • Track competition, groups
  • GCLit - gene summaries
  • - MeSH/gene co-occurrences
  • Produce similarity matrices for two genes
  • Back dating to trap now know associations

8
Rob Gaizauskas
  • GO tagging (19,022 terms), GOSlims
  • Many tools
  • GOPubmed (weighted GO-gtdoc assignment)
  • GO-KDS (comm. Assigns GO terms to PubMed,
    rubbish!)
  • BLAST -gt Lit, cluster and structure by GO code
  • AMBIT
  • combines IR/IE
  • Termino module GO, UniProt, UMLS
  • DiscoveryNet
  • data management software, includes Termino
  • Created complete GO corpus
  • fuzzy matchmanual to get GO complete corpus
  • High results achieved assigning GO to abstracts
  • F-measure 0.8
  • dubious, difficult to replicate evaluation as
    GO codes incomplete
  • User view applet GO Abstracts
  • Glass ceiling, too much tinkering, more
    fundamental ideas

9
Bio Research Assistant (BioRAT)
  • 100 words/sec
  • PhD grads in India pay them!
  • Tagging, PPI extraction, based on GATE (further
    funding 5 yrs)
  • User defines concepts of interest
  • program defines templates
  • select or reject, most are poor, time costly
  • Or,
  • ML sequence aligns sentences produces templates
  • requires less effort but less reliable

10
NER (Andrew Clegg)
  • Trees discard parts of tree dont need
  • NER
  • achieve max recall then filter through ABNER
    (high precision)
  • Create every possible variant, strip punctuation,
    substitute greek, remove stop words, long/short
    names

11
MMTx Mapping the UMLS to text
  • Stephen Edwards BSc.
  • University of Edinburgh
  • BioNLP meeting 14th November 2005

12
Overview
  • UMLS
  • MMTx
  • Hypothesis generation
  • milkER
  • Future use

13
UMLS
  • Unified Medical Language System
  • Multi-source vocabulary (gt60 families)
  • 2.5 million terms
  • Concepts in semantic network
  • 12,000,000 relations between concepts
  • Lexicon
  • Many IDs
  • AUI
  • SUI
  • CUI
  • TUI
  • Customisable (MetaMorphosys)

14
MMTx
  • Preparatory filtering
  • Relaxed (manual, lexical 87)
  • Moderate (relaxedtype-based75)
  • Strict (moderatesyntactic)
  • Highly computationally expensive
  • Options
  • restrict to sources
  • Restrict to semantic types
  • Show CUIs, semantic types, treecodes

15
MMTx parsing
  • Parsed into noun phrases
  • SPECIALIST minimal commitment parser/MedPost SKR
  • Variant generation
  • Largely preprocessed
  • Candidate retrieval
  • Candidate evaluation
  • Centrality
  • Variation
  • Coverage
  • Cohesiveness
  • Mapping
  • Combines candidates
  • Mapping evaluation (as with candidates)

16
  • Sentence 00183Progress is described on the
    advanced stages in design of an instrument for
    the study of red blood cell aggregation and blood
    viscosity under near-zero gravity
    conditions.115406091
  • Phrase "Progress"
  • Meta Mapping (1000)
  • 1000 C1280477Progress Functional Concept
  • Phrase "is"
  • Meta Candidates (0) ltnonegt
  • Meta Mappings ltnonegt
  • Phrase "described"
  • Meta Candidates (0) ltnonegt
  • Meta Mappings ltnonegt
  • Phrase "on the advanced stages"
  • Meta Mapping (888)
  • 694 C0205179Advanced Qualitative Concept
  • 861 C1306673Stages Functional Concept

17
MMTx customisation
  • Advised to customise
  • English only sources used
  • Removed inappropriate sources
  • 2secs/sentence (12 X improved performance)
  • Can limit to sources, semantic types
  • Running on Windows, FC2 Linux
  • Lots of fudging required!

18
Hypothesis generation
  • Aim to extract interactions and diseases
  • Swanson (Fish oil Blood viscosity - Raynauds
    disease)
  • Srinivasan (Turmeric - NF?B - Chrons Disease)
  • Weeber (Thalidamide IL-4 Pancretitis)
  • Confirmed experimentally

19
Hypothesis generation
  • Open/Closed
  • Co-occurrence relationship extraction
  • A (Raynauds Disease)
  • B1
  • B2
  • B3 (Blood Viscosity)
  • B4

20
Hypothesis generation
  • B3 (Blood viscosity)
  • C1
  • C2
  • C3 (Fish Oil)
  • C4

21
Hypothesis generation
  • Closed
  • A
  • B1
  • B2
  • B3
  • B4

C B5 B2 B6 B1
  • Need to remove known A C relationships

22
Other systems
  • ManJal MeSH only, basic
  • LitLinker shows associations by frequency
  • TransMiner can be linked to MicroArray
  • DAD Drug Adverse Drug Reactions
  • i-HOP slick informative sentences (e.g.
    experimental evidence, synonyms, hyperlinked
    BUT 5 species only)
  • (Refs cited at end)

23
(No Transcript)
24
(No Transcript)
25
Other systems
  • ManJal MeSH only, basic
  • LitLinker shows associations by frequency
  • TransMiner can be linked to MicroArray
  • DAD Drug Adverse Drug Reactions
  • i-HOP slick informative sentences
  • (e.g. experimental evidence, synonyms,
    hyperlinked BUT
  • five species only)
  • linkouts to external databases
  • EBIMed
  • (Refs cited at end)

26
milkER program
  • Manjal
  • DAD

27
milkER program
Input MEDLINE A/B/C term
Extract Titles, Abstracts, MeSH, Substance Terms
Tag milk proteins/peptides or term in title and
abstracts
Sort and count MeSH and Substance terms
Extract sentences containing the protein/peptide
or term
Partial standardisation, remove overmatching
UMLS tagging (customised MMTx)
28
  • Filter and sort terms
  • Physiological function
  • Entity
  • Location
  • Combined
  • No filter
  • (filter by MMTx weight?)

Group terms by concept (removes plurals and
variants)
Remove terms that are too general on second or
third level of the UMLS heiracrchy
Variable parameters
Remove parent or child terms of search term
Cluster concepts by using the head of the noun(?
E.g. common and right migraine)
Remove over-abundant terms e.g. gt15,000 documents
  • Calculate weighting of term
  • -TFIDF
  • Level of support of relationship (e.g. Must occur
    in gt5 titles with A term or is spurious )

Select B terms for subsequent analysis
29
Features
  • User defined gazetteer
  • Removes overmatches
  • ltprotgtcaseinlt/protgt kinase
  • Currently hard-coded
  • Some standardisation
  • E.g. alpha-CN gt alpha casein
  • prevents loss of data from MMTx
  • Each sent/title given unique ID
  • Main MeSH terms, MeSH terms, Substance terms
  • Can use any PubMed query, PMID etc
  • Did you mean?

30
Comparative filtration
  • Compare filter combinations
  • Calibrate with known link (RD Fish Oil)
  • Highest rank of blood viscosity
  • Dependence on topic?
  • Combine type ranks
  • MeSH terms
  • Substances terms
  • Title
  • Abstract

31
Targets
  • Milk proteins
  • Largely digested
  • Maternal regulation
  • Milk peptides
  • Can reach blood stream, stable
  • Receptor binding
  • Protein binding
  • Immunoresponse

32
Targets
  • Plasmin remodelling
  • Plasmin levels increase during parturition and
    involution
  • Hypothesis peptides involved in restructure
  • Extension Are peptides involved in apoptosis,
    hyperplasia?
  • Role of the abundant proteins
  • MFGL
  • Xanthine Oxidase
  • CD36
  • ?-lactoglobulin

33
(No Transcript)
34
Information kept
  • Defined area (milk) therefore can store detailed
    info., unlike generic system
  • Known assoc with strength
  • Unknown assoc with strength
  • LinkOuts
  • Main MeSH terms
  • MeSH terms
  • Substance terms
  • MMTx concepts

35
(No Transcript)
36
(No Transcript)
37
Problems
  • No directionality on relationships
  • Incorrect MMTx tagging
  • Peptide literature
  • Small(ish) amount of named peptide data
  • Need to TM peptides, however, also strength as
    more disparate data
  • Species/age differentiation (by MeSH?)

38
Conclusions
  • Co-occurrence relationships derived for milk
    protein/peptides and other terms
  • Hypothesis generation to identify new knowledge
  • Information stored for user access

39
Future work
  • Debug!
  • Species/age specificity by MeSH term?
  • Check incorrect MMTx tagging
  • add bioactive peptides to source data
  • Link proteins to milkER sequence database
  • Finish user interface
  • Learn Java ?

40
Acknowledgements
  • Prof. Lindsay Sawyer
  • Dr. Carl Holt (Hannah Research Institute, Ayr)
  • Prof. Bonnie Webber (Informatics)
  • Dr. Alistair Kerr and Gail Sinclair technical
    support

41
Miscellaneous
  • ArrayPaths, Stratagene
  • Huang et al., 2005 PPI extractor program
  • Metis (Mitchell et al) flags interesting
    sentences to user from a UniProt sequence search,
    crap but nice to have BLAST
  • MELISA (Abasolo et al) ontology based IE
  • Genomes to Systems ConferenceManchester, 22 -
    24th March 2006

42
References
  • Abasolo JM, Gomez M MELISA. An ontology-based
    agent for information retrieval in medicine. ECDL
    Workshop on the Semantic Web 2000.
  • Aronson AR Effective mapping of biomedical text
    to the UMLS Metathesaurus the MetaMap program.
    Proc AMIA Symp 200117-21.
  • Aronson AR Filtering the UMLS Metathesaurus for
    MetaMap. 2001.
  • Bodenreider O The Unified Medical Language
    System (UMLS) integrating biomedical
    terminology. Nucleic Acids Res 2004, 32(Database
    issue)D267-270.
  • iHOP (Information Hyperlinked over Proteins)
    http//www.pdg.cnb.uam.es/UniPub/iHOP/
  • Hoffmann R, Valencia A Implementing the iHOP
    concept for navigation of biomedical literature.
    Bioinformatics 2005, 21 Suppl 2ii252-ii258.
  • Mitchell AL, Divoli A, Kim JH, Hilario M, Selimas
    I, Attwood TK METIS multiple extraction
    techniques for informative sentences.
    Bioinformatics 2005, 21(22)4196-4197.
  • Narayanasamy V, Mukhopadhyay S, Palakal M, Potter
    DA TransMiner mining transitive associations
    among biological objects from text. J Biomed Sci
    2004, 11(6)864-873.
  • Pratt W, Yetisgen-Yildiz M LitLinker Capturing
    Connections Across the Biomedical Literature.
    K-CAP 2003 2003.

43
References (2)
  • Pratt W, Yetisgen-Yildiz M A study of biomedical
    concept identification MetaMap vs. people. AMIA
    Annu Symp Proc 2003529-533.
  • EBIMed http//www.ebi.ac.uk/Rebholz-srv/ebimed/in
    dex.jsp
  • Whatizit http//www.ebi.ac.uk/Rebholz-srv/whatizi
    t
  • Shatkay H Hairpins in bookstacks information
    retrieval from biomedical text. Brief Bioinform
    2005, 6(3)222-238.
  • Srinivasan P Text mining Generating hypotheses
    from MEDLINE. J Am Soc Inf Sci Technol 2004,
    55(5)396-413.
  • Srinivasan P, Libbus B Mining MEDLINE for
    implicit links between dietary substances and
    diseases. Bioinformatics 2004, 20 Suppl
    1I290-I296.
  • Weeber M, Klein H, Aronson AR, Mork JG, de
    Jong-van den Berg LT, Vos R Text-based discovery
    in biomedicine the architecture of the
    DAD-system. Proc AMIA Symp 2000903-907.
  • Weeber M, Klein H, de Jong-van den Berg LTW, Vos
    R Using concepts in literature-based discovery
    Simulating Swanson's Raynaud-fish oil and
    migraine-magnesium discoveries. J Am Soc Inf Sci
    Technol 2001, 52(7)548-557.
  • Weeber M, Vos R, Klein H, de Jong-van den Berg
    LTW, Aronson AR, Molema G Generating hypotheses
    by discovering implicit associations in the
    literature A case report of a search for new
    potential therapeutic uses for thalidomide. J Am
    Med Inf Assoc 2003, 10(3)252-259.
  • Wren JD Extending the mutual information measure
    to rank inferred literature relationships. BMC
    Bioinformatics 2004, 5145.
Write a Comment
User Comments (0)
About PowerShow.com