Title: BioText Conference Birkbeck College, London
1BioText ConferenceBirkbeck College, London
- Stephen Edwards BSc.
- University of Edinburgh
- BioNLP meeting 14th November 2005
2Speakers
- EDIMed
- SBSS
- Astra-Zeneca
- Rob Gaizauskas (Sheffield)
- BioRAT
- Andrew Clegg (UoL)
3EBIMed
- Co-occurrence based IE (looking at parse methods)
- Created on the fly
- 40 sentences retrieved useful for PPI
- Includes navigation to databases (important)
- Assessible data
- max 10,000 (moving to full papers)
- Whatizit modules
- High speed tagging modules
- Can be hooked up to any dictionary
- RegExp and ML combination should be combined
- Recall The whole truth
- Precision Nothing but the truth
4(No Transcript)
5(No Transcript)
6SBSS
- Business uses
- Patent recognisers
- IP protection
- Drug design
- Author networks, competition and funding
- Marketing
- NLP system, statistical linking between concepts
- External/Internal databases
- Includes web-sites, forums (negate false
rumours!) - Microarray -gt Text-Mining
- User interface to add new synonyms
- IBM Unstructured Information Management
Architecture
7Astra-Zeneca
- Drug discovery process gt masses of data
(chem/bio assays, clinical trials, reports etc) - Track competition, groups
- GCLit - gene summaries
- - MeSH/gene co-occurrences
- Produce similarity matrices for two genes
- Back dating to trap now know associations
8Rob Gaizauskas
- GO tagging (19,022 terms), GOSlims
- Many tools
- GOPubmed (weighted GO-gtdoc assignment)
- GO-KDS (comm. Assigns GO terms to PubMed,
rubbish!) - BLAST -gt Lit, cluster and structure by GO code
- AMBIT
- combines IR/IE
- Termino module GO, UniProt, UMLS
- DiscoveryNet
- data management software, includes Termino
- Created complete GO corpus
- fuzzy matchmanual to get GO complete corpus
- High results achieved assigning GO to abstracts
- F-measure 0.8
- dubious, difficult to replicate evaluation as
GO codes incomplete - User view applet GO Abstracts
- Glass ceiling, too much tinkering, more
fundamental ideas
9Bio Research Assistant (BioRAT)
- 100 words/sec
- PhD grads in India pay them!
- Tagging, PPI extraction, based on GATE (further
funding 5 yrs) - User defines concepts of interest
- program defines templates
- select or reject, most are poor, time costly
- Or,
- ML sequence aligns sentences produces templates
- requires less effort but less reliable
10NER (Andrew Clegg)
- Trees discard parts of tree dont need
- NER
- achieve max recall then filter through ABNER
(high precision) - Create every possible variant, strip punctuation,
substitute greek, remove stop words, long/short
names
11MMTx Mapping the UMLS to text
- Stephen Edwards BSc.
- University of Edinburgh
- BioNLP meeting 14th November 2005
12Overview
- UMLS
- MMTx
- Hypothesis generation
- milkER
- Future use
13UMLS
- Unified Medical Language System
- Multi-source vocabulary (gt60 families)
- 2.5 million terms
- Concepts in semantic network
- 12,000,000 relations between concepts
- Lexicon
- Many IDs
- AUI
- SUI
- CUI
- TUI
- Customisable (MetaMorphosys)
14MMTx
- Preparatory filtering
- Relaxed (manual, lexical 87)
- Moderate (relaxedtype-based75)
- Strict (moderatesyntactic)
- Highly computationally expensive
- Options
- restrict to sources
- Restrict to semantic types
- Show CUIs, semantic types, treecodes
15MMTx parsing
- Parsed into noun phrases
- SPECIALIST minimal commitment parser/MedPost SKR
- Variant generation
- Largely preprocessed
- Candidate retrieval
- Candidate evaluation
- Centrality
- Variation
- Coverage
- Cohesiveness
- Mapping
- Combines candidates
- Mapping evaluation (as with candidates)
16- Sentence 00183Progress is described on the
advanced stages in design of an instrument for
the study of red blood cell aggregation and blood
viscosity under near-zero gravity
conditions.115406091 - Phrase "Progress"
- Meta Mapping (1000)
- 1000 C1280477Progress Functional Concept
- Phrase "is"
- Meta Candidates (0) ltnonegt
- Meta Mappings ltnonegt
- Phrase "described"
- Meta Candidates (0) ltnonegt
- Meta Mappings ltnonegt
- Phrase "on the advanced stages"
- Meta Mapping (888)
- 694 C0205179Advanced Qualitative Concept
- 861 C1306673Stages Functional Concept
17MMTx customisation
- Advised to customise
- English only sources used
- Removed inappropriate sources
- 2secs/sentence (12 X improved performance)
- Can limit to sources, semantic types
- Running on Windows, FC2 Linux
- Lots of fudging required!
18Hypothesis generation
- Aim to extract interactions and diseases
- Swanson (Fish oil Blood viscosity - Raynauds
disease) - Srinivasan (Turmeric - NF?B - Chrons Disease)
- Weeber (Thalidamide IL-4 Pancretitis)
- Confirmed experimentally
19Hypothesis generation
- Open/Closed
- Co-occurrence relationship extraction
- A (Raynauds Disease)
- B1
- B2
- B3 (Blood Viscosity)
- B4
20Hypothesis generation
- B3 (Blood viscosity)
- C1
- C2
- C3 (Fish Oil)
- C4
21Hypothesis generation
C B5 B2 B6 B1
- Need to remove known A C relationships
22Other systems
- ManJal MeSH only, basic
- LitLinker shows associations by frequency
- TransMiner can be linked to MicroArray
- DAD Drug Adverse Drug Reactions
- i-HOP slick informative sentences (e.g.
experimental evidence, synonyms, hyperlinked
BUT 5 species only) - (Refs cited at end)
23(No Transcript)
24(No Transcript)
25Other systems
- ManJal MeSH only, basic
- LitLinker shows associations by frequency
- TransMiner can be linked to MicroArray
- DAD Drug Adverse Drug Reactions
- i-HOP slick informative sentences
- (e.g. experimental evidence, synonyms,
hyperlinked BUT - five species only)
- linkouts to external databases
- EBIMed
- (Refs cited at end)
26milkER program
27milkER program
Input MEDLINE A/B/C term
Extract Titles, Abstracts, MeSH, Substance Terms
Tag milk proteins/peptides or term in title and
abstracts
Sort and count MeSH and Substance terms
Extract sentences containing the protein/peptide
or term
Partial standardisation, remove overmatching
UMLS tagging (customised MMTx)
28- Filter and sort terms
- Physiological function
- Entity
- Location
- Combined
- No filter
- (filter by MMTx weight?)
Group terms by concept (removes plurals and
variants)
Remove terms that are too general on second or
third level of the UMLS heiracrchy
Variable parameters
Remove parent or child terms of search term
Cluster concepts by using the head of the noun(?
E.g. common and right migraine)
Remove over-abundant terms e.g. gt15,000 documents
- Calculate weighting of term
- -TFIDF
- Level of support of relationship (e.g. Must occur
in gt5 titles with A term or is spurious )
Select B terms for subsequent analysis
29Features
- User defined gazetteer
- Removes overmatches
- ltprotgtcaseinlt/protgt kinase
- Currently hard-coded
- Some standardisation
- E.g. alpha-CN gt alpha casein
- prevents loss of data from MMTx
- Each sent/title given unique ID
- Main MeSH terms, MeSH terms, Substance terms
- Can use any PubMed query, PMID etc
- Did you mean?
30Comparative filtration
- Compare filter combinations
- Calibrate with known link (RD Fish Oil)
- Highest rank of blood viscosity
- Dependence on topic?
- Combine type ranks
- MeSH terms
- Substances terms
- Title
- Abstract
31Targets
- Milk proteins
- Largely digested
- Maternal regulation
- Milk peptides
- Can reach blood stream, stable
- Receptor binding
- Protein binding
- Immunoresponse
32Targets
- Plasmin remodelling
- Plasmin levels increase during parturition and
involution - Hypothesis peptides involved in restructure
- Extension Are peptides involved in apoptosis,
hyperplasia? - Role of the abundant proteins
- MFGL
- Xanthine Oxidase
- CD36
- ?-lactoglobulin
33(No Transcript)
34Information kept
- Defined area (milk) therefore can store detailed
info., unlike generic system - Known assoc with strength
- Unknown assoc with strength
- LinkOuts
- Main MeSH terms
- MeSH terms
- Substance terms
- MMTx concepts
35(No Transcript)
36(No Transcript)
37Problems
- No directionality on relationships
- Incorrect MMTx tagging
- Peptide literature
- Small(ish) amount of named peptide data
- Need to TM peptides, however, also strength as
more disparate data - Species/age differentiation (by MeSH?)
38Conclusions
- Co-occurrence relationships derived for milk
protein/peptides and other terms - Hypothesis generation to identify new knowledge
- Information stored for user access
39Future work
- Debug!
- Species/age specificity by MeSH term?
- Check incorrect MMTx tagging
- add bioactive peptides to source data
- Link proteins to milkER sequence database
- Finish user interface
- Learn Java ?
40Acknowledgements
- Prof. Lindsay Sawyer
- Dr. Carl Holt (Hannah Research Institute, Ayr)
- Prof. Bonnie Webber (Informatics)
- Dr. Alistair Kerr and Gail Sinclair technical
support
41Miscellaneous
- ArrayPaths, Stratagene
- Huang et al., 2005 PPI extractor program
- Metis (Mitchell et al) flags interesting
sentences to user from a UniProt sequence search,
crap but nice to have BLAST - MELISA (Abasolo et al) ontology based IE
- Genomes to Systems ConferenceManchester, 22 -
24th March 2006
42References
- Abasolo JM, Gomez M MELISA. An ontology-based
agent for information retrieval in medicine. ECDL
Workshop on the Semantic Web 2000. - Aronson AR Effective mapping of biomedical text
to the UMLS Metathesaurus the MetaMap program.
Proc AMIA Symp 200117-21. - Aronson AR Filtering the UMLS Metathesaurus for
MetaMap. 2001. - Bodenreider O The Unified Medical Language
System (UMLS) integrating biomedical
terminology. Nucleic Acids Res 2004, 32(Database
issue)D267-270. - iHOP (Information Hyperlinked over Proteins)
http//www.pdg.cnb.uam.es/UniPub/iHOP/ - Hoffmann R, Valencia A Implementing the iHOP
concept for navigation of biomedical literature.
Bioinformatics 2005, 21 Suppl 2ii252-ii258. - Mitchell AL, Divoli A, Kim JH, Hilario M, Selimas
I, Attwood TK METIS multiple extraction
techniques for informative sentences.
Bioinformatics 2005, 21(22)4196-4197. - Narayanasamy V, Mukhopadhyay S, Palakal M, Potter
DA TransMiner mining transitive associations
among biological objects from text. J Biomed Sci
2004, 11(6)864-873. - Pratt W, Yetisgen-Yildiz M LitLinker Capturing
Connections Across the Biomedical Literature.
K-CAP 2003 2003.
43References (2)
- Pratt W, Yetisgen-Yildiz M A study of biomedical
concept identification MetaMap vs. people. AMIA
Annu Symp Proc 2003529-533. - EBIMed http//www.ebi.ac.uk/Rebholz-srv/ebimed/in
dex.jsp - Whatizit http//www.ebi.ac.uk/Rebholz-srv/whatizi
t - Shatkay H Hairpins in bookstacks information
retrieval from biomedical text. Brief Bioinform
2005, 6(3)222-238. - Srinivasan P Text mining Generating hypotheses
from MEDLINE. J Am Soc Inf Sci Technol 2004,
55(5)396-413. - Srinivasan P, Libbus B Mining MEDLINE for
implicit links between dietary substances and
diseases. Bioinformatics 2004, 20 Suppl
1I290-I296. - Weeber M, Klein H, Aronson AR, Mork JG, de
Jong-van den Berg LT, Vos R Text-based discovery
in biomedicine the architecture of the
DAD-system. Proc AMIA Symp 2000903-907. - Weeber M, Klein H, de Jong-van den Berg LTW, Vos
R Using concepts in literature-based discovery
Simulating Swanson's Raynaud-fish oil and
migraine-magnesium discoveries. J Am Soc Inf Sci
Technol 2001, 52(7)548-557. - Weeber M, Vos R, Klein H, de Jong-van den Berg
LTW, Aronson AR, Molema G Generating hypotheses
by discovering implicit associations in the
literature A case report of a search for new
potential therapeutic uses for thalidomide. J Am
Med Inf Assoc 2003, 10(3)252-259. - Wren JD Extending the mutual information measure
to rank inferred literature relationships. BMC
Bioinformatics 2004, 5145.