Lecture 2 Tools - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Lecture 2 Tools

Description:

Lecture 2 - Tools ... A biologist or medical researcher typically ... Certainly a fashionable term. Genomes. Structures. MVILLVILAIVLISD. VTGREGSWQIPCMNV ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 50
Provided by: skipg
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2 Tools


1
Lecture 2 - Tools
Objective - To familiarize you with the available
www resources,so that you can weave your way
through analysis of your data, and to help
interpret the analysis results.
2
Computational Biology vs. Biologist using
Computers- Two Different Things
A biologist or medical researcher typically
supplies data to or retrieves data from a
database and analyzes their data using available
tools created by others.
A computational biologist develops the original
tools, applies tools in new ways to make
discoveries in the data, develops and maintains
databases, and attempts to form a bigger picture
from large amounts of complex data.
An applied computational biologist uses
computational tools and laboratory skills
together.
3
Most Important - Use every database and tool
available, public and private. Check it
regularly, for there is new data every
day. Remember A computer search can save you
years!
4
Tools
  • BLAST - preparing the input, interpreting the
    output.
  • Multiple alignment - assembly.
  • Protein structure visualization.
  • Coding region determination.
  • Feature extraction - CpG islands, polymorphism,
    visualization.

5
Entrez is a search and retrieval system that
integrates information from databases at NCBI.
These databases include nucleotide sequences,
protein sequences, macromolecular structures,
whole genomes, and MEDLINE, through PubMed.
6
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is an
algorithm and a computer program that compares a
query DNA or protein sequence to a database of
other DNA or protein sequences. The results of
that comparison are ranked according to a score
and then each high scoring hit is shown with
the bases of the query and the hit aligned to
show the regions of similarity. Search engines
like BLAST, can find distant relationships
between a query and a database entry, i.e.
similarities that are far from identity. An
adjustable scoring matrix is used by these
codes to assign a value for a match and a penalty
for a mismatch. This matrix reflects
biological/evolution information specific to each
species.
7
Select your database, Be careful!
Searching - BLAST
blastp compares an amino acid query sequence
against a protein sequence
database blastn compares a nucleotide query
sequence against a nucleotide
sequence database blastx compares the
six-frame conceptual translation
products of a nucleotide query sequence
(both strands) against a protein
sequence database tblastn compares a
protein query sequence against a
nucleotide sequence database
dynamically translated in all
six reading frames (both
strands). tblastx compares the six-frame
translations of a nucleo- tide
query sequence against the six-frame transla-
tions of a nucleotide sequence
database.
http//www.ncbi.nlm.nih.gov/BLAST/blast_help.html
8
Pre-Filtering before a BLAST search
DNA sequences, especially those of mammals and
plants contain a large number of repeated
sequences, like CACACACACACACA.. The purpose
of these sequences being present is unknown at
this time. Since database entries and queries
often have many repeat sequences contained in
them, spurious similarities, due only because of
the presence of these sequences often occur. This
detracts from identifying important real
similarities. To eliminate spurious hits to
repeat sequences, query sequences are usually
filtered and masked so that they will make
no contribution to the overall similarity score.
Careful, these simple sequence databases are
incomplete, and vary from species to species!
9
BLAST inputs.
  • Query, usually in Fasta format gtSkipgene
    CAGTATAGTATATCAT
  • Search Parameters
  • Number of Hits to save.
  • Search as DNA (4 nucleotide) or translated
    protein (amino acid) sequence
  • Similarity (PAM) matrix, a matrix of penalties
    used to compute the similarity score given the
    types of discrepancies between the query and the
    database entries.

10
BLAST output components.
  • Execution statistics - database, its size, size
    of the query.
  • Hits - entries in the databases that have the
    highest similarity to the query.
  • Alignments - a base by base or protein by protein
    comparison that can be inspected by eye to
    confirm regions of similarity.

11
How the researcher uses similarity results.
  • Each user typically inspects the score of the
    P(N) value and has a particular threshold above
    which that individual feels is significant,
    others use alignments, voodoo, etc..
  • The user then inspects the short description for
    keywords or information of biological interest to
    him/her. The biological background and specific
    research objective greatly affects what is of
    interest.
  • For hits of interest, the user typically will
    inspect the alignment to confirm a real
    similarity.
  • For each hit of interest, the user might retrieve
    the full database entry and inspect the complete
    annotation.

12
Lets start somewhere, how about a short set of
sequences you saw as a marker in some paper.
Where can we go from there?
GCGAGCGTGTGGAAT GACGACCACAACTA
How about complementing one of them to put on the
same strand, concatenate with an n so that you
know where you joined them, and submit to BLASTn,
WITH GAPs.
GCGAGCGTGTGGAATnCTGCTGGTGTTGAT
13
You get this back

Score E Sequences producing
significant alignments
(bits) Value gbAF110314AF110314 Homo sapiens
herpesvirus immunoglobuli... 36
0.27 gbAF060231AF060231 Homo sapiens
herpesvirus entry protein... 36
0.27 refNM_002855.1HVEC Homo sapiens
herpesvirus entry mediat... 36
0.27 embZ34275UUTUFG U.urealyticum tuf gene
for elongation fac... 32 4.2
14
The alignment on the second looks good, so click
on it and lets see what is up.
gbAF060231AF060231 Homo sapiens herpesvirus
entry protein C (HVEC) mRNA, complete cds
Length 1710 Score 36.2 bits
(18), Expect 0.27 Identities 20/21 (95)
Strand Plus / Plus
Query 1 gcgagcgtgtggaatncctgc 21
Sbjct 437
gcgagcgtgtggaattcctgc 457 Score 32.2 bits
(16), Expect 4.2 Identities 16/16 (100)
Strand Plus / Plus
Query 17 cctgctggtgttgatt 32
Sbjct 1248 cctgctggtgttgatt 1263
15
The HVEC DNA sequence can be retrieved.
GCGAGCGTGTGGAATTCCTGCGGCCCTCCTTCACCGATGGCACTATCCGC
CTCTCCCGCCTGGAGCTGGA GGATGAGGGTGTCTACATCTGCGAGTTTG
CTACCTTCCCTACGGGCAATCGAGAAAGCCAGCTCAATCTC ACGGTGAT
GGCCAAACCCACCAATTGGATAGAGGGTACCCAGGCAGTGCTTCGAGCCA
AGAAGGGGCAGG ATGACAAGGTCCTGGTGGCCACCTGCACCTCAGCCAA
TGGGAAGCCTCCCAGTGTGGTATCCTGGGAAAC TCGGTTAAAAGGTGAG
GCCAGAGTACCAGGAGACTCCGGAACCCCAATGGCACCAGTGACGGTCAT
CAGC CGCTACCGCCTGGTGCCCAGCAGGGAAGCCCACCAGCAGTCCTTG
GCCTGCATCGTCAACTACCACATGG ACCGCTTCAAGGAAAGCCTCACTC
TCAACGTGCAGTATGAGCCTGAGGTAACCATTGAGGGGTTTGATGG CAA
CTGGTACCTGCAGCGGATGGACGTGAAGCTCACCTGCAAAGCTGATGCTA
ACCCCCCAGCCACTGAG TACCACTGGACCACGCTAAATGGCTCTCTCCC
CAAGGGTGTGGAGGCCCAGAACAGAACCCTCTTCTTCA AGGGACCCATC
AACTACAGCCTGGCAGGGACCTACATCTGTGAGGCCACCAACCCCATCGG
TACACGCTC AGGCCAGGTGGAGGTCAATATCACAGAATTCCCCTACACC
CCGTCTCCTCCCGAACATGGGCGGCGCGCC GGGCCGGTGCCCACGGCCA
TCATTGGGGGCGTGGCGGGGAGCATCCTGCTGGTGTTGATTGTGGTCGGC
G
There are a lot of directions one can go from
here.
16
M_002855 . Homo sapiens herpe...gi4506336 LOC
US HVEC 1557 bp mRNA
PRI 10-NOV-1999 DEFINITION Homo sapiens
herpesvirus entry mediator C (poliovirus
receptor-related 1 nectin) (HVEC),
mRNA. ACCESSION NM_002855 NID
g4506336 VERSION NM_002855.1
GI4506336 KEYWORDS . SOURCE human.
ORGANISM Homo sapiens Eukaryota
Metazoa Chordata Craniata Vertebrata
Mammalia Eutheria Primates
Catarrhini Hominidae Homo. REFERENCE 1
(bases 1 to 1557) AUTHORS Lopez,M.,
Eberle,F., Mattei,M.G., Gabert,J., Birg,F.,
Bardin,F., Maroc,C. and Dubreuil,P.
TITLE Complementary DNA characterization and
chromosomal localization of a human
gene related to the poliovirus receptor-encoding
gene JOURNAL Gene 155 (2), 261-265 (1995)
MEDLINE 95237621 REFERENCE 2 (bases 1 to
1557) AUTHORS Geraghty RJ, Krummenacher C,
Cohen GH, Eisenberg RJ and Spear PG. TITLE
Entry of alphaherpesviruses mediated by
poliovirus receptor-related protein 1
and poliovirus receptor JOURNAL Science 280
(5369), 1618-1620 (1998) MEDLINE
98279152 REFERENCE 3 (bases 1 to 1557)
AUTHORS Cocchi F, Menotti L, Mirandola P, Lopez
M and Campadelli-Fiume G. TITLE The
ectodomain of a novel member of the
immunoglobulin subfamily related to
the poliovirus receptor has the attributes of a
bona fide receptor for herpes simplex
virus types 1 and 2 in human cells JOURNAL J.
Virol. 72 (12), 9992-10002 (1998) MEDLINE
99030909 COMMENT REFSEQ This reference
sequence was derived from X76400.1.
PROVISIONAL RefSeq This is a provisional
reference sequence record that has
not yet been subject to human review. The final
curated reference sequence record may
be somewhat different from this one.
Inspect the annotation in the GenBank entry.
17
FEATURES Location/Qualifiers
source 1..1557
/organism"Homo sapiens"
/db_xref"taxon9606"
/map"11q23- q24"
/clone_lib"cDNA in pSPORT" gene
1..1557 /gene"HVEC"
/note"PVRL1 HIGR PRR1 PVRR1
SK-12" /db_xref"LocusID5818
" /db_xref"MIM600644"
CDS 1..1557
/gene"HVEC" /codon_start1
/db_xref"LocusID5818"
/db_xref"MIM600644"
/product"herpesvirus entry mediator C
(poliovirus receptor-related
1 nectin)"
/protein_id"NP_002846.1"
/db_xref"PIDg4506337"
/db_xref"GI4506337"
/db_xref"SPTREMBLQ15223"
/translation"MARMGLAGAAGRWWGLALGLTAFFLPGVHSQVVQVN
DSMYGFIG TDVVLHCSFANPLPSVKITQ
VTWQKSTNGSKQNVAIYNPSMGVSVLAPYRERVEFLRP
SFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNLTV
MAKPTNWIEGTQAVLRAK
KGQDDKVLVATCTSANGKPPSVVSWETRLKGEARVPGDSGTPMAPVTVIS
RYRLVPSR EAHQQSLACIVNYHMDRFKE
SLTLNVQYEPEVTIEGFDGNWYLQRMDVKLTCKADANP
PATEYHWTTLNGSLPKGVEAQNRTLFFKGPINYSLAGTYI
CEATNPIGTRSGQVEVNI
TEFPYTPSPPEHGRRAGPVPTAIIGGVAGSILLVLIVVGGIVVALRRRRH
TFKGDYST KKHVYGNGYSKAGIPQHHPP
MAQNLQYPDDSDDEKKAGPLGGSSYEEEEEEEEGGGGG
ERKVGGPHPKYDEDAKRPYFTVDEAEARQDGYGDRTLGYQ
YDPEQLDLAENMVSQNDG
18
Lets see if there is any information we can dig
up on the protein.
19
gtgi4506337refNP_002846.1pHVEC herpesvirus
entry mediator C (poliovirus receptor-related 1
nectin) MARMGLAGAAGRWWGLALGLTAFFLPGVHSQVVQVNDSMYG
FIGTDVVLHCSFANPLPSVKITQVTWQKS TNGSKQNVAIYNPSMGVSVL
APYRERVEFLRPSFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNL
TVMAKPTNWIEGTQAVLRAKKGQDDKVLVATCTSANGKPPSVVSWETRL
KGEARVPGDSGTPMAPVTVIS RYRLVPSREAHQQSLACIVNYHMDRFKE
SLTLNVQYEPEVTIEGFDGNWYLQRMDVKLTCKADANPPATE YHWTTLN
GSLPKGVEAQNRTLFFKGPINYSLAGTYICEATNPIGTRSGQVEVNITEF
PYTPSPPEHGRRA GPVPTAIIGGVAGSILLVLIVVGGIVVALRRRRHTF
KGDYSTKKHVYGNGYSKAGIPQHHPPMAQNLQYP DDSDDEKKAGPLGGS
SYEEEEEEEEGGGGGERKVGGPHPKYDEDAKRPYFTVDEAEARQDGYGDR
TLGYQ YDPEQLDLAENMVSQNDGSFISKKEWYV
Let work from the FASTA format of protein
sequence.
refNP_002846.1PHVEC herpesvirus entry
mediator C (poliovirus receptor-related 1
nectin) gtgi1082702pirJC4024 poliovirus
receptor-related protein - human
gtgi732796embCAA53980 (X76400) PRR1 Homo
sapiens Length 518
Score 57.0 bits (135), Expect 6e-08
Identities 33/119 (27), Positives 61/119
(50), Gaps 5/119 (4) Query 2
VVYTDREVYGAVGSQVTLHCSFWSSEWVSDDISFTWRYQPEGGRDAISIF
HYAKGQPYID 61 VV YG G V LHCSF
TW G I G Sbjct 32
VVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKSTNGSKQNVAIY
NPSMG---VS 88 Query 62 EVGTFKERIQWVGDPSWKDGSIVI
HNLDYSDNGTFTCDVKNPPDIVGKTSQVTLYVFEK 120
ER PS DGI L D G C P
SQ L V K Sbjct 89 VLAPYRERVEFL-RPSFTDGTIRLSRLE
LEDEGVYICEFATFP-TGNRESQLNLTVMAK 145
Myelin Membrane Adhesion Molecule is one thing we
get back.
20
And the sequence of the other hit, for HIgR
gtgi4154346gbAAD04944.1 herpesvirus
immunoglobulin-like receptor HIgR MARMGLAGAAGRWWGL
ALGLTAFFLPGVHSQVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVT
WQKS TNGSKQNVAIYNPSMGVSVLAPYRERVEFLRPSFTDGTIRLSRLE
LEDEGVYICEFATFPTGNRESQLNL TVMAKPTNWIEGTQAVLRAKKGQD
DKVLVATCTSANGKPPSVVSWETRLKGEAEYQEIRNPNGTVTVISR YRL
VPSREAHQQSLACIVNYHMDRFKESLTLNVQYEPEVTIEGFDGNWYLQRM
DVKLTCKADANPPATEY HWTTLNGSLPKGVEAQNRTLFFKGPINYSLAG
TYICEATNPIGTRSGQVEVNITEKPRPQRGLGSAARLL AGTVAVFLILV
AVLTVFFLYNRQQKSPPETDGAGTDQPLSQKPEPSPSRQSSLVPEDIQVV
HLDPGRQQQ QEEEDLQKLSLQPPYYDLGVSPSYHPSVRTTEPRGECP
Can you identify motifs, or highly conserved
regions, in these sequences? Try
http//www.sdsc.edu/MEME/meme/website/ What
about conserved regions for Myelin and HVEC, for
which sequence homology was found?
21
Lets put HVEC and Myelin into MEME
The following motifs are found.
DATABASE meme.30154.data (peptide) Last updated
on Tue Nov 23 055931 1999 Database contains 1
sequences, 642 residues MOTIFS
meme.30154.results (peptide) MOTIF WIDTH BEST
POSSIBLE MATCH ----- ----- -------------------
1 8 VYTCEFAN 2 12 ERHEQSLTCNVD
3 12 RSSQVNLNVFEK 4 8 PSWNDGSI 5
12 VSWQKRLKGEKR
Myelin HVEC
22
Myelin membrane adhesion molecule, which has a
solved structure, has shared motifs, Motif
1,3,4, with HVEC.
Myelin HVEC
homology
4 1 3
7.7e-11 1.0e-05 1.1e-11
PSWNDGSI VYTCEFAN RSSQVNLNVFEK
76
PSWKDGSIVIHNLDYSDNGTFTCDVKNPPDIVGKTSQVTLYVFEKVPT
RMARMGLAGAAGRWWGLALGLTAFFLP
1 5
1.3e-07 3.4e-12
VYTCEFAN VSWQKRLKGEKR
151
GVHSQVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKSTNGSKQ
NVAIYNPSMGVSVLAPYRERVEFLR 4
1 3
1 1.2e-08 5.0e-08 8.5e-09
5.2e-06 PSWNDGSI
VYTCEFAN RSSQVNLNVFEK
VYTCEFAN
226
PSFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNLTVMAKPTNWI
EGTQAVLRAKKGQDDKVLVATCTSAN 5
2 2
4.3e-13 4.3e-09
2.6e-08 VSWQKRLKGEKR
ERHEQSLTCNVD ERHEQSLTCNVD

301 GKPPSVVSWETRLKGEARVPGDSGTPMAPVT
VISRYRLVPSREAHQQSLACIVNYHMDRFKESLTLNVQYEPEVT
23
There is no solved protein for our sequence, so
we take the protein sequence BLAST results and
see if we turn up any that are solved, and then
look at those.

Score E Sequences producing
significant alignments
(bits) Value pdb1NEU Structure Of Myelin
Membrane Adhesion Molecule P0 57
4e-09 pdb1BIHA Chain A, Crystal Structure Of
The Insect Immune ... 34 0.025 pdb2H1PL
Chain L, The Three-Dimensional Structures Of A
P... 34 0.033 pdb1A3LL Chain L, Catalysis
Of A Disfavored Reaction An ... 33
0.056 pdb1A4JL Chain L, Diels Alder Catalytic
Antibody Germline... 33 0.056
Lucky, Myelin Membrane Adhesion Molecule is
solved!
Region of high homology on the outside of the
protein, perhaps a hint as to a domain involved
in some kind of interaction, maybe not.
24
Can we find a genomic clone for this sequence?
Why would we want to?
gbAC015907.1AC015907 Homo sapiens clone
RP11-48A13, LOW-PASS SEQUENCE SAMPLING
Length 55317 Score 48.1 bits
(24), Expect 0.005 Identities 24/24 (100)
Strand Plus / Plus
Query 878 gtgtggaggcccagaacagaaccc
901 Sbjct
44103 gtgtggaggcccagaacagaaccc 44126
Maybe, only maybe, because this is probably a
repeat sequence that passed the filters (WHY?),
but it might be worth trying to see what the rest
of the sequence looked like from this Roswell
Park clone, but the link is not good. Dead end?
25
Maybe we can find by electronic PCR.
26
What is missing?
  • Unification and integration of the analysis.
  • In-depth analysis.
  • The big picture?
  • Tools that work on many pieces of data at once.
    Data mining.
  • Expression database - mRNA, proteins.
  • Other?

Now, a quick look at a couple of stabs at this
list!
27
Local Software Projects
BANAL - NLP/Bayesian Network analysis of
Expression Arrays ARROGANT - Optimized Expression
Array Design and Analysis X-Hyb - Looking for
cross-hybridization in Expression Arrays MAD
PAD - Expression Array database and
layout Protein Molecular Dynamics - Sequence
polymorphism effects on solved protein
structures SNIDE - SNP prediction Rep-X (aka
UniPOMPOUS) - simple sequence repeat polymorphism
prediction
28
PANORAMA - a new server for Integrated Genomic
Sequence Analysis
  • Genomic sequence features visualization
  • Preparation for Expert System Based Analysis
  • GenBank (EST and non-EST) homologies
  • Gene prediction (GenScan)
  • POMPOUS
  • New - control / recognition sequences,
  • Transcription factors, CpG islands, enhancers,
    termination sequences more on the way!

29

30

PANORAMA Integrated Analysis on The
WWW BLAST CpG islands GenScan Repeats POMPOUS .
. Java soon.
31
Polymorphism prediction software
  • SNIDE (SNp IDEntification) Predict high-impact,
    high-probability SNPs.
  • POMPOUS Prediction of polymorphic markers for
    allelotyping (PNAS, June 98, Vol. 95 p7514-19)
  • Rep-X (UniPOMPOUS) Improvement of POMPOUS code
    and application to expressed gene sequences via
    Unigene

32
Rep-X (Repeat eXpansions within mRNAs)
Background on Nucleic acid repetitive elements
  • Repeating sequence units (microsatellites) known
    for long time to undergo expansion and
    contraction of base repeat unit
  • Slipped-strand mispairing and unequal
    recombination thought to be responsible
  • Well known polymorphic sequence units CA
    (intervening sequence) and CAG (linked to several
    neurological disorders).
  • Polymorphic repeat units mentioned in the
    literature range from 1 to over 250 bp.
  • Impact of polymorphisms found in all regions

5 UTR Hyperandrogenaemia CDS Haw River
Syndrome Intron Fredreichs Ataxia 3 UTR
Myotonic Dystrophy
33
Reasons to study
  • Candidates for genetic diseases
  • Candidates for phenotypic variations
  • Polymorphism profile indicative of functional
    role for protein region
  • Nature may use non-degenerate codon repeats for
    more rapid evolutionary response to selection
    pressure
  • Learn more about roles of peptide repeats

34
Computational Process
  • Download UniGene (Unique Gene) dataset of
    assembled EST sequences
  • Longest, cleanest sequence obtained for each
    Unigene cluster
  • Program run on entire Unigene database (10/99
    build 85,639 entries)
  • Candidates for follow-up experimental study
    picked by repeat type, location and interest in
    the gene

35
Example Follow-up on 30 patient DNAs
  • Herpes Virus Entry Protein C AGG(8)

5-----------start--------------------------X---
------stop-----------------------gt3
  • Variable resistance to HSV infection in
    population
  • HSV unable to penetrate cell in C-terminal
    deletion experiment (including Glu repeat)
  • Glu region bears homology to calcium-based
    transporters
  • HSV unable to enter without calcium present

36
Experimental Verification Results
  • Out of 146 genes chosen for testing, 102
    amplified and 54 were polymorphic (53)
  • Tested on 30 patient DNA samples.

37
We can predict repeat polymorphisms, and there
are a lot of them.
We have found defective entries in UniGene that
result in overprediction of the number of genes
by 20
38
Data Mining - What is it? Certainly a
fashionable term.
39
and public servers are available for SQL queries
to linked data
MEDLINE
ACGATGTGGTCGATG TTCTCTATTATTATC GGAAGCTAAGGATAT CG
CTGATGTGAGGTGA TCGGTTCTATCTGCA TAGCATGGATATTGA TGG
CTTATAGGCTAG CGCTGATGTGAGGTG
MVILLVILAIVLISD VTGREGSWQIPCMNV KRKKGREGDHIVLIL IL
LNNAWASVLPESDS SDSGPLIILHEREKR LALAMAREENSPNCT PLI
KRESAEDSEDLR KRKKTDEDDHIVLIL
Protein Sequences
GenBank
Links
Genomes
Structures
40
Some are using simple mining methods on titles
  • pieces of evidence extracted from titles of
    articles in the biomedical literature (Swanson
    1988, Swanson and Smalheiser 1997)

stress can lead to loss of magnesium magnesium
is a natural calcium channel blocker high levels
of magnesium inhibit SCD magnesium can suppress
platelet aggregability
stress is associated with migraines calcium
channel blockers prevent some migraines spreading
cortical depression (SCD) is implicated in some
migraines migraine patients have high platelet
aggregability
  • led to the discovery
  • magnesium deficiency may play a role in migraine
    headaches
  • confirmed by subsequent study (Ramadan et al.
    1989)

41
and there is still enormous opportunity!!
  • Traditional approach
  • one thread is followed through several databases
  • Result
  • finding A is related to sequence B and structure C

text databases
A
sequence databases
structure databases
42
and there is still enormous opportunity!!
  • Directed approaches
  • keyword/grammar based SQL
  • simple data mining on titles and abstracts
  • Result
  • finding A is related to other findings of the
    same data type

text databases
A
sequence databases
structure databases
43
and there is still enormous opportunity!!
  • Machine learning
  • data mining on full texts and other biomedical
    data
  • Result
  • finding A is related to other findings of the
    same data type through connections found among
    other datatypes

text databases
A
sequence databases
structure databases
44
because tree overlap may be new associations
?
45
and EMILE (Perot Systems)for knowledge discovery
  • Entity Modeling Intelligent Learning Engine
  • language-independent grammar induction
  • cluster analysis on text to identify semantic and
    syntactic clusters
  • clustering of biomedical data based on concepts
  • will use for biological knowledge base
    construction
  • concept clustering may reveal previously
    undiscovered knowledge
  • molecular interaction networks (MINe) to serve as
    basis for static cell and dynamic cell modeling

46
EMILE Results dataset too small to learn the
language
91 PubMed abstracts keywords cancer, polymorphism
47
EMILE results interesting associations are
discovered when clusters are inspected manually
  • EMILE realized that these are related
  • 94 --gt Chinese
  • 94 --gt Japanese
  • 94 --gt Polish
  • And found a biological connection among diverse
    verb use
  • 11 --gt LOH was 105
  • 105 --gt identified in 13 cases (72
  • 105 --gt detected in 9 of 87 informative cases
    (10
  • 105 --gt observed in 5 (55

48
Some vision of what is to come.
  • Data mining will become important given the
    amount of data becoming available.
  • Patient/phenotypes become increasingly important
    for identifying genes and their function within
    existing genomic sequence.
  • Genome-Transcriptome-Proteome are unified.
  • Understanding of complex systems (humans)
    possible from network analysis and computers.
  • Novel genes will continue to be discovered as we
    sequence more organisms.

49
Closing message
The intent of this set of lectures was to
introduce you to the wide variety of data and
tools that are available, mainly on the www, and
encourage you to use these tools For an
in-depth understanding of the organization of the
data and the algorithms that are the basis of the
tools, there may be a new course next year?
Write a Comment
User Comments (0)
About PowerShow.com