Medical Informatics - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Medical Informatics

Description:

Medical Informatics – PowerPoint PPT presentation

Number of Views:556
Avg rating:3.0/5.0
Slides: 65
Provided by: anilC1
Category:

less

Transcript and Presenter's Notes

Title: Medical Informatics


1
Medical Informatics
Bioinformatics
2
Biomedical informatics The broad discipline
concerned with the study and application of
computer science, information science,
informatics, cognitive science and human-computer
interaction in the practice of biological
research, biomedical science, medicine and
healthcare. Bioinformatics, clinical informatics
and public health informatics or medical
informatics can be considered as sub-domains
within biomedical informatics.
Bioinformatics The merger of biotechnology and
information technology with the goal of revealing
new insights and principles in biology OR The
science of managing and analyzing biological data
using advanced computing techniques. Especially
important in analyzing genomic research
data. Health Informatics or Medical Informatics
The intersection of information science, computer
science, and health care. It deals with the
resources, devices, and methods required to
optimize the acquisition, storage, retrieval, and
use of information in health and biomedicine.
Wikipedia
3
Mining Bio-Medical Mountains How Computer Science
can help Biomedical Research and Health Sciences
Anil Jegga Division of Biomedical Informatics,
Cincinnati Childrens Hospital Medical Center
(CCHMC) Department of Pediatrics, University of
Cincinnati http//anil.cchmc.org Anil.Jegga_at_cchmc.
org
4
Algorithm A fixed procedure embodied in a
computer program. Base One of the molecules
that form DNA and RNA molecules. Base pair Two
nitrogenous bases (adenine and thymine or guanine
and cytosine) held together by weak bonds. Two
strands of DNA are held together in the shape of
a double helix by the bonds between base pairs.
Wikipedia
5
Nucleotide A subunit of DNA or RNA consisting of
a nitrogenous base (adenine, guanine, thymine, or
cytosine in DNA adenine, guanine, uracil, or
cytosine in RNA), a phosphate molecule, and a
sugar molecule (deoxyribose in DNA and ribose in
RNA). Thousands of nucleotides are linked to form
a DNA or RNA molecule. Genome All the genetic
material in the chromosomes of a particular
organism its size is generally given as its
total number of base pairs. Genomics The study
of genes and their function. Functional
Genomics The study of genes, their resulting
proteins, and the role played by the proteins the
body's biochemical processes.
Wikipedia
6
Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics
With Some Data Exchange
7
The genome is our Genetic Blueprint
  • Nearly every human cell contains 23 pairs of
    chromosomes
  • 1 to 22 and
  • XY or XX
  • XY Male
  • XX Female
  • Length of chromosomes 1 to 22, X, Y together is
    3.2 billion bases.

8
The Genome is Who We Are on the inside!
  • Chromosomes consist of DNA
  • molecular strings of A, C, G, T
  • base pairs, A-T, C-G
  • Genes
  • DNA sequences that encode proteins
  • less than 3 of human genome

Information coded in DNA
9
5000 bases per page..
CACACTTGCATGTGAGAGCTTCTAATATCTAAATTAATGTTGAATCATT
ATTCAGAAACAGAGAGCTAACTGTTATCCCATCCTGACTTTATTCTTTAT
GAGAAAAATACAGTGATTCCAAGTTACCAAGTTAGTGCTGCTTGCTTTAT
AAATGAAGTAATATTTTAAAAGTTGTGCATAAGTTAAAATTCAGAAATAA
AACTTCATCCTAAAACTCTGTGTGTTGCTTTAAATAATCAGAGCATCTGC
TACTTAATTTTTTGTGTGTGGGTGCACAATAGATGTTTAATGAGATCCTG
TCATCTGTCTGCTTTTTTATTGTAAAACAGGAGGGGTTTTAATACTGGAG
GAACAACTGATGTACCTCTGAAAAGAGAAGAGATTAGTTATTAATTGAAT
TGAGGGTTGTCTTGTCTTAGTAGCTTTTATTCTCTAGGTACTATTTGATT
ATGATTGTGAAAATAGAATTTATCCCTCATTAAATGTAAAATCAACAGGA
GAATAGCAAAAACTTATGAGATAGATGAACGTTGTGTGAGTGGCATGGTT
TAATTTGTTTGGAAGAAGCACTTGCCCCAGAAGATACACAATGAAATTCA
TGTTATTGAGTAGAGTAGTAATACAGTGTGTTCCCTTGTGAAGTTCATAA
CCAAGAATTTTAGTAGTGGATAGGTAGGCTGAATAACTGACTTCCTATCA
TTTTCAGGTTCTGCGTTTGATTTTTTTTACATATTAATTTCTTTGATCCA
CATTAAGCTCAGTTATGTATTTCCATTTTATAAATGAAAAAAAATAGGCA
CTTGCAAATGTCAGATCACTTGCCTGTGGTCATTCGGGTAGAGATTTGTG
GAGCTAAGTTGGTCTTAATCAAATGTCAAGCTTTTTTTTTTCTTATAAAA
TATAGGTTTTAATATGAGTTTTAAAATAAAATTAATTAGAAAAAGGCAAA
TTACTCAATATATATAAGGTATTGCATTTGTAATAGGTAGGTATTTCATT
TTCTAGTTATGGTGGGATATTATTCAGACTATAATTCCCAATGAAAAAAC
TTTAAAAAATGCTAGTGATTGCACACTTAAAACACCTTTTAAAAAGCATT
GAGAGCTTATAAAATTTTAATGAGTGATAAAACCAAATTTGAAGAGAAAA
GAAGAACCCAGAGAGGTAAGGATATAACCTTACCAGTTGCAATTTGCCGA
TCTCTACAAATATTAATATTTATTTTGACAGTTTCAGGGTGAATGAGAAA
GAAACCAAAACCCAAGACTAGCATATGTTGTCTTCTTAAGGAGCCCTCCC
CTAAAAGATTGAGATGACCAAATCTTATACTCTCAGCATAAGGTGAACCA
GACAGACCTAAAGCAGTGGTAGCTTGGATCCACTACTTGGGTTTGTGTGT
GGCGTGACTCAGGTAATCTCAAGAATTGAACATTTTTTTAAGGTGGTCCT
ACTCATACACTGCCCAGGTATTAGGGAGAAGCAAATCTGAATGCTTTATA
AAAATACCCTAAAGCTAAATCTTACAATATTCTCAAGAACACAGTGAAAC
AAGGCAAAATAAGTTAAAATCAACAAAAACAACATGAAACATAATTAGAC
ACACAAAGACTTCAAACATTGGAAAATACCAGAGAAAGATAATAAATATT
TTACTCTTTAAAAATTTAGTTAAAAGCTTAAACTAATTGTAGAGAAAAAA
CTATGTTAGTATTATATTGTAGATGAAATAAGCAAAACATTTAAAATACA
AATGTGATTACTTAAATTAAATATAATAGATAATTTACCACCAGATTAGA
TACCATTGAAGGAATAATTAATATACTGAAATACAGGTCAGTAGAATTTT
TTTCAATTCAGCATGGAGATGTAAAAAATGAAAATTAATGCAAAAAATAA
GGGCACAAAAAGAAATGAGTAATTTTGATCAGAAATGTATTAAAATTAAT
AAACTGGAAATTTGACATTTAAAAAAAGCATTGTCATCCAAGTAGATGTG
TCTATTAAATAGTTGTTCTCATATCCAGTAATGTAATTATTATTCCCTCT
CATGCAGTTCAGATTCTGGGGTAATCTTTAGACATCAGTTTTGTCTTTTA
TATTATTTATTCTGTTTACTACATTTTATTTTGCTAATGATATTTTTAAT
TTCTGACATTCTGGAGTATTGCTTGTAAAAGGTATTTTTAAAAATACTTT
ATGGTTATTTTTGTGATTCCTATTCCTCTATGGACACCAAGGCTATTGAC
ATTTTCTTTGGTTTCTTCTGTTACTTCTATTTTCTTAGTGTTTATATCAT
TTCATAGATAGGATATTCTTTATTTTTTATTTTTATTTAAATATTTGGTG
ATTCTTGGTTTTCTCAGCCATCTATTGTCAAGTGTTCTTATTAAGCATTA
TTATTAAATAAAGATTATTTCCTCTAATCACATGAGAATCTTTATTTCCC
CCAAGTAATTGAAAATTGCAATGCCATGCTGCCATGTGGTACAGCATGGG
TTTGGGCTTGCTTTCTTCTTTTTTTTTTAACTTTTATTTTAGGTTTGGGA
GTACCTGTGAAAGTTTGTTATATAGGTAAACTCGTGTCACCAGGGTTTGT
TGTACAGATCATTTTGTCACCTAGGTACCAAGTACTCAACAATTATTTTT
CCTGCTCCTCTGTCTCCTGTCACCCTCCACTCTCAAGTAGACTCCGGTGT
CTGCTGTTCCATTCTTTGTGTCCATGTGTTCTCATAATTTAGTTCCCCAC
TTGTAAGTGAGAACATGCAGTATTTTCTAGTATTTGGTTTTTTGTTCCTG
TGTTAATTTGCCCAGTATAATAGCCTCCAGCTCCATCCATGTTACTGCAA
AGAACATGATCTCATTCTTTTTTATAGCTCCATGGTGTCTATATACCACA
TTTTCTTTATCTAAACTCTTATTGATGAGCATTGAGGTGGATTCTATGTC
TTTGCTATTGTGCATATTGCTGCAAGAACATTTGTGTGCATGTGTCTTTA
TGGTAGAATGATATATTTTCTTCTGGGTATATATGCAGTAATGCGATTGC
TGGTTGGAATGGTAGTTCTGCTTTTATCTCTTTGAGGAATTGCCATGCTG
CTTTCCACAATAGTTGAACTAACTTACACTCCCACTAACAGTGTGTAAGT
GTTTCCTTTTCTCCACAACCTGCCAGCATCTGTTATTTTTTGACATTTTA
ATAGTAGCCATTTTAACTGGTATGAAATTATATTTCATTGTGGTTTTAAT
TTGCATTTCTCTAATGATCAGTGATATTGAGTTTGTTTTTTTTCACATGC
TTGTTGGCTGCATGTATGTCTTCTTTTAAAAAGTGTCTGTTCATGTACTT
TGCCCACATTTTAATGGGGTTGTTTTTCTCTTGTAAATTTGTTTAAATTC
CTTATAGGTGCTGGATTTTAGACATTTGTCAGACGCATAGTTTGCAAATA
GTTTCTCCCATTCTGTAGGTTGTCTGTTTATTTTGTTAATAGTTTCTTTT
GCTATGCAGAAGCTCTTAATAAGTTTAATGAGATCCTGATATGTTAGGCT
TTGTGTCCCCACCCAAATCTCATCTTGAATTATATCTCCATAATCACCAC
ATGGAGAGACCAGGTGGAGGTAATTGAATCTGGGGGTGGTTTCACCCATG
CTGTTCTTGTGATAGTGAATGAGTTCTCACGAGATCTAATGGTTTTATGA
GGGGCTCTTCCCAGCTTTGCCTGGTACTTCTCCTTCCTGCCGCTTTGTGA
AAAAGGTGCATTGCGTCCCTTTCACCTTCTTCTATAATTGTAAGTTTCCT
GAGGCCTTCCCAGCCATGCTGAACTTCAAGTCAATTAAACCTTTTTCTTT
ATAAATTACTCAGTCTCTGGTGGTTCTTTATAGCAGTGTGAAAATGGACT
AATGAAGTTCCCATTTATGAATTTTTGCTTTTGTTGCAATTGCTTTTGAC
ATCTTAGTCATGAAATCCTTGCCTGTTCTAAGTACAGGACGGTATTGCCT
AGGTTGTCTTCCAGGGTTTTTCTAATTTTGTGTTTTGCATTTAAGTGTTT
AATCCATCTTGAGTTGATTTTTGTATATTGTGTAAGGAAGGGGTCCAGTT
TCAATCTTTTGCATATGGCTAGTTAGTTATCCCAGTACCATTTATTGAAA
AGACAGTCTTTTCCCCATCGCTCGTTTTTGTCAGTTTTATTGATGATCAG
ATAATCATAGCTGTGTGGCTTTATTTCTGGGTTCTTTATTCTGTTCTATT
GGTTTATGTCCCTGTTTTTGTGCCAGTACCATGCTGTTTTGGTTAACATA
GCCCTGTAGTATAGTTTGAGGTCAGATAGCCTGATGCTTCCAGCTTTGTT
CTTTTTCTTAAGATTGCCTTGGCTATTTGGCCTCTTTTTTGGTTCCACAT
GAATTTTAAAACAGTTGTTTCTAGTTTTTGAAGAATGTCATTGGTAGTTT
GATAGAAATAGCATTTAATCTGTAAATTGATTTGTGCAGTATGGCCTTTT
AATGATATTGATTCTTCCTATCCATGAGCATGATATGTTTTCCATTTTGT
TTGTATCCTCTCTGATTTCTTTGTGCAGTGTTTTGTAATTCTCATTGTAG
AGATTTTTCACCTCCCTGGTTAGTTGTATTTTACCCTAGATATTTTATTC
TTTTTGTGAAAATTGTGAATGGGATTGCCTTCCTGATTTGACTGCCAGCT
TGGTTACTGTTGGTTTATAGAAATGCTAGTGATTTTTGTACATTGATTTT
CTTTCTAAAACTTTGCTGAAGTTTTTTTTATTAGCAGAAGGAGCTTTGGG
GCTGAGACTATGGGGTTTTCTAGATATAGAATCATGTCAGCTTCAAATAG
GGATAATTTTACTTCCTCTCTTCCTATTTGGATGCCCTTTATTTCTTTCT
CTTGCCTGATTACTCTGGCTGGGATTTCCTATGTTGAATAGGAGTCATGA
GAGAGGGCATCAAATCTACACATATCAAATACTAACCTTGAATGTCTAGA
T
10
How much data make up the human genome?
  • 3 pallets with 40 boxes per pallet x 5000 pages
    per box x 5000 bases per page 3,000,000,000
    bases!
  • To get an accurate sequence requires
  • 6-fold coverage!
  • Now imagine shredding 18 pallets and reassembling!

11
Human Genome ProjectInitial Stages
  • Most of the initial phases were primarily focused
    on improving speeding the technology to
    sequence and analyze DNA.
  • Scientists all around the world worked to make
    detailed maps of our chromosomes and sequence
    model organisms, like worm, fruit fly, and mouse.

Image Courtesy Google Images
12
Overwhelming Challenges
  • First there was the Assembly
  • The DNA sequence is so long that no
    technology can read it all at once, so it was
    broken into pieces.
  • There were millions of clones (small sequence
    fragments).
  • The assembly process included finding where
    the pieces overlapped in order to put the draft
    together.

3,200,000 piece puzzle anyone?
13
(No Transcript)
14
The Completion of the Human Genome Sequence
  • One June 26, 2000 President Clinton, with J.
    Craig Venter, and Francis Collins, announces
    completion of "the first survey of the entire
    human genome - 80 working draft.
  • Publication of 90 percent of the sequence in the
    February 2001 issue of the journal Nature.
  • Completion of 99.99 of the genome as finished
    sequence on July 2003.

Image Courtesy Google Images
15
Butthe Project is not Done
Human Genome is finally Sequenced!!!
  • Next there is the Annotation
  • The sequence is like a topographical map, the
    annotation would include cities, towns, schools,
    libraries and coffee shops!
  • So, where are the genes?
  • How do genes function?
  • How do we use this information for scientific
    understanding?
  • How does it benefit or improve the health care?

16
What do genes do anyway?
  • As per current estimate, we only have 27,000
    genes! That means each gene has to do a lot!
  • Genes make proteins that make up nearly all we
    are (bones, muscles, hair, eyes, etc.).
  • Almost everything that happens in our bodies
    happens because of proteins (walking, digestion,
    fighting disease).

Image Courtesy Google Images
17
Of Mice and Men Its all in the genes
  • Humans and Mice have about the same number of
    genes. But then why are we so different from each
    other, how is this possible?

Did you say cheese?
Mmm, Cheese!
  • While one human gene can make many different
    proteins a mouse gene can only make a few
    probably!

Image Courtesy Google Images
18
Genes are important
  • By selecting different pieces of a gene, your
    body can make many kinds of proteins. (This
    process is called alternative splicing.)
  • If a gene is expressed that means it is turned
    on and it will make proteins.

19
What weve learned from our genome so far
  • There are a relatively small number of human
    genes, less than 30,000, but they have a complex
    architecture that we are only beginning to
    understand and appreciate.
  • We know where 85 of genes are in the sequence.
  • We dont know where the other 15 are because we
    havent seen them on (they may only be
    expressed during fetal development).
  • We only know what about 50 of our genes do so
    far.
  • So it is relatively easy to locate genes in the
    genome, but it is hard to figure out what they
    do.

20
How do scientists find genes?
  • The genome is so large that useful information is
    hard to find.
  • Researchers use a computational microscope to
    help scientists search the genome.
  • Just as you would use google to find something
    on the internet, researchers can use the Genome
    Browser to find information in the human genome.

Image Courtesy Google Images
21
The Continuing Project
  • Finding the complete set of genes and annotating
    the entire sequence. Annotation is like
    detailing scientists annotate sequence by
    listing what has been learnt experimentally and
    computationally about its function.
  • Proteomics is studying the structure and function
    of groups of proteins. Proteins are really
    important, but we dont really understand how
    they work.
  • Comparative Genomics is the process of comparing
    different genomes in order to better understand
    what they do and how they work. Like comparing
    humans, chimpanzees, and mice that are all
    mammals but all quite different.

Image Courtesy Google Images
22
Who works on this stuff anyway?
  • Biologists and Chemists understand the physical
    sciences-they take biology and chemistry classes.
  • Computer Scientists program the computers (the
    same people who make video games!)-they take math
    and computer classes.
  • Computer Engineers try to build better, faster,
    smarter computers-they take math, physics and
    computer classes.
  • Social Scientists try to understand how this new
    information and technology will impact our
    lives-they take sociology and philosophy classes.

23
How can I work on this project, or something like
it?
  • Read about it, online at http//www.genome.gov,
    or in Nature, Science, or other scientific
    magazines.
  • Take classes in biology, chemistry, mathematics
    and physics classes at high school.
  • Go to college and get a degree in science,
    engineering, mathematics, or social sciences.

24
Bioinformatics Opportunities
Director/Professor - University Company
(Pharmaceutical) National Laboratory Research
Foundation
Ph.D.
Bioinformatics Biochemistry Biology Computer
Science Computer Engineering Mathematics Physics L
inguistics Education, Sociology, Philosophy,
Psychology, Community Studies) A research degree
in any of these majors will take you far!
Research Staff - Company/University National
Laboratory Research Foundation Teaching
- Community College Public Schools
M.S. (M.A.)
Entry-Level - Company National Laboratory Teaching
Private Schools
B.S. (B.A.)
25
now. The number 1 FAQ
How much biology should I know??
No simple or straight-forward answer
unfortunately!
But the mantra is Take the classes and Interact
routinely with biologists OR Work with the
biologists or the biological data
High School Senior Summer Internship http//www.ci
ncinnatichildrens.org/ed/research/undergrad/hs/def
ault.htm Summer Undergraduate Research
Fellowship http//www.cincinnatichildrens.org/ed/r
esearch/undergrad/surf/default.htm
26
But I want to start with some basics..
  • http//www.ncbi.nlm.nih.gov/Education
  • http//www.ebi.ac.uk/2can/
  • http//www.genome.gov/Education/
  • http//genomics.energy.gov/
  • Books
  • Introduction to Bioinformatics by Teresa Attwood,
    David Parry-Smith
  • A Primer of Genome Science by Gibson G and Muse
    SV
  • Bioinformatics A Practical Guide to the Analysis
    of Genes and Proteins, Second Edition by Andreas
    D. Baxevanis, B. F. Francis Ouellette
  • Algorithms on Strings, Trees, and Sequences
    Computer Science and Computational Biology by Dan
    Gusfield
  • Bioinformatics Sequence and Genome Analysis by
    David W. Mount
  • Discovering Genomics, Proteomics, and
    Bioinformatics by A. Malcolm Campbell and Laurie
    J. Heyer

27
Biological Challenges - Computer Engineers
  • Post-genomic Era and the goal of bio-medicine
  • to develop a quantitative understanding of how
    living things are built from the genome that
    encodes them.
  • Deciphering the genome code
  • Identifying unknown genes and assigning function
    by computational analysis of genomic sequence
  • Identifying the regulatory mechanisms
  • Identifying their role in normal
    development/states vs disease states

28
Biological Challenges - Computer Engineers
  • Data Deluge exponential growth of data silos and
    different data types
  • Human-computer interaction specialists need to
    work closely with academic and clinical
    biomedical researchers to not only manage the
    data deluge but to convert information into
    knowledge.
  • Biological data is very complex and interlinked!
  • Creating information systems that allow
    biologists to seamlessly follow these links
    without getting lost in a sea of information - a
    huge opportunity for computer scientists.

29
Biological Challenges - Computer Engineers
A major goal in molecular biology is Functional
Genomics Study of the relationships among genes
in DNA their function in normal and disease
states
  • Networks, networks, and networks!
  • Each gene in the genome is not an independent
    entity. Multiple genes interact to perform a
    specific function.
  • Environmental influences Genotype-environment
    interaction
  • Integrating genomic and biochemical data together
    into quantitative and predictive models of
    biochemistry and physiology
  • Computer scientists, mathematicians, and
    statisticians will ALL be an integral and
    critical part of this effort.

30
Informatics Biologists Expectations
  • Representation, Organization, Manipulation,
    Distribution, Maintenance, and Use of
    information, particularly in digital form.
  • Functional aspect of bioinformatics
    Representation, Storage, and Distribution of
    data.
  • Intelligent design of data formats and databases
  • Creation of tools to query those databases
  • Development of user interfaces or visualizations
    that bring together different tools to allow the
    user to ask complex questions or put forth
    testable hypotheses.

31
Informatics Biologists Expectations
  • Developing analytical tools to discover knowledge
    in data
  • Levels at biological information is used
  • comparing sequences predict function of a newly
    discovered gene
  • breaking down known 3D protein structures into
    bits to find patterns that can help predict how
    the protein folds
  • modeling how proteins and metabolites in a cell
    work together to make the cell function.

32
Finally.What does informatics mean to
biologists?
  • The ultimate goal of analytical bioinformaticians
    is to develop predictive methods that allow
    biomedical researchers and scientists to model
    the function and phenotype of an organism based
    only on its genomic sequence. This is a grand
    goal, and one that will be approached only in
    small steps, by many scientists from different
    but allied disciplines working cohesively.

33
Biology Data Structures
  • Four broad categories
  • Strings To represent DNA, RNA, amino acid
    sequences of proteins
  • Trees To represent the evolution of various
    organisms (Taxonomy) or structured knowledge
    (Ontologies)
  • Sets of 3D points and their linkages To
    represent protein structures
  • Graphs To represent metabolic, regulatory, and
    signaling networks or pathways

34
Biology Data Structures
  • Biologists are also interested in
  • Substrings
  • Subtrees
  • Subsets of points and linkages, and
  • Subgraphs.

Beware Biological data is often characterized by
huge size, the presence of laboratory errors
(noise), duplication, and sometimes unreliability.
35
Support Complex Queries A typical demand
  • Get me all genes involved in or associated with
    brain development that are differentially
    expressed in the Central Nervous System.
  • Get me all genes involved in brain development in
    human and mouse that also show iron ion binding
    activity.
  • For this set of genes, what aspects of function
    and/or cellular localization do they share?
  • For this set of genes, what mutations are
    reported to cause pathological conditions?



36
Model Organism Databases Common Issues
  • Heterogeneous Data Sets - Data Integration
  • From Genotype to Phenotype
  • Experimental and Consensus Views
  • Incorporation of Large Datasets
  • Whole genome annotation pipelines
  • Large scale mutagenesis/variation projects
    (dbSNP)
  • Computational vs. Literature-based Data
    Collection and Evaluation (MedLine)
  • Data Mining
  • extraction of new knowledge
  • testable hypotheses (Hypothesis Generation)

37
Human Genome Project Data Deluge
No. of Human Gene Records currently in NCBI
29413 (excluding pseudogenes, mitochondrial genes
and obsolete records). Includes 460 microRNAs
NCBI Human Genome Statistics as on February12,
2008
38
The Gene Expression Data Deluge
Till 2000 413 papers on microarray!
Problems Deluge! Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray data analysis from
disarray to consolidation and consensus. Nat Rev
Genet. 7(1) 55-65.
39
Information Deluge..
  • 3 scientific journals in 1750
  • Now - gt120,000 scientific journals!
  • gt500,000 medical articles/year
  • gt4,000,000 scientific articles/year
  • gt16 million abstracts in PubMed derived from
    gt32,500 journals

40
Data-driven Problems..
  • Generally, the names refer to some feature of the
    mutant phenotype
  • Dickies small eye (Thieler et al., 1978, Anat
    Embryol (Berl), 155 81-86) is now Pax6
  • Gleeful "This gene encodes a C2H2 zinc finger
    transcription factor with high sequence
    similarity to vertebrate Gli proteins, so we have
    named the gene gleeful (Gfl)." (Furlong et al.,
    2001, Science 293 1632)

Whats in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
  • Disease names
  • Mobius Syndrome with Polands Anomaly
  • Werners syndrome
  • Downs syndrome
  • Angelmans syndrome
  • Creutzfeld-Jacob disease
  • Accelerin
  • Antiquitin
  • Bang Senseless
  • Bride of Sevenless
  • Christmas Factor
  • Cockeye
  • Crack
  • Draculin
  • Dickies small eye
  • Draculin
  • Fidgetin
  • Gleeful
  • Knobhead
  • Lunatic Fringe
  • Mortalin
  • Orphanin
  • Profilactin
  • Sonic Hedgehog

41
Rose is a rose is a rose is a rose.. Not Really!
What is a cell?
  • any small compartment
  • (biology) the basic structural and functional
    unit of all organisms they may exist as
    independent units of life (as in monads) or may
    form colonies or tissues as in higher plants and
    animals
  • a device that delivers an electric current as a
    result of chemical reaction
  • a small unit serving as part of or as the nucleus
    of a larger political movement
  • cellular telephone a hand-held mobile
    radiotelephone for use in an area divided into
    small sections, each with its own short-range
    transmitter/receiver
  • small room in which a monk or nun lives
  • a room where a prisoner is kept

Image Sources Somewhere from the internet and
Google Images
42
Foundation Model Explorer
43
  • COLORECTAL CANCER 3-BP DEL, SER45DEL
  • COLORECTAL CANCER SER33TYR
  • PILOMATRICOMA, SOMATIC SER33TYR
  • HEPATOBLASTOMA, SOMATIC THR41ALA
  • DESMOID TUMOR, SOMATIC THR41ALA
  • PILOMATRICOMA, SOMATIC ASP32GLY
  • OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC
    SER37CYS
  • HEPATOCELLULAR CARCINOMA SOMATIC SER45PHE
  • HEPATOCELLULAR CARCINOMA SOMATIC SER45PRO
  • MEDULLOBLASTOMA, SOMATIC SER33PHE

The REAL Problems
Many disease states are complex, because of many
genes (alleles ethnicity, gene families, etc.),
environmental effects (life style, exposure,
etc.) and the interactions.
44
The REAL Problems
45
Methods for Integration
  • Link driven federations
  • Explicit links between databanks.
  • Warehousing
  • Data is downloaded, filtered, integrated and
    stored in a warehouse. Answers to queries are
    taken from the warehouse.
  • Others.. Semantic Web, etc

46
Link-driven Federations
  • Creates explicit links between databanks
  • query get interesting results and use web links
    to reach related data in other databanks
  • Examples NCBI-Entrez, SRS

47
http//www.ncbi.nlm.nih.gov/Database/datamodel/
48
http//www.ncbi.nlm.nih.gov/Database/datamodel/
49
http//www.ncbi.nlm.nih.gov/Database/datamodel/
50
http//www.ncbi.nlm.nih.gov/Database/datamodel/
51
http//www.ncbi.nlm.nih.gov/Database/datamodel/
52
Link-driven Federations
  • Advantages
  • complex queries
  • Fast
  • Disadvantages
  • require good knowledge
  • syntax based
  • terminology problem not solved

53
Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
  • Advantages
  • Good for very-specific, task-based queries and
    studies.
  • Since it is custom-built and usually
    expert-curated, relatively less error-prone.
  • Disadvantages
  • Can become quickly outdated needs constant
    updates.
  • Limited functionality For e.g., one
    disease-based or one system-based.

54
Algorithms in Bioinformatics
  • Finding similarities among strings
  • Detecting certain patterns within strings
  • Finding similarities among parts of spatial
    structures (e.g. motifs)
  • Constructing trees
  • Phylogenetic or taxonomic trees evolution of an
    organism
  • Ontologies structured/hierarchical
    representation of knowledge
  • Classifying new data according to previously
    clustered sets of annotated data

55
Algorithms in Bioinformatics
  • Reasoning about microarray data and the
    corresponding behavior of pathways
  • Predictions of deleterious effects of changes in
    DNA sequences
  • Computational linguistics NLP/Text-mining.
    Published literature or patient records
  • Graph Theory Gene regulatory networks,
    functional networks, etc.
  • Visualization and GUIs (networks, application
    front ends, etc.)

56
Disease Gene Identification and Prioritization
Hypothesis Functionally similar or related genes
cause similar disease.
  • Functional Similarity Common/shared features
  • Gene Ontology term
  • Pathway
  • Phenotype
  • Chromosomal location
  • Expression
  • Cis regulatory elements (Transcription factor
    binding sites)
  • miRNA regulators
  • Interactions
  • Other features..

57
PPI - Predicting Disease Genes
  • Direct proteinprotein interactions (PPI) are one
    of the strongest manifestations of a functional
    relation between genes.
  • Hypothesis Interacting proteins lead to same or
    similar diseases when mutated.
  • Several genetically heterogeneous hereditary
    diseases are shown to be caused by mutations in
    different interacting proteins. For e.g.
    Hermansky-Pudlak syndrome and Fanconi anaemia.
    Hence, proteinprotein interactions might in
    principle be used to identify potentially
    interesting disease gene candidates.

58
  • Prioritize candidate genes in the interacting
    partners of the disease-related genes
  • Training sets disease related genes
  • Test sets interacting partners of the training
    genes

59
  • Example Breast cancer

15
342
2469
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
PubMed
OMIM
64
http//sbw.kgi.edu/
Write a Comment
User Comments (0)
About PowerShow.com