Title: Bioinformatics in Cancer Biotechnology
1Bioinformatics in Cancer Biotechnology
- Bob StephensAdvanced Biomedical Computing
CenterAdvanced Technology ProgramSAIC-Frederick,
Inc.National Cancer Institute at Frederick - April 19, 2007
2Objectives
- Overview/introduce bioinformatics concepts,
applications and databases. - Describe interplay between bioinformatics,
technologies and the web. - Profile importance of bioinformatics in cancer
research.
3What is bioinformatics ?
- Bioinformatics is the application of
computational methods to the analysis of any type
of biological data. - Bioinformatics has become a diverse and
multi-disciplined field that originally derived
from computer science and biological science.
4Evolution of bioinformatics
- Rapid technological advances in sequence
determination set the pace for data acquisition. - Similar advances in computing power and
algorithmic approaches for sequence analysis,
robotics enabled instruments. - Co-evolution with web browser and programming
language technologies.
5Bioinformatics evolution (contd.)
- Additional high throughput technologies becoming
available almost daily - microarrays, proteomics,
population and genetic data, medical literature
etc. - Data volume is increasing at the same time as
data complexity. - Data distribution/synchronization becoming an
increasingly difficult task.
6Interplay between technology and bioinformatics
- New HT Technologies, eg. mRNA microarray
- Analysis and storage software
- Computational infrastructure
- Data integration
7Example
- mRNA expression chip (20000 genes x 16 probes per
gene), a few mb per sample. - Data normalization software.
- Exon array - multiple probes for each exon for
each of the 20000 genes - one file about 1gb. - New normalization method requires all samples to
be loaded simultaneously. - More complex analysis reveals alternative
splicing etc.
8Interface of technologies and biology
- Experimental design very important in HT biology
- Experiments shaped by data access and
availability - Re-analysis of old data with new methods important
9(No Transcript)
10Bioinformatics historical perspective
- Stage 1 - bioinformatics term is coined to
represent what had been DNA and protein sequence
analysis (ca. 1995) - Stage 2 - additional disciplines become rolled
into bioinformatics including literature mining,
statistical analysis, and virtually anything to
do with computational analysis of biological
data. (ca. 2000)
11Bioinformatics - historical perspective (contd)
- Realization that bioinformatics is too broad a
term, other disciplines break away eg. OMICs
fields (eg genomics, proteomics others (ca.
2001). - Still later (current) realization is made that we
wont be able to make any sense of individual
disciplines without integrating them together,
term now changed to integrative biology or
systems biology (ca. 2003).
12Importance of bioinformatics
- Bioinformatics has become a major part of both
the NCI 2015 directive and the NIH Roadmaps. - Virtually impossible to perform biological
research without some form of computer aided
analysis, especially in areas like genomics and
proteomics. - Important to keep scientific community in touch
with developing technologies and capabilities for
highest return on research investment.
13Bioinformatics infrastructures
- Command-line implementations.
- Primitive GUI implementations.
- Sophisticated GUI interfaces and application
packaging. - Web interface and Java language gives platform
independent access. - PC-based, web-based and server-based
architectures. - Multiple tier infrastructures distributes
computational burden.
14What does bioinformatics technology involve ?
- Computer readable form of some type or types of
biological data (instruments) - Automation also requires programmable robotics
capabilities (process science). - Computer infrastructure for storing and analyzing
the data. - As data volume and complexity grows, the
dependency on computer analysis increases.
15Sources of bioinformatics technology
- Computer science leveraged technologies including
algorithms and data representation models,
visualization frameworks and programming
languages. - Web industry leveraged technologies including
communication protocols, web servers and secure
access. - Database industry derived connectivity and
technologies. - Robotics and process engineering technologies for
faster, cheaper throughput.
16What can bioinformatics technology do for
biological science ?
- Develop uniform data standards and controlled
vocabularies to allow for integration of
disparate sources/types of data. - Connect scientists to entire wealth of knowledge
from basic science results to clinical trial data
in context-sensitive manner. - Fully integrate worldwide volume of knowledge,
for example patient information
disease-gttreatment-gtoutcome across multiple
centers to allow for cross-comparisons.
17NCI Resources
- caBIG NCICB Initiatives to develop integrated
data/tool environment.. - Long term project requiring unprecedented
cooperation, sharing. - Short term solutions for day-to-day problems.
- Solution - use multiple approaches, staged
implementation and layered technologies
18(No Transcript)
19ABCC hardware
- 128 cpu linux cluster (3.0 ghz processors).
- 256 cpu linux smp box with 1Tb memory.
- 64 cpu IRIX smp box with 256gb memory.
- 32 cpu IBM AIX smp computers.
- 16 cpu IBM HPC AIX smp computer.
- 8 x 8cpu IRIX computers.
- Other miscellaneous computers, disk storage, tape
backup and network connectivity. - Graphics visualization wall
20(No Transcript)
21ABCC Organization
- Networking and Security
- System administration
- Scientific program development
- Bioinformatics support
- Staff 40
22ABCC Training Programs
- Classes for NIH/NCI scientists
- Unix, GCG, Java, High throughput sequence
analysis, Geospiza (LIMS) - Eudora, Advanced Eudora, Webmail
- Homology, Docking, QSAR, Intro to Modeling,
Phred, Phrap, Consed - One-on-one consulting services and training.
- Organize and host vendor specific training in
genomics, pathways, and modeling
23ABCC Support within ATP
Proteomics and Analytical Technologies (LPAT)
Computational Support Database Tools/Pathways Mass
Storage and Archive Pattern Analysis and
Clustering
Molecular Technologies (LMT)
Image Analysis (IAL)
Computational Support Database Tools and
LIMS Mass Storage and Archive Bioinformatics/Web P
attern/SNP Analysis
ABCC
Algorithm and Software Image Database Mass
Storage and Archive Viz Technology Development
Gene Expression (GEL)
Protein Chemistry (PCL)
Software Support
Gene Assembly and Validation
Animal Sciences (LASP)
Protein Expression (PEL)
Mass Storage Database
POET/Web
24ABCC applications
- Sequence analysis - protein and nucleic acid, GCG
and EMBOSS. - Sequence assembly, SNP detection.
- Gene finders, analysis tools.
- Molecular modeling, docking.
- Molecular evolution and phylogeny.
- Computational chemistry.
- Linkage analysis.
- Proteomics.
- Classification tools (microarray and proteomics).
25ABCC databases
- Genbank and derived divisions.
- Refseq, WGS, unigene divisions.
- dbSNP, gene, OMIM, homologene.
- UCSC, EBI and ncbi genome datasets.
- LIMS systems, data management.
- Uniprot, PDB, PIR, iProClass, Swissprot.
- CGAP, MGC data files, pathways.
- Medline, transfac and repeats data files.
26ABCC web resources
- ABCC General information web page
http//www.abcc.ncifcrf.gov - ABCC account application information
http//www.abcc.ncifcrf.gov/apps_apply.shtml - ABCC Training web page http//www.abcc.ncifcrf.gov
/training/courses.shtml - ABCC scientific applications webpage
http//www.abcc.ncifcrf.gov/app/htdocs/appdb/index
.php - ABCC GRID Database web page http//grid.abcc/ncifc
rf.gov - ABCC Pipelines web page http//www.abcc.ncifcrf.go
v/app/login/login.php
27The role of bioinformatics in cancer research
- Diagnosis - identify classifiers to better
sub-divide cancer etiologies into groups. Better
individual data to put treatment and individual
together. - Treatment - identify better methods to track
treatment progress and indicate problems earlier. - Prevention - understand mechanisms for cancer
initiation, progression and development and
identify targets in this process. - Connect cancer patient data from geographically
distributed cancer patients for more complete
analysis.
28Protein analysis tools
- Protein composition, isoelectric point, molecular
weight analysis tools. - Comparable alignment/searching tools for
proteins. - Protein secondary structure prediction tools.
- Protein structure modeling tools.
29Genomics tools
- Gene finder and general genome annotation tools.
- Cross genome comparison tools and databases.
- Large scale sequence assembly and polymorphism
identification tools. - Genomic visualization tools (UCSC, NCBI,
Ensembl). - Data cleansing tools - vector screening, repeat
masking.
30Gene expression tools
- EST Clustering and differential expression
analysis tools and databases. - SAGE Analysis tools and databases.
- Microarray data collection, calibration and
analysis tools and databases. - Gene clustering and visualization tools.
- Integration tools - pathways, regulatory networks
and medical literature. - Databases for housing and querying the data.
31Proteomics tools
- Mass spectroscopy tools for peptide
identification. - Fragment classification tools for identification
of diagnostics - Peptide fragment resolution tools -
identification of protein mixtures from peptide
sets. - Databases for storing and querying the data.
32Inherent bioinformatics problems
- Keeping data sources synchronized and up to date.
- Keeping applications up to date.
- Remaining aware of current palette of available
tools and resources. - Separation between computer developers and
biologist users of software and databases. - The silo concept- separate dysfunctional units.
- Lack of common language or database schema.
33Data Analysis
- Pathway analysis
- Polymorphism
- Proteomics
- Image analysis
- Homology Modeling
- Live polymorphism analysis (if time permits)
34Pathway Analysis
- Identify specific requirements of individual
tumor. - Advance to detection from diagnosis.
- Multiple points to cause aberrations and multiple
points to act to correct them. - Identify/characterize tissue, cell specific
targets.
35Pathway Gene Set Analysis
- Many experiments result in sets of genes, eg
microarray, proteomics, literature searches etc. - Clustering genes based on expression etc.
provides only first dimension. - View prospective pathways impacted by changes in
expression, protein levels, phosphorylation etc.
36G5G8Tg1Liver
G5G8Tg2Liver
G5G8-/-1Liver
G5G8-/-2Liver
G5G8-/-3Liver
37G5G8Tg1Liver
G5G8Tg2Liver
G5G8-/-1Liver
G5G8-/-2Liver
G5G8-/-3Liver
38Integrative Strategy for Microarray Analysis
Microarray Data
Clustering Analysis
Load into WPS
WSCP
Unassigned Genes
Integrate with WPS
Lists of Genes
Assign to uncharacterized pathway(s)
Assign to known pathway(s)
Putative Pathway
PSCP
PSCP
PSCP
39Project Goal Integrate Biological Data and/or
Information Databases into Biological Networks
User input Microarray Data, Proteomics
Protein Interaction Database (BIND, DIP etc.)
Comparative Genomics
P1
P2
Protein Modification Phos., Glyco.
Gene regulation (Promoter etc)
Gene Ontology
SNP Haplotype Database (SNPinfo etc)
Literature DB (e.g. Pubgene ResNet)
NCBI resources OMIM etc
Statistical Evaluation Network Expansion (high,
low confidence)
40One example of analysis scenario
microarray data pathway analysis or clustering
in local PC
Candidate gene sets Candidate pathway sets
Pre-computed DBs or Run-time computed
Internet-enabled
SNP Haplotype data (SNPinfo Disease association
Promoter Comparsion 1.CGI generator 2.CoreSearch 3
ConsInspector)
Protein interaction
Literature-based (Pubgene etc NCBI OMIM etc)
GO
Known gene training
Weighted scoring (Statistic analysis, filtering)
Final set of candidate genes (visualization and
re-creation of the new subnetwork within the
whole network)
Pathway expansion
41Polymorphism Impacts
- Variation within species as great as differences
between closely related species - Confounds correlation analysis
- Impacts gene structure and expression
- Start with complete sequence for individual,
obtain polymorphism data for populations/strains
and breeds etc. - Strains/breeds allow for good start
42Polymorphism Types
- SNPs
- Indels
- STRs
- Tandem
- NonTandem (Copy number variation)
- Retroelement
- Complex
- Inversion/translocation
43STR Polymorphism View
44Strain Trace and Contig Coverage View
45InDel Polymorphism Information View
46Location Polymorphism Locator Query
47STR Query results
48Polymorphism Visualization
49Proteomics InitiativeABCC Projects
- Disk Storage and Archiving (centralized storage)
- LAN Support
- Software Development
- Spectral Filtering
- Clustering/Biomarker Identification
- Database Development and Update
- Peptide identification DB
- MS Integration with Pathways
- ABCC Pathway tool
- Provide Scalable Computational Resources
- Software Optimization
- Sequest (working with LPAT,Yates Lab, and
Thermoelectron)
50Raw Data
Binning
Biological Marker
Clustering
51Need for effective classification schemes for
correlating large amounts of data with Cancer
markers
- Large amounts of data.
- Many features (data points) to fit but few
samples - Problems are over-determined
- Solutions may be purely mathematical with no
biological basis
52Image Processing
- Confocal Microscopy Whole Animal Imaging
- 3D Segmentation
- Traditional/Real-time Microscopy
- Automated Quantitative Feature Analysis
53Confocal Imaging
- Confocal Microscopy captures 3D volumes of tissue
in situ - Cancer appearance / development is related to the
cellular neighborhood - Therefore, segmentation and interpretation of
cellular clusters is required - NCI Developed Algorithms
- Segmentation needs human review
54Imaging/Confocal Microscopy
55Homology Modeling
- Many new chemotheraputic molecules are specific
enzyme inhibitors - Structural biology plays key role in
design/enhancement of these compounds. - Identify better inhibitors, understand specific
differences and mechanisms.
56Homology Modeling of Cysteing Finger in 3 Human
Raf Proteins
57ABCC Bioinformatics Support Group
- Anney Che
- Jack Chen
- Jin Chen
- Qingrong Chen
- David Liu
- Uma Mudunuri
- Jigui Shan
- Wei Shao
- Gary Smythers
- Hong Mei Sun
- Natalia Volfovsky
- Xinyu Wen
- Ming Yi
- Jack Zhu
58Bob Stephensbobs_at_ncifcrf.govwww.abc.ncifcrf.gov
Query tool
GBrowse