Title: Bioinformatics tools and techniques Into the heart of darkness
1Bioinformatics tools and techniquesInto the
heart of darkness
- Elaine Kenny
- Colm ODushlaine
- 15/11/07
2Summary
- Simple overviews of some of the tools and methods
used by EK and COD - TK notebook
- get_hapmap_snps.pl retrieve HM genotype
information for a list of SNPs - GeneViewer.pl cross_ref.pl visualise e.g. SNPs
in the context of other genomic landmarks. Score
SNPs depending on how many of these landmarks
they overlap with - ld_expander.pl find SNPs in LD with SNPs of
interest, based on user-specified r2 and LD
window (distance between SNPs) - STATA
- VIM command line text editor
- Lab website
3TK notebook
- Application for saving notes, to-do lists, daily
logs, and any other kind of textual information
in a place where you can find it all again, and
where related information is easily found - Easy to edit and rapidly searchable
- DEMO editing
- DEMO search
4get_hapmap_snps.pl
- Simple script to read in a 1-column list of SNPs
and retrieve HapMap genotypes - Can select population and strand
- DEMO
- Retrieved data can be loaded into HaploView
- DEMO
5cross_ref_scored.pl
- Score SNPs based on how many putatively
functional regions they overlap with - On a per gene / chromosome basis
- Gene basis
- Type perl cross_ref_scored.pl file_A file_B
file_C ... - where
- file_A - 2-column file of SNPs (format
id, location) - file_B - 3-column file of EXONS (format
id/name, start, stop) - file_C ... - whatever you want, (format
id/name, start, stop) - i.e. other regions like CpGs,
TFBS, clusters. Any order. -
6cross_ref_scored.pl example output
Can then be merged with HapMap / Perlegen to
retrieve MAF data for SNPs
7Merge cross_ref_scored data with HapMap/ Perlegen
data using merge_per_hap.pl
- Type
- perl merge_per_hap.pl perlegen.txt hapmap.txt
overlapped_region_scored.txt - Where
- hapmap.txt 3-column file (format rsid,
ref_allele, ref_allele_freq), - perlegen.txt 3-column file (format rsid,
ref_allele, ref_allele_freq)
8cross_ref.pl applied to WGA data
- cross_ref.pl Scoring SNPs throughout genome
- Data analysed on coding/non-coding basis
- (coding)
- perl cross_ref.pl Overlapped_regions_scored.WTCCC.
chr22.coding.txt 22 WTCCC_T2D_chr22_without_inferr
ed.forCrossRef WGA_databases/coding_non_synon_SNPs
_UCSC.clean3 WGA_databases/coding_synon_SNPs_UCSC
.clean2 WGA_databases/RefSeq_Genes_UCSC.byExon.u
niqid1 WGA_databases/Triplexes_may2006.bed2
WGA_databases/splice_site_SNPs_UCSC.clean2 gt
Overlapped_regions_scored.WTCCC.chr22.coding.log
- (input-dependent, coding/non-coding dependent,
arbitrary) - (noncoding)
- perl cross_ref.pl Overlapped_regions_scored.WTCCC.
chr22.NONcoding.txt 22 WTCCC_T2D_chr22_without_inf
erred.forCrossRef WGA_databases/TFBS.chr221
WGA_databases/CpG_islands_UCSC.uniqid1
WGA_databases/Most_conserved_phastConsElements17wa
y_UCSC.clean1 WGA_databases/promoters_knowngene_h
g18.txt1 WGA_databases/sno_or_miRNA_UCSC.uniqid1
gt Overlapped_regions_scored.WTCCC.chr22.NONcoding
.log
9cross_ref.pl
- cross_ref.pl output
- Load into STATA. If SNPs have e.g. association
p-values, calculate adjusted p-value (R. Anney)
as -log10P
cross_ref_score
10GeneViewer.pl
- GeneViewer.pl Visualise overlapping features
(e.g. exons, SNPs etc.) along e.g. your gene of
interest (html output)
11ld_expander.pl
- Find proxies (SNPs in LD) for a list of SNPs
- User specifies the r2 and LD window
- Currently configured to obtain proxies from HM
CEU - Result is a list of additional proxy SNPs that
have been obtained by LD expansion - DEMO
- Note dont LD expand gt150000 SNPs, or HapMap
will ban you! COD has an alternative version
that uses local pre-computed pairwise LD SNP files
12STATA
- Extremely powerful and flexible
- gt65k rows handled shock horror!
- Can write scripts to automate tasks, e.g. read in
file, do analysis, save results - When use GUI to run some commands, the commands
are shown in the command window, so can save in a
do file - COD, EK and R. Anney strongly advocate this as a
platform for both file manipulation and
statistical analysis
13STATA example using WTCCC data
Bipolar Disorder, Coronary Artery Disease,
Crohn's Disease, Hypertension, Rheumatoid
Arthritis, Type 1 Diabetes, Type 2 Diabetes
14DATA FORMAT
- 3 folders
- Basic
- Each case collection against the pooled control
groups 58C and UKBS - Combined cases
- Combining other case collections as controls
- Combined controls
- Combining phenotypically relevant case
collections - (e.g. RA/T1D, autoimmune )
- Data are split by chromosome
15Questions
- How do I get all of the chromosome data for my
gene of interest into one file? - How do I search easily all of the SNP information
for my gene(s) of interest? - Create a .do file for all manipulations that
you want to carry out to the data - DEMO
- Good starting resource http//www.ats.ucla.edu/st
at/stata/
16VIM
- Vi Improved. Mainly UNIX but cross-platform
text editor (available for Windows). - Full list of commands outside scope of this
demonstration - Very fast and efficient, esp. with search and
replace functions on large datasets - Regular expression pattern matching
- DEMO
- Integrates with Cygwin (www.cygwin.com very
useful UNIX emulator for windows)
17Group website
- Some useful stuff up there!
- Please send information about current projects
etc. Good for our image as a group and minimal
effort required on your part - DEMO
18Conclusions
- Small summary of some things you can do
- Slides and video demonstrations will be online
at http//www.medicine.tcd.ie/psychiatry/research
/neuropsychiatry/Protocols/ - COD EK available for advice (Fridays
9-9.02am) - These things will help you in your work!!