Title: Bioinformatics Software
1Bioinformatics Software
Stuart M. Brown Research Computing NYU School of
Medicine
2BioinformaticsBeyond Websites
- A lot of sophisticated bioinformatics work can be
done using free websites - Some biology projects generate a LOT of data
- how we define a lot is a moving targetTB
- The only solution is to have a dedicated
bioinformatics computer, database, and custom
programs. - May require more processor power and more hard
drive space than a typical desktop personal
computer
3(No Transcript)
4Computer Hardware is not Free
- However, you can build a powerful Linux cluster
for 20-50-100K (depending on how much power
you need) - The big cost is for people with the skills to
manage the machines, install the software, and
train scientists to use it. - Small schools can join together or affiliate with
a larger neighbor.
5Open Source Bioinformatics
- All of the bioinformatics software that you need
to do complex analyses is free for UNIX computers - The Open Source software ethic is very strong
among biologists - Bioinformatics.org
- Bioperl.org
- Open-bio.org
- BioConductor
- New algorithms generally appear first as free
software (a publication requirement)
6What is "Open Source?"
- Software distributed with source code under
licenses guaranteeing anybody rights to freely
use, modify, and redistribute the code. - When users can read, redistribute, and modify the
source code for a piece of software, the software
evolves. People improve it, people adapt it,
people fix bugs. User communities provide free
tech support.
7Open Source
- Free
- Users can see and change code
- Easy to modify add new functions
- Send changes back to developer - share with
everyone - grows as a community - Can be incorporated into other products
- Developers can lose interest - no guarantee of
support
8Popular Open Source Projects
- Linux
- Apache
- PHP
- mySQL
- Mozilla/Firefox
- OpenOffice
- Wiki
9Science is Open Source
- Must publish results in a form that your work can
be checked and recreated by others - If your scientific work is a program then is
must be open source
10Free Software
- Linux operating system, mySYQL database
- Perl - programming language
- R, BioConductor - statistics package
- Blast and Fasta - similarity search
- Clustal - multiple alignment
- Phylip - phylogenetics
- Phred/Phrap/Consed - sequence assembly and SNP
detection - EMBOSS - a complete sequence analysis package
created by the EMBL
11Assemble your Own Bioinformatics computer
- We will install some bioinformatics tools in the
CS computer room - You will learn more if you install these tools on
your own computer - Every installation has its own style and its own
hassles - What tools will be needed?
- Interface?
12Bioinformatics Requires Powerful Computers
- One definition of bioinformatics is "the use of
computers to analyze biological problems. - As biological data sets have grown larger and
biological problems have become more complex, the
requirements for computing power have also grown. - Computers that can provide this power generally
use the Unix operating system - so you must learn
Unix
13Why Unix?
- Bioinformatics applications are typically written
in Unix - It is the operating system of choice for high
performance computers (multi-processor or
clusters) - Ability to use lots of software written by others
- Pipelining and database integration
14Unix Runs the Internet
- Unix is a command line interface, used by most
large, powerful computers. -
- In fact, Unix is the underlying structure for
most of the Internet and most large scale
bioinformatics operations. - A knowledge of Unix is likely to be helpful in
your future career, regardless of where you
pursue it.
15When Do I Need Unix?
- Big sets of data
- Integrating raw data, results, and display
- Speed of large batch processing
- Integration of web server and database with
automated data processing programs
16Unix Advantages
- It is very popular, so it is easy to find
information and get help - pick up books at the local bookstore (or street
vendor) - plenty of helpful websites
- USENET discussions and e-mail lists
- most Comp. Sci. students know Unix
- Unix can run on virtually any computer (IBM,
Sun, Compaq, Macintosh,etc) - Unix is free or nearly free
- Linux/open source software movement
- RedHat, FreeBSD, MKLinux, LinuxPPC, etc.
17Stable and Efficient
- Unix is very stable - computers running Unix
almost never crash - Unix is very efficient
- it gets maximum number crunching power out of
your processor (and multiple processors) - it can smoothly manage extremely huge amounts of
data - it can give a new life to otherwise obsolete Macs
and PCs - Most new bioinformatics software is created for
Unix - its easy for the programmers
18Unix has some Drawbacks
- Unix computers are controlled by a command line
interface - NOT user-friendly
- difficult to learn, even more difficult to truly
master - Hackers love Unix
- there are lots of security holes
- most computers on the Internet run Unix , so
hackers can apply the same tricks to many
different computers -
- There are many different versions of Unix with
subtle (or not so subtle) differences
19Flavors of Unix
- Dec Alpha True64
- Sun Solaris SunOS
- IBM AIX
- HPUX
- DGUX
- Minix
- KNOPPIX
- Gentoo
- Linux
- Redhat/Fedora
- SuSe
- Mandrake
- Mandriva
- Slackware
- Debian
20Linux is Unix
- Linux is a free (or cheap) operating system
- Works on any computer
- Increasingly popular for both desktops and
servers - It is an open implementation of Unix
- Uses all normal Unix commands
- Most new bioinformatics programs are written
first for Linux
21General Unix Tips
- UNIX is case sensitive!!
- myfile.txt and MyFile.txt do not mean the same
thing - I like to use capital letters for directory names
- it puts them at the top of an alphabetical
listing - Every program is independent
- the core operating system (known as the kernel)
manages each program as a distinct process with
its own little chunk of dedicated memory. - If one program runs into trouble, it dies, but
does not affect the affect the kernel or the
other programs running on the computer.
22Shell scripts
- Part of the UNIX operating system
- a powerful program in a 3-line text file
- Very simple method to run an existing program on
multiple files (foreach) - Can chain together several programs (a pipeline)
- Can make simple decisions (if then)
- Can read output files and match patterns (grep)
- Can manipulate text files (sed awk)
23Unix Help on the Web
- Here is a list of a few online Unix tutorials
- Unix for Beginners
- http//www.ee.surrey.ac.uk/Teaching/Unix/
- Getting Started in Unix
- http//its.ucsc.edu/services/web/unix/start.php
- Unix Guru Universe
- http//www.ugu.com/sui/ugu/show?help.beginners
- Getting Started With The Unix Operating System
http//www.leeds.ac.uk/iss/documentation/be
g/beg8/beg8.html
24Standalone Tools
- BLAST - sequence similarity
- free from NCBI
- works on most operating systems
- versions for local, web client, web server
- EMBOSS - complete package of tools
- free from EMBL
- many handy little tools
- BioConductor - genomics/statistics
- free from BioConductor.org (requires "R" Java)
- works on most operating systems (best on Windows)
25BLAST
26EMBOSS
- EMBOSS is a suite of hundreds of simple programs
that work with DNA and protein sequences - Free UNIX software -works on Mac OS X
- -has a Windows implementation
- Somewhat challenging to download, compile, and
configure - Has web and Java interfaces
27(No Transcript)
28- transeq
- Translate nucleic acid sequences
- Input sequence(s) xelrhodop.fasta
- Output sequence xelrhodop.pep
- transeq -opt
- Translate nucleic acid sequences
- Input sequence(s) xelrhodop.fasta
- Translation frames
- 1 1
- 2 2
- 3 3
- F Forward three frames
- -1 -1
- -2 -2
- -3 -3
- 6 All six frames
- Frame(s) to translate 1 1
- Genetic codes
- 0 Standard
Minimalist Interface Fast, powerful Easy to
script (all options for all programs can be
input on command line)
29(No Transcript)
30(No Transcript)
31Ugly EMBOSS on unix command line has now turned
into a beautiful Jemboss swan!
32programs are highlighted based unambiguous letters
33(No Transcript)
34(No Transcript)
35GDE - multiple alignment viewer
36GDE - supports phylogenetic Apps. and graphics
37Programming Languages
- Like religions
- Perl is the most commonly used language for
simple bioinformatics apps. - Really a fast prototyping language
- Useful for linking other programs and shell
scripts - SLOW compared to compiled languages
- C Java are used for many more polished programs
38Perl
- Perl is a powerfull language
- convert file formats
- search a file for something you need
- change things in a file
- run a program and select just some lines of the
output - process your sequences
- build a pipeline that runs on many different
systems - Very powerful text handling
- has its own Regular Expression language
39Compiled vs. Interpreted
- Perl is an Interpreted language
- A Perl program is just a bit of text with
commands in a (semi) human readable form - Must have a Perl Interpreter (program) on your
computer to execute this program - Can run on any operating system
- Very slow
- Compiled program language like C
- Executable program is a bunch of gibberish
- Can only run on one specific operating system
- Much faster
40BioPerl
- Don't spend your time reinventing the wheel
- most types of bioinformatics data manipulations
have been done before - grab a module and go from there (there are
hundreds of them) - object oriented
41Role of CS in Biology Project
- The CS Bioinformatics specialist works closely
with biologists - Provides consulting in project design
- collect the data in a way that makes it easy to
analyze - Works with biologists to establish scope of the
project - Chooses appropriate technologies
- Builds the programs/database and processes the
data - Works with biologists to produce data
visualization/output that works for THEM!
42Its all about the INTERFACE
- Biologists are (generally) visual wholistic
thinkers - A good tool needs a good interface
- Biologists will tolerate an essential tool with a
lame interface, but it will dramatically reduce
the quality of the science (and the utility of
the tool) - As we play with each bioinformatics tool this
semester, think about the interface and how it
aids or impedes your work.
43UI Elements
- Most biologists are not comfortable with the
command line - A simple web form is vastly better than a command
line tool - Graphical display is better than text (although a
text output option is also highly desirable) - The Visual Display of Quantitative Information
- User feedback for navigation, runtime indicator,
record of options used, etc. is also important
44- The chart shows 6 variables
- size of the army
- location on a 2D surface
- direction of the army's movement
- Temperature
- location of major river crossings
45(No Transcript)
46Toolkits
- Existing bioinformatics tools and reference
datasets are dispersed and uncoordinated - Each scientist is required to discover or
re-invent appropriate tools for their data
analysis - Toolkits are extremely useful when they group
related tools with a common interface
47(No Transcript)
48(No Transcript)
49Project Mangement
- Building the project
- Basic Unix
- Databases
- Shells
- Pipelining / Scaling
- Basic BioPerl / BioJava / Obj.Oriented
Programming - Translating Biology to IT
- Integration
- Publishing / Managing the Project
- Data Security
50BIG DATA
51New Technology Big Data
- The past 2-3 years have seen a huge increase in
the size of data files generated in scientific
projects - This has been driven entirely by new
technologies - High-throughput (Next-Generation) DNA sequencing
- Genome tiling chips
- High density SNP chips
52Computing Challenges
- There are many practical challenges associated
with very large data files - Data storage for TB of laboratory information
- Cant easily manipulate millions of records in
traditional sql databases - Difficult to run Perl scripts on data files
millions of lines long
53Need massive new CPU power
- Traditional software/algorithms may not run on
clusters (MPI challenges) - Biologists are not parallel computing programming
experts - Familiar bioinformatics skills/tools run into
roadblocks of scale - Whats old is new again (hello to grep, sed, awk,
etc)
54New technology enables new science
- Just like a new microscope, the development of
these new big data technologies are creating
opportunities for new kinds of science - Quantitative changes in technology enable
qualitative changes in science - Chip-seq, Transcript profiling
- Metagenomics
- Genome Wide Association
55Biology is a big new market for computing vendors
- Compute Clusters
- Storage
- LIMS databases
- MPI compliant bioinformatics tools
- Consultants and 3rd party VARs
- Job opportunities for bioinformatics-savvy CS
people.