Bioinformatics Software - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Bioinformatics Software

Description:

All of the bioinformatics software that you need to do complex analyses is free ... Software distributed with source code under licenses guaranteeing anybody rights ... – PowerPoint PPT presentation

Number of Views:838
Avg rating:3.0/5.0
Slides: 56
Provided by: stuart67
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Software


1
Bioinformatics Software
Stuart M. Brown Research Computing NYU School of
Medicine
2
BioinformaticsBeyond Websites
  • A lot of sophisticated bioinformatics work can be
    done using free websites
  • Some biology projects generate a LOT of data
  • how we define a lot is a moving targetTB
  • The only solution is to have a dedicated
    bioinformatics computer, database, and custom
    programs.
  • May require more processor power and more hard
    drive space than a typical desktop personal
    computer

3
(No Transcript)
4
Computer Hardware is not Free
  • However, you can build a powerful Linux cluster
    for 20-50-100K (depending on how much power
    you need)
  • The big cost is for people with the skills to
    manage the machines, install the software, and
    train scientists to use it.
  • Small schools can join together or affiliate with
    a larger neighbor.

5
Open Source Bioinformatics
  • All of the bioinformatics software that you need
    to do complex analyses is free for UNIX computers
  • The Open Source software ethic is very strong
    among biologists
  • Bioinformatics.org
  • Bioperl.org
  • Open-bio.org
  • BioConductor
  • New algorithms generally appear first as free
    software (a publication requirement)

6
What is "Open Source?"
  • Software distributed with source code under
    licenses guaranteeing anybody rights to freely
    use, modify, and redistribute the code.
  • When users can read, redistribute, and modify the
    source code for a piece of software, the software
    evolves. People improve it, people adapt it,
    people fix bugs. User communities provide free
    tech support.

7
Open Source
  • Free
  • Users can see and change code
  • Easy to modify add new functions
  • Send changes back to developer - share with
    everyone - grows as a community
  • Can be incorporated into other products
  • Developers can lose interest - no guarantee of
    support

8
Popular Open Source Projects
  • Linux
  • Apache
  • PHP
  • mySQL
  • Mozilla/Firefox
  • OpenOffice
  • Wiki

9
Science is Open Source
  • Must publish results in a form that your work can
    be checked and recreated by others
  • If your scientific work is a program then is
    must be open source

10
Free Software
  • Linux operating system, mySYQL database
  • Perl - programming language
  • R, BioConductor - statistics package
  • Blast and Fasta - similarity search
  • Clustal - multiple alignment
  • Phylip - phylogenetics
  • Phred/Phrap/Consed - sequence assembly and SNP
    detection
  • EMBOSS - a complete sequence analysis package
    created by the EMBL

11
Assemble your Own Bioinformatics computer
  • We will install some bioinformatics tools in the
    CS computer room
  • You will learn more if you install these tools on
    your own computer
  • Every installation has its own style and its own
    hassles
  • What tools will be needed?
  • Interface?

12
Bioinformatics Requires Powerful Computers
  • One definition of bioinformatics is "the use of
    computers to analyze biological problems.
  • As biological data sets have grown larger and
    biological problems have become more complex, the
    requirements for computing power have also grown.
  • Computers that can provide this power generally
    use the Unix operating system - so you must learn
    Unix

13
Why Unix?
  • Bioinformatics applications are typically written
    in Unix
  • It is the operating system of choice for high
    performance computers (multi-processor or
    clusters)
  • Ability to use lots of software written by others
  • Pipelining and database integration

14
Unix Runs the Internet
  • Unix is a command line interface, used by most
    large, powerful computers.
  • In fact, Unix is the underlying structure for
    most of the Internet and most large scale
    bioinformatics operations.
  • A knowledge of Unix is likely to be helpful in
    your future career, regardless of where you
    pursue it.

15
When Do I Need Unix?
  • Big sets of data
  • Integrating raw data, results, and display
  • Speed of large batch processing
  • Integration of web server and database with
    automated data processing programs

16
Unix Advantages
  • It is very popular, so it is easy to find
    information and get help
  • pick up books at the local bookstore (or street
    vendor)
  • plenty of helpful websites
  • USENET discussions and e-mail lists
  • most Comp. Sci. students know Unix
  • Unix can run on virtually any computer (IBM,
    Sun, Compaq, Macintosh,etc)
  • Unix is free or nearly free
  • Linux/open source software movement
  • RedHat, FreeBSD, MKLinux, LinuxPPC, etc.

17
Stable and Efficient
  • Unix is very stable - computers running Unix
    almost never crash
  • Unix is very efficient
  • it gets maximum number crunching power out of
    your processor (and multiple processors)
  • it can smoothly manage extremely huge amounts of
    data
  • it can give a new life to otherwise obsolete Macs
    and PCs
  • Most new bioinformatics software is created for
    Unix - its easy for the programmers

18
Unix has some Drawbacks
  • Unix computers are controlled by a command line
    interface
  • NOT user-friendly
  • difficult to learn, even more difficult to truly
    master
  • Hackers love Unix
  • there are lots of security holes
  • most computers on the Internet run Unix , so
    hackers can apply the same tricks to many
    different computers
  • There are many different versions of Unix with
    subtle (or not so subtle) differences

19
Flavors of Unix
  • Dec Alpha True64
  • Sun Solaris SunOS
  • IBM AIX
  • HPUX
  • DGUX
  • Minix
  • KNOPPIX
  • Gentoo
  • Linux
  • Redhat/Fedora
  • SuSe
  • Mandrake
  • Mandriva
  • Slackware
  • Debian

20
Linux is Unix
  • Linux is a free (or cheap) operating system
  • Works on any computer
  • Increasingly popular for both desktops and
    servers
  • It is an open implementation of Unix
  • Uses all normal Unix commands
  • Most new bioinformatics programs are written
    first for Linux

21
General Unix Tips
  • UNIX is case sensitive!!
  • myfile.txt and MyFile.txt do not mean the same
    thing
  • I like to use capital letters for directory names
    - it puts them at the top of an alphabetical
    listing
  • Every program is independent
  • the core operating system (known as the kernel)
    manages each program as a distinct process with
    its own little chunk of dedicated memory.
  • If one program runs into trouble, it dies, but
    does not affect the affect the kernel or the
    other programs running on the computer.

22
Shell scripts
  • Part of the UNIX operating system
  • a powerful program in a 3-line text file
  • Very simple method to run an existing program on
    multiple files (foreach)
  • Can chain together several programs (a pipeline)
  • Can make simple decisions (if then)
  • Can read output files and match patterns (grep)
  • Can manipulate text files (sed awk)

23
Unix Help on the Web
  • Here is a list of a few online Unix tutorials
  • Unix for Beginners
  • http//www.ee.surrey.ac.uk/Teaching/Unix/
  • Getting Started in Unix
  • http//its.ucsc.edu/services/web/unix/start.php
  • Unix Guru Universe
  • http//www.ugu.com/sui/ugu/show?help.beginners
  • Getting Started With The Unix Operating System
    http//www.leeds.ac.uk/iss/documentation/be
    g/beg8/beg8.html

24
Standalone Tools
  • BLAST - sequence similarity
  • free from NCBI
  • works on most operating systems
  • versions for local, web client, web server
  • EMBOSS - complete package of tools
  • free from EMBL
  • many handy little tools
  • BioConductor - genomics/statistics
  • free from BioConductor.org (requires "R" Java)
  • works on most operating systems (best on Windows)

25
BLAST
26
EMBOSS
  • EMBOSS is a suite of hundreds of simple programs
    that work with DNA and protein sequences
  • Free UNIX software -works on Mac OS X
  • -has a Windows implementation
  • Somewhat challenging to download, compile, and
    configure
  • Has web and Java interfaces

27
(No Transcript)
28
  • transeq
  • Translate nucleic acid sequences
  • Input sequence(s) xelrhodop.fasta
  • Output sequence xelrhodop.pep
  • transeq -opt
  • Translate nucleic acid sequences
  • Input sequence(s) xelrhodop.fasta
  • Translation frames
  • 1 1
  • 2 2
  • 3 3
  • F Forward three frames
  • -1 -1
  • -2 -2
  • -3 -3
  • 6 All six frames
  • Frame(s) to translate 1 1
  • Genetic codes
  • 0 Standard

Minimalist Interface Fast, powerful Easy to
script (all options for all programs can be
input on command line)
29
(No Transcript)
30
(No Transcript)
31
Ugly EMBOSS on unix command line has now turned
into a beautiful Jemboss swan!
32
programs are highlighted based unambiguous letters
33
(No Transcript)
34
(No Transcript)
35
GDE - multiple alignment viewer
36
GDE - supports phylogenetic Apps. and graphics
37
Programming Languages
  • Like religions
  • Perl is the most commonly used language for
    simple bioinformatics apps.
  • Really a fast prototyping language
  • Useful for linking other programs and shell
    scripts
  • SLOW compared to compiled languages
  • C Java are used for many more polished programs

38
Perl
  • Perl is a powerfull language
  • convert file formats
  • search a file for something you need
  • change things in a file
  • run a program and select just some lines of the
    output
  • process your sequences
  • build a pipeline that runs on many different
    systems
  • Very powerful text handling
  • has its own Regular Expression language

39
Compiled vs. Interpreted
  • Perl is an Interpreted language
  • A Perl program is just a bit of text with
    commands in a (semi) human readable form
  • Must have a Perl Interpreter (program) on your
    computer to execute this program
  • Can run on any operating system
  • Very slow
  • Compiled program language like C
  • Executable program is a bunch of gibberish
  • Can only run on one specific operating system
  • Much faster

40
BioPerl
  • Don't spend your time reinventing the wheel
  • most types of bioinformatics data manipulations
    have been done before
  • grab a module and go from there (there are
    hundreds of them)
  • object oriented

41
Role of CS in Biology Project
  • The CS Bioinformatics specialist works closely
    with biologists
  • Provides consulting in project design
  • collect the data in a way that makes it easy to
    analyze
  • Works with biologists to establish scope of the
    project
  • Chooses appropriate technologies
  • Builds the programs/database and processes the
    data
  • Works with biologists to produce data
    visualization/output that works for THEM!

42
Its all about the INTERFACE
  • Biologists are (generally) visual wholistic
    thinkers
  • A good tool needs a good interface
  • Biologists will tolerate an essential tool with a
    lame interface, but it will dramatically reduce
    the quality of the science (and the utility of
    the tool)
  • As we play with each bioinformatics tool this
    semester, think about the interface and how it
    aids or impedes your work.

43
UI Elements
  • Most biologists are not comfortable with the
    command line
  • A simple web form is vastly better than a command
    line tool
  • Graphical display is better than text (although a
    text output option is also highly desirable)
  • The Visual Display of Quantitative Information
  • User feedback for navigation, runtime indicator,
    record of options used, etc. is also important

44
  • The chart shows 6 variables
  • size of the army
  • location on a 2D surface
  • direction of the army's movement
  • Temperature
  • location of major river crossings

45
(No Transcript)
46
Toolkits
  • Existing bioinformatics tools and reference
    datasets are dispersed and uncoordinated
  • Each scientist is required to discover or
    re-invent appropriate tools for their data
    analysis
  • Toolkits are extremely useful when they group
    related tools with a common interface

47
(No Transcript)
48
(No Transcript)
49
Project Mangement
  • Building the project
  • Basic Unix
  • Databases
  • Shells
  • Pipelining / Scaling
  • Basic BioPerl / BioJava / Obj.Oriented
    Programming
  • Translating Biology to IT
  • Integration
  • Publishing / Managing the Project
  • Data Security

50
BIG DATA
51
New Technology Big Data
  • The past 2-3 years have seen a huge increase in
    the size of data files generated in scientific
    projects
  • This has been driven entirely by new
    technologies
  • High-throughput (Next-Generation) DNA sequencing
  • Genome tiling chips
  • High density SNP chips

52
Computing Challenges
  • There are many practical challenges associated
    with very large data files
  • Data storage for TB of laboratory information
  • Cant easily manipulate millions of records in
    traditional sql databases
  • Difficult to run Perl scripts on data files
    millions of lines long

53
Need massive new CPU power
  • Traditional software/algorithms may not run on
    clusters (MPI challenges)
  • Biologists are not parallel computing programming
    experts
  • Familiar bioinformatics skills/tools run into
    roadblocks of scale
  • Whats old is new again (hello to grep, sed, awk,
    etc)

54
New technology enables new science
  • Just like a new microscope, the development of
    these new big data technologies are creating
    opportunities for new kinds of science
  • Quantitative changes in technology enable
    qualitative changes in science
  • Chip-seq, Transcript profiling
  • Metagenomics
  • Genome Wide Association

55
Biology is a big new market for computing vendors
  • Compute Clusters
  • Storage
  • LIMS databases
  • MPI compliant bioinformatics tools
  • Consultants and 3rd party VARs
  • Job opportunities for bioinformatics-savvy CS
    people.
Write a Comment
User Comments (0)
About PowerShow.com