Title: Managing complexity Advanced Perl
1Managing complexity(Advanced Perl)
- Using perl for specific tasks with help from
Bioperl and others
2Login
- Username bioinfouser
- Password loginbioinfo
3Funny?
4Goals
- I already assume you know perl basics -- some
more advanced features - Learn how to write OO code
- More flexible modules
- Understand other modules
- Some APIs that you may need.
- Bioperl
- PerlDBI
5What I assume you already know
- Scalars
- Arrays
- Hashes
- Control structures (if-then, for, foreach, while,
etc.) - File IO
6Managing complexity
- By managing complexity
- Make hard tasks easy(er)
- Perl itself does this
- Regular expressions, text manipulations
- Extensions (modules) do this
- May come at the expense of execution speed
- You may not care
- Consider the big picture
- Development time
- Errors
- Extremely custom software
- Some things need speed
7How complex is it now?
- Perl is a very compact language in terms of
human languages - Perl is large compared with other languages
- TMTOWTDI
- Perl has approximately 233 reserved words
- Java has approximately 47 reserved words
- Both are easy to learn harder to use effectively
8General practices
- Always use !/usr/bin/perl w or use warnings
- Consider use strict for scripts longer than 10
lines - You cant have too many comments
-
- head
- cut
- perldoc
9Getting values into the program or subroutine.
- Perl is pass by value
- A scalar can have as a value a pointer to an
array, hash, function etc. - The args to a program or function arrive in a
special variable called _at__ - my first_value shift _at__
- my first_value _1
- my first_value shift
10References
my _at_array (one, two, three,
four) function_call(_at_array) function_call(\_at_ar
ray) function_call(one,two,three) sub
function_call my passed shift _at__ print
passed
Output one ARRAY(0x80601a0) ARRAY(0x804c9a0)
11Debugging complex data structures.
- Print the reference
- It will tell you a little bit of information
- Use the Dumper module.
- This will give you a snapshot of the whole data
structure
12Some more advanced features
13Regular expressions
- Not Perl specific
- Very useful
- What they do
- String comparisons
- String substitutions
- Substring selection
14Regex
Could put m
string /find/ string /find/ string
/find/ string /find/
. Match any character \w Match "word"
character (alphanumeric plus "_") \W Match
non-word character \s Match whitespace
character \S Match non-whitespace character \d
Match digit character \D Match non-digit
character \t Match tab \n Match newline \r
Match return
15Repetition
string /(ti)2/ string
/ATG?C3A3,T4,6/
Character Classes
string /ATGCN/ string /ATGCNatgcn/i
16Selection/Replacement
string /(A3,8)/ print 1
string s/a/A/ string tr/atgc/ATGC/
17Additional syntax
string /AT?AT/
string m/var/log/messages
_ ATATATAGTGTGCGTGATATGGG (one,two,three)
/AT..AT/g
18What is a module
- Two types
- Object-oriented type
- Provides something similar to a class definition
- Remote function call
- Provides a method to import subroutines or
variables for the main program to use
19Howto Making a module
Create a file called workSaver.pm pack
age workSaver sub doStuff print Stuff
done\n 1 statement that evaluates to
true Now you can use with use
workSaver Some restrictions apply
20HowtoMaking a module cont.
- This method would work very well for subroutines
that are used in several programs. - Reduces the clutter in your program
- Provides one maintenance point instead of unknown
number. - Eases bug fixes
- Careful of boundaries
21More Complete method
- Allows you to pollute the namespace of the
original program selectively. - Makes the use of functions and variables easier
- Still used about the same way as the simple
method but things are clearer
22More Complete
package functional use strict use Exporter our
_at_ISA ("Exporter") our _at_EXPORT qw () our
_at_EXPORT_OK qw (variable1 variable2
printout) our VERSION 2.0 our variable1
"var1" our variable2 "var2" my variable3
"var3" sub printout my passed_variable
shift print "Your variable is passed_variable
mine are variable1 , variable2, variable3
\n" 1
23CPAN
- Wouldnt it be nice to have a place where
- You could find a bunch of perl modules
- It would be brows able
- Searchable
- Big pipe for people to download stuff
- Other people would be encouraged to submit fixes
and updates - And it was all free
24Sources of modules/Information
- www.CPAN.org
- www.bioperl.org
- www.perl.com
- www.cetus-links.org/oo_infos.html
25Bioperl
- Set of modules that are extremely useful for
working with biological data. Actively
maintained. - www.bioperl.org is a very good place to get the
basics of bioperl - We will go through an example to see a typical use
26- Bioperl has several basic types of objects
- Seq a sequence the most common type BioSeq
- Location objects where it is how long it is etc.
- Interface objects BioxyzI No implementation
mostly a documentation
27Bioperl documentation
- Several different ways to find out about a module
- perldoc BioSeq
- bioperl.org/usr/lib/perl5/site_perl/5.8.0/bptutori
al.pl 100 BioSeq - DataDumper to print the data structure
- Print the variable
28Bio perl demo
29Why use a database
- Transaction control - only one user can modify
the data at any one time. - Access control - some people can modify data,
some can read data, others can create
data-structures. - Fast handling of lots of data
- Precise definition of data (mostly).
- Easy to share data resources with others
30Many choices
- There are many types MS Access, Excel(sortof),
sybase, oracle, postgres, msql, mysql - They each have their niche and function best in
certain cases, there is also considerable
overlap. - SQL structured query language is a common thread
31MySQL is better than YourSQL
- Free on Unix
- Good developer support
- Constant bug fixes and feature addition
- Good scalability to medium size and load, OK
performance. - Easy to install.
- Used at Ensemble and UCSC genome browsers, so a
lot of information is readily available in that
format.
32Table Structure - Schema
Gene table Gene_ID Name
Gene ATP7B Aliases Wilson disease-associated
protein Copper-transporting ATPase 2 References
Enzyme Commission 3.6.3.4 UniGene
Hs.84999 AffyProbeU133 204624_at AffyProbeU95
37930_at RefSeq NM_000053 GenBank
AF034838 GenBank U11700 LocusLink 540
Alias table Alias_ID Gene_ID Alias
Reference table Reference_ID Gene_ID Reference Dat
aSource
33SQL (MySQL dialect)
- SELECT col_name FROM table WHERE col_name
value - SELECT COUNT() FROM table WHERE col_name is like
value - SELECT count(distinct(col_name)) FROM table where
col_name is not null - CREATE, UPDATE, DELETE, INSERT have similar forms
34SQL cont.
- USE database_name
- Also can be specified on the command line D
- SHOW TABLES lists all the tables in that
database (also SHOW DATABASES). - DESCRIBE table_name lists the columns and
datatypes for each column - or SHOW COLUMNS FROM table_name
35More advanced SELECTS
- SELECT (column_list) FROM (table_list) WHERE
(constraints) GROUP_BY (grouping columns)
ORDER_BY (sorting columns) LIMIT (limit number) - SELECT col_name from (table1, table2) where
table1_val table2_val and table1_val2 gt value - Example of a equi-join
36Getting the names right
- If you only have one table you only need to use
the column name - When you are using joins this may not be
adequate. - If two tables have the column primary you would
need to call the column table1.primary or
table2.primary
37Data Types
- INT
- Tinyint 128 to 127
- Smallint 32768 to 32767
- Mediumint 8388608 to 8388607
- Int 2147683648 to 2147483647
- Bigint 9223372036854775808 to 9223372036854775807
- FLOAT
- Float 4 bytes
- Double 8 bytes
38- CHAR
- Char(n) character string of n n bytes
- Varchar(n) character string up to n long L1
bytes - Text upto 216 bytes
- BLOBs Binary Large OBjects
39Perl DBI
- Method for perl to connect to a database
(virtually any database) and read or modify data.
- The statements are constructed very similar to
SQL statements that would be entered on the
command line so learning SQL is still necessary
40Statements in DBI
- Connect
- Used to establish initial connection
- Prepare
- Prepare a statement to execute
- Execute
- Execute the statement
- Do
- prepare a statement that does not return results
and execute it
41- Fetch
- Several types used to get returned data
- Disconnect
- Disconnect from the server
42Types of fetch
- fetchrow_array
- Used to fetch an array of scalars each time
- Can also use fetchrow_arrayref
- fetchrow_hash
- Used to fetch a hash indexed by column name.
- Slower but cleaner code.
- Can also use fetchrow_hashref.
43More advanced statements
- Quote
- Used to properly quote data for use with a
prepare statement - value dbh-gtquote(blast_result)
- Placeholders
- Speeds up execution, optional
- my prep dbh-gtprepare (select x from y where
z ?) - loop_start
- prep-gtbind_param(1,z)
- prep-gtexecute()
- loop_end