Title: Strings in Python
1Strings in Python
2Computers store text as strings
gtgtgt s "GATTACA"
s
Each of these are characters
3Why are strings important?
- Sequences are strings
- ..catgaaggaa ccacagccca gagcaccaag ggctatccat..
- Database records contain strings
- LOCUS AC005138
- DEFINITION Homo sapiens chromosome 17, clone
hRPK.261_A_13, complete sequence - AUTHORS Birren,B., Fasman,K., Linton,L.,
Nusbaum,C. and Lander,E. - HTML is one (big) string
4Getting Characters
gtgtgt s "GATTACA" gtgtgt s0 'G' gtgtgt s1 'A' gtgtgt
s-1 'A' gtgtgt s-2 'C' gtgtgt s7 Traceback (most
recent call last) File "ltstdingt", line 1, in
? IndexError string index out of range gtgtgt
5Getting substrings
gtgtgt s13 'AT' gtgtgt s3 'GAT' gtgtgt
s4 'ACA' gtgtgt s35 'TA' gtgtgt
s 'GATTACA' gtgtgt s2 'GTAA' gtgtgt
s-22-1 'CAT' gtgtgt
6Creating strings
Strings start and end with a single or double
quote characters (they must be the same)
"This is a string" "This is another
string" "" "Strings can be in double quotes" Or
in single quotes. 'Theres no difference.' Okay,
there\s a small one.
7Special Characters andEscape Sequences
Backslashes (\) are used to introduce special
characters
gtgtgt s 'Okay, there\'s a small one.'
The \ escapes the following single quote
gtgtgt print s Okay, there's a small one.
8Some special characters
9Working with strings
length concatenation repeat substring
test substring location substring count
gtgtgt len("GATTACA") 7 gtgtgt "GAT"
"TACA" 'GATTACA' gtgtgt "A" 10 'AAAAAAAAAA' gtgtgt
"G" in "GATTACA" True gtgtgt "GAT" in
"GATTACA" True gtgtgt "AGT" in "GATTACA" False gtgtgt
"GATTACA".find("ATT") 1 gtgtgt "GATTACA".count("T") 2
gtgtgt
10Converting from/to strings
gtgtgt "38" 5 Traceback (most recent call last)
File "ltstdingt", line 1, in ? TypeError cannot
concatenate 'str' and 'int' objects gtgtgt int("38")
5 43 gtgtgt "38" str(5) '385' gtgtgt int("38"),
str(5) (38, '5') gtgtgt int("2.71828") Traceback
(most recent call last) File "ltstdingt", line
1, in ? ValueError invalid literal for int()
2.71828 gtgtgt float("2.71828") 2.71828 gtgtgt
11Change a string?
Strings cannot be modified They are
immutable Instead, create a new one
gtgtgt s "GATTACA" gtgtgt s3 "C" Traceback (most
recent call last) File "ltstdingt", line 1, in
? TypeError object doesn't support item
assignment gtgtgt s s3 "C" s4 gtgtgt
s 'GATCACA' gtgtgt
12Some more methods
gtgtgt "GATTACA".lower() 'gattaca' gtgtgt
"gattaca".upper() 'GATTACA' gtgtgt
"GATTACA".replace("G", "U") 'UATTACA' gtgtgt
"GATTACA".replace("C", "U") 'GATTAUA' gtgtgt
"GATTACA".replace("AT", "") 'GTACA' gtgtgt
"GATTACA".startswith("G") True gtgtgt
"GATTACA".startswith("g") False gtgtgt
13Ask for a string
The Python function raw_input asks the user
(thats you!) for a string
gtgtgt seq raw_input("Enter a DNA sequence
") Enter a DNA sequence ATGTATTGCATATCGT gtgtgt
seq.count("A") 4 gtgtgt print "There are",
seq.count("T"), "thymines" There are 7
thymines gtgtgt "ATA" in seq True gtgtgt substr
raw_input("Enter a subsequence to find ") Enter
a subsequence to find GCA gtgtgt substr in
seq True gtgtgt
14Assignment 1
Ask the user for a sequence then print its length
Enter a sequence ATTAC It is 5 bases long
15Assignment 2
Modify the program so it also prints the number
of A, T, C, and G characters in the sequence
Enter a sequence ATTAC It is 5 bases
long adenine 2 thymine 2 cytosine 1 guanine 0
16Assignment 3
Modify the program to allow both lower-case and
upper-case characters in the sequence
Enter a sequence ATTgtc It is 6 bases
long adenine 1 thymine 3 cytosine 1 guanine 1
17Assignment 4
Modify the program to print the number of unknown
characters in the sequence
Enter a sequence ATTUgtc It is 7 bases
long adenine 1 thymine 3 cytosine 1 guanine
1 unknown 2