Information Infrastructure II - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Information Infrastructure II

Description:

Utilize web services to get 2D images. Put a web page front end onto all of this ... Yahoo provides a CGI program that returns stock quote data in CSV format ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 44
Provided by: rajars
Category:

less

Transcript and Presenter's Notes

Title: Information Infrastructure II


1
Information Infrastructure II
  • I211 Week 6
  • Rajarshi Guha

2
Outline
  • Documenting Python code
  • Project Overview
  • Connecting to the web

3
Why Document
  • Simple code can be self-explanatory
  • Code that is more than 1 screenful will require
    some sort of explanation
  • For other people
  • For yourself, after 6 months
  • Trivial documentation is nearly as bad as no
    documentation
  • No need to say that i is a loop counter!

4
Python Docstrings
  • Simple way to comment a function, a class etc
  • The docstring is amulti-line commentas the
    first thingin your function,method or class

def myfunc(a,b) A short summary of
the function What is the type of a and what
is it for What is the type of b and what is
it for What does the function return?
the rest of your code
5
Making Use of Docstrings
  • If you have a set of functions in a file, the
    file is a module
  • After importing the module, you can access the
    functions
  • For any given function you do f.__doc__ which
    will output the doc string for that function

6
Making Use of Docstrings
gtgtgt from myfunc import gtgtgt print f1.__doc__
The docstring for f1 gtgtgt print f2.__doc__
The docstring for f2 gtgtgt help(f1)
def f1(a) The doc string for f1
return a def f2(b,a) The
docstring for f2 return ab
myfunc.py
7
Practicing Docstrings
  • From here on, all classes and methods developed
    as assignments will need to have docstrings
  • They dont need to be long
  • A short description of what the function or class
    does
  • What they take as input
  • What they will output (if a function or method)

8
Project Overview
  • Unified interface to PubMed and PubChem
  • Along with some other functionality based on
    local databases/services
  • Web page front end to the whole thing

9
PubMed
  • Broad collection of databases
  • Medical terms
  • Literature
  • Genes
  • Proteins

10
PubChem
  • A collection of chemical structures (gt 10
    million)
  • Biological data
  • Searchable

11
Goals of the Project
  • Utilize the Entrez utilities to retrieve
    information from PubChem/PubMed
  • A set of URLs which allow you query these
    databases
  • Look up some databases located at IU
  • Cache results, so that if a query gets repeated
    you can pull it from a local DB
  • Utilize web services to get 2D images
  • Put a web page front end onto all of this

12
Requirements
  • Youll be dealing with
  • Bibliographic information
  • Compound information
  • Youll need to write classes that represent these
    things and provide methods
  • To easily get specific pieces of information
  • Interact with the databases

13
What Will Be Given?
  • Ill provide the relevant URLs, keywords etc
  • Youll have to use various string methods, URL
    related methods to construct the appropriate
    URLs and get the information
  • Ill create database schema
  • You will just need to perform inserts, updates
    etc.
  • Ill give a brief overview of SQL, but it will be
    enough to get the job done
  • We wont need any SQL magic!
  • Ill provide an overview of HTML
  • You can easily get HTML tutorials on the web

14
Procedure?
  • Each week Ill list out tasks that need to be
    completed
  • Tasks will include the code to do the job
  • Test code that will show that the actual code
    works
  • You will need to submit the file(s) by the
    following week
  • 5 points for submission and correct running of
    the test code
  • At the end of semester, Ill run the code and
    test the web page.
  • 10 points when it all comes together
  • The total is scaled down to the range 0-30

15
Connecting to the Web
  • Python is very good for network related stuff
  • Sending email
  • Doing FTP, SSH
  • Doing HTTP (i.e., WWW)
  • Well just focus on using the WWW from Python

16
The urllib module
  • Provides various functions to
  • Open a connection to a URL
  • Read stuff from a URL
  • Quote URLS
  • Getting stuff from the web is equivalent to
    opening a file and reading from it
  • Here a file is a URL

17
Opening a Connection
  • Simply specify the URL
  • Need to add http
  • Just www.google.com will not work
  • http tells Python what protocol to use to
    connect to the URL
  • For more useful thingsyoull construct aURL
  • Have to worry about quoting

import urllib url http//www.google.com con
urllib.urlopen(url)
18
Get Data from the Connection
  • The result of urlopen is a connection object
  • Behaves the same as a file object
  • Except you cant write to it
  • To get the page from that URL just call
    readlines()

import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines()
19
What Did We Get?
  • readlines() returns a list containing the lines
    of the page
  • REMEMBER You get back HTML which is not the same
    as what you see in your browser
  • To view the data,just dump it to a file
  • Open up page.html inyour browser

import urllib url http//www.google.com con
urllib.urlopen(url) page con.readlines() f
open(page.html, w) f.write(x) for x in
page f.close()
20
So Whats the Big Deal?
  • Its very easy to get a page
  • The real utility is extracting information from
    the page
  • In many cases, the URL is actually a CGI program
    that can accept arguments
  • You can get different answers depending on what
    URL you construct
  • Depending on what the developer provides you may
  • Curse, tear your hair out, get drunk
  • Write the code in 5 minutes and go party

21
Getting Stock Quotes
  • Yahoo provides a stock quote service
  • Visit http//finance.yahoo.com/q?sGOOG
  • Lots of useful info
  • Last trade
  • Volume
  • Market cap
  • I want to keeptrack of 10 stocks
  • Navigate the page 10 times?

22
Getting Stock Quotes
  • Perfect job for a Python program
  • If we open a connection using that URL we get the
    page for the Google stock
  • What about others?

http//finance.yahoo.com/q?sGOOG
The constant URL
This changes for each company
23
Question
s http//finance.yahoo.com/q?s googleSymbol
GOOG urlForGoogle ?
24
Getting the Stock Pages
  • The code is basically the same as before
  • This time we construct the URL for each company

import urllib baseurl http//finance.yahoo.com
/q?s cos GOOG, AAPL, MSFT, SNE for
company in cos con urllib.urlopen(baseurlc
ompany) lines con.readlines() page
.join(lines) do something with the page
25
Extracting Data (?)
  • So weve gotten the stock page for GOOG
  • We want to the last trade, market cap etc
  • Should be a matter of just looking at the text we
    downloaded and finding Last Trade etc?
  • Yes
  • You will not want to do it this way

26
lthtmlgtltheadgtltmeta http-equiv"Content-Type"
content"text/html charsetiso-8859-1"gtlttitlegtGOO
G Summary for GOOGLE - Yahoo! Financelt/titlegtltlin
k rel"stylesheet" type"text/css"
href"http//us.js2.yimg.com/us.yimg.com/i/us/fi/y
fs/css/yfs_popup_1.17.css"gtltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/gnav_200703141540.css"
type"text/css" rel"stylesheet"gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.js.yimg.com/i/us/fi/scripts/gnavb_200702051500
.js"gtlt/scriptgtlt!--if lte IE 7gt ltlink
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/navbar/css/ie6_200701252030.css"
type"text/css" rel"stylesheet" media"all"/gt
ltscript type"text/javascript"
src"http//us.js2.yimg.com/us.js.yimg.com/i/us/fi
/navbar/scripts/pseudohover_200701262100.js"gtlt/scr
iptgt lt!endif--gtltscript
type"text/javascript" src"http//us.js2.yimg.com
/us.yimg.com/i/us/fi/03rd/yg_csstare_nobgcolor.js"
gtlt/scriptgtltlink rel"stylesheet"
href"http//us.js2.yimg.com/us.js.yimg.com/i/us/f
i/03rd/yfnc_200706251450.css" type"text/css"
media"screen"gtltstyle media"screen"gt screen,mas
theadtext-aligncentermargin0width752px sc
reen.xpand,masthead.xpandwidth100 screen.xp
and yfncsubtitwidth100 mastheadpadding-bot
tom12px leftcol,rightcolmargin0 leftcol
width155pxfloatleft rightcolwidth585px
contentwidth752px footer clearboth ///
bodytext-aligncenter screentext-alignleft
min-width700pxwidth62.5emwidthexpression( doc
ument.all.footer ? (document.all.footer.offsetWidt
hgt1000) ? '980''px' document.all.footer.offsetW
idth '62.5''em') marginautoborder1px solid
FFF screen.xpandwidth100
text-aligncenter portfoliomargin10px
auto hdradsmargin10px auto 0
auto masthead,leftcol,rightcolmargin0 ma
stheadwidth100 leftcolfloatleftwidth18
rightcolfloatrightwidth80 contentwidth
100 footerclearbothtext-aligncenterpaddi
ng10px 0width60emmarginautoborder0px solid
FFF leftNavTable,yfncmh,yfncmkttme,yfncdupl
gnwrn,yfncpsnlbar,yfnctitbar,y fncsubtit,yfncb
robtn,yfncsumtab, .yfnc_systitlelinea1,.yfnc_syst
itlelineb1,.yfnc_modtitlew1 width100 yfncm
hwidth100 .yfncsumdatagridbackgroundDCDCDC
width100 .yfnc_modtitlew2width49 // .
yfncnhlcolor666margin-bottom10px .yfncnhl
.yfncnhlblcolor000width1.6emtext-aligncente
r lt/stylegtltstyle media"print"gt leftcoldisplay
none rightcoldisplayblock lt/stylegtltmeta
name"keywords" content"quote result, quote,
quote summary, real-time, chart, technical chart,
price history, headlines, message board, key
statistics, annual income, E
View the page in your browser and then right
click and view source - this is what your program
sees
27
Extracting Data
  • In many cases this is all you get
  • You have to look at this type of data and
    identify patterns
  • In many cases its doable, but is very tiresome
  • Luckily Yahoo provides a much easier way to get
    the data we need

28
Getting Stock Data Easily
  • Yahoo provides a CGI program that returns stock
    quote data in CSV format
  • The URL for Google stock is
  • If we connect to this URL we dont get back a
    HTML page
  • Instead we get a single line

http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
"GOOG",528.75,"9/14/2007","400pm",3.97,523.18,53
0.27,522.22,2765940,165.0B,524.78,"0.76",523.18,
"392.74 - 558.58",12.306,42.64,"GOOGLE"
29
What Are We Getting?
  • If you look at the components of the line, youll
    see that they match the web page
  • What this URL returns to you is
  • Symbol, last trade, time, change
  • Open, high, low, volume
  • Market cap, previous close, previous change
  • Year low, year high, eps, PE Ratio
  • And its all in a nicely splittable string!

30
Getting the Last Trade
  • Now we can easily get the last trade for a set of
    stocks

import urllib baseurl http//quote.yahoo.com/d
/quotes.csv?ssfsl1d1t1c1ohgvj1pp2owerne.csv
cos GOOG, AAPL, MSFT, SNE for
company in cos url baseurl (company)
con urllib.urlopen(url) lines
con.readlines() data lines0 data
data.split(,) print Last trade for s was
3.2f (company, data1)
31
Quoting
  • A CGI program is something that anybody can
    access
  • If it takes argument anybody can send anything
  • People will try to hack you
  • People will send invalid input

32
Quoting
  • In general, HTTP supports the letters A-Z, a-z,
    0-9 and / and .
  • You can put in other symbols such as or a space
    but those are special characters
  • Ideally, you should escape them
  • Just as we use \n or \t in an ordinary string
  • When working with URLs we call this quoting

33
Quoting Hex
  • We dont use the \ symbol to do quoting
  • Instead special characters are converted to their
    hexadecimal codes
  • In the computer a single character is represented
    by a single integer (ASCII code)
  • A 65, a 97, 126 and so on
  • We can get the integer code for a character using
    the ord function

34
Quoting Hex
  • So quoting special characters in a string means
  • Identify the special character
  • Get its integer code
  • Convert the code to hex
  • Finally replace the special character with the
    hex number in the form XX where XX is

char
n ord(char)
h X (n)
35
Examples
  • http//localhost/rguha becomes
    http3A//localhost/7Erguha
  • becomes 3A and becomes 7E
  • http//moin.org/A Wiki Page becomes
    http3A//moin.org/A20Wiki20Page
  • Spaces become 20

36
urllib Lets Us Do This Easily
  • Use the quote function
  • In many cases you dont need to bother
  • Its a good habit to quote any URL that you use
    to connectto something
  • Very easy to do!

gtgtgt import urllib gtgtgt url http//rguha.ath.cx/
rguha gtgtgt url urllib.quote(url) gtgtgt print
url http3A//rguha.ath.cx/7Erguha
37
Going the Other Way
  • You might receive a quoted URL
  • Special chars are already replaced by hex
  • Its a pain to work with them
  • Convert it back to the special char form using
    unquote

gtgtgt u http3A//moin.org/A20Wiki20Page gtgtgt
newu urllib.unquote(u) gtgtgt print
newu 'http//moin.org/A Wiki Page'
38
Connecting to CGI Programs
  • CGI programs are basically programs behind a web
    page
  • They are represented as URLs
  • Take arguments just like a function
  • Basically a CGI program can be viewed as a
    function located on someone elses computer
  • The return values can vary
  • Could be tough-to-parse HTML
  • Could be as simple as a comma separated list

39
Connecting to a CGI Program
  • Simply construct the proper URL
  • The URL for a CGI program is of the form

http//finance.yahoo.com/q?sGOOG
http//quote.yahoo.com/d/quotes.csv?sGOOGfsl1d1
t1c1ohgvj1pp2owerne.csv
URL_TO_CGI?arg1value1arg2value2
40
An Example
http//quote.yahoo.com/d/quotes.csv? sGOOG
fsl1d1t1c1oh e.csv
URL for the CGI
Parameter Name Value
Basically, making calls to CGIs is just string
manipulation. Handling the return value is where
all the effort goes
41
urllib Makes it Easier
  • Since the parameters are basically name, value
    pairs what type of object should we use to
    represent them?

42
urllib Makes it Easier
  • Create a dictionary of the name, value pairs and
    then call urlencode
  • And pass the result to urlopen along with the URL
    of the program

43
Example
gtgtgt url 'http//quote.yahoo.com/d/quotes.csv?
gtgtgt d 's' 'GOOG', 'f' 'sl1d1t1c1ohgvj1pp2ow
ern', 'e' '.csv' gtgtgt params
urllib.urlencode(d) gtgtgt print params sGOOGe.csv
fsl1d1t1c1ohgvj1pp2owern gtgtgt con
urllib.urlopen(urlparams) gtgtgt print
con.readlines() "GOOG",528.75,"9/14/2007","400p
m",3.97,523.18,530.27
Write a Comment
User Comments (0)
About PowerShow.com