Title: cfgPres
1 Security Oriented Data Grids for Microarray
Expression Profiles Prof. Richard O.
Sinnott Technical Director National e-Science
Centre Deputy Director (Technical)
Bioinformatics Research Centre University of
Glasgow 25th April 2007
2BRIDGES Project
3Lessons learned
- Scientists wary of new paradigms
- myData culture
- Need simple but secure access to data /
computational resources - Hide the Grid as much as possible
- Take certificates away from end users
- Present the Grid like the Internet, i.e. browser
point and click! - Need to work outside of silos
- Inter-disciplinary, translational research
- Seamlessly move between associated resources
- Need fine grained single sign-on across these
resources
4Security Usability
- Grid Security
- AAAA
- Users like usernames/passwords
- Provide them (once!)
- Users dont like/understand X.509 based PKI
- Forget training, education for most users!
- gt openssl pkcs12 -in cert.p12 -clcerts -nokeys
-out usercert.pem! - The vast majority most certainly wont jump
through hoops to get on the Grid - me-Science culture
5AAAA
- Identity management issues
- Certificate Revocation Lists
- When revoked? By whom? How timely?
- Strong passwords for private keys
- Users write them down, share them, forget them
- Privilege Management
- Numerous domains where never get access to local
account to do stuff - I need to access your NHS DB to run queries,
change tables, run arbitrary code - At NeSC Glasgow we have focused on
- improving AAAA and AAAA
6Improving AAAA
- Best to exploit local authentication
- Sites know best if users still at institution and
are best placed to state what their privileges
are/should be - Introducing Shibboleth
- will replace Athens as access mgt system across
UK academia - i.e. this is main stream and not (weird) Grid
solutions! - Federations based on trust
- or more accurately trust but verify
- numerous international federations exist MAMS,
SWITCH, HAKA, SDSS
7Typical Shibboleth Scenario
Identity Provider
AuthN
Home Institution
Federation
Service provider
5. User accesses resource
W.A.Y.F.
User
Grid resource / portal
8Its a start, but
- Benefit from local authentication but really want
finer grained control - I know you have authenticated, but I need to know
that you have sufficient/correct privileges to
access my VO resources - can also return various other information needed
to support authorisation decisions - At NeSC we have been working extensively with
PERMIS and portal content configuration
9Finer Grained Shibboleth Scenario
Service provider
Identity Provider
Shib Frontend
AuthN
Home Institution
6. Make final AuthZ decision
Federation
Grid Application
5. Pass authentication info and attributes
to authZ function
W.A.Y.F.
User
Grid Portal
Browser based sign sign-on
10Inter-disciplinary e-Life Science Research
Tissues
Cell
Protein functions
Organs
Protein Structures
Organisms
Gene expressions
Physiology
Populations
Nucleotide structures
Cell signalling
GRIDSecurity
Nucleotide sequences
Protein-protein interaction (pathways)
11Grid Enabled Microarray Expression Profile Search
(GEMEPS) Project
- 1 year BBSRC funded project just completed
- Involves Glasgow, Cornell University, US, Riken
Institute, Japan - Aim to provide tools for discovery, comparison
and - analysis of microarray data sets
- BRIDGES focused on genes of interest
- GEMEPS focuses on microarrays consisting of many
thousands of genes of interest - Gene expression profiles
12Messy
3.5MB per experiment, thousands of experiments,
...
13Key Scenarios
- Key questions to be answered by GEMEPS
infrastructure - who has run a microarray experiment and generated
similar results to mine? - How similar were these results?
- who has undertaken experiments and produced data
relevant to my own interests, - for a particular phenotype,
- for a particular cell type,
- for a particular pathogen,
- on a particular platform
-
- show me the conditions and analysis associated
with experimental results similar to mine
14GEMEPS Context
- Levels of gene expression or differential
expression important - Requires security focused, data access,
integration and data mining - Microarrays expensive to run and contain
potentially important (academically/commercially)
data sets - Key aspect is that scientists keep their own data
and define their own policies on access and usage
15Microarray Repositories and Data Formats
- MIAME goal is
- minimum information required to interpret
unambiguously and potentially reproduce and
verify an array based gene expression monitoring
experiment - Several data formats/controlled vocabularies and
ontologies defined and applied across different
sites communities including - MAGE-ML - Microarray Gene Expression Markup
Language - SOFTtext Simple Omnibus Format in Text
- MINiML MIAME Notation in Markup Language
- SOFTmatrix - Simple Omnibus Format in Matrix
-
- Numerous major repositories now exist including
- Gene Expression Omnibus (GEO),
- ArrayExpress,
- CIBEX
16GEMEPS Discovery and Analysis of Matching Profiles
- Step 1
- Query over appropriate meta-data and return
matching experiments - (appropriate to level of privilege)
- Step 2
- For matching results
- extract gene ordering (based on level of
expression values) - Can include filtering/cut off, e.g. so only
compare only 10, 100, 1000, most expressed
genes from experiments - run similarity algorithm to determine best match
of own data results against experiment gene
expression ordering - currently support Spearman Rank/Kendall Tau
- Step 3
- Merge the results and display
17Experiment Similarity
- Based on correlation coefficient
- Measuring correspondence between two rankings,
and assessing the significance of this
correspondence - rank correlation coefficient given in interval
-1,1 where - If the agreement between the two rankings is
perfect, i.e., the two rankings are the same, the
coefficient has value 1 - for GEMEPS implies that the same sets of genes in
the same ordering exists - If the disagreement between the two rankings is
perfect, i.e., one ranking is the reverse of the
other then the coefficient has value -1 - For all other arrangements the value lies between
-1 and 1, and increasing values imply increasing
agreement between the rankings. - If the rankings are completely independent, the
coefficient has value 0. - Spearman Rank correlation coefficient given by
- where
- d i the difference between each rank of
corresponding values of x and y, and - n the number of pairs of values
18Kendall Tau
- Kendall Tau correlation coefficient given by
- where
- where n is the number of items
- P is the sum over all the items, of items ranked
after the given item by both rankings. - Example of ranking between height/weight
19- Demo?
- (or death by snapshot?)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24- A different user logs in to Bioinformatics/GEMEPS
portal
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Single-sign on!!!
33(No Transcript)
34(No Transcript)
35Conclusions
- Shibboleth model is aligned with wider UK
academic community eduPerson attributes - No distinction to end user from accessing
e-journal or Grid resource more generally - Readily supports inter-disciplinary hopping!
- Many other projects in this space at NeSC
- Drug Discovery Portal
- Paediatric Endocrinology Registry for Congenital
Anomalies - Generation Scotland Scottish Family Health Study
- Brain Trauma
36LSGrid 2007 www.lsgrid.org/2007
- 4th International Life Science Grid Conference
- Will take place at University of Glasgow on 6-7th
September 2007 - Paper submission deadline 29th June 2007
- 10-13th September 2007 UK e-Science All Hands
Meeting Nottingham UK - Braemar Highland Games 1st September 2007
- Usually attended by the Queen