Title: What is a LIMS
1What is a LIMS ?
- LIMS Laboratory Information Management System
- Computerized system that tracks and manages
samples through a protocol - interfaces for both laboratory personnel and
instruments - helps support high throughput operations
2Types of LIMS
- Enterprise
- cover all aspects of scientific research
- data capture
- reagent use and purchasing tracking
- Protocol-specific
- cover a specific protocol
- data capture
3sample management
inventory management
data collection
instrument management
data warehouse
chain of custody
resource management
data analysis
4sample management
inventory management
data collection
instrument management
data warehouse
chain of custody
resource management
data analysis
5Microarrays
- large-scale sequencing projects like the human
genome project have given us the ability to
examine the complete transcriptome (the
transcriptional response to an environmental
challenge - new (and expensive) technology
- large output of data
6Microarray Data
- produced in a tabular format (rows and columns)
- users are relatively unsophisticated in
computational and informatic skills - much data ends up in spreadsheets which lack the
capability to handle rich datasets (no complex
query or visualization capabilities)
7Microarray Databases
- plethora of databases and schemas
- three types of interactions
- local data management
- publication of data in a repository
- analysis of repository data
- the latter two interactions require a certain
level of sophistication to consolidate exogenous
data
8Microarrays Concept
9Microarrays Raw Data
10Microarrays Data
1 AC3.5 Member of the aminopeptidase protein
family 5 10337580 AC3.5 20834 2/25/00 0 849 196 6
53 650 144 506 438 97 341 199 155 161 924 1.290513
0.774885 1.913864 0.522503 0.734 0.688 0.870632
0.71 0.71 1787 51 1802 66 1 1 1 1 A 1 0 U 2 A
C3.7 Member of the UDP-glucuronosyltransferase
protein family 5 10344769 AC3.7 20835 2/25/00 4 23
4 186 48 188 154 34 127 104 23 187 163 79 594 1.41
1764 0.708333 2.093682 0.477628 1.2 0.116 0.219089
0.32 0.21 1798 953 1809 964 1 1 1 1 A 2 2 U
3 AC3.8 Member of the UDP-glucuronosyltransferase
protein family 5 10347864 AC3.8 20836 2/25/00 0 3
63 198 165 348 155 193 235 105 130 254 221 121 593
0.854922 1.169696 1.267871 0.788724 1.241 1.046 0
.858487 0.25 0.29 1788 71 1801 84 1 1 1 1 A 3
0 U
11Local Databases
- make data available to local researchers
- may have WWW-based tools
- database and compute server centralized and
closely linked
12GeneX
- National Center for Genome Resources
- www.ncgr.org/research/genex
- relational database with Perl, R, and Java
components
13GeneX Features
- Free
- integrated and extensible toolset
- multiple types of array technology in single
database - experiment-centric design
- supports an XML specification to allow
interchange between databases
14BASE
- BioArray Software Environment
- http//base.thep.lu.se/
- Relational database (MySQL) with WWW interface
built upon C/javascript/PHP
15BASE Features
- Free
- MIAME compliant
- user administration
- array production
- sample management
16(No Transcript)
17Repositories
- provide public access to multiple datasets
- create standard database similar to sequence
- automatic deposition of data upon publication
18Stanford Microarray Database
- genome-www4.stanford.edu/MicroArray
- www-based database and a dataset distribution
system - relational database
- perl/java toolset
- supports some complex querying as well as
browsing for datasets - datasets distributed as compressed flat-files
and/or graphical images
19GEO
- Gene Expression Omnibus
- www.ncbi.nlm.nih.gov/geo/
- data repository and distribution system
- precomputed definitions and descriptions of data
to aid in data set retrieval
20Data Interchange
- Proposed interchange standard
- MIAME
- Proposed OMG exchange standards
- MAML
- GEML
- NetGenics
21MIAME
- Minimal Information About a Microarray Experiment
- www.mged.org/Annotations-wg/
- Goal
- specify the minimum amount of information needed
to ensure interpretability - facilitate creation of repositories
- encourage journals and funding agencies to
require submission of data to repositories
22Design Considerations
- reflect data accurately
- efficient access to data
- efficient storage of data
- compatibility with other databases
23Data Representation
External Sequence Databases
GIPO
GIPO
GIPO
GIPO
GIPO
GIPO
GIPO
spots
spots
spots
spots
spots
spots
spots
Conditions
????
Experiment
Sample
Tissue
Species
Protocol
24MIAME Considerations
- Experimental design the set of hybridization
experiments as a whole - Array design each array used and each element
(spot) on the array - Samples samples used, extract preparation and
labeling - Hybridizations procedures and parameters
- Measurements images, quantitation,
specifications - Normalization controls types, values,
specifications
25(No Transcript)
26Background
- Center for Biomedical Genomics and Informatics
- Engaged in a number of gene expression studies
ranging from liver disease, osteoarthritis and
cancer - Species studies human and rat
- cDNA in house printed slides (5K human chip, 40K
human chip)
27GMU Clinical Genomics
- studying the relationship between disease and
genome expression - clinical measurements
- standard battery of tests
- genomic measurements
- gene expression levels
- genetic variation
- derive correlation between clinical/genomic
factors and treatment outcome
28Gene Expression Queries
Patient Demographic Queries
Microarray Data
Clinical Data
Clinical Database
Expression Database
29Dataflow
Clinical Tests and Samples
Clinical Database
Analysis (Genespring, etc.)
RNA Extraction Protocols
LIMS
WWW Access (GENet)
Researchers
Microarray Experiment Protocols
BASE
30Generic difference in gene expression patterns
- We do this via visual inspection following
clustering (genes and samples) - Often we will reduce the number of genes by some
criterion (e.g., cluster only on genes that are
2-fold expressed in at least one sample/category) - Often we will group the number of samples by
condition in order to compensate for the lack of
replicates
31Clustering of genes and samples
32Disease vs. Normal
33Clinical Data Challenges
- Collection
- text formats
- disperse sources
- Storage
- heterogenous
- incomplete
- degenerate
- Protection
- HIPPA regulations
34Large Clinical Databases
- Nadkarni and Brandt (1998) JAMIA 5, 511
- Issues involved in data mining EAV databases
- Nadkarni et al. (1999) JAMIA 6, 478
- Extension of EAV with classes and relationships
- Chen et al. (2000) JAMIA 7, 475
- Performance of EAV/CR
35Issues with Clinical Data
- Too many columns
- Over 43,000 attributes
- Sybase capacity
- 1024 columns per table
- 32 indexed
- up to 50 tables per query
- Sparse data
- Multiple entries
36Sample Clinical Table
37Solution EAV
- Entity-Attribute-Value
- form of row modeling
- turns columns into rows
- eliminates sparse data
- reduction in database size
- Faster single value queries
- Pushes depth rather than width
38EAV Clinical Table
39Accessing Single Attributes
Traditional
SELECT patient, date, BMI FROM relTable WHERE
patient 1017 AND BMI !NULL
EAV
SELECT patient, date, value FROM EAVTable WHERE
patient 1017 AND test BMI
40Limitations for Data Mining
- Complex boolean queries tough
- no set operations
- Complex SQL
- nested subqueries
- self-joins
- Performance
41Ad Hoc Query Interface
- Presents a user interface which generates the
required complex SQL queries
42EAV/CR
- Simulation of a complex logical schema using an
extensive yet simple physical schema - Addition of object tables to contain like
attributes - strong data typing
- Creates metadata about objects to help describe
the relationships between data objects
43(No Transcript)
44(No Transcript)
45Testing EAV/CR
- Data sources
- used microbiology data from VA patients
- extracted from existing DB
- loaded in EAV/CR schema
- scaled by replicating data with new IDs
- Benchmarking
- two attribute centered queries
- two entity-centered queries
46(No Transcript)
47(No Transcript)
48Results
- Comparable speeds for entity queries
- massive hit for attribute query
- up to 10-fold worse
- "ancestor" improvement
- represents denormalization
- space for performance trade-off
49EAV for Clinical Genomics ?
- performance issues a problem
- data mining on attributes
- I/O issues
- full EAV not feasible
- partial row modeling a good option
50Clinical Database
- Used CGO database out of Univ of Arkansas as a
template - Myeloma database
- Want to generalize it for any cancer
51(No Transcript)
52Altering CGO
- remove gene chip references
- affymetrix
- MIAME/MAGE non-compliant
- attach to GeneX
- generalize clinical system
- row model test results
- row model questionaires
53Patient
LabReport
LabTest
id birthdate race occupation ...
id test_cat_id(FK) test_id(FK) patient_id(FK) test
_date result
test_cat_id test_id test_cat_desc test_desc
Questionaire
Alcohol
id patient_id
id study_id(FK) Q01 Q02 Q03
...
54HIPAA
- Health Insurance Portability and Accountability
Act - ensure the integrity and confidentiality of
patient information, protect against reasonably
anticipated threats or hazards to the security or
integrity of the information or unauthorized uses
or disclosures of the information
55Clinical Data Flow
clinical database
redacted database
cleansing protocol
publication services
redacted database
research protocol