Effective design and analysis of bioinformation Unit 3 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Effective design and analysis of bioinformation Unit 3

Description:

Office hours: Wednesday, 4pm-6pm (Room 554, phone: ... Bitter Taste Perception. TAS2R38. Earwax Type. ABCC11. Lactose Intolerance. LCT. Muscle Performance ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 55
Provided by: irenegab
Category:

less

Transcript and Presenter's Notes

Title: Effective design and analysis of bioinformation Unit 3


1
Effective design and analysis of
bioinformationUnit 3
  • BIOL221T Advanced Bioinformatics for
    Biotechnology

Irene Gabashvili, PhD igabashvili_at_yahoo.com
2
Course availability
  • Lectures Lab every Wednesday, Duncan Hall,
    Room 550, 600 pm to  945 pm
  • Office hours Wednesday, 4pm-6pm (Room 554,
    phone 92404831) and by appointment
  • Lecture notes will be posted at
    http//home.comcast.net/igabashvili/221T.htm
  • Or the SJSU page --
  • The user name is ewok\biostudents (dont enter
    quotation mark)  
  • And the password is 4biolecture (dont enter
    quotation mark).

3
Consumer genomics gets crowded
In the News
  • http//www.seqwright.com/ SoliD, ABI
  • http//www.decodeme.com/ Illumina
  • https//www.23andme.com/ Illumina
  • http//www.navigenics.com/ Affymetrix
  • http//www.knome.com/ ABI,Amersham,Illumina

4
https//www.23andme.com/experts/letters/science/
5
List from DeCODE genetics
  • Our current list of diseases includes
    Age-related Macular Degeneration, Asthma,
    Alzheimer's Disease, Atrial Fibrillation, Breast
    Cancer, Celiac Disease, Colorectal Cancer,
    Exfoliation Glaucoma XFG, Crohn's Disease,
    Multiple Sclerosis, Myocardial Infarction,
    Obesity, Prostate Cancer, Psoriasis, Restless
    Legs, Rheumatoid Arthritis, Type 1 Diabetes and
    Type 2 Diabetes.

6
Three important sub-disciplines within
bioinformatics
  • the development of new algorithms and statistics
    with which to assess relationships among members
    of large data sets
  • the analysis and interpretation of various types
    of data including nucleotide and amino acid
    sequences, protein domains, and protein
    structures
  • the development and implementation of tools that
    enable efficient access and management of
    different types of biological information.

7
biomedical informatics
Main tasks of
  • Storage, Analysis, Visualization and Management
    of biomedical data
  • Mining for new knowledge, hypothesis formulation
    and testing
  • Development of tools and resources for the above

8
Brief History of Bioinformatics
  • 1920 - term genome was introduced by H. Winkler
    to denote the complete set of chromosomal and
    extra chromosomal genes
  • 1933 - A new technique, electrophoresis, is
    introduced by Tiselius for separating proteins in
    solution.
  • 1951 - Pauling and Corey propose the structure
    for the alpha-helix and beta-sheet

9
Brief History of Bioinformatics
  • 1953 - Watson Crick propose the double helix
    model for DNA (data by Franklin Wilkins)
  • 1954 - Perutz's group develop methods to solve
    the phase problem in protein crystallography.
  • 1955 - The sequence of the first protein to be
    analyzed, bovine insulin, announced by F.Sanger.
  • 1956 - The first protein sequence reported was
    that of bovine insulin, consisting of 51 residues

10
Brief History of Bioinformatics
  • 1962 - Pauling's theory of molecular evolution
  • 1965 M.Dayhoffs Atlas of Protein Sequences
  • 1970 - Needleman-Wunsch algorithm
  • 1972 The Protein DataBank
  • 1980 - The first complete gene sequence for an
    organism (FX174)5,386 bp, nine proteins.
  • 1981 - The Smith-Waterman algorithm IBM
    introduces its PC to the market. The concept of
    a sequence motif ( Doolittle )

11
Brief History of Bioinformatics
  • 1983 Sequence DB searching (Wilbur-Lipman)
  • 1986 - Human Genome Initiative announcement
  • 1987 SWISSPROT protein sequence database
  • 1988 - NCBI created at NIH/NLM (databases)
  • 1988 - FASTA by Pearson and Lupman EMBL
    establish sequence database network
  • 1990 - BLAST by Altschul,et.al.
  • 2003 -Human Genome Project Completion

12
biomedical informatics
The data of
  • Public Private Databases store biological data
    in various formats
  • Sequences DNA, RNA, proteins
  • Structures X-ray, NMR, microscopy
  • Expression microarrays, gels
  • Interaction 2 hybrid, mass spec
  • Metabolism GC-MS, NMR
  • Physiology medical images, PK/PD

13
Search Engines
  • AND, OR, NOT
  • Specifying database fields (Organism, Author)
  • Order of words, neonatal pre/3 screening
    (neonatal at least 3 words before screening
  • Spaces wom?n cats

14
Search Download
  • Entrez integrated, text-based search and
    retrieval system for PubMed, Nucleotide and
    Protein Sequences, Protein Structures, Complete
    Genomes, Taxonomy, etc batch download
  • http//www.ncbi.nlm.nih.gov/sites/batchentrez
  • term field OPERATOR term field
  • 110ESTC AND Homo sapiensORGN AND
    deafnessdis (BSND Bartter syndrome,
    infantile, with sensorineural deafness)
  • http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
    searchDBunigene
  • More on the courses website

15
DATA FORMATS AND DATA INTEGRATION
  • It is widely recognized that successful data
    integration is one of the keys to improved
    productivity in biopharmaceutical RD
  • Success in most bioinformatics-related
    activities, from functional characterization of
    genomic sequences to prioritization of drug
    targets, requires an integrated view of all
    relevant data in a drug discovery RD program
  • Bioinformatics data sources often have large,
    complex data structures, reflecting the richness
    of the scientific concepts they model. Many
    bioinformatics data sources cover similar
    domains, such as genes, proteins, sequence
    annotations or microarray results.

16
Database design links
  •  http//www.devx.com/ibm/Article/20702
  •  http//www.campus.ncl.ac.uk/databases/design/
  • http//www.dbazine.com/mullins_datamodel.shtml
  • http//www.extropia.com/tutorials/sql/toc.html
  • http//www.surfermall.com/relational/lesson_1.htm

17
Database Definition
  • A collection of data that
  • is organized
  • usually computer-based
  • represents repetitive information implicitly
  • supports retrieval
  • A set of rules to manipulate data
  • A method to mold information into knowledge

18
Database applications
  • Who uses Computerized Databases
  • Stores to keep track of inventory
  • Hospitals to track of patient info
  • Travel agents to keep up with their customers
    and reservations
  • Biologists to efficiently manage and manipulate
    their data
  • DATA ? INFORMATION ? KNOWLEDGE

19
Paper Database as Expert System
20
HISTORY
  • 1960's Two main data models are developed
    network model (CODASYL) and hierarchical (IMS). A
    user would need to know the physical structure of
    the database in order to query for information.
    SABRE IBM/AA.
  • 1970-72 E.F. Codd proposed relational model He
    disconnects the schema (logical organization) of
    a database from the physical storage methods.

21
HISTORY
  • 1970's
  • Ingres UCB ? Ingres Corp., Sybase, MS SQL
    Server, Britton-Lee, Wang's PACE. This system
    used QUEL as query language.
  • System R IBM ? IBM's SQL/DS DB2, Oracle, HP's
    Allbase, Tandem's Non-Stop SQL. This system used
    SEQUEL as query language.
  • The term Relational Database Management System
    (RDBMS) is coined

22
HISTORY
  • 1976 P. Chen proposed the Entity-Relationship
    (ER) model for database design
  • Early 1980's Commercialization of relational
    systems begins as a boom
  • Mid-1980's SQL (Structured Query Language)
    becomes "intergalactic standard". DB2 becomes
    IBM's flagship product. Network and hierarchical
    models fade into the background

23
HISTORY
  • Early 1990's Application and personal
    productivity tool development PowerBuilder
    (Sybase), Oracle Developer, VB (Microsoft),
    Excel/Access (MS) and ODBC. First Object Database
    Management Systems (ODBMS) prototypes.
  • Mid-1990's Internet/WWW. Web/DB grows
    exponentially, usable for average users

24
HISTORY
  • Late-1990's Boom for Web/Internet/DB connectors.
    Open source solution with widespread use of gcc,
    cgi, Apache, MySQL, etc. Online Transaction
    processing (OLTP) and online analytic processing
    (OLAP) comes of age
  • Early 21st century Burst of.com but solid growth
    of DB applications. PDAs, POS transactions, IBM,
    Microsoft, Oracle.

25
FUTURE
  • Terabyte and Petabyte databases of everything
  • Mobile databases
  • Semantic Web
  • Object Oriented Everything, includes databases
  • Object Database Management Group (ODMG) standards
    are proposed and accepted
  • Security issues

26
Database advantages
  • An advantage of a database program is
  • Can find a specific file quickly
  • Can easily add records
  • Can alphabetize and sort data faster than most
    people
  • Is as accurate as the data that is entered
  • Can make many different types of reports
  • Is invaluable for large amounts of data

27
Database Parts
  • Parts of a relational database
  • Fields-categories of information
  • lttablegt
  • Entry data in a field
  • Record all of the information about one item
    (row)
  • File document of all of the records
  • To sort field, ascend or descend (Excel, Works)

28
Database types
  • Flat (spreadsheet)
  • Hierarchical
  • Network (two fundamental constructs, called
    records and sets)
  • Relational

29
Relational Databases
  • Relational databases started to get to be a big
    deal in the 1970's, and they're still a big deal
    today, which is a little peculiar, because
    they're a 1960's technology.
  • A relational database is a bunch of rectangular
    tables. Each row of a table is a record about one
    person or thing the record contains several
    pieces of information called fields.

30
Entities and Relationships
Entities things we store information
about Relationships links between the
entities Many-to-many One-to-one One-to-many
31
A Table is a Relation
Columns, Fields, Attributes Rows, Records,
Tuples, Entities. records of data, comprised of
fields, stored in tables
32
Keys and Functional Dependencies
  • Key field (superkey, key) - a field that uniquely
    identifies a record
  • If there is a functional dependency between
    column A and column B in a given table,
  • (A ? B), then the value of column A determines
    the value of column B. (employeeID ? name)

33
Schema
  • Database schema is the structure or design of the
    database, a blueprint for the data in the
    database.
  • employee(employeeID, name, job, cube,
    departmentID)
  • What information needs to be stored? (things or
    entities)
  • What questions will we ask of the database?
    (queries.)

34
Flawed schemas
This Schema design leads to redundancies Employee(
employee ID, name, job, department
ID Department(Department ID, Department name)
35
Flawed schemas
Insertion Anomaly
Deletion Anomaly
Update Anomaly
36
Avoid Null Values
37
Normalization
Unnormlized table lists instead of atomic
numbers. This violates the rules of first normal
form
38
Normalization
This schema is in first normal form, 1NF
39
Second Normal Form, 2NF
2NF Attributes must depend on the whole key
40
3NF and BCNF (Boyce-Codd)
3NF Attributes must depend on nothing but the
key BCNF all the functional dependencies must
have a superkey on the left side
41
Concepts
  • Entities are things, and relationships are the
    links between them.
  • Relations or tables hold a set of data in tabular
    form.
  • Columns belonging to tables describe the
    attributes that each data item possesses.
  • Rows in tables hold data items with values for
    each column in a table.
  • Keys are used to identify a single row.
  • Functional dependencies identify which attributes
    determine the values of other attributes.
  • Schemas are the blueprints for a database.

42
Design Principles
  • Minimize redundancy without losing data.
  • Insertion, deletion, and update anomalies are
    problems that occur when trying to insert,
    delete, or update data in a table with a flawed
    structure.
  • Avoid designs that will lead to large quantities
    of null values.

43
Normalization
  • Normalization is a formal process for improving
    database design.
  • First normal form (1NF) means atomic column or
    attribute values.
  • Second normal form (2NF) means that all
    attributes outside the key must depend on the
    whole key.
  • Third normal form (3NF) means no transitive
    dependencies.
  • Boyce-Codd normal form (BCNF) means that all
    attributes must be functionally determined by a
    superkey.

44
Hierarchical Databases
1234567
Sandiego, Carmen
123 Main Street
Labs
Chem7
Chem7
K 3.9
Na142
K 4.3
Na136
45
Hierarchical Databases
  • Easy to use
  • Efficient storage
  • Tree walking is fast
  • Queries across trees are slow
  • Flexible
  • Too flexible chaos is allowed
  • Too easy to modify
  • Difficult to document complex structures

46
Hierarchical Databases
  • EMR(1234567)Sandiego, Carmen
  • EMR(1234567, Address)123 Main Street
  • EMR(1234567, Chem7, 2/2/02, Na)136
  • EMR(1234567, Chem7, 2/2/02, K)4.3
  • EMR(1234567, Chem7, 2/3/02, Na)142
  • EMR(1234567, Chem7, 2/3/02, K)3.9

47
Hierarchical Chaos
1234567
Admissions
Admission 1
Admit Date 2/2/02
Primary DX CHF
Other DX
AODM
A Fib
Flag S
Flag P
48
Network Databases
1234567
Gyn Clinic
2 Main St.
Sandiego
305-2500
Secretary
Gyn Clinic
8AM-5PM
Ms Smith
305-1000
Service
Pap
Dr. Jones
Gyn Visit
Beeper 34
49
Extensible Markup Language (XML) Databases
  • SGML is a metalanguage
  • SGML is used to write Document Type Definitions
    (DTDs) that define languages
  • HTML is a language with an SGML DTD
  • Tags are for formatting/presentation syntax
  • XML is a proper subset of SGML
  • XML defines tags that convey semantics
  • We could write Health Markup Language (HML)
    in XML (if we could agree on the semantics and
    tags)
  • Tags may or may not be stored with data

50
ltdocumentgt lt/documentgt
ltdocument.idgtCXR001lt/document.idgt ltdoc.
dategt19991101lt/doc. dategt ltdocument.typegt lt/d
ocument.typegt ltdocument.bodygt ltdocument.bod
ygt
ltidentifiergtP5-00010lt/identifiergt
lttextgtChest X-Raylt/textgt
ltfindingsgtNo infiltrate, cardiac shadow not
enlarged...lt/findingsgt ltimpressiongtNormal
X-raylt/impressiongt
51
ltpatientgt lt/patientgt
ltpatient.idgt lt/patient.idgt ltpatient.namegt lt/pa
tient.namegt ltpatient.dobgt19230113lt/patient.dobgt ltp
atient.sex value"male"/gt ltinpatient/gt
ltid.valuegt1234789lt/id.valuegt
ltfamily.namegtSandiegolt/family.namegt ltgiven.namegtCa
rmenlt/given.namegt ltsuffixgtM.D.lt/suffixgt
52
Extensible Markup Language (XML) Databases
  • Strengths
  • Flexibility to represent wide range of data
  • Data carries its field assignment
  • Sparse data handled compactly
  • Tags can have platform-specific display
  • Weaknesses
  • Immature database tools
  • Verbose
  • I/O intensive
  • A trade-off of decreased efficiency for increased
    flexibility ? scalability

53
Relational Databases - Advantages
  • Comprehensible
  • Multiple views possible
  • Easy to modify
  • New elements dont break programs
  • Database management systems (DBMS)
  • Referential integrity
  • Reorg for efficiency
  • Access control
  • Locking for multiple simultaneous use

54
Relational Databases - Disadvantages
  • Storage overhead
  • I/O-intense
  • Cost
Write a Comment
User Comments (0)
About PowerShow.com