Title: AI
1AI Molecular BiologyA Growing Success Story
2How has AI been successful in molecular biology?
- Wide, daily use of AI-based tools by biologists
- Thriving AI/MolBio community
- Intelligent Systems for Molecular Biology (ISMB)
conference now 11 years old, with gt1,000
attendees - Significant scientific publications, e.g.
- Successful businesses based on AI techniques
- http//www.medicalscientists.com
3Medical Scientists, Inc.
- Predictive modeling in health care cost domain
- Patented multistrategy constructive induction
algorithm - Privately held and profitable.
4MolBio even creeping into mainstream AI
- KDD cup competition last two years involved
learning in molecular biology domains - TREC launched genomics track this year.
- AI Magazine special issue on MolBio in Spring 04
5Why success in biology?
- Big open questions, e.g.
- Drug design, engineering novel organisms,
evolution - Rich sources of new information about life, e.g.
- Genome sequencing
- Expression array chips
- No common sense issues
- Everything anyone knows about MolBio is written
down - Significant community investment
- Biologists built a gene ontology, construct
curated knowledge bases, and are eager
consumers of software
6The Irony of AI MolBio
- Human understanding of the overwhelming
complexity of our own genome will require
partnership with biognostic machines
7What is a biognostic machine?
- From the Greek??????(life) and????????(knowing)
- Two kinds of biognostic machines
- Instruments that produce data about a living
things in molecular detail and with genomic
breadth - Bioinformatics systems that bring to bear
existing knowledge in the computational analysis
of data
8Biognostic instruments
- Gene chips read out the expression of each gene
in a tissue sample - 10,000 genes/chip anddozens of chips per study
- High throughput SNP genotyping automation
- Finds millions of tiny genetic differences among
people
9Drinking from a firehose
- 150 published genomes, 19 Eukaryotes (human,
mouse, wheat, rice, fruit fly, etc.) 798 ongoing
projects (243 Eukaryotes) - 12,661,480 articles in MedLine 12,824 new in the
last week 372 journals provide free full text
(gt100,000 full text articles)
10What AI technologiesare used in bioinformatics?
- Some of the key AI technologies that have been
broadly adopted in computational biology - Hidden Markov Models
- Ontologies and related knowledge-based
computation - Clustering, e.g. Self-Organizing Maps
- Supervised learning, e.g. Support Vector Machines
- Information extraction / natural language parsing
11HMMs in molecular biology
- HMMs (trained with E/M) are the main mechanism
used to represent patterns in DNA and protein
sequences
12The Gene Ontology
- Actively developed, community curated ontology
http//geneontology.org - About 12,000 defined concepts, in a DAG with two
link types (part-of, is-a) under three roots - Cellular component
- Biological process
- Molecular function.
- Used as annotations for genes (gt80,000 so far),
HMMs of domain patterns, etc.
13(No Transcript)
14A closer look at a biognostic instrument
- Gene expression arrays (gene chips)
- Produces 10,000 measurements/chip, generally
10s-100s of chips/experiment - Huge computational challenges
- Many novel statistical and data management issues
- Interpretation of results can be overwhelming
must transcend one gene at a time methods. - Linking data to prior knowledge is crucial.
15What is gene expression?
- Not all of the genes in a genome are used in all
circumstances - In order for a gene to play a role in a cell, it
must be expressed. - A gene is expressed when the protein it encodes
is synthesized - Transcription of DNA to mRNA is the first step in
protein production - Measuring abundance of mRNA assays the level gene
expression
16Expression is central because...
- Differentiation All cells in a body have the
same genome. Expression is what differentiates,
e.g. brain cells from liver. - Physiology Cells do their business (dividing,
sending signals, digesting, etc.) largely via
changes in expression - Response to stimuli Environmental changes (like
drugs or disease) often cause changes in
expression - Disease markers and drug targets changes in
expression associated with disease can be
diagnostic markers and/or suggest novel
pharmaceutical approaches.
17Laboratory robotics, too
- One form of expression array places controlled
quantities (and shapes) of thousands of different
DNA sequences on glass slides
18Statistical challenges!
- Many basic tools for analysis of expression data
(normalization, statistical tests, visualization,
clustering) are open source in the R language,
see http//bioconductor.org - Novel approaches stillneeded, e.g. for multiple
testing corrections, finding gene-gene
interaction terms, etc.
19Clustering approaches
- Gene expression changes are coordinated, so
levels should cluster meaningfully, but - Clusters change with situation (biclustering)
- Expression levels have complex correlational
structure - Distance measures unknown
- Approaches include
- SOMs (Slonim)
- PRMs (Koller Friedman)
- Trajectory clustering
20Discrimination tasks
- Given expression array results from e.g. tumors
that were successfully treated vs. not, develop a
predictive model - High dimensionality,interactions, but
- Feature selection
- Support vector machines
- Interesting kernels!
- Meet FDA regulations?
21Understanding expression changes in context
- Long lists of differentially expressed genes are
difficult to interpret meaningfully - Much knowledge about structure,function and
interactions of genes - Hundreds of public databaseshttp//nar.oupjournal
s.org/ - Best information in the literature.
- Key computational challenge Bring prior
knowledge to bear on understanding expression
(and other high-throughput) data
22Data integration
- Just tracking down all of the information about a
list of genes isnt easy - Dozens of general and hundreds of specialized
data sources available (many public free) - No universal IDs Sometimes heuristic key
matching is necessary to link data sources - Inference is often required (e.g. about the
applicability of information from a different
species). - Rapid change as new information becomes available
- Errors and inconsistencies abound.
23Semantic interpretation tools
- Mapping gene lists to the Gene Ontology
24Literature-based approaches
- Many active areas of research
- Information extraction to transform the
biomedical literature into more computationally
useful form - Information retrieval and presentation making
large collections of relevant documents
comprehensible - Document meta-analysis finding potential
linkages among biomolecules from patterns of use
in documents. - Great resources
- PubMed NLM indexers (e.g. GeneRIFs)
- Growing full text repositories
25Meta-analysis for gene-gene interactions
26Towards The Biological Knowledge-base
- Inferential potential of a unified knowledge-base
transcends human ability - Even heroic bioscientists cant keep up with
flood of information as disciplinary boundaries
break down. - Integrated database search isnt enough
- Semantic issues in integration
- Meta-analysis
- Making a compelling story from disparate bits of
evidence - A grand challenge for AI
27Minsky, AI Common Sense
- Marvin Minksy in the August 03 Wired on Why AI
is brain dead - There is no computer that has common sense.
We're only getting the kinds of things that are
capable of making an airline reservation. - The elderly segment of the population is growing
to the point where there won't be enough doctors,
nurses, and nurses' aides. We should be working
to get robots to pick up the slack. - I think Marvin has the right diagnosis, but the
wrong prescription
28But AI isnt psychology
- AI should be about general principles of
intelligence people are just one example - Turing test Is this program indistinguishable
from a person? - Human idiosyncracies as the sine qua non of
intelligence? - My alternative approach Is this a mind worth
wanting to know? - Also an approach to the other minds problem
29Pharmacology as a test of intelligence?
- Making a contribution to inventing a new drug as
a test for computational theories of intelligence - Lots of existing, declarative background
knowledge - Clear metric for success FDA approval
- Credit assignment exists (but note Hollywood
accounting) - and improvements in human health riding on it
- Reasonable incremental tasks
- Passing graduate pharmacology exams
- Making contributions to subtasks
30Pharmacology 101
- Find a target a naturallyoccurring molecule to
beenhanced or inhibited - Find a lead a drug-likemolecule that
interacts specifically with the target - Optimize find a compound in the same family as
the lead that is specific and effective enough to
be a drug - ADMET absorption, distribution, metabolism,
excretion, and toxicity
31Biognosticopoeia
- Our first steps
- Integrate human-curated databases
- Exploit 10Ms years of effort
- Requires dynamic and heuristic approaches
- Extend GO to many other relationships
- IE from literature using DMAP
- Explicit representation of procedural computation
tasks - IBM p690 w/ 8x Power4 processors 64GB RAM Lisp
Machine
32Come visit!
- The UCHSC Center for Computational Pharmacology,
http//compbio.uchsc.edu - International Society for Computational
Biologyhttp//iscb.org - Medical Scientists, Inc.http//medicalscientists.
com - Larry HunterLarry.Hunter_at_uchsc.edu