Data and Text Mining for Computational Biology presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data and Text Mining for Computational Biology

1
Data and Text Mining for Computational Biology

Introduction

2
Course information

CS 6365
Data and Text Mining for Computational Biology
Meets Tuesday and Thursday 700-815
pm at ECSS 2.412

3
Instructor

Vasileios Hatzivassiloglou
Associate Professor, Computer Science
Founding Professor, Bioengineering
Research focus Discover knowledge from massive
amounts of raw data
data not the same as information
information overload

4
Research Interests

Text analysis, machine learning, intelligent
information retrieval, summarization, question
answering, bioinformatics, medical informatics

5
Contact information

Office hours Tuesday and Thursday 600-700pm
and by appointment
Office location ECSS 3.406
vh_at_hlt.utdallas.edu
(972) 883-4342
Teaching Assistant TBA

6
Course goals

Introduce the field of bioinformatics
Discuss primary techniques used for data mining
Introduce text mining and additional issues it
brings to data mining methods
Use examples from computational biology

7
Intended audience

For both computer scientists and biologists
Not an easy task to balance the two
Focus on data and text mining algorithms and
applications
Coverage of machine learning background
No extensive algorithmic analysis / computational
complexity
Medium level of programming

8
Prerequisites

Officially CS 6325 Introduction to
Bioinformatics
Waived for this offering of the course
You should know
Basic data structures (multidimensional arrays,
hash tables, binary trees)
One high-level programming language and be able
to adapt to a new one as needed
Be able to install and use external software
packages

9
You need not know

Molecular biology
Machine learning
Data mining (in general)
Text analysis / natural language processing
Information retrieval
Artificial intelligence

10
Course level

Introductory graduate course (MS or first-year
PhD)
Maturity in programming and data structures as of
a Computer Science senior
Ability (and interest in) accessing the primary
literature in a guided fashion

11
Course structure

6 lectures on biological background and
bioinformatics in general
6 lectures on data similarity
8 lectures on data mining methods
3 lectures on text mining and knowledge mining
methods
student presentations of research papers (3-4
sessions)

12
Expected work load

Two homework sets given in mid-to-late September
and mid-to-late October
Two weeks to turn in each homework set
Mid-term exam in early October
Each student selects two or three research papers
to review in late October
Student presentations of research papers in the
last week of November / first week of December
Final exam

13
Course project

In lieu of the research papers and presentation,
students may elect to work on a project in teams
of two or three
Project is chosen by the students with the advice
and consent of the instructor
Project investigation/implementation should be
approximately 1.5-2 times the work required for a
regular homework

14
Programming

Each student selects their own programming
language (must be available at UTD and accessible
to TA)
Examples C, C, Java, Perl, Python
Can also use a package/programming environment
specifically tailored to bioinformatics

15
One likely package

R (http//www.r-project.org/)
R is the free alternative to S-Plus developed at
ATT research
S-Plus is the extensible, programmable
alternative to statistical packages like SAS and
SPSS
If you know C, you will be right at home with R

16
Another likely package

BioPerl (http//bio.perl.org/)
A collection of library modules in Perl written
by and for bioinformaticians
Perl supports high-level operations such as
hashes as a basic data structure, string
matching, and regular expressions
Perl is really bad at OOP and efficiency
Easy to learn

17
Grading

Class participation 20
Homework assignments 30 (total)
Midterm 10
Research paper presentation or project 20
Final exam 20

18
Textbooks

No good integrated textbook on data mining from a
computational biology perspective
We will use a text book covering bioinformatics
algorithms and another text book on data mining
in general, and additional chapters from other
books and research articles
Copies of chapters / research articles will be
provided

19
Recommended textbook 1

An Introduction to Bioinformatics Algorithms
(Computational Molecular Biology), by Neil C.
Jones and Pavel A. Pevzner, MIT Press, 2004.
ISBN 0262101068
448 pages
Available on Amazon.com for 41, Barnes and Noble
for 60

20
Recommended textbook 2

Data Mining Concepts and Techniques by Jiawei
Han and Micheline Kamber, Elsevier, second
edition, 2006.
ISBN 1558609016
800 pages
Available on Amazon.com for 52, Barnes and Noble
for 65

21
Supplementary textbooks

Bioinformatics The Machine Learning Approach
by Pierre Baldi and Soren Brunak, 2nd edition,
2001.
Data mining multimedia, soft computing, and
bioinformatics by Sushmita Mitra and Tinku
Acharya, 2003.
Both of the above are available as full-text
eBooks via http//library.utdallas.edu.

22
Background reading

Biology Molecular Biology of the Cell by Bruce
Alberts et al., 4th edition, 2002.
Machine learning Machine Learning by Tom
Mitchell, 1997.

23
Background reading (II)

Statistics The elements of statistical
learning data mining, inference, and prediction
by Trevor Hastie, Robert Tibshirani and Jerome
Friedman, 2001.
Data structures and algorithms Introduction to
Algorithms, by Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, and Clifford Stein,
2nd edition, 2001.

24
So what is it all about?

Three parts
Bioinformatics / computational biology
Data mining
Text mining

25
Bioinformatics

A fast developing discipline
We will discuss
basic concepts of molecular biology
databases of biological data
structure and function of DNA, RNA, proteins
sequence searching (BLAST)
sequence similarity and comparison
protein structure (2D and 3D)
protein motifs and patterns
microarrays
phylogenetics

26
Data mining

Given a large amount of data of known types,
extract useful information
We will discuss
data cleanup and outliers
model construction
data and dimensionality reduction
classification
prediction / probability estimation
clustering
measuring performance

27
Text mining

Not only we have a large amount of raw data, but
we dont know what each item means
We will discuss
tokenization and basics of text processing
recognition of terms and entities
classification
dictionary creation
relationship learning and extraction
document level clustering and information
retrieval

Write a Comment

User Comments (0)

About PowerShow.com

Data and Text Mining for Computational Biology PowerPoint PPT Presentation