Data and Text Mining for Computational Biology - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data and Text Mining for Computational Biology

Description:

Data and Text Mining for Computational Biology – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 28
Provided by: VasileiosH9
Category:

less

Transcript and Presenter's Notes

Title: Data and Text Mining for Computational Biology


1
Data and Text Mining for Computational Biology
  • Introduction

2
Course information
  • CS 6365
  • Data and Text Mining for Computational Biology
  • Meets Tuesday and Thursday 700-815
    pm at ECSS 2.412

3
Instructor
  • Vasileios Hatzivassiloglou
  • Associate Professor, Computer Science
  • Founding Professor, Bioengineering
  • Research focus Discover knowledge from massive
    amounts of raw data
  • data not the same as information
  • information overload

4
Research Interests
  • Text analysis, machine learning, intelligent
    information retrieval, summarization, question
    answering, bioinformatics, medical informatics

5
Contact information
  • Office hours Tuesday and Thursday 600-700pm
    and by appointment
  • Office location ECSS 3.406
  • vh_at_hlt.utdallas.edu
  • (972) 883-4342
  • Teaching Assistant TBA

6
Course goals
  • Introduce the field of bioinformatics
  • Discuss primary techniques used for data mining
  • Introduce text mining and additional issues it
    brings to data mining methods
  • Use examples from computational biology

7
Intended audience
  • For both computer scientists and biologists
  • Not an easy task to balance the two
  • Focus on data and text mining algorithms and
    applications
  • Coverage of machine learning background
  • No extensive algorithmic analysis / computational
    complexity
  • Medium level of programming

8
Prerequisites
  • Officially CS 6325 Introduction to
    Bioinformatics
  • Waived for this offering of the course
  • You should know
  • Basic data structures (multidimensional arrays,
    hash tables, binary trees)
  • One high-level programming language and be able
    to adapt to a new one as needed
  • Be able to install and use external software
    packages

9
You need not know
  • Molecular biology
  • Machine learning
  • Data mining (in general)
  • Text analysis / natural language processing
  • Information retrieval
  • Artificial intelligence

10
Course level
  • Introductory graduate course (MS or first-year
    PhD)
  • Maturity in programming and data structures as of
    a Computer Science senior
  • Ability (and interest in) accessing the primary
    literature in a guided fashion

11
Course structure
  • 6 lectures on biological background and
    bioinformatics in general
  • 6 lectures on data similarity
  • 8 lectures on data mining methods
  • 3 lectures on text mining and knowledge mining
    methods
  • student presentations of research papers (3-4
    sessions)

12
Expected work load
  • Two homework sets given in mid-to-late September
    and mid-to-late October
  • Two weeks to turn in each homework set
  • Mid-term exam in early October
  • Each student selects two or three research papers
    to review in late October
  • Student presentations of research papers in the
    last week of November / first week of December
  • Final exam

13
Course project
  • In lieu of the research papers and presentation,
    students may elect to work on a project in teams
    of two or three
  • Project is chosen by the students with the advice
    and consent of the instructor
  • Project investigation/implementation should be
    approximately 1.5-2 times the work required for a
    regular homework

14
Programming
  • Each student selects their own programming
    language (must be available at UTD and accessible
    to TA)
  • Examples C, C, Java, Perl, Python
  • Can also use a package/programming environment
    specifically tailored to bioinformatics

15
One likely package
  • R (http//www.r-project.org/)
  • R is the free alternative to S-Plus developed at
    ATT research
  • S-Plus is the extensible, programmable
    alternative to statistical packages like SAS and
    SPSS
  • If you know C, you will be right at home with R

16
Another likely package
  • BioPerl (http//bio.perl.org/)
  • A collection of library modules in Perl written
    by and for bioinformaticians
  • Perl supports high-level operations such as
    hashes as a basic data structure, string
    matching, and regular expressions
  • Perl is really bad at OOP and efficiency
  • Easy to learn

17
Grading
  • Class participation 20
  • Homework assignments 30 (total)
  • Midterm 10
  • Research paper presentation or project 20
  • Final exam 20

18
Textbooks
  • No good integrated textbook on data mining from a
    computational biology perspective
  • We will use a text book covering bioinformatics
    algorithms and another text book on data mining
    in general, and additional chapters from other
    books and research articles
  • Copies of chapters / research articles will be
    provided

19
Recommended textbook 1
  • An Introduction to Bioinformatics Algorithms
    (Computational Molecular Biology), by Neil C.
    Jones and Pavel A. Pevzner, MIT Press, 2004.
  • ISBN 0262101068
  • 448 pages
  • Available on Amazon.com for 41, Barnes and Noble
    for 60

20
Recommended textbook 2
  • Data Mining Concepts and Techniques by Jiawei
    Han and Micheline Kamber, Elsevier, second
    edition, 2006.
  • ISBN 1558609016
  • 800 pages
  • Available on Amazon.com for 52, Barnes and Noble
    for 65

21
Supplementary textbooks
  • Bioinformatics The Machine Learning Approach
    by Pierre Baldi and Soren Brunak, 2nd edition,
    2001.
  • Data mining multimedia, soft computing, and
    bioinformatics by Sushmita Mitra and Tinku
    Acharya, 2003.
  • Both of the above are available as full-text
    eBooks via http//library.utdallas.edu.

22
Background reading
  • Biology Molecular Biology of the Cell by Bruce
    Alberts et al., 4th edition, 2002.
  • Machine learning Machine Learning by Tom
    Mitchell, 1997.

23
Background reading (II)
  • Statistics The elements of statistical
    learning data mining, inference, and prediction
    by Trevor Hastie, Robert Tibshirani and Jerome
    Friedman, 2001.
  • Data structures and algorithms Introduction to
    Algorithms, by Thomas H. Cormen, Charles E.
    Leiserson, Ronald L. Rivest, and Clifford Stein,
    2nd edition, 2001.

24
So what is it all about?
  • Three parts
  • Bioinformatics / computational biology
  • Data mining
  • Text mining

25
Bioinformatics
  • A fast developing discipline
  • We will discuss
  • basic concepts of molecular biology
  • databases of biological data
  • structure and function of DNA, RNA, proteins
  • sequence searching (BLAST)
  • sequence similarity and comparison
  • protein structure (2D and 3D)
  • protein motifs and patterns
  • microarrays
  • phylogenetics

26
Data mining
  • Given a large amount of data of known types,
    extract useful information
  • We will discuss
  • data cleanup and outliers
  • model construction
  • data and dimensionality reduction
  • classification
  • prediction / probability estimation
  • clustering
  • measuring performance

27
Text mining
  • Not only we have a large amount of raw data, but
    we dont know what each item means
  • We will discuss
  • tokenization and basics of text processing
  • recognition of terms and entities
  • classification
  • dictionary creation
  • relationship learning and extraction
  • document level clustering and information
    retrieval
Write a Comment
User Comments (0)
About PowerShow.com