An Introduction to Data and Text Mining for Biology - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

An Introduction to Data and Text Mining for Biology

Description:

Machine. Learning. Visualization. What is Data Mining? The Context ... of the most discriminant terms (Document frequency, Information Gain, Chi-square, ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 56

Provided by: bioinform2

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Data and Text Mining for Biology

1
An Introduction to Data and Text Mining for
Biology

Intelligent Agents and Soft-Computing Group
Dept. of Electrical Electronic Engineering
University of Cagliari

2
Outline of the Talk

Intended Audience and Goal
Data Mining from concepts to applications
Text Mining from concepts to applications
Conclusions

3
Intended Audience and Goal
4
Intended Audience and Goal

Intended Audience
Biologists and bioinformaticians
Goal
Make biologists and if needed bioinformaticians
familiar with relevant concepts and techniques
related with data and text mining

5
Data Miningfrom Concepts to Applications
6
Data Mining

What is data mining?
Some concepts and techniques for data mining
Can data mining be helpful in biology?
Some relevant systems, tools, and applications of
data mining in biology

7
The Context
What is Data Mining?
8
The Context
What is Data Mining?

Data Mining is a step in the process of Knowledge
Discovery in Databases (KDD)
Fayyad et al., 1996 KDD is a nontrivial process
of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data
Data and patterns are the starting and ending
points, respectively, in a KDD process, with
large volumes of data processing, iterative
testing and analysis in between

Sometimes DM and KDD are used as synonims
9
The Context
What is Data Mining?

KDD (steps)
Select the target data (filtering)
Preprocess and transform them if necessary
(encoding)
Perform data mining
Interpret and assess the discovered structures

10
A definition of Data Mining
What is Data Mining?

Data Mining is ...
extraction of potentially useful patterns or
knowledge from (typically large) observational
data sets
What kind of patterns or knowledge?
previously unknown
non-trivial
implicit

11
A definition of Data Mining
What is Data Mining?

Goal
discover information that allows to summarize the
data in novel ways, understandable and useful to
the data owner / to the user

12
Data Mining Tasks
What is Data Mining?

Descriptive / predictive modeling
Give description or summarization
Discovering patterns and rules
Retrieving similar objects

13
Examples of Data Mining
What is Data Mining?

descriptive modeling
Identifying groups of people with similar hobbies
prescriptive modeling
Are chances of getting cancer higher if you live
near a power line?
give description or summarization
Find a characteristic description for people that
may be interested in buying Italian red wine

14
Examples of Data Mining
What is Data Mining?

discovering patterns and rules
Recall the beers and diapers story
retrieving similar objects
Finding similar images in a data base

15
An Algorithmic View of Data Mining
What is Data Mining?

Generic steps ...
1. Determine the nature and structure of the
representation to be used
2. Decide the method to be used for quantifying
and comparing how well different representations
fit the data (score function)
3. Choose a technique to be used for optimizing
the score function
--continue--

16
An Algorithmic View of Data Mining
What is Data Mining?

4. Decide which transformations on data (somewhat
related with data warehousing) and what
principles of data management are required to
implement the technique efficiently
5. Perform data mining
6. Test results
Repeat until ...

17
Some Concepts and Techniques ...

Main concerns
Relevant concepts
Main techniques
Representation issues
Assessment

18
Main Concerns for Data Mining
Concepts and Techniques

The complexity problem
Due to the huge amount of data to be processed,
time (and space) complexity must be kept as low
as possible
Solution
Inherit methods from statistical theory, machine
learning, and pattern recognition provided that
they fulfill the constraints about time (and
space) complexity

19
Relevant Concepts for Data Mining
Concepts and Techniques

Clustering
to partition a data set into subsets (clusters)
Classification
to place into groups individual items
Regression
to estimate the relationships between one or more
response variables Y and some independent
variables X
Prediction
to state that an event will occur in the future

20
Main Techniques for Data Mining
Concepts and Techniques

Techniques (from ML and PR)
Symbolic Rule Induction
Decision Tree Learning
Artificial Neural Networks (MLP, RANN)
Bayesian Networks
Bayesian Classifiers
Hidden Markov Models
K-Means Clustering
Fuzzy c-means Clustering

21
Main Techniques for Data Mining
Concepts and Techniques

Techniques (from ML and PR)
At the end of the talk we will briefly illustrate
some of the main techniques...

22
Representation Issues (from PR / ML)
What is Data Mining?

Modeling (global, summary of a data set)
Linear or higher order equations (for regression
tasks), graphs, tree structures, conditional
probabilities, ...
Related tasks Regression / prediction,
classification, interpretation, clustering
Pattern Finding (local, restricted regions)
Recurrent patterns, symbolic rules, prototypes,
non symbolic descriptions, ...
Related tasks Detection of characteristic
patterns, discriminant patterns, anomalies

23
Assessing Data Mining Techniques
Concepts and Techniques

Assessment (from statistical theory, etc.)
Distance measures
Confusion matrix (accuracy / prediction / recall)
k-fold cross-validation
... and also user acknowledgement

24
Can Data Mining be Helpful in Biology?
Can Data Mining Be Helpful in Biology?

Candidate tasks
Pathway analysis
Gene finding
Secondary and tertiary structure prediction
Docking
Microarray experiments
...

25
Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications

Pathway analysis
VisANT Integrative Visual Analysis Tool for
biological networks and pathways Hu et al.,
2005
Gene finding
Merck Gene Index browser an extensible data
integration system for gene finding, gene
characterization and EST data miningEckman et
al., 1998

26
Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications

Secondary structure prediction
Consensus Data Mining secondary structure
prediction by combining Garnier-Osguthorpe-Robson
(GOR) and Fragment Database Mining (FDM) Sen et
al., 2006
Microarray analysis
Cleaver Classification of Expression Array,
tools to assist in the analysis of microarray
expression datahttp//classify.stanford.edu/

27
Text Miningfrom Concepts to Applications
28
Text Mining

Text Mining
What is text mining?
Some concepts and techniques for text mining
Can text mining be helpful in biology?
Some relevant systems, tools, and applications of
text mining in biology

29
The Context
What is Text Mining?

Text Mining is part of Information Retrieval (IR)
wikipedia IR is the science of searching for
information in documents, searching for documents
themselves, searching for metadata which describe
documents, or searching within databases, whether
relational stand-alone databases or hypertext
networked databases such as the Internet or World
Wide Web or intranets, for text, sound, images or
data

30
A definition of Text Mining
What is Text Mining?

Text Mining is ...
the discovery of new, previously unknown
information, by automatically extracting it from
different text resources
one of the major task is to link together the
extracted information to form new facts to be
explored further
input encoding is important

Text Mining is sometimes also referred to as text
data mining
31
A definition of Text Mining
What is Text Mining?

Goal
deriving high-quality, previously unknown,
information from text

32
Some Concepts and Techniques ...

Main concerns
Typical Text Mining tasks (with relevant concepts
and techniques)
Assessment

33
Main Concerns - I
Concepts and Techniques

The encoding problem
Due to the huge amount of possible features (each
word could in fact be a feature) encoding input
data is very important for text mining
Solution
Diminish the amount of relevant features by
resorting to feature selection / extraction (both
techniques imply the evaluation of some form of
entropy)

34
Encoding Input Data
Concepts and Techniques

Encoding is very important in text mining
activities
A text document is usually represented by a
vector of n weighted index terms (bag of words
approach)
Each term of a document/vector is represented by
different information contents
Term frequency
TF-IDF (term frequency-inverse document frequency)

35
Encoding Input Data
Concepts and Techniques

The high dimensionality of the search space
should be reduced (dimensionality reduction)
Feature selection
The set of terms is reduced through the selection
of the most discriminant terms (Document
frequency, Information Gain, Chi-square,...).
Feature extraction
The original n-dimensional feature/term space is
mapped to a new m-dimensional (mltn), through a
linear transformation (PCA, LDA,...)

36
Main Concerns - II
Concepts and Techniques

Evaluation metrics
Depending on the specific task (and possibly on
the user's preferences) a different trade-off
should be found concerning the will of keeping
low the amount of both false positives and
false negatives in text categorization tasks
Solution
Use ROC analysis ...

ROC Receiver Operating Characteristic
37
Evaluation Metrics
Concepts and Techniques

Accuracy
Precision
Recall
Other metrics
Specificity
True/False positive rate

38
Text Categorization
Concepts and Techniques

The task of assigning a label (topic) to a text
document.
This task is performed through supervised
classification algorithms.
Main supervised classification techniques
kNN (k-Nearest Neighbor)
NB (Naïve Bayes)
SVM (Support Vector Machines)
MLPs (Multiple Layer Percepron) with
Backpropagation

39
Text Clustering
Concepts and Techniques

The task of discovering natural groups in
datasets, without having any background knowledge
of characteristics of the data.
This task is performed through unsupervised
classification algorithms
Main unsupervised classification algorithms
k-Means
Fuzzy c-means clustering

40
Entity Relation Modeling
Concepts and Techniques

The task of seeking and classifying atomic
elements in text into predefined categories such
as the names of persons, organizations,
locations, etc..
This task is performed through three main
different approaches
Manually created rule-based systems
Fully automatic learning-based systems
Mixed approach

41
Assessment
Concepts and Techniques

Assessment is usually performed by resorting to
some sort of ROC analysis, possibly in
combination with averaging techniques (such as
k-fold cross validation)

42
Can Text Mining Be Helpful in Biology?
Can Text Mining Be Helpful in Biology?

Candidate Tasks
Text in microarray analysis
Text-enhanced in sequence alignments
Information extraction
...

43
Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications

Text in microarray analysis
MedMiner an Internet text-mining tool for
biomedical informationTanabe et al., 99
Text-enhanced in sequence alignments
SAWTED structure assignment with text
description-enhanced detection of remote
homologues with automated SWISS-PROT annotation
comparisonsMacCallum et al., 2000

44
Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications

Information extraction
Constructing biological knowledge bases by
extracting information from text sourcesCraven
and Kumlien, 99

45
Technical Issues
46
Some Technical Details...

Common issues on classification tasks
Technical specifications
kNN
Naive Bayes
SVM
MLP
k-Means
Fuzzy c-Means

47
Common Issues

The classification task assigns a score to each
text document. That score
can be an estimation of the a posteriori
probability (NB, KNN)
can be related to a specific issue of the
algorithm, like geometric distances between
separation surfaces (SVM).
It could not be possible to identify a specific
meaning of the score (ANN).

48
Technical Specifications kNN

The score is assigned using non parametric
density estimation of the a posteriori
probability
determines the volume V that contains the k
nearest prototypes of each unlabeled document.
estimates the a posteriory probability of each
text category

49
Technical Specifications Naive Bayes

The score is assigned a posteriori through a
simplified model of the Bayes theorem.
The Bayes theorem
Through the strong (naive) independence
assumption it states

50
Technical Specifications SVM

The score is related to the distance of the
document from a separation hyperplane
Defines the optimal separation hyperplane (OSH)
OSH is specified by a weighted sum of prototypes
(support vectors, SV)

51
Technical Specifications SVM

The score is related to the distance of the
document from a separation hyperplane
using kernel functions, SVM builds the OSH in a
high dimensional hyperspace defining complex
separation surfaces in the feature domain.

52
Technical Specifications MLP

The score is typically related to the similarity
of a document to each category
MLP are the most common Artificial Neural
Networks (ANNs).
MLP is composed of several layers of artificial
neurons, interconnected in a feed-forward way.

53
Technical Specifications k-Means

Tries to find the natural clusterization of
documents.
Splits the features space in k partitions by
fixing k random centres
Assigns each document to the nearest centre
Calculates the means of each cluster and
reassigns each document the new centres.
Repeats the former two steps until the points no
longer switch clusters.

54
Technical SpecificationsFuzzy c-Means