Title: An Introduction to Data and Text Mining for Biology
1An Introduction to Data and Text Mining for
Biology
- Intelligent Agents and Soft-Computing Group
- Dept. of Electrical Electronic Engineering
- University of Cagliari
2Outline of the Talk
- Intended Audience and Goal
- Data Mining from concepts to applications
- Text Mining from concepts to applications
- Conclusions
3Intended Audience and Goal
4Intended Audience and Goal
- Intended Audience
- Biologists and bioinformaticians
- Goal
- Make biologists and if needed bioinformaticians
familiar with relevant concepts and techniques
related with data and text mining
5Data Miningfrom Concepts to Applications
6Data Mining
- What is data mining?
- Some concepts and techniques for data mining
- Can data mining be helpful in biology?
- Some relevant systems, tools, and applications of
data mining in biology
7The Context
What is Data Mining?
8The Context
What is Data Mining?
- Data Mining is a step in the process of Knowledge
Discovery in Databases (KDD) - Fayyad et al., 1996 KDD is a nontrivial process
of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data - Data and patterns are the starting and ending
points, respectively, in a KDD process, with
large volumes of data processing, iterative
testing and analysis in between
Sometimes DM and KDD are used as synonims
9The Context
What is Data Mining?
- KDD (steps)
- Select the target data (filtering)
- Preprocess and transform them if necessary
(encoding) - Perform data mining
- Interpret and assess the discovered structures
10A definition of Data Mining
What is Data Mining?
- Data Mining is ...
- extraction of potentially useful patterns or
knowledge from (typically large) observational
data sets - What kind of patterns or knowledge?
- previously unknown
- non-trivial
- implicit
11A definition of Data Mining
What is Data Mining?
- Goal
- discover information that allows to summarize the
data in novel ways, understandable and useful to
the data owner / to the user
12Data Mining Tasks
What is Data Mining?
- Descriptive / predictive modeling
- Give description or summarization
- Discovering patterns and rules
- Retrieving similar objects
13Examples of Data Mining
What is Data Mining?
- descriptive modeling
- Identifying groups of people with similar hobbies
- prescriptive modeling
- Are chances of getting cancer higher if you live
near a power line? - give description or summarization
- Find a characteristic description for people that
may be interested in buying Italian red wine
14Examples of Data Mining
What is Data Mining?
- discovering patterns and rules
- Recall the beers and diapers story
- retrieving similar objects
- Finding similar images in a data base
15An Algorithmic View of Data Mining
What is Data Mining?
- Generic steps ...
- 1. Determine the nature and structure of the
representation to be used - 2. Decide the method to be used for quantifying
and comparing how well different representations
fit the data (score function) - 3. Choose a technique to be used for optimizing
the score function - --continue--
16An Algorithmic View of Data Mining
What is Data Mining?
- 4. Decide which transformations on data (somewhat
related with data warehousing) and what
principles of data management are required to
implement the technique efficiently - 5. Perform data mining
- 6. Test results
- Repeat until ...
17Some Concepts and Techniques ...
- Main concerns
- Relevant concepts
- Main techniques
- Representation issues
- Assessment
18Main Concerns for Data Mining
Concepts and Techniques
- The complexity problem
- Due to the huge amount of data to be processed,
time (and space) complexity must be kept as low
as possible - Solution
- Inherit methods from statistical theory, machine
learning, and pattern recognition provided that
they fulfill the constraints about time (and
space) complexity
19Relevant Concepts for Data Mining
Concepts and Techniques
- Clustering
- to partition a data set into subsets (clusters)
- Classification
- to place into groups individual items
- Regression
- to estimate the relationships between one or more
response variables Y and some independent
variables X - Prediction
- to state that an event will occur in the future
20Main Techniques for Data Mining
Concepts and Techniques
- Techniques (from ML and PR)
- Symbolic Rule Induction
- Decision Tree Learning
- Artificial Neural Networks (MLP, RANN)
- Bayesian Networks
- Bayesian Classifiers
- Hidden Markov Models
- K-Means Clustering
- Fuzzy c-means Clustering
21Main Techniques for Data Mining
Concepts and Techniques
- Techniques (from ML and PR)
- At the end of the talk we will briefly illustrate
some of the main techniques...
22Representation Issues (from PR / ML)
What is Data Mining?
- Modeling (global, summary of a data set)
- Linear or higher order equations (for regression
tasks), graphs, tree structures, conditional
probabilities, ... - Related tasks Regression / prediction,
classification, interpretation, clustering - Pattern Finding (local, restricted regions)
- Recurrent patterns, symbolic rules, prototypes,
non symbolic descriptions, ... - Related tasks Detection of characteristic
patterns, discriminant patterns, anomalies
23 Assessing Data Mining Techniques
Concepts and Techniques
- Assessment (from statistical theory, etc.)
- Distance measures
- Confusion matrix (accuracy / prediction / recall)
- k-fold cross-validation
- ... and also user acknowledgement
24Can Data Mining be Helpful in Biology?
Can Data Mining Be Helpful in Biology?
- Candidate tasks
- Pathway analysis
- Gene finding
- Secondary and tertiary structure prediction
- Docking
- Microarray experiments
- ...
25Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications
- Pathway analysis
- VisANT Integrative Visual Analysis Tool for
biological networks and pathways Hu et al.,
2005 - Gene finding
- Merck Gene Index browser an extensible data
integration system for gene finding, gene
characterization and EST data miningEckman et
al., 1998
26Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications
- Secondary structure prediction
- Consensus Data Mining secondary structure
prediction by combining Garnier-Osguthorpe-Robson
(GOR) and Fragment Database Mining (FDM) Sen et
al., 2006 - Microarray analysis
- Cleaver Classification of Expression Array,
tools to assist in the analysis of microarray
expression datahttp//classify.stanford.edu/
27Text Miningfrom Concepts to Applications
28Text Mining
- Text Mining
- What is text mining?
- Some concepts and techniques for text mining
- Can text mining be helpful in biology?
- Some relevant systems, tools, and applications of
text mining in biology
29The Context
What is Text Mining?
- Text Mining is part of Information Retrieval (IR)
- wikipedia IR is the science of searching for
information in documents, searching for documents
themselves, searching for metadata which describe
documents, or searching within databases, whether
relational stand-alone databases or hypertext
networked databases such as the Internet or World
Wide Web or intranets, for text, sound, images or
data
30A definition of Text Mining
What is Text Mining?
- Text Mining is ...
- the discovery of new, previously unknown
information, by automatically extracting it from
different text resources - one of the major task is to link together the
extracted information to form new facts to be
explored further - input encoding is important
Text Mining is sometimes also referred to as text
data mining
31A definition of Text Mining
What is Text Mining?
- Goal
- deriving high-quality, previously unknown,
information from text
32Some Concepts and Techniques ...
- Main concerns
- Typical Text Mining tasks (with relevant concepts
and techniques) - Assessment
33Main Concerns - I
Concepts and Techniques
- The encoding problem
- Due to the huge amount of possible features (each
word could in fact be a feature) encoding input
data is very important for text mining - Solution
- Diminish the amount of relevant features by
resorting to feature selection / extraction (both
techniques imply the evaluation of some form of
entropy)
34Encoding Input Data
Concepts and Techniques
- Encoding is very important in text mining
activities - A text document is usually represented by a
vector of n weighted index terms (bag of words
approach) - Each term of a document/vector is represented by
different information contents - Term frequency
- TF-IDF (term frequency-inverse document frequency)
35Encoding Input Data
Concepts and Techniques
- The high dimensionality of the search space
should be reduced (dimensionality reduction) - Feature selection
- The set of terms is reduced through the selection
of the most discriminant terms (Document
frequency, Information Gain, Chi-square,...). - Feature extraction
- The original n-dimensional feature/term space is
mapped to a new m-dimensional (mltn), through a
linear transformation (PCA, LDA,...)
36Main Concerns - II
Concepts and Techniques
- Evaluation metrics
- Depending on the specific task (and possibly on
the user's preferences) a different trade-off
should be found concerning the will of keeping
low the amount of both false positives and
false negatives in text categorization tasks - Solution
- Use ROC analysis ...
ROC Receiver Operating Characteristic
37Evaluation Metrics
Concepts and Techniques
- Accuracy
- Precision
- Recall
- Other metrics
- Specificity
- True/False positive rate
38Text Categorization
Concepts and Techniques
- The task of assigning a label (topic) to a text
document. - This task is performed through supervised
classification algorithms. - Main supervised classification techniques
- kNN (k-Nearest Neighbor)
- NB (Naïve Bayes)
- SVM (Support Vector Machines)
- MLPs (Multiple Layer Percepron) with
Backpropagation
39Text Clustering
Concepts and Techniques
- The task of discovering natural groups in
datasets, without having any background knowledge
of characteristics of the data. - This task is performed through unsupervised
classification algorithms - Main unsupervised classification algorithms
- k-Means
- Fuzzy c-means clustering
40Entity Relation Modeling
Concepts and Techniques
- The task of seeking and classifying atomic
elements in text into predefined categories such
as the names of persons, organizations,
locations, etc.. - This task is performed through three main
different approaches - Manually created rule-based systems
- Fully automatic learning-based systems
- Mixed approach
41Assessment
Concepts and Techniques
- Assessment is usually performed by resorting to
some sort of ROC analysis, possibly in
combination with averaging techniques (such as
k-fold cross validation)
42Can Text Mining Be Helpful in Biology?
Can Text Mining Be Helpful in Biology?
- Candidate Tasks
- Text in microarray analysis
- Text-enhanced in sequence alignments
- Information extraction
- ...
43Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications
- Text in microarray analysis
- MedMiner an Internet text-mining tool for
biomedical informationTanabe et al., 99 - Text-enhanced in sequence alignments
- SAWTED structure assignment with text
description-enhanced detection of remote
homologues with automated SWISS-PROT annotation
comparisonsMacCallum et al., 2000
44Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications
- Information extraction
- Constructing biological knowledge bases by
extracting information from text sourcesCraven
and Kumlien, 99
45Technical Issues
46Some Technical Details...
- Common issues on classification tasks
- Technical specifications
- kNN
- Naive Bayes
- SVM
- MLP
- k-Means
- Fuzzy c-Means
47Common Issues
- The classification task assigns a score to each
text document. That score - can be an estimation of the a posteriori
probability (NB, KNN) - can be related to a specific issue of the
algorithm, like geometric distances between
separation surfaces (SVM). - It could not be possible to identify a specific
meaning of the score (ANN).
48Technical Specifications kNN
- The score is assigned using non parametric
density estimation of the a posteriori
probability - determines the volume V that contains the k
nearest prototypes of each unlabeled document. - estimates the a posteriory probability of each
text category
49Technical Specifications Naive Bayes
- The score is assigned a posteriori through a
simplified model of the Bayes theorem. - The Bayes theorem
- Through the strong (naive) independence
assumption it states
50Technical Specifications SVM
- The score is related to the distance of the
document from a separation hyperplane - Defines the optimal separation hyperplane (OSH)
- OSH is specified by a weighted sum of prototypes
(support vectors, SV)
51Technical Specifications SVM
- The score is related to the distance of the
document from a separation hyperplane - using kernel functions, SVM builds the OSH in a
high dimensional hyperspace defining complex
separation surfaces in the feature domain.
52Technical Specifications MLP
- The score is typically related to the similarity
of a document to each category - MLP are the most common Artificial Neural
Networks (ANNs). - MLP is composed of several layers of artificial
neurons, interconnected in a feed-forward way.
53Technical Specifications k-Means
- Tries to find the natural clusterization of
documents. - Splits the features space in k partitions by
fixing k random centres - Assigns each document to the nearest centre
- Calculates the means of each cluster and
reassigns each document the new centres. - Repeats the former two steps until the points no
longer switch clusters.
54Technical SpecificationsFuzzy c-Means
- A variant of k-means algorithm
- Assigns a continuous value of inclusion in
clusters to each document - Each documents could belong to several clusters
- May converge more quickly than k-means
55That's All Folks!
- Contacts
- IASC Group http//iasc.diee.unica.it
- Prof. G. Armano armano_at_diee.unica.it
- Dott. E. Vargiu vargiu_at_diee.unica.it