An Introduction to Data and Text Mining for Biology - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

An Introduction to Data and Text Mining for Biology

Description:

Machine. Learning. Visualization. What is Data Mining? The Context ... of the most discriminant terms (Document frequency, Information Gain, Chi-square, ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 56
Provided by: bioinform2
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Data and Text Mining for Biology


1
An Introduction to Data and Text Mining for
Biology
  • Intelligent Agents and Soft-Computing Group
  • Dept. of Electrical Electronic Engineering
  • University of Cagliari

2
Outline of the Talk
  • Intended Audience and Goal
  • Data Mining from concepts to applications
  • Text Mining from concepts to applications
  • Conclusions

3
Intended Audience and Goal
4
Intended Audience and Goal
  • Intended Audience
  • Biologists and bioinformaticians
  • Goal
  • Make biologists and if needed bioinformaticians
    familiar with relevant concepts and techniques
    related with data and text mining

5
Data Miningfrom Concepts to Applications
6
Data Mining
  • What is data mining?
  • Some concepts and techniques for data mining
  • Can data mining be helpful in biology?
  • Some relevant systems, tools, and applications of
    data mining in biology

7
The Context
What is Data Mining?
8
The Context
What is Data Mining?
  • Data Mining is a step in the process of Knowledge
    Discovery in Databases (KDD)
  • Fayyad et al., 1996 KDD is a nontrivial process
    of identifying valid, novel, potentially useful,
    and ultimately understandable patterns in data
  • Data and patterns are the starting and ending
    points, respectively, in a KDD process, with
    large volumes of data processing, iterative
    testing and analysis in between

Sometimes DM and KDD are used as synonims
9
The Context
What is Data Mining?
  • KDD (steps)
  • Select the target data (filtering)
  • Preprocess and transform them if necessary
    (encoding)
  • Perform data mining
  • Interpret and assess the discovered structures

10
A definition of Data Mining
What is Data Mining?
  • Data Mining is ...
  • extraction of potentially useful patterns or
    knowledge from (typically large) observational
    data sets
  • What kind of patterns or knowledge?
  • previously unknown
  • non-trivial
  • implicit

11
A definition of Data Mining
What is Data Mining?
  • Goal
  • discover information that allows to summarize the
    data in novel ways, understandable and useful to
    the data owner / to the user

12
Data Mining Tasks
What is Data Mining?
  • Descriptive / predictive modeling
  • Give description or summarization
  • Discovering patterns and rules
  • Retrieving similar objects

13
Examples of Data Mining
What is Data Mining?
  • descriptive modeling
  • Identifying groups of people with similar hobbies
  • prescriptive modeling
  • Are chances of getting cancer higher if you live
    near a power line?
  • give description or summarization
  • Find a characteristic description for people that
    may be interested in buying Italian red wine

14
Examples of Data Mining
What is Data Mining?
  • discovering patterns and rules
  • Recall the beers and diapers story
  • retrieving similar objects
  • Finding similar images in a data base

15
An Algorithmic View of Data Mining
What is Data Mining?
  • Generic steps ...
  • 1. Determine the nature and structure of the
    representation to be used
  • 2. Decide the method to be used for quantifying
    and comparing how well different representations
    fit the data (score function)
  • 3. Choose a technique to be used for optimizing
    the score function
  • --continue--

16
An Algorithmic View of Data Mining
What is Data Mining?
  • 4. Decide which transformations on data (somewhat
    related with data warehousing) and what
    principles of data management are required to
    implement the technique efficiently
  • 5. Perform data mining
  • 6. Test results
  • Repeat until ...

17
Some Concepts and Techniques ...
  • Main concerns
  • Relevant concepts
  • Main techniques
  • Representation issues
  • Assessment

18
Main Concerns for Data Mining
Concepts and Techniques
  • The complexity problem
  • Due to the huge amount of data to be processed,
    time (and space) complexity must be kept as low
    as possible
  • Solution
  • Inherit methods from statistical theory, machine
    learning, and pattern recognition provided that
    they fulfill the constraints about time (and
    space) complexity

19
Relevant Concepts for Data Mining
Concepts and Techniques
  • Clustering
  • to partition a data set into subsets (clusters)
  • Classification
  • to place into groups individual items
  • Regression
  • to estimate the relationships between one or more
    response variables Y and some independent
    variables X
  • Prediction
  • to state that an event will occur in the future

20
Main Techniques for Data Mining
Concepts and Techniques
  • Techniques (from ML and PR)
  • Symbolic Rule Induction
  • Decision Tree Learning
  • Artificial Neural Networks (MLP, RANN)
  • Bayesian Networks
  • Bayesian Classifiers
  • Hidden Markov Models
  • K-Means Clustering
  • Fuzzy c-means Clustering

21
Main Techniques for Data Mining
Concepts and Techniques
  • Techniques (from ML and PR)
  • At the end of the talk we will briefly illustrate
    some of the main techniques...

22
Representation Issues (from PR / ML)
What is Data Mining?
  • Modeling (global, summary of a data set)
  • Linear or higher order equations (for regression
    tasks), graphs, tree structures, conditional
    probabilities, ...
  • Related tasks Regression / prediction,
    classification, interpretation, clustering
  • Pattern Finding (local, restricted regions)
  • Recurrent patterns, symbolic rules, prototypes,
    non symbolic descriptions, ...
  • Related tasks Detection of characteristic
    patterns, discriminant patterns, anomalies

23
Assessing Data Mining Techniques
Concepts and Techniques
  • Assessment (from statistical theory, etc.)
  • Distance measures
  • Confusion matrix (accuracy / prediction / recall)
  • k-fold cross-validation
  • ... and also user acknowledgement

24
Can Data Mining be Helpful in Biology?
Can Data Mining Be Helpful in Biology?
  • Candidate tasks
  • Pathway analysis
  • Gene finding
  • Secondary and tertiary structure prediction
  • Docking
  • Microarray experiments
  • ...

25
Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications
  • Pathway analysis
  • VisANT Integrative Visual Analysis Tool for
    biological networks and pathways Hu et al.,
    2005
  • Gene finding
  • Merck Gene Index browser an extensible data
    integration system for gene finding, gene
    characterization and EST data miningEckman et
    al., 1998

26
Some Relevant Systems, ...
Some Relevant Systems, Tools, and Applications
  • Secondary structure prediction
  • Consensus Data Mining secondary structure
    prediction by combining Garnier-Osguthorpe-Robson
    (GOR) and Fragment Database Mining (FDM) Sen et
    al., 2006
  • Microarray analysis
  • Cleaver Classification of Expression Array,
    tools to assist in the analysis of microarray
    expression datahttp//classify.stanford.edu/

27
Text Miningfrom Concepts to Applications
28
Text Mining
  • Text Mining
  • What is text mining?
  • Some concepts and techniques for text mining
  • Can text mining be helpful in biology?
  • Some relevant systems, tools, and applications of
    text mining in biology

29
The Context
What is Text Mining?
  • Text Mining is part of Information Retrieval (IR)
  • wikipedia IR is the science of searching for
    information in documents, searching for documents
    themselves, searching for metadata which describe
    documents, or searching within databases, whether
    relational stand-alone databases or hypertext
    networked databases such as the Internet or World
    Wide Web or intranets, for text, sound, images or
    data

30
A definition of Text Mining
What is Text Mining?
  • Text Mining is ...
  • the discovery of new, previously unknown
    information, by automatically extracting it from
    different text resources
  • one of the major task is to link together the
    extracted information to form new facts to be
    explored further
  • input encoding is important

Text Mining is sometimes also referred to as text
data mining
31
A definition of Text Mining
What is Text Mining?
  • Goal
  • deriving high-quality, previously unknown,
    information from text

32
Some Concepts and Techniques ...
  • Main concerns
  • Typical Text Mining tasks (with relevant concepts
    and techniques)
  • Assessment

33
Main Concerns - I
Concepts and Techniques
  • The encoding problem
  • Due to the huge amount of possible features (each
    word could in fact be a feature) encoding input
    data is very important for text mining
  • Solution
  • Diminish the amount of relevant features by
    resorting to feature selection / extraction (both
    techniques imply the evaluation of some form of
    entropy)

34
Encoding Input Data
Concepts and Techniques
  • Encoding is very important in text mining
    activities
  • A text document is usually represented by a
    vector of n weighted index terms (bag of words
    approach)
  • Each term of a document/vector is represented by
    different information contents
  • Term frequency
  • TF-IDF (term frequency-inverse document frequency)

35
Encoding Input Data
Concepts and Techniques
  • The high dimensionality of the search space
    should be reduced (dimensionality reduction)
  • Feature selection
  • The set of terms is reduced through the selection
    of the most discriminant terms (Document
    frequency, Information Gain, Chi-square,...).
  • Feature extraction
  • The original n-dimensional feature/term space is
    mapped to a new m-dimensional (mltn), through a
    linear transformation (PCA, LDA,...)

36
Main Concerns - II
Concepts and Techniques
  • Evaluation metrics
  • Depending on the specific task (and possibly on
    the user's preferences) a different trade-off
    should be found concerning the will of keeping
    low the amount of both false positives and
    false negatives in text categorization tasks
  • Solution
  • Use ROC analysis ...

ROC Receiver Operating Characteristic
37
Evaluation Metrics
Concepts and Techniques
  • Accuracy
  • Precision
  • Recall
  • Other metrics
  • Specificity
  • True/False positive rate

38
Text Categorization
Concepts and Techniques
  • The task of assigning a label (topic) to a text
    document.
  • This task is performed through supervised
    classification algorithms.
  • Main supervised classification techniques
  • kNN (k-Nearest Neighbor)
  • NB (Naïve Bayes)
  • SVM (Support Vector Machines)
  • MLPs (Multiple Layer Percepron) with
    Backpropagation

39
Text Clustering
Concepts and Techniques
  • The task of discovering natural groups in
    datasets, without having any background knowledge
    of characteristics of the data.
  • This task is performed through unsupervised
    classification algorithms
  • Main unsupervised classification algorithms
  • k-Means
  • Fuzzy c-means clustering

40
Entity Relation Modeling
Concepts and Techniques
  • The task of seeking and classifying atomic
    elements in text into predefined categories such
    as the names of persons, organizations,
    locations, etc..
  • This task is performed through three main
    different approaches
  • Manually created rule-based systems
  • Fully automatic learning-based systems
  • Mixed approach

41
Assessment
Concepts and Techniques
  • Assessment is usually performed by resorting to
    some sort of ROC analysis, possibly in
    combination with averaging techniques (such as
    k-fold cross validation)

42
Can Text Mining Be Helpful in Biology?
Can Text Mining Be Helpful in Biology?
  • Candidate Tasks
  • Text in microarray analysis
  • Text-enhanced in sequence alignments
  • Information extraction
  • ...

43
Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications
  • Text in microarray analysis
  • MedMiner an Internet text-mining tool for
    biomedical informationTanabe et al., 99
  • Text-enhanced in sequence alignments
  • SAWTED structure assignment with text
    description-enhanced detection of remote
    homologues with automated SWISS-PROT annotation
    comparisonsMacCallum et al., 2000

44
Some Text Mining Systems, Tools, ...
Some Text Mining Systems, Tools, and Applications
  • Information extraction
  • Constructing biological knowledge bases by
    extracting information from text sourcesCraven
    and Kumlien, 99

45
Technical Issues
46
Some Technical Details...
  • Common issues on classification tasks
  • Technical specifications
  • kNN
  • Naive Bayes
  • SVM
  • MLP
  • k-Means
  • Fuzzy c-Means

47
Common Issues
  • The classification task assigns a score to each
    text document. That score
  • can be an estimation of the a posteriori
    probability (NB, KNN)
  • can be related to a specific issue of the
    algorithm, like geometric distances between
    separation surfaces (SVM).
  • It could not be possible to identify a specific
    meaning of the score (ANN).

48
Technical Specifications kNN
  • The score is assigned using non parametric
    density estimation of the a posteriori
    probability
  • determines the volume V that contains the k
    nearest prototypes of each unlabeled document.
  • estimates the a posteriory probability of each
    text category

49
Technical Specifications Naive Bayes
  • The score is assigned a posteriori through a
    simplified model of the Bayes theorem.
  • The Bayes theorem
  • Through the strong (naive) independence
    assumption it states

50
Technical Specifications SVM
  • The score is related to the distance of the
    document from a separation hyperplane
  • Defines the optimal separation hyperplane (OSH)
  • OSH is specified by a weighted sum of prototypes
    (support vectors, SV)

51
Technical Specifications SVM
  • The score is related to the distance of the
    document from a separation hyperplane
  • using kernel functions, SVM builds the OSH in a
    high dimensional hyperspace defining complex
    separation surfaces in the feature domain.

52
Technical Specifications MLP
  • The score is typically related to the similarity
    of a document to each category
  • MLP are the most common Artificial Neural
    Networks (ANNs).
  • MLP is composed of several layers of artificial
    neurons, interconnected in a feed-forward way.

53
Technical Specifications k-Means
  • Tries to find the natural clusterization of
    documents.
  • Splits the features space in k partitions by
    fixing k random centres
  • Assigns each document to the nearest centre
  • Calculates the means of each cluster and
    reassigns each document the new centres.
  • Repeats the former two steps until the points no
    longer switch clusters.

54
Technical SpecificationsFuzzy c-Means
  • A variant of k-means algorithm
  • Assigns a continuous value of inclusion in
    clusters to each document
  • Each documents could belong to several clusters
  • May converge more quickly than k-means

55
That's All Folks!
  • Contacts
  • IASC Group http//iasc.diee.unica.it
  • Prof. G. Armano armano_at_diee.unica.it
  • Dott. E. Vargiu vargiu_at_diee.unica.it
Write a Comment
User Comments (0)
About PowerShow.com