Title: Concept Extraction from Biological Corpora
1Concept Extraction from Biological Corpora
- A Text Mining approach scalable in Parallel
Architectures
Supervised by Christos Makris University of
Patras Department of Computer Science and
Engineering
2Motivation
- The vast literature of biomedical papers.
- New discoveries in biomedical sciences.
- Need for Text Mining tools to help the researcher
gather the scattered knowledge.
3Aim of Text Mining Tools
- Text Mining Tools concerning biological corpora
can have several targets such as - Extracting gene relations
- Extracting evolution paths
- Discovering biomolecules interactions
- We aim at a general approach capable of
extracting groups of correlated terms from a
collection of documents.
4Common Text Mining Practices
- Some of the popular text mining techniques were
- Rule Based
- Depending on knowledge databases
- Applying statistical measures
- Applying Natural Language Processing
5Our Approach
6Phase 1 Text Retrieval
- Boolean retrieval of biological papers.
- The set of retrieved documents is the outcome of
a boolean search in the database. - The full text of each document is stored in a
separate file.
7Phase 2 Linguistic Proccessing
- Input Full Text of Biological Papers
- Output Term-by-Document matrix
- Stemming is applied (PorterStemmer).
- Stoplist used to cut common words.
- TF/IDF metric used.
- Only terms which occur in a significant number of
different documents are allowed. -
- So less significant terms are discarded.
8Phase 3 Latent Semantic Indexing
The SVD decomposition
9Reasons for using LSI
- LSI provides us with the k approximation
Term-by-Document matrix. - Reduces noise representation due to synonymy.
- LSI reduces the dimensionality. Let the rows be
the document vectors. - Then instead of the original rows we can
represent docs with the rows of matrix USk. - Docs vectors have fixed dimensionality k.
10Phase 4 Clustering
- Input The new document vectors
- Output Document Clusters
- The intuition behind clustering
- Docs contain the semantic structure.
- Clustering will group semantically similar docs
together. - Those groups form different answers of
queries in the vector space model. - We will later have to look for those queries gt
terms.
11Phase 5 Concept Extraction
- For each cluster of documents we compute the
union of indexing terms. - For each term we compute the log-odds formula.
- Terms of a cluster exceeding a threshold ? show
specific preference to that cluster. - These terms formulate the query and express the
core concept of a doc cluster. - Under this assumption the query terms are
expected to be correlated.
12Computational Issues Linguistic Processing
- Linguistic processing is major bottleneck in time
? parse every single character. - To cope with this we propose the following
parallelization scheme.
13Computational Issues Linguistic Processing
14Computational Issues Linguistic Processing
15Computational Issues of LSI
- LSI constitutes a major time bottleneck due to
the SVD. To increase capacity - We tried to reduce the indexing terms (Stemming
IDF filtering) - We applied a parallelization scheme for the One
Side Jacobi method that functions like this.
Each cell indicates which pair of collums is
being orthogonalized. We notice that diagonal
pairs can be orthgonalized simultaneously. This
gave a speedup of 2.07 on 4 proc.
16Computational Issues Clustering
- Clustering is another bottleneck in time and in
space. - Reducing dimensionality with LSI increases both
space and time efficiency. - k-Windows unsupervised clustering algorithm was
applied. - The algorithm tries iteratively to capture
clusters in a number of d-dimensional rectangles
which are then merged based on some criterion.
17Computational Issues Clustering
- For the d-dimensional operations we used
- Range Tree
- R-Tree
- To cope with time we parallelized k-Windows.
- To cope with space of d-dimensional data
structures we chose RTrees ? similar time
behavior with Range Trees.
18Computational Issues Clustering
- Parallel k-Windows
- Movement Enlargement of windows are independent
procedures ?Straightforward Parallelization. - Merging was parallelized by distributing the
merge operations for a specific window i to the
processors. - 2 single merge operations for window i can affect
only different windows ? can be executed in
parallel. - When the operations for window i are finished we
proceed to the next window and follow the same
parallelization scheme. - This gave a 2.4 speedup on a 4 proc machine.
19Results
- Input originates from the online journal BioMed
- (www.biomedcentral.com)
- Boolean query
- transcription factors AND signaling cascades
- Final Input 73 docs of total size 3.7MB
20Resulted Clusters
21Remarks about the results
- In the previous table we demonstrate the effect
of dimensionality on the quality of the clusters. - We noticed that the results remain the same
despite the increase in the dimensions we keep. - Keeping low dimensionality is vital as high
dimensionality dramatically increases the cost of
clustering.
22Biological meaning of the results
- In the final clusters we distinguish
- Yellow Cluster with documents 3 and 25
- These documents refer to osteoarthritis and
rheumatoid arthritis respectively, describing
procedures of inhibiting the action of
interleukins, which are responsible for the
deterioration of those diseases.