Title: KDDRG Research Projects
1KDDRG Research Projects
- Prof. Carolina Ruiz
- ruiz_at_cs.wpi.edu
- Department of Computer Science
- Worcester Polytechnic Institute
2Some Current Analytical Data Mining Research
Projects at WPI
- Mining Complex Data Set and Sequence Mining
- Systems performance Data
- Sleep Data
- Financial Data
- Web Data
- Data Mining for Genetic Analysis
- Correlating genetic information with diseases
- Predicting gene expression patterns
- Data Mining for Electronic Commerce
- Collaborative and Content-Based Filtering
- Using Association Rules and using Neural Networks
3Analyzing Sleep Data
- Purpose
- Associations between sleep patterns and
health/pathology - Obtain patterns of different sleep stages (4
sleepREM Wake)
- DATA SET
- Clinical (sequential)
- Electro-encephalogram (EEG),
- Electro-oculogram (EOG),
- Electro-myogram (EMG),
- Probe measuring flow of Oxygen
- in blood etc.
- Diagnostic (tabular)
- Questionnaire responses
- Patients demographic info.
- Patients medical history
(Source http//www. blsc.com)
- Potential Rules
- Association Rules
- (Sleep latency lt3 min) (hereditary disorder)
gt Narcolepsy confidence92, support 13 - (B) Classification Rules
- (snoring HEAVY) (AHI gt 30/hour)
severe OSA
- gt (Race Caucasian) confidence70, support
8 - AHI Apnea Hypopnea index, OSA
Obstructive Sleep Apnea -
WPI, UMassMedical, BC
4Input Data
- Each instance Tabular set sequential
attributes - attr1 attr2 attr3
attr4 attr5 class - illnesses heart rate
age oxygen gender Epworth
depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14
P1 P2 P3
5Analyzing Financial Data
- Sequential data daily stock values
- Normal (tabular/relational) data
- sector (computers, agricultural, educational, ),
type of government, product releases, companies
awards, - Desired rules
- If DELLs stock value increases 1999ltyearlt2002
gt IBMs stock value decreases
6Events Financial Data Basic events 16 or so
financial templates LittleRhodes78difficult
pattern matching alignments and time warping
Panic Reversal Head
Shoulders Reversal
Rounding Top Reversal Descending
Triangle Reversal
7WPI Weka Tool for mining complex
temporal/spatial associations
8Data Mining for Genetic Analysisw/ Profs. Ryder
(BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS,
WPI), and Alvarez (CS, BC)
- SNP analysis
- discovering correlations between sequence
variations and diseases - Gene expression
- discovering patterns that cause a gene to be
expressed in a particular cell -
9Correlating Genetics with Diseases
- Utilize Data Mining Techniques with Actual
Genetic Data Sampled from Research - Spinal Muscular Atrophy inherited disease that
results in progressive muscle degeneration and
weakness.
10Genomic Data Resources
Patient Gender SMA Type (Severity) SNP Location C212 Father / Mother AG1-CA Father / Mother
Female Severe Y272C 31 / 28 29 102 / 108 112
Male Mild Y272C 28 29 / 25 108 112 / 114
Wirth, B. et al. Journal of Human Molecular
Genetics
11Our System CAGE
- To predict gene expression based on DNA sequences.
Muscle Cell
Gene 1
Gene 3
Gene 2
Neural Cell
CAGE
Gene 1
Gene 3
Gene 2
Seam Cells
On
Gene 1
Gene 3
Gene 2
Off
12Gene expression Analysis
PR1
PROMOTER(S)
CELL TYPES
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
13Gene Expression
- Transcription of DNA into RNA
TRANSCRIPTIONAL PROTEINS
PROMOTER REGION
..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA
MOTIFS M1, M2, M4
MUSCLE CELL
14 PR1 PROMOTER(S)
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
R1 M1, M4, M5 gt Neural supp 22, conf100 Supp. instances PR1, PR2
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
R2 M2, M4, M5 gt Neural supp 22 , conf100 Supp. instances PR1,PR8
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
15Well-clustered motifs
M1
M2
M4
M5
240
150
100
M1
M4
M5
260
210
M1
M4
360
M1
M2
M5
100
350
M1
M4
190
IR1M1,M2,M5 ?(M1,M2) 120.1 ?(M1,M2)
216.6 cvd(M1,M2) 0.55
M4
M5
M3
150
120
M1
M2
M5
M3
210
100
110
M2
M4
M5
21
18
M4
M3
60
16Distance-based Association Rules
Sample distance-based assoc. rule
- Given
- min-support
- min-confidence
- max-cvd
- thresholds
- Mine
- all distance-based association rules
17Grad. Undergrad. Students
- Jonathan Rudolph
- Eduardo Paredes
- Iavor N. Trifonov.
- Takeshi Kawato
- Cindy Leung and Sam Holmes.
- John Baird (BB), Jay Farmer, Rebecca Gougian
(BB), Ken Monterio (BB), Paul Young. - Zachary Stoecker-Sylvia. Kristin Blitsch (BB),
Ben Lucas, Sarah Towey(BB) - Wendy Kogel, Brooke LeClair, Christopher St.
Yves. - Brian Murphy, David Phu (CS/BB), Ian Pushee,
Frederick Tan (CS/BB). - Daniel Doyle, Jared Judecki, James Lund, Bryan
Padovano (BB). - Christopher Cole.
- Michael Ciman and John Gulbrandsen.
- Tara Halwes
- Christopher Martino.
- Matthew Berube.
- Anna Novikov.
- Amy Kao and Dana Rock.
- Ali Benamara.
- Dharmesh Thakkar.
- Senthil K Palanisamy.
- Zachary Stoecker-Sylvia.
- Keith A. Pray.
- Jonathan Freyberger.
- Maged El-Sayed.
- Parameshvyas Laxminarayan.
- Aleksandar Icev.
- Wendy Kogel.
- Michael Sao Pedro.
- Christopher Shoemaker.
- Weiyang Lin.