Title: Data Mining Approaches for Water Quality Protection
1Data Mining Approaches for Water Quality
Protection
Edwin Brands, R. Rajagopal, Lifang Huang,
Katie Foreman, Bethany Gast,
David Riley
Department of Geography Center for Health
Effects of Environmental Contamination The
University of Iowa, Iowa City IA
Cooperative State Research, Education, and
Extension Service
Grant 2001-51130-11373
2Data Mining Exercise
- Goal Create a word from the following
76 83 69 68 73
- What do we need to do first?
3Data Mining Exercise
- Recognize the form of the data and translate
76 L 83 S 69 E 68 D 73 I
4Data Mining Exercise
76 83 69 68 73
Re-order these characters
? ? ? ? ?
120 possible combinations
S L I D E
5Data Mining Steps
- Cleaning (Standardization, Translation)
- Processing (Reorientation, Re-ordering)
- Winnowing (Sifting, Narrowing Down, Choosing the
Right Word)
6Data Mining (our definition)
- Relies on cognition, pattern recognition,
statistics, and computer programming in order to
sift through large databases in an organized
manner and uncover valuable relationships. - In the context of this project the process of
squeezing the juice out of the SDWA and other
datasets.
7Background Water Quality in Agroecosystems
- 3 Year Effort
- Year 1 SDWA
- Year 2 CWA/ambient water
- Year 3 Integration
8SDWA, Overarching Goal
- To protect public health 250 million consumers
of water from 54,000 Community Water Supplies - Treatment/Filtration
- Monitoring/Reporting 86 constituents
9SDWA and Data Collection
- Each year, millions of worth of data collected
for compliance with SDWA - Total investment (1974-present) gt 1 billion
- Data used for
- making treatment/blending decisions
- determining compliance
- adjusting sampling frequency
- informing consumers about water quality
10SDWA What is Measured?
- Organic Chemicals (e.g. pesticides)
- Inorganic Chemicals (e.g. arsenic)
- Disinfectants/Disinfection Byproducts
(e.g. chlorine) - Microorganisms (e.g. coliform bacteria)
- Radionuclides (e.g. radium)
11SDWA CONSTITUENTS
Toxaphene 2,4,5-TP (Silvex) 1,2,4-Trichlorobenzene
1,1,1-Trichloroethane 1,1,2-Trichloroethane Trich
loroethylene Vinyl chloride Xylenes (total)
cis-1,2-Dichloroethylene trans-1,2-Dichloroethylen
e Dichloromethane 1,2-Dichloropropane Di(2-ethylhe
xyl) adipate Di(2-ethylhexyl) phthalate Dinoseb Di
oxin (2,3,7,8-TCDD) Diquat Endothall Endrin Epichl
orohydrin Ethylbenzene Ethylene
dibromide Glyphosate Heptachlor Heptachlor
epoxide Hexachlorobenzene Hexachlorocyclopentadien
e Lindane Methoxychlor Oxamyl (Vydate) Polychlorin
ated biphenyls (PCBs) Pentachlorophenol Picloram S
imazine Styrene Tetrachloroethylene Toluene
Microorganisms (7) Cryptosporidium Giardia
lamblia Heterotrophic plate count Legionella Total
Coliforms Turbidity Viruses (enteric) Disinfect
ion Byproducts (4) Bromate Chlorite Haloacetic
acids (HAA5) Total Trihalomethanes
(TTHMs) Disinfectants (3) Chloramines (as
Cl2) Chlorine (as Cl2) Chlorine dioxide (as
ClO2) Inorganic Chemicals (16) Antimony Arsenic
Asbestos Barium Beryllium Cadmium
Chromium (total) Copper Cyanide (as free
cyanide) Fluoride Lead Mercury (inorganic) Nitrate
(measured as Nitrogen) Nitrite (measured as
Nitrogen) Selenium Thallium Organic Chemicals
(53) Acrylamide Alachlor Atrazine Benzene Benzo(a
)pyrene (PAHs) Carbofuran Carbon
tetrachloride Chlordane Chlorobenzene 2,4-D Dalapo
n 1,2-Dibromo-3-chloropropane o-Dichlorobenzene p
-Dichlorobenzene 1,2-Dichloroethane 1,1-Dichloroet
hylene
86 Contaminants
Radionuclides (4) Alpha particles Beta particles
and photon emitters Radium 226 and Radium
228 Uranium
12Public Water Supplies in Eastern and Central Iowa
Watersheds
Legend
Public Water Supply
State Boundary
Iowa River Watershed
Des Moines River Watershed
13Data Received from CHEEC
14Data Received from CHEEC
15Data Standardization and Transformation
- Problems with original data set
- Incomplete information (Sample IDs missing in
some records) - Difficult to collapse and generalize (Due to file
structure) - Data Standardization
- Create unique sample ID for records missing IDs
- Database Transformation
- Transform original database into tabular format
- Remove redundancy
16Data Standardization
Missing Sample IDs
Replace with unique IDs
17Original File Structure
18Transformation Steps/Rules
- Create a list of all measured constituents
- Transpose them to read horizontally rather than
vertically - Eliminate all duplicate analyses (same day,
place, and sample) - Average all replicate analyses (same day, place,
different sample)
19File Transformation 4 Potential Methods
- Manual
- Excel Pivot Table
- Access Cross Tab Query
- Original Computer Code (Lifang)
20Transformed File (Flat File, or Tabular Format)
21(No Transcript)
22File Structure Comparison
- New Structure
- Tabular
- Each row 1 sample Each column 1 constituent
- Easy to collapse or generalize
- Large amount of blank space
- Old Structure
- Relational
- Each row 1 record
- Not easy to collapse or generalize
- Easy to retrieve individual items
- Efficient storage
23What do we want to know?
- In the context of SDWA protection of public
health is the main goal, so
- Which SDWA contaminants are of public
- health concern?
- Which SDWA contaminants occur in
- what concentrations?
24Narrowing the Database
- Create a summary table (min/max, mean/median,
percentiles) - Merge summary table with SDWA requirements
- Keep only SDWA contaminants
- Delete all non-occurring contaminants
- Delete contaminants whose maximum found value is
less than 1/5th of the SDWA standard - Generate a final table of contaminants
25Summary Table
26Narrowing the Database
- Create a summary table (min/max, mean/median,
percentiles) - Merge summary table with SDWA requirements
- Keep only SDWA contaminants
- Delete all non-occurring contaminants
- Delete contaminants whose maximum found value is
less than 1/5th of the SDWA standard - Generate a final table of contaminants
27Final Table
28Results from 18 Supplies
Average of Contaminants 7
29SDWA CONSTITUENTS
Toxaphene 2,4,5-TP (Silvex) 1,2,4-Trichlorobenzene
1,1,1-Trichloroethane 1,1,2-Trichloroethane Trich
loroethylene Vinyl chloride Xylenes (total)
cis-1,2-Dichloroethylene trans-1,2-Dichloroethylen
e Dichloromethane 1,2-Dichloropropane Di(2-ethylhe
xyl) adipate Di(2-ethylhexyl) phthalate Dinoseb Di
oxin (2,3,7,8-TCDD) Diquat Endothall Endrin Epichl
orohydrin Ethylbenzene Ethylene
dibromide Glyphosate Heptachlor Heptachlor
epoxide Hexachlorobenzene Hexachlorocyclopentadien
e Lindane Methoxychlor Oxamyl (Vydate) Polychlorin
ated biphenyls (PCBs) Pentachlorophenol Picloram S
imazine Styrene Tetrachloroethylene Toluene
Microorganisms (7) Cryptosporidium Giardia
lamblia Heterotrophic plate count Legionella Total
Coliforms Turbidity Viruses (enteric) Disinfect
ion Byproducts (4) Bromate Chlorite Haloacetic
acids (HAA5) Total Trihalomethanes
(TTHMs) Disinfectants (3) Chloramines (as
Cl2) Chlorine (as Cl2) Chlorine dioxide (as
ClO2) Inorganic Chemicals (16) Antimony Arsenic
Asbestos Barium Beryllium Cadmium
Chromium (total) Copper Cyanide (as free
cyanide) Fluoride Lead Mercury (inorganic) Nitrate
(measured as Nitrogen) Nitrite (measured as
Nitrogen) Selenium Thallium Organic Chemicals
(53) Acrylamide Alachlor Atrazine Benzene Benzo(a
)pyrene (PAHs) Carbofuran Carbon
tetrachloride Chlordane Chlorobenzene 2,4-D Dalapo
n 1,2-Dibromo-3-chloropropane o-Dichlorobenzene p
-Dichlorobenzene 1,2-Dichloroethane 1,1-Dichloroet
hylene
86 Contaminants
Radionuclides (4) Alpha particles Beta particles
and photon emitters Radium 226 and Radium
228 Uranium
30Conclusion/Question
- Only about 3.3 (7 out of 210) of contaminants
are found in significant concentrations in our 18
supplies. - If similar findings hold for the state of Iowa
and/or the entire U.S., what does this say about
SDWA monitoring requirements?
31Limitations
- Small number of supplies
- Similar types of supplieswould such an approach
be useful with a homogeneous group of systems or
would additional processing steps be necessary? - Different data mining rules or procedures might
produce different outcomes
32Summary
- Cleaning
- Processing
- Winnowing