Title: The Ethics of LargeScale Web Data Analysis Webmetrics
1The Ethics of Large-Scale Web Data Analysis
(Webmetrics)
- Mike Thelwall, Statistical Cybermetrics Research
Group, University of Wolverhampton, UK - Rob Ackland, Australian Demographic and Social
Research Institute, Australian National University
Virtual Knowledge Studio (VKS)
Information Studies
2Contents
- What is webmetrics?
- Context Online access to personal information
- Researchers use of personal information
- Confidentiality and anonymity
- Resource issues
- What ethical considerations apply to collecting
and analysing web data on a large scale from
unaware web publishers ?
31. What is webmetrics?
- Large-scale analysis if web-based data
- Collecting and quantitatively analysing online
information - Objective is not to find information about
individuals but identify trends - Data gathered with VOSON, SocSciBot, Issue
Crawler, LexiURL,
4Example
- VOSON Hyperlink
- network of
- political parties
- from 6 countries
- (Ackland and
- Gibson, 2006).
- Node size prop.
- to outdegree.
- 76 nodes.
5Austria
Geopolitical connected
Switzerland
Example Links between EU universities
Belgium
Germany
France
Spain
NL
UK
Norway
Italy
Poland
Finland
Sweden
AltaVista link searches
Normalised linking, smallest countries removed
6Link associations between social network sites
7Example Blog searching
82. Context Online access to personal information
- Blogs, social network sites, personal web sites
contain information that is - Private and protected (invisible to researchers)
- Intentionally public
- Publicly private1 (intended for friends but
allowed to be public) - Unintentionally public (public but believed by
owner to be private)
1. Lang (2007)
9Accessing public information
- Commercial search engines
- Web crawlers
- Internet Archive (includes deleted info)
10Who is using Dataveillance?
- Dataveillance1 Downloading or otherwise
gathering data on internet users in order to
influence their behaviour - Google can use email, searching, blogging,
social network activities to target advertising
( may report to US government) - Amazon can use past activities to target
adverts or improve web site
1. Zimmer (2008)
113. Researchers use of personal information
- Key issue for large scale research, data
from/about the unaware is used without their
approval, and possibly for purposes that they
might disagree with - Which ethical safeguards should be taken for this
kind of research?
12Issue 1 People vs. Documents
- Traditionally, documents can be researched
without approval, but people cant - Even harsh criticism is fair practice (e.g., book
review/analysis) - Since web pages are documents, researching them
without permission is normally OK
13Issue 2 Invasion of privacy? Natural vs.
normative
- A situation is naturally private1 if a reasonable
person would expect privacy - A situation is normatively private1 if a
reasonable person would expect others to protect
their privacy - Non-secure web pages/data are typically naturally
private - Accessing is not normally invading privacy, even
if undesired by page owners and with negative
consequences
1. Moor (2004)
144. Confidentiality and anonymity
- When should anonymity be granted to research
subjects (page owners)? - When a possibly undesired label attached (e.g.,
hate group, terrorist) - When undesired groups might benefit? (e.g.,
league table of hate groups) - When publicly private individuals singled out
(e.g., detailed analysis of average blogger) - Should data be anonymised as for Census data
used for research?
155. Resource issues
- Accessing a web page uses the owners server
time/bandwidth - Crawling a web site can use a lot of the owners
server time/bandwidth - May incur charges or loss of service quality
16Robots.txt protocol
- This file lists pages/folders in a web site may
not be crawled - It does not restrict crawling speed
- It should be obeyed in research
- Most individual users are probably unaware of
this and so dont use its protection
17Crawling speed
- Web crawlers should not run too fast that they
cause service issues - Full speed is probably OK on a UK university web
site but not on a Burkina Faso library web site - Use judgement to decide how quickly to crawl
length of pauses in crawling
18How many pages to crawl?
- Crawling too many pages puts unnecessary strain
on the server crawled - Use judgement to decide the minimum number of
pages/crawl depth that is enough - Use search engine queries as a substitute, if
possible
19Automatic search engine searches
- Research can piggyback off the crawling of
commercial search engines - No resource implications for site owners
- Uses search engine Applications Programming
Interfaces - Search engines specify the maximum number of
searches per day - Results limited to the imperfect web
crawling/coverage of search engine crawlers
20Summary
- Researchers need to be aware of potential issues
when doing large scale data analysis research - Judgement is called for in all issues
- Research does not normally need participant
permission - Be sensitive to impact of findings and any need
for anonymity
21References
- Lange, P. G. (2007). Publicly private and
privately public Social networking on YouTube.
Journal of Computer-Mediated Communication,
13(1), Retrieved May 8, 2008 from
http//jcmc.indiana.edu/vol2013/issue2001/lange.ht
ml - Zimmer, M. (2008). The gaze of the perfect search
engine Google as an infrastructure of
dataveillance. In A. Spink M. Zimmer (Eds.),
Web search Multidisciplinary perspectives (pp.
77-99). Berlin Springer. - Moor, J. H. (2004). Towards a theory of privacy
for the information age. In R. A. Spinello H.
T. Tavani (Eds.), Readings in CyberEthics (2nd
ed., pp. 407-417). Sudbury, MA Jones and
Bartlett.