The Ethics of LargeScale Web Data Analysis Webmetrics - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

The Ethics of LargeScale Web Data Analysis Webmetrics

Description:

Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK ... OK on a UK university web site but not on a Burkina Faso library web site ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 22

Provided by: ncr4

Category:

more less

Transcript and Presenter's Notes

Title: The Ethics of LargeScale Web Data Analysis Webmetrics

1
The Ethics of Large-Scale Web Data Analysis
(Webmetrics)

Mike Thelwall, Statistical Cybermetrics Research
Group, University of Wolverhampton, UK
Rob Ackland, Australian Demographic and Social
Research Institute, Australian National University

Virtual Knowledge Studio (VKS)
Information Studies
2
Contents

What is webmetrics?
Context Online access to personal information
Researchers use of personal information
Confidentiality and anonymity
Resource issues
What ethical considerations apply to collecting
and analysing web data on a large scale from
unaware web publishers ?

3
1. What is webmetrics?

Large-scale analysis if web-based data
Collecting and quantitatively analysing online
information
Objective is not to find information about
individuals but identify trends
Data gathered with VOSON, SocSciBot, Issue
Crawler, LexiURL,

4
Example

VOSON Hyperlink
network of
political parties
from 6 countries
(Ackland and
Gibson, 2006).
Node size prop.
to outdegree.
76 nodes.

5
Austria
Geopolitical connected
Switzerland
Example Links between EU universities
Belgium
Germany
France
Spain
NL
UK
Norway
Italy
Poland
Finland
Sweden
AltaVista link searches
Normalised linking, smallest countries removed
6
Link associations between social network sites
7
Example Blog searching
8
2. Context Online access to personal information

Blogs, social network sites, personal web sites
contain information that is
Private and protected (invisible to researchers)
Intentionally public
Publicly private1 (intended for friends but
allowed to be public)
Unintentionally public (public but believed by
owner to be private)

1. Lang (2007)
9
Accessing public information

Commercial search engines
Web crawlers
Internet Archive (includes deleted info)

10
Who is using Dataveillance?

Dataveillance1 Downloading or otherwise
gathering data on internet users in order to
influence their behaviour
Google can use email, searching, blogging,
social network activities to target advertising
( may report to US government)
Amazon can use past activities to target
adverts or improve web site

1. Zimmer (2008)
11
3. Researchers use of personal information

Key issue for large scale research, data
from/about the unaware is used without their
approval, and possibly for purposes that they
might disagree with
Which ethical safeguards should be taken for this
kind of research?

12
Issue 1 People vs. Documents

Traditionally, documents can be researched
without approval, but people cant
Even harsh criticism is fair practice (e.g., book
review/analysis)
Since web pages are documents, researching them
without permission is normally OK

13
Issue 2 Invasion of privacy? Natural vs.
normative

A situation is naturally private1 if a reasonable
person would expect privacy
A situation is normatively private1 if a
reasonable person would expect others to protect
their privacy
Non-secure web pages/data are typically naturally
private
Accessing is not normally invading privacy, even
if undesired by page owners and with negative
consequences

1. Moor (2004)
14
4. Confidentiality and anonymity

When should anonymity be granted to research
subjects (page owners)?
When a possibly undesired label attached (e.g.,
hate group, terrorist)
When undesired groups might benefit? (e.g.,
league table of hate groups)
When publicly private individuals singled out
(e.g., detailed analysis of average blogger)
Should data be anonymised as for Census data
used for research?

15
5. Resource issues

Accessing a web page uses the owners server
time/bandwidth
Crawling a web site can use a lot of the owners
server time/bandwidth
May incur charges or loss of service quality

16
Robots.txt protocol

This file lists pages/folders in a web site may
not be crawled
It does not restrict crawling speed
It should be obeyed in research
Most individual users are probably unaware of
this and so dont use its protection

17
Crawling speed

Web crawlers should not run too fast that they
cause service issues
Full speed is probably OK on a UK university web
site but not on a Burkina Faso library web site
Use judgement to decide how quickly to crawl
length of pauses in crawling

18
How many pages to crawl?

Crawling too many pages puts unnecessary strain
on the server crawled
Use judgement to decide the minimum number of
pages/crawl depth that is enough
Use search engine queries as a substitute, if
possible

19
Automatic search engine searches

Research can piggyback off the crawling of
commercial search engines
No resource implications for site owners
Uses search engine Applications Programming
Interfaces
Search engines specify the maximum number of
searches per day
Results limited to the imperfect web
crawling/coverage of search engine crawlers

20
Summary

Researchers need to be aware of potential issues
when doing large scale data analysis research
Judgement is called for in all issues
Research does not normally need participant
permission
Be sensitive to impact of findings and any need
for anonymity

21
References

Lange, P. G. (2007). Publicly private and
privately public Social networking on YouTube.
Journal of Computer-Mediated Communication,
13(1), Retrieved May 8, 2008 from
http//jcmc.indiana.edu/vol2013/issue2001/lange.ht
ml
Zimmer, M. (2008). The gaze of the perfect search
engine Google as an infrastructure of
dataveillance. In A. Spink M. Zimmer (Eds.),
Web search Multidisciplinary perspectives (pp.
77-99). Berlin Springer.
Moor, J. H. (2004). Towards a theory of privacy
for the information age. In R. A. Spinello H.
T. Tavani (Eds.), Readings in CyberEthics (2nd
ed., pp. 407-417). Sudbury, MA Jones and
Bartlett.