Title: Balancing EvidenceBased Librarianship and Protecting Patron Privacy through the Bibliomining Process
1Balancing Evidence-Based Librarianship and
Protecting Patron Privacy through the
Bibliomining Process
Protecting Patron Privacy
Evidence-Based Librarianship
Bibliomining Process
- Scott Nicholson
- Assistant Professor
- Syracuse University School of Information Studies
2Overview
- Evidence-Based Librarianship
- Information Seeking in Context
- Threats to Patron Privacy
- USA PATRIOT Act
- Bibliomining Process
- Data Warehousing
3Evidence-Based Librarianship
4Evidence-Based Librarianship Idea
- Basic Idea
- Use data-based evidence to make decisions
5Dont we already do that?
- Do we?
- Many times, decision is
- Made on beliefs of user needs based on tacit
knowledge - Evaluated afterwards
- EBL is focused on using data-based evidence first
to make the decision - Then evaluate as well
6Evidence-Based Librarianship conceptualization
- EBL is based upon Evidence-Based Medicine
- EBM concept Combine study results (evidence) to
make a decision. - EBL translation Use library research projects
(best available evidence) and combine results to
better understand a phenomenon
7EBL Levels of Evidence
- Reviews of rigorous studies
- Reviews of less rigorous students
- Randomized controlled trials
- Controlled comparison studies
- Cohort studies
- Descriptive surveys
- Case studies
- Decision analysis
- Qualitative research (focus groups, etc.)
From Eldredge, J. (2000). Evidence-based
librarianship An overview. Bulletin of the
Medical Library Association 88(4). 289-302. Table
2.
8EBL Conceptualization
Patrons
Patron representations
Generalization from study
Generalizations across studies Evidence
Problems? Bias in elicitation, different
elicitation methods, not in my library
9Problems with Traditional EBL
- Small base of library research that is similar
enough to be used together - EBM has a much larger base of controlled
randomized research - Different elicitation methods different types
of generalizations - Argument Those studies dont apply to _my_
patrons.
10Thinking about Context
- Information Seeking in Context framework
- Importance of capturing context and not just
seeking behavior - Dervin, Kuhlthau, Taylor
11Contexts for Library Decision-Making
- Taylors Information Use Environments(IUE)
- Assumption People within a group are more
similar than people outside the group (in some
way) - Similar tasks, settings, constraints
- Similar value of what is useful
12Groups of Users
- Professions
- Entrepreneurs
- Special interest groups
- Special socioeconomic groups
- Institution-specific groupings
- Department Major
- Educational Level
- Groups of users Communities
13Model combining ISiCand IR Evaluation
Järvelin, K. Ingwersen, P. (2004). "Information
seeking research needs extension toward tasks and
technology". Information Research, 10(1) paper
212.
14One User, Several Communities
Profession
Special Interest Group
Socioeconomic Group
But what about Privacy?
15Protecting User Privacy
16Privacy Issues
- Librarians want to provide a safe space for users
- US government threats to library data
- Library Awareness Program (70s-80s, tracking
reading patterns) - Patchwork of state laws covering data
- USA PATRIOT and USA PATRIOT II
17Problem with PATRIOT
- Agents can request data on one person
- They arent picky - they can get much more data
- But, of course, they wont use it
- Result Anything the library keeps could be
taken.
18First Response Destroy!
- Just dont keep it!
- Libraries were discussed in the news as deleting
records to protect patrons - Why? Were not using it
- Permanent solution
- Causes more problems
19Ways of protecting user privacy
- Ignorance data audit
- Backups?
- Deletion
- Potential problems
- Encoding
- Being selective in archival
- Data Warehouse
20The Data Warehouse
21The Data Warehouse
- Data Warehouse One place for data
- Collected from Different Systems
- Cleaned and Joined
- Outside of Functional System
- Place for Reporting and Analysis
- But also support for normal library operations
- Creates awareness of library data
- Key Concept Operational vs. Archival data
22Data Produced through Library Services
- Data about Materials
- Data about Users
- Data about Services
23Work
24Representation of Work
25Bibliographic surrogate
- Information taken from the work
- Title, Author, Abstract (?), Publisher
- Information created to describe the work
(metadata) - Subject headings, Classification, Type, Keywords
- Information about access to the work
- Call Number, General Location, Form(s)
26Bibliographic Surrogate
27(No Transcript)
28User
29Data about Users
- Personal information (remove)
- Demographic Surrogate
- Demographic Surrogate
- Information Use Environment
- Collected during application or enrollment
process - External from other sources
- Matching Zip code to demographic database
- Match Proxy ID to student database or company
database - Assumptions
- IP address -gt physical location (on or
off-campus)
30Protecting the Privacy of Users
- Demographic Surrogate
- Contexts without Identification
- HIPAA
- Upcoming in JASIST
- 18 Items in 4 groups
- Direct identifiers and identifiers that connect
into other databases - Address and location information
- Dates related to an individual
- Contact information.
31Methods for dealing with Personally Identifiable
Information
- Use codes, Ids for matching and discard
32Coding and not discarding
- Use when some component of ID is important
- Example IP addresses
- Useful to know when it was the same
- Extract important info
- Recode into new variable
33Methods for dealing with Personally Identifiable
Information
- Use for matching and discard
34Dealing with categories
Make sure that combinations of categories dont
identify an individual.
35DemographicSurrogate
36Enter the LibraryConnecting Users to Information
- Different methods of connection (based upon
material) - Searching
- Circulation
- In-House Use
- Reference
- Interlibrary Loan
37Baseline for Library Services
- Time (length of time when appropriate)
- Date
- Location
- Method
- Physical
- Digital
- Staff involved
- Concurrent with other resources/services
38Differences in Services
- Searching
- Path of search, success of search
- Reference
- Content of transaction, Path of referral
- ILL
- Cost, speed, pattern of use
- All have same baseline, but different additional
fields
39Library Services Searching, Circulation/Use,
Reference, Outreach/Training, ILL/Request Time,
Date, Location, Format, Staff Involved, Cost,
Concurrent use
DemographicSurrogate
40Data-Based DifferencesPractitioners and
Researchers
- Problems in cooperation
- Different purposes to research
- Practitioners
- Specific to own library operations
- Generalizability not as important
- Researchers
- Look across multiple library operations
- Different applications of research
- Data Warehouse can serve both needs
41Library Operations Library personnel Selection/Acq
uisition Cataloging Staffing
Library Services Searching, Circulation/Use,
Reference, Outreach/Training, ILL/Request Time,
Date, Location, Format, Staff Involved,
Concurrent use
DemographicSurrogate
42Library Operations Library personnel Selection/Acq
uisition Cataloging Staffing
Library Services Searching, Circulation/Use,
Reference, Outreach/Training, ILL/Request Time,
Date, Location, Format, Staff Involved,
Concurrent use
Bibliometric Data Social networks Citations /
Links Disciplines Affiliations
DemographicSurrogate
43Library Operations Library personnel Selection/Acq
uisition Cataloging Staffing
Bibliometric Data Social networks Citations /
Links Disciplines Affiliations
Library Services Searching, Circulation/Use,
Reference, Outreach/Training, ILL/Request Time,
Date, Location, Format, Staff Involved,
Concurrent use
E-Resource Activity From vendors Currently
Aggregates Desired Individual Items Proxy
server
DemographicSurrogate
44Dealing with Textual data
- Digital Reference transactions
- Easy to deal with the metadata
- Hard to deal with the text
- Manual cleaning of PII
- Similar problem with deidentification of medical
records - Natural Language Processing research
45Finding the Connections
- Data mining is about patterns
- Patterns can come from links between works
- Connecting the data sources allows for more links
between works
46Links used for Bibliometrics
Citations
Patterns from Creation and Publication
Work
Work
Work
Authors
Author
Collection Journal, Package,Vendor, etc.
Collection
Collection
Collection
Subject
Subject
Subject
47Links used for Data Mining
Work
Work
Work
Authors
Author
Patterns from user selection
Demographic
Demographic
Demographic
Demographic ANY context
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
48Anonymization
Work
Work
Work
Authors
Author
Demographic
Demographic
Demographic
Demographic ANY context
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
49Data sources for Bibliomining
Citations
Work
Work
Work
Bibliometrics Data Mining Bibliomining
Authors
Author
Collection
Collection
Collection
Subject
Subject
Subject
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
Demographic
50Library Data Warehouse
Cleaned, Archived, Anonymized data kept separate
from the operational systems
Data Warehouse
51Traditional method of applying concepts
Data Warehouse Bibliographic Surrogate Library
Service Demographic Surrogate
Bibliometrics Add Citations, Author Affiliation,
Connections
Library Operations Add Publishers/Vendors,
Plans, Staff
Practitioner Data Mart (Internal Validity)
Researcher Data Mart (External Validity)
52Current method of applying concepts
Data Warehouse
Practitioner Data Mart
Researcher Data Mart
Reports
Reports
Models
Models
Tools
Tools
Results Improve local services Occasionally
external
Results Generalized scholarship Sometimes local
53Open Efforts Project
Data Warehouse
Standards and Schema
Practitioner Data Mart
Researcher Data Mart
Reports Models Tools
Reports Models Tools
Infrastructure
Results Librarians benefit directly from
researchers Researchers get data and gain
better understanding of librarian needs
Results Improve local services Occasionally
external
Results Generalized scholarship Sometimes local
54Youve collected the data
55Dealing with Data
- Traditional Methods of Analysis
- Online Analytical Processing
- Visualization
- Data Mining
- Rolling it all into the Bibliomining Process
56Traditional Analysis
- Aggregates and Averages
- Individual Reports
- Who has large late fees?
- What books havent circulated in years?
- What databases get the heaviest usage
- Stable, dependable reports
- Useful for comparing over time
- Great as a baseline
57Penn Library Data Farm
- Web-based front end for ad-hoc reporting
- More data going in from additional sources
- Project is moving from exploratory to central in
reporting - Library staff producing more quantitative reports
- Trickle-down and grass-roots effects
58(No Transcript)
59Humanities
Archeology
Patterns? Bibliometric laws predict that there
are.
60The Problem with Aggregates
- Topics are set ahead of time
- Time-consuming to ask new questions
- Aggregates and averages mask differences in
groups - Example
- On average, our service is rated very good.
- The reality
- 10 people rate the service Excellent (5 out of 5)
- 5 people rate the service Good (3 out of 5)
- 5 people rate the service Poor (1 out of 5)
- Average 3.5
- Much more useful to understand the breakdown
- More importantly, what variables correspond with
each group? - Time of day? Demographics? Other services used?
61On-Line Analytical Processing (OLAP)
- OLAP Reports on demand
- Excel Pivot Tables
- Interactive headings and entries
- Designed to be used by managers
- Allow for easy ad-hoc reporting
- Demo Normative Data Project
- Sirsi/Dynix
62Normative Data Project - Circulation
63(No Transcript)
64(No Transcript)
65OLAP Concept
66Data mining
- History of data mining
- Extraction of patterns from large data sets
- Requires the same metadata about each record in
the dataset - Uses statistical, visual, and AI techniques
- And, more importantly, people.
- Different types of data mining
- Directed vs. Undirected
- Descriptive vs. Predictive
67Data Mining Goal
- Locate patterns that are
- Novel
- Meaningful
- Actionable
- Many patterns will be
- Trivial (if freshman, then undergrad)
- Not meaningful (Odd birth month -gt Late)
- Not actionable (anti-redlining laws)
- To sort requires a domain expert
68Data Mining Concept
OLAP
DM
69Case Study - Library
- Penn State Data Farm
- 10,000 circulation records from Fall 2004
- Items that had been returned
- Patron classification Circulation information
70(No Transcript)
71- Data Cleaning
- Dropped non-LC
- Took only first letter
- Started with first 2
- Too much
72(No Transcript)
73(No Transcript)
74More about Prediction
- Assign a value or category
- Requires training data from the past
- Result may be rules or formulas
- Important Correlation is not causation.
- People with sun stroke are sunburned.
- Does sun stroke cause sunburn?
- Requires domain expert to confirm
75Considering Patterns
Question Asked What are the best predictors of
circulation length? Clementine says If renewal
count of 0 or 1, the average circulation is 15
days. If renewal count is higher than 1, the
average circulation is 60 days. What about this
rule?
76Considering Patterns
- If item_type in "audio "bound jrnl" "music"
"non-circ" "reference" "reserve"
"special" "video" - Ave 2.812
- 2. If item_type in "bestseller" "book/seria"
"twoweek" - Ave 28.044
-
- then if LCName in "MEDICINE" "MUSIC"
- "NAVAL
SCIENCE" "SCIENCE" - gt 14.645
-
-
else gt 33.888
77Case of Prediction
- Virginia Tech Paul Metz and John Cosgriff
- Gathered a data source with both bibliometric and
library use data to determine which journals to
keep. - Received one or more individual or departmental
votes - Were profiled on CARL Reveal by five or more
individuals - Were borrowed twenty or more times on ILL
- Contained ten or more publications by Virginia
Tech authors - Were cited fifty or more times by Virginia Tech
authors - Were reshelved fifty or more times
- Think about the data for each criterion. Where
did it come from?
78Prediction Uses
- Use data from the past to predict staffing needs
for the future - Use past examples to determine when to intervene
with late material - Predict what a user needs based upon
- Searching behavior
- Works examined
- Predict when a digital reference question could
be handled by a match to a database
79The Bibliomining Process
- Data collection
- Data cleaning
- Asking questions
- Analysis (data mining, bibliometrics)
- Presenting patterns
- Asking new questions
80EBL Conceptualization Multiple Library
Research Project
Patrons
Data warehouse of patron representations
- Warehouse holds evidence for decision-making
- Many analysis options
- Maintain library identify
- Libraries can benefit from consortial research
-
81Importance of standard-creation projects
(COUNTER, DREW)
- Standards for library records make
multi-consortia data warehouses a possibility. - Allows for generalizable evidence for EBL
- Supports the creation and testing of theories
(patterns over multiple settings) - Entices researchers, who can create models and
tools -
82Power of Bibliomining in Consortia
- Library consortia using the same systems
- Different systems need bridge programs
- Data warehouse of information from multiple
libraries - Consortia can make much stronger decisions
- Takes competitive advantage away from
e-publishers - Researchers creates generalizable results
- Services can apply research to own library
83Conclusions
Bibliomining is the combination of data
warehousing, data mining, bibliometrics,
statistics, and reporting tools used to extract
patterns of behavior-based artifacts from library
systems.
- Bibliomining
- Data Mining Bibliometrics
84Goals of Bibliomining
- Improved decision-making through better
understanding of - Patterns in Resource Creation
- Patron Behavior
- Library Staff Behavior
- Behavior of outside organizations
- Can provide justification for
- Library management policies and decisions
- Acquisitions and ILL source selection
- Collection development decisions
- Use of library services (funding bodies)
85Concerns with Bibliomining
- May get no novel, actionable patterns
- Threatening to domain experts
- Must keep them involved
- Time-consuming startup
- Ensure they have input regarding patterns
- Model may go out of date with changing
circumstances - Monitoring procedures to detect when modeling
variables change considerably
86But
- Bibliomining doesnt provide the whole story
- What people didnt do
- Who didnt visit the library
- What was useful
- It provides a strong baseline
- Give you models of communities
87People to Involve
- Institutional Research Board (IRB)
- Legal counsel
- Ensures you are following state laws for library
data - Library administration / Board
- Patrons
- If there are policies, follow them
- If there are not, create them
88For More Information
- Bibliomining.com
- Discussion list
- Bibliography
- OCLC WebJunction
- Learning Center
- Introduction to Bibliomining
- Free!
89Striking a Balance
- A well-designed data warehouse strikes the
balance between - Protecting Privacy
- and
- Maintaining a Data-Based History
- For Evidence-Based Librarianship
90RememberIf we delete our data-based
historythen none of this is possible.
91Thank you for your attention!