Title: Lynn Silipigni Connaway, Ph.D.
1Beyond Data MiningDelivering the Next
Generation of Services from Library Data
- Lynn Silipigni Connaway, Ph.D.
- Senior Research Scientist
- OCLC
- Timothy J. Dickey, Ph.D.
- Post-Doctoral Researcher
- OCLC
2WorldCat as an Aggregate Collection
- Data Mining and Analysis of WorldCat
- affords high-level perspective on historical
patterns, suggests future trends, and supplies
useful intelligence with which to inform decision
making. - Lavoie, B.F., Connaway, L. S., ONeill, E. T.
(2007). Mapping WorldCats digital landscape.
Library Resources Technical Services, 51,
106-115 at 107.
3WorldCat July 2008
Manifestations (records) 108,828,533
Works 84,096,107
Total holdings 1,292,763,300
Digital Items 3,182,550
Institutions 69,000
Physical Items 1.2 billion
4Global Origins of WorldCat Materials
Germany 10
Rest of World 27
Unknown 17
France 4
Canada 3
UK 8
US 28
5Global Origins of WorldCat Materials
Materials w/non-US origins 57.9 million
(55) Top 5 Germany 10.0 million UK 8.8
million France 4.2 million Netherlands 2.9
million Canada 2.9 million
Content Languages 478 49 of WC non-English Top
5 non-English German 12 million French 6.1
million Spanish 3.5 million Dutch 2.6
million Japanese 2.4 million
Non-English Metadata Language 28 million (66
languages) Top 5 German 11 million French
1.8 million Dutch 5.0 million
Finnish 0.7 million Swedish 1.9 million
6WorldCat as a Decision-Making Resource
- Collection management
- Cooperative collection development
- Comparative collection analysis
- Collection assessment
- Mass digitization
- Off-site storage
- Preservation
7WorldCat as a Decision-Making Resource
- Services
- Virtual reference
- Recommender services
- Social networking
- Systems
- Precision
8WorldCat as a Decision-Making Resource
- Three Areas of Data Mining Research
- OCLC WorldMap
- Audience Level
- Publisher Name Server
9OCLC WorldMap
10OCLC WorldMapTM Objectives
- Geographically represent WorldCat data
- Titles published in each country
- Holdings for titles published in each country
- Languages represented for titles published in
each country
11OCLC WorldMapTM Objectives
- Geographically represent data from UNESCO, ARL,
and NCES for each country - Number of
- Libraries
- Library volumes
- Certified/degreed librarians
- Registered library users
- Library expenditures
- Cultural heritage institutions (museums and
archives) - Publishers
12OCLC WorldMapTM Objectives
- Research prototype
- Support OCLC data mining research
- Visually display data for review and analysis
- Internal use
- Sales and marketing
- External use
- Library collection assessment and comparison
- Data may be processed AT A GLANCE
- Complement the AAU/ARL Global Resources Network
project - Project of the Council on Library and Information
Resources (CLIR)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22- http//pubserv.oclc.org12223/WorldMap/
23OCLC Audience Level
24Audience Level Rationale and Objectives
Holdings represent selection decisions by
librarians implies there are more than 1
billion individual selection decisions in the
WorldCat holdings file
- Selections serve the interests of a librarys
target community - Associate community (audience level) to library
profiles - e.g., ARL, non-ARL academic, public,
K-12 school
?
- Thus we can infer materials audience level from
holdings patterns, which in turn can support - Collection management
- Readers advisory services
- Reference services
- Information retrieval
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Example Computation Build Community
Library symbol Library name Library type Weight
OHI State Library of Ohio Other x
OCO Columbus Metropolitan Library Public 0.33
CDC Cedarville University Academic 0.67
LIM Lima Public Library Public 0.33
OUN Ohio University Research 1.00
OSD SEO Automation Consortium Other x
BGU Bowling Green State University Academic 0.67
MIA Miami University Academic 0.67
AKR University of Akron Academic 0.67
BGF Firelands College Academic 0.67
CIN University of Cincinnati Research 1.00
TOL University of Toledo Academic 0.67
KSU Kent State University Research 1.00
HIR Hiram College Academic 0.67
YNG Youngstown State University Academic 0.67
30FRBRizing Audience Level Results
- Calculate Audience Level for each Manifestation
- Aggregate weighted holdings for Work
OCLC Number Total Holdings Usable Holdings Manifestation Audience Level
15504400 147 114 0.783825
29613712 172 117 0.769453
40393191 207 136 0.789426
62762763 190 124 0.758274
81016224 1 0 x
31(No Transcript)
32Evaluating the OCLC Audience Level
- Random sample of 30 Zoology books, all audience
levels - Human subjects
- Ranked books in increasing order of difficulty
- Strong statistical correlation between human
subjects ranking and programmatic ranking
33Evaluating the OCLC Audience Level
34- http//audiencelevel.oclc.org/
35OCLC Publisher Name Server
36Publisher Name Server Research Objectives
- Resolve for data mining and quality of WorldCat
- ISBN prefixes to publisher name
- Variant publisher names to a preferred form
- Complement Collection Analysis Service
- Librarians
- Publishers
- Capture and profile attributes of individual
publishers - Location(s)
- Language(s) of materials published
- Genre(s)/format(s)
- Dominant subject domain(s)
- Parent company and subsidiaries
37Publisher Name Server Methodology
- Programmatically cluster publishers records
using ISBN prefixes - Data clustering (The Free Dictionary)
- "The science of extracting useful information
from large data sets or databases" - Classification of similar objects into different
groups - Partitioning of a data set into
subsets (clusters) - Data in each subset (ideally) share some common
trait - Hand parse the entities and resolve ISBN prefixes
38Publisher Name Server Database
- 1750 publishing entities
- Relational database, preserving hierarchical
relationships - Begins with high-occurrence entities
- Top 10 lists (USA, UK, Canada, Australia,
Germany, France, Netherlands, Japan, Italy,
China, Russia, Spain, Finland, Australia, Taiwan,
New Zealand) - Top 10 university presses
- Mergers and acquisitions, last 8 years
39Publisher Name Server Data Captured
- Database Fields
- Publisher Name, Preferred Form
- Source of Preferred Form
- Former Names
- Variant Forms
- ISBN Prefixes
- HQ City
- HQ Country
- Other Cities
- URL
- -----
- Languages
- Formats
- Conspectus Subjects
- Data Sources
- U.S. Library of Congress, National Authority
File, 110 (Corporate Name) field - Books In Print Online (W.W. Bowker)
- The International ISBN Registry (K.G. Saur)
- Publishers Weekly Online
- Hoovers Handbook Online
- Standard and Poors Corporate Descriptions
- The Directory of Corporate Affiliations (DIALOG)
- Company websites
- DATA MINING
40(No Transcript)
41Publisher Name Server Database
- More than 56,000 separate strings mapped to 1750
entities - 8.5 million OCLC records
- 22 of these are Library of Congress records
- 490 million holdings
- Hierarchical relationships maintained
42Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
43Publisher Profiles
- Oxford University Press
- 119,237 records with ISBNs mapped to 210,095
records (0.19 of WorldCat) - Pearson PLC
- Includes 14 subsidiaries and acquisitions
- Aggregate 291,433 records (0.27 of WorldCat)
44Publisher Profiles Top Languages
- Pearson PLC
- English 95.27
- Spanish 1.43
- German 1.33
- French 0.60
- Dutch 0.55
- Latin 0.26
- Malay 0.06
- Ancient Greek 0.05
- Portuguese 0.05
- Italian 0.04
- Oxford Univ. Press
- English 96.74
- Latin 0.51
- German 0.39
- Chinese 0.39
- French 0.37
- Spanish 0.28
- Afrikaans 0.14
- Middle English 0.13
- Malay 0.09
- Swahili 0.09
45Publisher Profiles Conspectus Divisions
- Pearson PLC
- Language/ Literature 18.67
- Business/ Economics 13.30
- Computer Science 9.42
- Engineering 8.04
- History 7.59
- Mathematics 6.04
- Education 5.64
- Sociology 4.18
- Philosophy/ Religion 3.81
- Physical Sciences 2.75
- Oxford Univ. Press
- Language/ Literature 27.12
- History 11.92
- Music 9.78
- Philosophy/ Religion 9.55
- Business/ Economics 6.15
- Medicine 4.36
- Law 3.85
- Sociology 3.75
- Political Science 3.58
- Biology 2.60
46Publisher Profiles Conspectus Categories
- Pearson PLC
- English language 7.74
- Business admin. 4.62
- English literature 3.63
- Economics 2.94
- Comp. programming 2.39
- Electrical engineering 2.24
- Early childhood ed. 2.05
- Computer software 1.88
- U.S. federal law 1.80
- Computer Science 1.54
- Oxford Univ. Press
- English literature 10.66
- English language 5.86
- Instrumental music 3.48
- Vocal music 3.09
- Literature on music 2.26
- History Britain 1.82
- Economic history 1.38
- American lit. 1.35
- History S. Asia 1.30
- General history 1.29
47Publisher Profiles Conspectus Subjects
- Pearson PLC
- English modern 7.68
- Management 2.53
- Programming 1.74
- Arithmetic 1.09
- Economic theory 1.06
- Marketing 1.06
- General algebra 1.04
- Accounting 0.97
- Juvenile lit. 0.93
- English lit 19th c. 0.89
- Oxford Univ. Press
- English modern 5.57
- English lit prose 2.51
- English lit 19th c. 2.23
- Juvenile lit. 1.06
- English lit poetry 1.03
- English lit collections 0.80
- Biographies 0.76
- English lit 1900-1960 0.74
- Shakespeare 0.68
- Sacred choruses 0.66
48Projected MARC coding of Authorized Forms
- 710 Added Entry Corporate Name
- Add 4 for publisher name
- Add 2 NAF where preferred form matches existing
authority record (44 of current PNAF) - 752 Added Entry Hierarchical Place Name
- Add 2 FAST where place of publication matches
FAST geographical subject headings
49Future Research
- Further data mining
- Profile aspects of publication output
- Deeper scaling into WorldCat (beyond ISBN)
- Plan for long-term maintenance
- ISBN-13 compliance
- File expansion of ongoing mergers/ acquisition
activities
50Thank You!
- Questions and Discussion
- Lynn Silipigni Connaway connawal_at_oclc.org
- Timothy J. Dickey dickeyt_at_oclc.org