Title: Lynn Silipigni Connaway, Ph'D'
1Data Mining, Advanced Collection Analysis, and
Publisher Profiles An Update on the OCLC
Publisher Name Authority File
- Lynn Silipigni Connaway, Ph.D.
- Senior Research Scientist
- OCLC Research
- Timothy J. Dickey, Ph.D.
- Post-Doctoral Researcher
- OCLC Research
2Overall Research Goals
- To Build a Database that Will
- Identify
- Authoritative strings for publisher names
- Common variants for names and locations
- Hierarchical references indicating relationships
and nesting of subsidiaries - Definitions of publishing entities
3Overall Research Goals
- To Build a Database that Will
- Produce
- Profiles, including data-mined information
regarding formats, languages, subjects, etc. for
publishers - Conform
- to international authority and standards
practice, and - inter-operate with other OCLC products
4Issues Challenges
- Database Quality
- Historical Practices
- the shortest form in which it can be
understood. AACR2 2004 - Different versions of cataloging rules
- Abbreviations
- Errors and misspellings
- Local Practices
5Method Data Mining in an Aggregate Collection
- Data Mining and Analysis of WorldCat
- affords high-level perspective on historical
patterns, suggests future trends, and supplies
useful intelligence with which to inform decision
making. - Lavoie, B.F., Connaway, L. S., ONeill, E. T.
(2007). Mapping WorldCats digital landscape.
Library Resources Technical Services, 51,
106-115 at 107.
6WorldCat July 2008
Manifestations (records) 108,828,533
Works 84,096,107
Total holdings 1,292,763,300
Digital Items 3,182,550
Institutions 69,000
Physical Items 1.2 billion
7Global Origins of WorldCat Materials
Germany 10
Rest of World 27
Unknown 17
France 4
Canada 3
UK 8
US 28
8Global Origins of WorldCat Materials
Materials w/non-US origins 57.9 million
(55) Top 5 Germany 10.0 million UK 8.8
million France 4.2 million Netherlands 2.9
million Canada 2.9 million
Content Languages 478 49 of WC non-English Top
5 non-English German 12 million French 6.1
million Spanish 3.5 million Dutch 2.6
million Japanese 2.4 million
Non-English Metadata Language 28 million (66
languages) Top 5 German 11 million French
1.8 million Dutch 5.0 million
Finnish 0.7 million Swedish 1.9 million
9OCLC Publisher Name Server
10Publisher Name Server Objectives
- Resolve for data mining and quality of WorldCat
- ISBN prefixes to publisher name
- Variant publisher names to a preferred form
- Complement Collection Analysis Service
- Librarians Publishers
11Publisher Name Server Objectives
- Capture and profile attributes of individual
publishers - Location(s)
- Language(s) of materials published
- Genre(s)/format(s)
- Dominant subject domain(s)
- Parent company and subsidiaries
12Publisher Name Server Methodology
- Programmatically cluster publishers records
using ISBN prefixes - Data clustering
- Classification of similar objects into different
groups - Partitioning of a data set into
subsets (clusters) - Hand parse the entities and resolve ISBN prefixes
13Publisher Name Server Database
- 1750 publishing entities
- Relational database, preserving hierarchical
relationships - Begins with high-occurrence entities
- Top 10 lists
- Top 10 university presses
- Mergers and acquisitions, last 8 years
14Example Top U.S. Publishing Entities by ISBN
15Publisher Name Server Data Captured
- Data
- Publisher Name, Preferred Form
- Source of Preferred Form
- Former Names
- Variant Forms
- ISBN Prefixes
- HQ City
- HQ Country
- Other Cities
- URL
- -----
- Languages
- Formats
- Conspectus Subjects
- Sources
- U.S. Library of Congress, National Authority
File, 110 (Corporate Name) field - Books In Print Online (W.W. Bowker)
- The International ISBN Registry (K.G. Saur)
- Publishers Weekly Online
- Hoovers Handbook Online
- Standard and Poors Corporate Descriptions
- The Directory of Corporate Affiliations (DIALOG)
- Company websites
- DATA MINING
16(No Transcript)
17Publisher Name Server Current Scope
- More than 56,000 separate strings mapped to 1750
entities - 8.5 million OCLC records
- 22 of these are Library of Congress records
- 490 million holdings
- Hierarchical relationships maintained
18Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
19Publisher Profiles within WorldCat
- Oxford University Press
- 119,237 records with ISBNs mapped to 210,095
records (0.19 of WorldCat) - Pearson PLC
- Includes 14 subsidiaries and acquisitions
- Aggregate 291,433 records (0.27 of WorldCat)
- Springer (Firm)
- 197,263 records (0.18 of WorldCat)
- Reed Elsevier PLC
- Includes dozens of subsidiaries
- Aggregate 370,029 records (0.34 of WorldCat)
20WorldCat Publisher Profiles Top Languages
- Pearson PLC
- English 95.27
- Spanish 1.43
- German 1.33
- French 0.60
- Dutch 0.55
- Latin 0.26
- Malay 0.06
- Ancient Greek 0.05
- Portuguese 0.05
- Italian 0.04
- Oxford Univ. Press
- English 96.74
- Latin 0.51
- German 0.39
- Chinese 0.39
- French 0.37
- Spanish 0.28
- Afrikaans 0.14
- Middle English 0.13
- Malay 0.09
- Swahili 0.09
21WorldCat Publisher Profiles Top Languages
- Reed Elsevier PLC
- English 83.64
- French 9.34
- Dutch 2.32
- Spanish 0.95
- Italian 0.60
- Latin 0.27
- Afrikaans 0.16
- Ancient Greek 0.12
- Portuguese 0.09
- Polish 0.06
- Springer (Firm)
- English 61.25
- German 37.10
- French 1.02
- Italian 0.29
- Polish 0.13
- Czech 0.04
- Spanish 0.04
- Hungarian 0.03
- Dutch 0.02
- Danish 0.02
22WorldCat Publisher Profiles - Formats
- Oxford University Press
- Printed Material 89.57
- Computer File 8.23
- Microform 1.39
- Sound Recording 0.50
- Video Recording 0.16
- Springer (Firm)
- Printed Material 81.69
- Computer file 17.51
- Microform 0.71
- Video Recording 0.05
- Pearson PLC
- Printed Material 92.98
- Microform 2.82
- Computer File 2.15
- Video Recording 0.70
- Sound Recording 0.67
- Reed Elsevier PLC
- Printed Material 92.31
- Computer File 5.46
- Microform 1.85
- Video Recording 0.14
23WorldCat Publisher Profiles Conspectus Divisions
- Pearson PLC
- Language/ Literature 18.67
- Business/ Economics 13.30
- Computer Science 9.42
- Engineering 8.04
- History 7.59
- Mathematics 6.04
- Education 5.64
- Sociology 4.18
- Philosophy/ Religion 3.81
- Physical Sciences 2.75
- Oxford Univ. Press
- Language/ Literature 27.12
- History 11.92
- Music 9.78
- Philosophy/ Religion 9.55
- Business/ Economics 6.15
- Medicine 4.36
- Law 3.85
- Sociology 3.75
- Political Science 3.58
- Biology 2.60
24WorldCat Publisher Profiles Conspectus
Categories
- Pearson PLC
- English language 7.74
- Business admin. 4.62
- English literature 3.63
- Economics 2.94
- Comp. programming 2.39
- Electrical engineering 2.24
- Early childhood ed. 2.05
- Computer software 1.88
- U.S. federal law 1.80
- Computer Science 1.54
- Oxford Univ. Press
- English literature 10.66
- English language 5.86
- Instrumental music 3.48
- Vocal music 3.09
- Literature on music 2.26
- History Britain 1.82
- Economic history 1.38
- American lit. 1.35
- History S. Asia 1.30
- General history 1.29
25WorldCat Publisher Profiles Conspectus Subjects
- Pearson PLC
- English modern 7.68
- Management 2.53
- Programming 1.74
- Arithmetic 1.09
- Economic theory 1.06
- Marketing 1.06
- General algebra 1.04
- Accounting 0.97
- Juvenile lit. 0.93
- English lit. 19th c. 0.89
- Oxford Univ. Press
- English modern 5.57
- English lit. prose 2.51
- English lit. 19th c. 2.23
- Juvenile lit. 1.06
- English lit. poetry 1.03
- English lit. collections 0.80
- Biographies 0.76
- English lit. 1900-1960 0.74
- Shakespeare 0.68
- Sacred choruses 0.66
26WorldCat Publisher Profiles Conspectus Divisions
- Reed Elsevier PLC
- Language/ Literature 14.18
- Law 11.78
- Engineering 11.73
- Business/ Economics 6.82
- Medicine 6.50
- Physical Sciences 5.01
- History 4.57
- Biology 4.32
- Health Professions 3.70
- Chemistry 3.51
- Springer (Firm)
- Computer Science 16.83
- Engineering 15.12
- Mathematics 12.96
- Medicine 9.93
- Physical Sciences 9.83
- Biology 5.22
- Business/ Economics 5.13
- Health Professions 4.48
- Chemistry 3.14
- Geography 2.58
27WorldCat Publisher Profiles Conspectus
Categories
- Reed Elsevier PLC
- English literature 5.84
- Health professions 3.40
- English language 2.79
- U.S. federal law 2.32
- General engineering 2.26
- Electrical engineering 2.10
- General law 1.70
- Industrial economics 1.65
- Business admin. 1.53
- U.S. state law 1.46
- Springer (Firm)
- Computer science 5.23
- General math 4.48
- Health professions 4.03
- Electrical engineering 3.73
- General engineering 3.25
- Mathematical analysis 3.06
- Computer software 2.37
- Comp. programming 2.34
- Probability/ Statistics 2.20
- Mech. engineering 2.17
28WorldCat Publisher Profiles Conspectus Subjects
- Reed Elsevier PLC
- English modern 2.68
- English - prose 2.06
- Health professions 1.92
- U.S. state law 1.37
- Industrial management 1.22
- Legal periodicals 1.16
- English lit. - 1900-1960 1.15
- Engineering materials 0.86
- English fiction 0.83
- Nuclear physics 0.68
- Springer (Firm)
- Health professions 3.56
- Math collections 2.76
- Computer science 1.84
- Programming 1.46
- Access/ security 1.10
- Artificial intelligence 1.03
- Mathematical stats 1.03
- Analytical physics 1.02
- Industrial management 0.99
- Engineering materials 0.90
29Projected MARC coding of Authorized Forms
- 710 Added Entry Corporate Name
- Add 4 for publisher name
- Add 2 NAF where preferred form matches existing
authority record (44 of current PNAF) - 752 Added Entry Hierarchical Place Name
- Add 2 FAST where place of publication matches
FAST geographical subject headings
30Ongoing Research
- Further data mining
- Profile other aspects of publication output
- Profile other publishers
- Trends over time
- Author clusters
- Geographic holdings patterns
- Collection Analysis
31Ongoing Research
- Plan for long-term maintenance
- ISBN-13 compliance
- File expansion of ongoing mergers/ acquisition
activities - Deeper scaling into WorldCat (beyond ISBN)
32OCLC Publisher Name Server
- Project page
- http//www.oclc.org/research/projects/publisherns
/
33Thank You!
- Questions and Discussion
- Lynn Silipigni Connaway connawal_at_oclc.org
- Timothy J. Dickey dickeyt_at_oclc.org