Lynn Silipigni Connaway, Ph.D. - PowerPoint PPT Presentation

About This Presentation
Title:

Lynn Silipigni Connaway, Ph.D.

Description:

German: 11 million French: 1.8 million. Dutch: 5.0 million Finnish: 0.7 million ... Columbus Metropolitan Library. OCO. x. Other. State Library of Ohio. OHI ' ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 51
Provided by: oclc3
Learn more at: https://www.oclc.org
Category:

less

Transcript and Presenter's Notes

Title: Lynn Silipigni Connaway, Ph.D.


1
Beyond Data MiningDelivering the Next
Generation of Services from Library Data
  • Lynn Silipigni Connaway, Ph.D.
  • Senior Research Scientist
  • OCLC
  • Timothy J. Dickey, Ph.D.
  • Post-Doctoral Researcher
  • OCLC

2
WorldCat as an Aggregate Collection
  • Data Mining and Analysis of WorldCat
  • affords high-level perspective on historical
    patterns, suggests future trends, and supplies
    useful intelligence with which to inform decision
    making.
  • Lavoie, B.F., Connaway, L. S., ONeill, E. T.
    (2007). Mapping WorldCats digital landscape.
    Library Resources Technical Services, 51,
    106-115 at 107.

3
WorldCat July 2008
Manifestations (records) 108,828,533
Works 84,096,107
Total holdings 1,292,763,300
Digital Items 3,182,550
Institutions 69,000
Physical Items 1.2 billion
4
Global Origins of WorldCat Materials
Germany 10
Rest of World 27
Unknown 17
France 4
Canada 3
UK 8
US 28
5
Global Origins of WorldCat Materials
Materials w/non-US origins 57.9 million
(55) Top 5 Germany 10.0 million UK 8.8
million France 4.2 million Netherlands 2.9
million Canada 2.9 million
Content Languages 478 49 of WC non-English Top
5 non-English German 12 million French 6.1
million Spanish 3.5 million Dutch 2.6
million Japanese 2.4 million
Non-English Metadata Language 28 million (66
languages) Top 5 German 11 million French
1.8 million Dutch 5.0 million
Finnish 0.7 million Swedish 1.9 million
6
WorldCat as a Decision-Making Resource
  • Collection management
  • Cooperative collection development
  • Comparative collection analysis
  • Collection assessment
  • Mass digitization
  • Off-site storage
  • Preservation

7
WorldCat as a Decision-Making Resource
  • Services
  • Virtual reference
  • Recommender services
  • Social networking
  • Systems
  • Precision

8
WorldCat as a Decision-Making Resource
  • Three Areas of Data Mining Research
  • OCLC WorldMap
  • Audience Level
  • Publisher Name Server

9
OCLC WorldMap
10
OCLC WorldMapTM Objectives
  • Geographically represent WorldCat data
  • Titles published in each country
  • Holdings for titles published in each country
  • Languages represented for titles published in
    each country

11
OCLC WorldMapTM Objectives
  • Geographically represent data from UNESCO, ARL,
    and NCES for each country
  • Number of
  • Libraries
  • Library volumes
  • Certified/degreed librarians
  • Registered library users
  • Library expenditures
  • Cultural heritage institutions (museums and
    archives)
  • Publishers

12
OCLC WorldMapTM Objectives
  • Research prototype
  • Support OCLC data mining research
  • Visually display data for review and analysis
  • Internal use
  • Sales and marketing
  • External use
  • Library collection assessment and comparison
  • Data may be processed AT A GLANCE
  • Complement the AAU/ARL Global Resources Network
    project
  • Project of the Council on Library and Information
    Resources (CLIR)

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
  • http//pubserv.oclc.org12223/WorldMap/

23
OCLC Audience Level
24
Audience Level Rationale and Objectives
Holdings represent selection decisions by
librarians implies there are more than 1
billion individual selection decisions in the
WorldCat holdings file
  • Selections serve the interests of a librarys
    target community
  • Associate community (audience level) to library
    profiles - e.g., ARL, non-ARL academic, public,
    K-12 school

?
  • Thus we can infer materials audience level from
    holdings patterns, which in turn can support
  • Collection management
  • Readers advisory services
  • Reference services
  • Information retrieval

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Example Computation Build Community
Library symbol Library name Library type Weight
OHI State Library of Ohio Other x
OCO Columbus Metropolitan Library Public 0.33
CDC Cedarville University Academic 0.67
LIM Lima Public Library Public 0.33
OUN Ohio University Research 1.00
OSD SEO Automation Consortium Other x
BGU Bowling Green State University Academic 0.67
MIA Miami University Academic 0.67
AKR University of Akron Academic 0.67
BGF Firelands College Academic 0.67
CIN University of Cincinnati Research 1.00
TOL University of Toledo Academic 0.67
KSU Kent State University Research 1.00
HIR Hiram College Academic 0.67
YNG Youngstown State University Academic 0.67
30
FRBRizing Audience Level Results
  • Calculate Audience Level for each Manifestation
  • Aggregate weighted holdings for Work

OCLC Number Total Holdings Usable Holdings Manifestation Audience Level
15504400 147 114 0.783825
29613712 172 117 0.769453
40393191 207 136 0.789426
62762763 190 124 0.758274
81016224 1 0 x
31
(No Transcript)
32
Evaluating the OCLC Audience Level
  • Random sample of 30 Zoology books, all audience
    levels
  • Human subjects
  • Ranked books in increasing order of difficulty
  • Strong statistical correlation between human
    subjects ranking and programmatic ranking

33
Evaluating the OCLC Audience Level
34
  • http//audiencelevel.oclc.org/

35
OCLC Publisher Name Server
36
Publisher Name Server Research Objectives
  • Resolve for data mining and quality of WorldCat
  • ISBN prefixes to publisher name
  • Variant publisher names to a preferred form
  • Complement Collection Analysis Service
  • Librarians
  • Publishers
  • Capture and profile attributes of individual
    publishers
  • Location(s)
  • Language(s) of materials published
  • Genre(s)/format(s)
  • Dominant subject domain(s)
  • Parent company and subsidiaries

37
Publisher Name Server Methodology
  • Programmatically cluster publishers records
    using ISBN prefixes
  • Data clustering (The Free Dictionary)
  • "The science of extracting useful information
    from large data sets or databases"
  • Classification of similar objects into different
    groups
  • Partitioning of a data set into
    subsets (clusters)
  • Data in each subset (ideally) share some common
    trait
  • Hand parse the entities and resolve ISBN prefixes

38
Publisher Name Server Database
  • 1750 publishing entities
  • Relational database, preserving hierarchical
    relationships
  • Begins with high-occurrence entities
  • Top 10 lists (USA, UK, Canada, Australia,
    Germany, France, Netherlands, Japan, Italy,
    China, Russia, Spain, Finland, Australia, Taiwan,
    New Zealand)
  • Top 10 university presses
  • Mergers and acquisitions, last 8 years

39
Publisher Name Server Data Captured
  • Database Fields
  • Publisher Name, Preferred Form
  • Source of Preferred Form
  • Former Names
  • Variant Forms
  • ISBN Prefixes
  • HQ City
  • HQ Country
  • Other Cities
  • URL
  • -----
  • Languages
  • Formats
  • Conspectus Subjects
  • Data Sources
  • U.S. Library of Congress, National Authority
    File, 110 (Corporate Name) field
  • Books In Print Online (W.W. Bowker)
  • The International ISBN Registry (K.G. Saur)
  • Publishers Weekly Online
  • Hoovers Handbook Online
  • Standard and Poors Corporate Descriptions
  • The Directory of Corporate Affiliations (DIALOG)
  • Company websites
  • DATA MINING

40
(No Transcript)
41
Publisher Name Server Database
  • More than 56,000 separate strings mapped to 1750
    entities
  • 8.5 million OCLC records
  • 22 of these are Library of Congress records
  • 490 million holdings
  • Hierarchical relationships maintained

42
Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
43
Publisher Profiles
  • Oxford University Press
  • 119,237 records with ISBNs mapped to 210,095
    records (0.19 of WorldCat)
  • Pearson PLC
  • Includes 14 subsidiaries and acquisitions
  • Aggregate 291,433 records (0.27 of WorldCat)

44
Publisher Profiles Top Languages
  • Pearson PLC
  • English 95.27
  • Spanish 1.43
  • German 1.33
  • French 0.60
  • Dutch 0.55
  • Latin 0.26
  • Malay 0.06
  • Ancient Greek 0.05
  • Portuguese 0.05
  • Italian 0.04
  • Oxford Univ. Press
  • English 96.74
  • Latin 0.51
  • German 0.39
  • Chinese 0.39
  • French 0.37
  • Spanish 0.28
  • Afrikaans 0.14
  • Middle English 0.13
  • Malay 0.09
  • Swahili 0.09

45
Publisher Profiles Conspectus Divisions
  • Pearson PLC
  • Language/ Literature 18.67
  • Business/ Economics 13.30
  • Computer Science 9.42
  • Engineering 8.04
  • History 7.59
  • Mathematics 6.04
  • Education 5.64
  • Sociology 4.18
  • Philosophy/ Religion 3.81
  • Physical Sciences 2.75
  • Oxford Univ. Press
  • Language/ Literature 27.12
  • History 11.92
  • Music 9.78
  • Philosophy/ Religion 9.55
  • Business/ Economics 6.15
  • Medicine 4.36
  • Law 3.85
  • Sociology 3.75
  • Political Science 3.58
  • Biology 2.60

46
Publisher Profiles Conspectus Categories
  • Pearson PLC
  • English language 7.74
  • Business admin. 4.62
  • English literature 3.63
  • Economics 2.94
  • Comp. programming 2.39
  • Electrical engineering 2.24
  • Early childhood ed. 2.05
  • Computer software 1.88
  • U.S. federal law 1.80
  • Computer Science 1.54
  • Oxford Univ. Press
  • English literature 10.66
  • English language 5.86
  • Instrumental music 3.48
  • Vocal music 3.09
  • Literature on music 2.26
  • History Britain 1.82
  • Economic history 1.38
  • American lit. 1.35
  • History S. Asia 1.30
  • General history 1.29

47
Publisher Profiles Conspectus Subjects
  • Pearson PLC
  • English modern 7.68
  • Management 2.53
  • Programming 1.74
  • Arithmetic 1.09
  • Economic theory 1.06
  • Marketing 1.06
  • General algebra 1.04
  • Accounting 0.97
  • Juvenile lit. 0.93
  • English lit 19th c. 0.89
  • Oxford Univ. Press
  • English modern 5.57
  • English lit prose 2.51
  • English lit 19th c. 2.23
  • Juvenile lit. 1.06
  • English lit poetry 1.03
  • English lit collections 0.80
  • Biographies 0.76
  • English lit 1900-1960 0.74
  • Shakespeare 0.68
  • Sacred choruses 0.66

48
Projected MARC coding of Authorized Forms
  • 710 Added Entry Corporate Name
  • Add 4 for publisher name
  • Add 2 NAF where preferred form matches existing
    authority record (44 of current PNAF)
  • 752 Added Entry Hierarchical Place Name
  • Add 2 FAST where place of publication matches
    FAST geographical subject headings

49
Future Research
  • Further data mining
  • Profile aspects of publication output
  • Deeper scaling into WorldCat (beyond ISBN)
  • Plan for long-term maintenance
  • ISBN-13 compliance
  • File expansion of ongoing mergers/ acquisition
    activities

50
Thank You!
  • Questions and Discussion
  • Lynn Silipigni Connaway connawal_at_oclc.org
  • Timothy J. Dickey dickeyt_at_oclc.org
Write a Comment
User Comments (0)
About PowerShow.com