LIS510 lecture 12 - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

LIS510 lecture 12

Description:

Investigation what to buy. Negotiation of the purchase. Acquisition of access to a service ... mp3 file containing the recording. repositories ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 63
Provided by: open6
Learn more at: http://openlib.org
Category:
Tags: lecture | lis510

less

Transcript and Presenter's Notes

Title: LIS510 lecture 12


1
LIS510 lecture 12
  • Thomas Krichel
  • 2006-12-13

2
today
  • Leftovers from last time.
  • I discuss some elements of Bill Arms book on
    Digital Libraries.
  • Its introductory book that general, but smartly
    written.
  • It is not a book to each someone to become a
    digital librarian.
  • LIS650 and LIS651 are for that. They really deal
    with the introduction to digital information.
  • I also talk generally about understanding some
    digital contents.

3
definition
  • An informal definition of a digital library is a
    managed collection of information, with
    associated services, where the information is
    stored in digital formats and accessible over a
    network.
  • managed in the key word here.

4
benefits of digital libraries
  • The digital library brings the library to the
    user.
  • Computer power is used for searching and
    browsing.
  • Information can be shared.
  • Information is easier to keep current.
  • The information is always available.
  • New forms of information become possible.

5
costs
  • Non-digital libraries are very expensive.
  • Digital libraries are also expensive. Many
    publishers charge more for online editions that
    for traditional print.
  • However the cost of the infrastructure is
    dropping.
  • And there are potentials for changes in the way
    information is supplied in digital libraries.

6
technical change
  • Electronic storage is becoming cheaper than
    paper.
  • Personal computer displays are becoming more
    pleasant to use.
  • High-speed networks are becoming widespread.
  • Computers have become portable.

7
libraries adapt
  • Libraries get wired
  • They offer electronic access, even to the home
    user.
  • Other actions depend on the library type
  • Some shift from information access to community
    center.
  • Some adopt digital reference with 24/7
    asynchronous help.
  • Some get involved in digital archiving of
    institutional assets.

8
digital library cost
  • The digital library material will cost more
    initially because publishers want to see a return
    in the extra functionality they have developed.
  • In the longer run, digital library costs may be
    lower than in print
  • lower storage cost
  • less risk to the items
  • fewer staff (but differently trained) requirements

9
classic roles for the library with digital
material
  • Investigation what to buy
  • Negotiation of the purchase
  • Acquisition of access to a service
  • Installation of access devices
  • Training of users
  • Maintenance update, migrate, replace

10
beyond the library
  • The classic roles will at best a stagnating, if
    not declining source for information
    professionals.
  • The rise of open access will mean that no longer
    as many assets as before will have to be
    purchased. Todays example
  • http//dme.mozarteum.at
  • Training needs of users decline as digital media
    are getting easier to use.

11
new roles for information professionals
  • The information age does not happen without
    information professionals.
  • There a huge demand for tech-savvy information
    professionals out there. Examples include
  • web site maintenance
  • digital archiving

12
impact of technology on staff
  • Information professionals that are
    technologically savvy will thrive better than
    those who are not.
  • Fortunately the Palmer School offers LIS508,
    LIS650, LIS651.
  • It still does not have a system administration
    class, but that may come as well.

13
impact of technology on staff
  • Constant computer use can cause serious health
    problems
  • Problem areas are
  • bad posture problems at the desk
  • eye strain
  • The use of mouse is particularly bad. Learn how
    to avoid using it.
  • Injuries take a long time to heal.

14
digital libraries are hard
  • In digital libraries terminology is a bad
    problem. Basic concepts are hard to find.
  • These definition problems also hurt efforts to
    build sophisticated information systems by
    semi-automated means.
  • We live in the age of the brute-force
    calculation, not the age of artificial
    intelligence.

15
data and metadata
  • Metadata is data about data. The distinction
    between data and metadata depends often on the
    context.
  • Metadata is often divided into
  • descriptive metadata
  • structural metadata
  • administrative metadata

16
whats in the digital library?
  • Items ?
  • Material ?
  • Documents ?
  • Objects?
  • Digital Items ?
  • Digital Material ?
  • Digital Documents ?
  • Digital Objects ?

17
storage and dissemination
  • Items are stored in digital format in a way we
    can call the stored form of the item.
  • When the item is shown to the user, it is shown
    as a presentation or dissemination. This is
    the way the object leaves the server.
  • When it arrives at the users machines, they have
    to render the presentation.

18
users and clients
  • A user is someone who uses a digital library.
    Many times, the user is anonymous and can not be
    identified.
  • A client is a software that the user runs to use
    the digital library. Sometimes this is called a
    user agent. Many times common people refer to it
    as a browser.

19
work and contents
  • These are difficult things to discuss. Look at
    the example at the song Der Lindenbaum. Could
    mean
  • song as sound and words
  • score
  • performance
  • recording
  • mp3 file containing the recording

20
repositories
  • This is general term used to talk about a
    computer system that has primarily the function
    of storing contents.
  • When long-run storage is involved a repository
    becomes an archive.
  • A server is a computer that is switched on
    constantly to provide services to the public.

21
an example of terminology
  • A data model is an abstraction (or an extra
    level of indirection) for digital objects such
    that each digital object can be seen as an
    instance of the class defined by the data model.
  • A surrogate is a transmittable serialization or
    representation of a digital object that can be
    passed back and forth so we can do things with
    it. Possible serialization techniques include XML
    and RDF/XML.

22
a digital library from scratch
  • Much of the data that is stored in digital
    libraries is text.
  • Most other material, that is not textual in
    nature, such as
  • sound files
  • graphics
  • need textual metadata in order to be found.
  • Current technology is not able to find it
    otherwise.

23
Information
  • Information is best understood as what it takes
    to answer a question.
  • The simplest question has a yes or no answer.
    Therefore a bit is the natural measure of
    information.
  • Term first used by John Turkey in 1946.
  • Concatenation of binary digit.

24
Usage of bits
  • Computers are sometimes classified by the number
    of bits they can process at one time. "32 bit
    processor"
  • Graphics are also often described by the number
    of bits used to represent each dot.

25
bits and bytes
  • a bit can take the values 0 or 1, thus it can
    describe 2 possibilities
  • two bits can take the value 00, 01, 10, 11, thus
    it can describe four 22 possibilities
  • n bits can encode 2 power n possibilities.
  • The first chips used to process 8 bits at a time.
    It become customary to refer to them as a byte.
    It can encode 2 power 8 possibilities.
  • We can use binary numbers just as decimal
    numbers.

26
application of bytes
  • IP (Internet Protocol) numbers are used as the
    addresses of computers on the Internet.
  • In IP version 4 (the one that is most commonly
    used), each IP number has 4 bytes.
  • It is represented as x.x.x.x where x is a number
    between 0 and 255 (why?)
  • How many computers can there be on the Internet
    at any one time?

27
Many bytes
  • Larger units are
  • Kilo byte is 2 power 10 bytes (1024 bytes)
  • Mega bytes is 2 power 20 bytes
  • Giga bytes is 2 power 30 bytes
  • Tera byte is 2 power 40 bytes
  • From ancient Greek words for "thousand", "large",
    "giant", and "monster", respectively. Terms date
    back to the French revolution.

28
Hex numbers
  • A byte is often represented by two hex numbers.
  • Each hex number can encode 16 values
  • Written 0 to 9, then A B C D E F. F is 15.
  • Conventionally prefixed with 0x
  • Use Microsoft calculator with scientific notation
    to convert.

29
applications of hex numbers
  • Media Access Control (mac) addresses of hardware
    that allows access to computer networks. They are
    6-byte numbers, each byte written as 2 hex
    numbers, e.g. 006008F520A9
  • character numbers that you see when you are
    inserting a special symbol in Microsoft software,
    e.g. powerpoint.
  • Color codes on web pages use 6 hex digits.
  • 000000 is black
  • FFFFFF is white

30
Information in a computer file
  • A file is a piece of data on a stored on a
    computer.
  • Any file contains a sequence of 0s and 1s, like
    1010100101010011110101010101
  • For a computer to make sense of a file, it has
    to know what type of file it is.

31
executable files
  • Files that are executable are files that make the
    computer do something. For example the file
    starts a program, say powerpoint. An executable
    on one computer may not run on another one.
  • Non-executable files hold data that is used by an
    executable file. We will call them data files.
    Example powerpoint slides file.

32
Characters
  • Much of the information processed by computers is
    in the form of characters.
  • From wikipedia
  • A character is a unit of information that roughly
    corresponds to a grapheme, or written symbol, of
    a natural language, such as a letter, numeral, or
    punctuation mark.
  • A character is not a grapheme because there are
    ligatures.

33
control characters
  • The concept also includes control characters,
    which do not correspond to natural language
    symbols but to other bits of information used to
    process texts of the language, such as
    instructions to printers or other devices that
    display such texts.
  • An example for such a control character is the
    newline character.

34
text files
  • Many data files contain textual data.
  • Textual data is a sequence of characters.
  • A character is an elementary symbol that has some
    meaning
  • alphabet letter
  • hieroglyph
  • Example email file
  • Text files can be read by many computer programs.

35
non-text files
  • Examples for non-text files are
  • graphics files
  • movie files
  • sound files
  • Non-text files are of minor significance in
    library settings
  • There is no way to organize information retrieval
    for non-text files. They have to be retrieved
    using a textual surrogate.
  • Traditional library material are textual
  • will talk about this later.

36
Representing characters
  • Computers don't understand text, they only
    understand numbers. For computers to be able to
    treat text, there must be a correspondence
    between numbers and text characters. Such a
    correspondence is called a character set.
  • Examples for characters are
  • a
  • c
  • ë

37
Legacy character sets
  • In early days, computers were a lot less powerful
    than they are today.
  • Could only deal with the characters that are most
    commonly used.
  • Such sets are
  • ascii
  • ISO-8859-1
  • cp1252

38
ASCII
  • American Standard Code for Information
    Interchange
  • 7-bit character set. There is no such thing as
    8-bit ASCII
  • 95 printable symbols
  • 33 control characters (0-31, 127)
  • http//www.ccmr.cornell.edu/helpful_data/ascii2.ht
    ml has a list up to 127

39
some ASCII control characters
  • CR (13, M) is the carriage return
  • LF (10, J) is the linefeed
  • FF (12, L) is the form feed (new page)
  • BS (8, H) is the backspace
  • DEL (127, ALT-127) is delete
  • ESC (27, ) escape

40
ISO-8859-1
  • ISO-8859-1, aka ISO-latin-1 extends ASCII with
    characters that are commonly used by the western
    European languages.
  • It is the default character set of html.
  • Positions 128 to 159 are not used.
  • Cp1252 fills these with graphic chars. It is as
    Microsoft character set.

41
This is not enough
  • There are around 6800 different languages around.
  • Some of these languages use characters sets that
    are not finite, i.e. folks can make up now
    characters out of existing ones!
  • Setting up a character set for all languages is
    almost impossible.

42
ISO 10646-1
  • Defines the Universal Character Set (UCS)
  • UCS contains the characters required to represent
    characters used by many known languages, even the
    likes of Oriya, Telugu, Bopomofo, Runic.
  • ISO 10646 defines formally a 31-bit character
    set. They are represented as 32 bits, i.e. 4
    bytes, or 8 hex chars.
  • Not finished.

.
43
Unicode
  • ISO is a inter-government agency. Slow and
    bureaucratic.
  • Industry has come together to work on Unicode, a
    2-byte character set.
  • With some minor exceptions, the Unicode
    characters are the some as the first 65536
    characters in UCS.
  • Much better documented standard.

44
Unicode and legacy sets
  • The first 128 characters are identical to those
    in ASCII
  • The next 128 characters are identical to ISO
    8859-1 (Latin-1).
  • Unicode is well documented and the Unicode book
    can be downloaded from the Internet. A must-have
    for the serious digital librarian.

45
Beyond characters
  • There is more to text than a string of
    characters.
  • There is layout
  • titles
  • abstracts
  • mathematical formula spacing

46
Layout
  • Layout can be conveyed by additional text that
    has special meaning. Examples
  • LaTeX
  • HTML
  • PostScript
  • Another way is to do non-textual layout by adding
    some other digital signals. Examples
  • DVI
  • MS Word
  • MS Powerpoint
  • These can not be shown in these slides!

47
Example LaTeX
  • \bigskip\textbfClass structure
  • Classes will be held in the computer lab in the
    Palmer School between 1815 and 2045. An
    optional practice session will last until 2115.
  • \begintabular_at_llll_at_
  • 02006--09--12introduction to the course \\
  • 12006--09--19libraries and food \\
  • 22006--09--26introduction to shushing \\

48
Example HTML
  • ltpgtltstronggtClass structurelt/stronggtltpgtClasses
    will be held in the computer lab in the Palmer
    School between 1815 and 2045. An optional
    practice session will last until 2115.ltpgtClass
    details
  • ltpgtltcentergtlttable width100 border1gt
  • lttrgtlttd alignleftgt 0 lt/tdgtlttd alignleftgt
    2006821109821112 lt/tdgtlttd alignleftgtlta
    href"lis510w06a-00.ppt"gtintroduction to the
    courselt/agt lt/tdgtlt/trgtlttrgtlttd alignleftgt 1
    lt/tdgtlttd alignleftgt 2006821109821119
    lt/tdgtlttd alignleftgtlta href"lis510w06a-01.ppt"gtli
    braries and foodlt/agt lt/tdgt

49
Example PostScript
  • Fc(Class)g(structur)o(e)-104 3956 y
    Fd(Classes)26b(will)g(be)e(held)g(in)h(the)f(compu
    ter)f(lab)i(in)f(the)h(P)o(almer)f(School)g(betwee
    n)f(1815)h(and)g(2045.)36 b(An)25
    b(optional)e(practice)h(session)-104 4055
    y(will)d(last)g(until)f(2115.)-104 4155
    y(Class)i(details)-104 4307 y(0)141
    b(2003\22609\22623)94b(introduction)18
    b(to)i(the)h(course)-104 4407 y(1)141
    b(2002\22609\22630)94 b(bits)21
    b(bytes)f(and)g(characters)-104 4507 y(2)141
    b(2003\22610\22607)94 b(databases)20
    b(and)g(markup)e(languages)-

50
DVI (rendition, "class structure")
  • 1659 fntnum27 current font is ptmb8t
  • 1660 setchar67 h-820459473168-347291,
    hh-22
  • 1661 setchar108 h-347291182183-165108,
    hh-10
  • 1662 setchar97 h-165108327680162572, hh11
  • 1663 setchar115 h162572254928417500, hh27
  • 1664 setchar115 h417500254928672428, hh43
  • 1665 right3 163840 h672428163840836268,
    hh53
  • 1669 setchar115 h8362682549281091196, hh69
  • 1670 setchar116 h10911962182321309428,
    hh83
  • 1671 setchar114 h13094282909761600404,
    hh101
  • 1672 setchar117 h16004043643761964780,
    hh124
  • 1673 setchar99 h19647802909762255756,
    hh142
  • 1674 setchar116 h22557562182322473988,
    hh156
  • 1675 setchar117 h24739883643762838364,
    hh179
  • 1676 setchar114 h28383642909763129340,
    hh197

51
XML
  • XML the extensible markup language. It have
    become the lingua franca for structured textual
    data.
  • It is also increasingly use on the web.

52
Databases
  • Databases are collection of data with some
    organization to them.
  • The classic example is the relational database.
  • But not all database need to be relational
    databases.

53
Relational databases
  • A relational database is a set of tables. There
    may be relations between the tables.
  • Each table has a number of record. Each record
    has a number of fields.
  • When the database is being set up, we fix
  • the size of each field
  • relationships between tables

54
Example Movie database
  • ID title director date
  • M1 Gone with the wind F. Ford Coppola 1963
  • M2 Room with a view Coppola, F Ford 1985
  • M3 High Noon Woody Allan 1974
  • M4 Star Wars Steve Spielberg 1993
  • M5 Alien Allen, Woody 1987
  • M6 Blowing in the Wind Spielberg, Steven
    1962
  • Single table
  • No relations between tables, of course

55
Problem with this database
  • All data wrong, but this is just for
    illustration.
  • Name covered inconsistently. There is no way to
    find films by Woody Allan without having to go
    through all spelling variations.
  • Mistakes are difficult to correct. We have to
    wade through all records, a masochists pleasure.

56
Better movie database
  • ID title director year
  • M1 Gone with the wind D1 1963
  • M2 Room with a view D1 1985
  • M3 High Noon D2 1974
  • M4 Star Wars D3 1993
  • M5 Alien D2 1987
  • M6 Blowing in the Wind D3 1962
  • ID director name birth year
  • D1 Ford Coppola, Francis 1942
  • D2 Allan, Woody 1957
  • D3 Spielberg, Steven 1942

57
Relational database
  • We have a one to many relationship between
    directors and film
  • Each film has one director
  • Each director has produced many films
  • Here it becomes possible for the computer
  • To know which films have been directed by Woody
    Allen
  • To find which films have been directed by a
    director born in 1942

58
Many-to-many relationships
  • Each film has one director, but many actors star
    in it. Relationship between actors and films is a
    many to many relationship.
  • Here are a few actors
  • ID sex actor name birth year
  • A1 f Brigitte Bardot 1972
  • A2 m George Clooney 1927
  • A3 f Marilyn Monroe 1934

59
Actor/Movie table
  • actor id movie id
  • A1 M4
  • A2 M3
  • A3 M2
  • A1 M5
  • A1 M3
  • A2 M6
  • A3 M4
  • as many lines as required

60
SQL
  • Once we have the relational database, we can ask
    sophisticated questions
  • Which director has had the most female actors
    working for him?
  • In which years films have been shot that starred
    actors born between 1926 and 1935?
  • Such questions can be encoded in a language know
    as structured query language or SQL. All
    relational database vendors implement a dialect
    of SQL.

61
databases in libraries
  • Relational databases dominate the world of
    structured data
  • But not so popular in libraries
  • Slow on very large databases (such as catalogs)
  • Library data has nasty ad-hoc relationships, e.g.
  • Translation of the first edition of a book
  • CD supplement that comes with the print version
  • Difficult to deal with in a system where all
    relations and field have to be set up at the
    start, can not be changed easily later.

62
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com