ISYS 300 Week 2 Document Representation - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

ISYS 300 Week 2 Document Representation

Description:

canine 03TI01. cats 01DE01, 01TI01, 02DE01, 02TI04. cultural 01DE04 ... 3 TI: Canine mandibular structure. 4 TI: It rains like cats and dogs last night. Metadata ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 65
Provided by: xia52
Category:

less

Transcript and Presenter's Notes

Title: ISYS 300 Week 2 Document Representation


1
ISYS 300 - Week 2Document Representation
  • Dr. Xia Lin
  • Associate Professor
  • College of Information Science and Technology
  • Drexel University

2
QUIZ
  • Question 1 What are two major components of IR
    systems?
  •  
  •  
  • Question 2 What are the two abstraction
    principles of IR ?

3
Reviews of Last Week
  • Information retrieval systems
  • The user requests information
  • The system processes queries
  • IR system is to match two abstractions
  • abstraction of information needs
  • abstraction of data from text
  • Differences between
  • Data, information, knowledge

4
Documents
  • Logical Units of Text
  • Units of records (text other components)
  • Units that can be stored, retrieved, and
    displayed as an unique entity
  • Units of semantic entity
  • units of text grouped together for a purpose
  • Units of unformatted text
  • Text as written by authors of documents.

5
Document Representation
  • Documents need to be represented in a concise and
    identifiable formats/structures.
  • Not every words of the text are meaningful for
    searching/retrieval.
  • Documents themselves do not have identifiable
    attributes such as author, titles.

6
Document Representation
  • Document representation helps users identify and
    receive information from the system.
  • identify authors and titles
  • identify subjects
  • provide summaries/abstracts
  • classify subject categories
  • Document Citation
  • A set of information to make it easy to identify
    a document.

7
Document Surrogates
  • Each document should have a unique identifier
  • Accession (sequential) number
  • Classification number
  • Barcodes
  • ISBN number
  • Good for the computer but not enough for the
    user?
  • Go to bookstore and get the book 0-471-14338-3.
  • Do you want to have 200737-103146 for dinner?

8
Document Indexing
  • Computerized Indexing
  • Indexing based on citations
  • Indexing based on full text
  • Subject indexing
  • Creating a set of control vocabularies (Thesaurus
    or Subject headings) to represent documents
  • Assigning terms of control vocabularies to
    documents

9
Computerizing Indexing
  • The Computer creates indexing files based on
    document surrogates
  • to improve access speed
  • to increase access points
  • to improve precision
  • to reduce false drops
  • to identify similar documents
  • How?

10
Computerized Indexing
  • Title indexing
  • Sort all the titles alphabetically
  • Not consider the beginning a or the
  • Convert all letters to uppercases.
  • Matching always starts from the beginning of the
    title (not individual words).
  • Most early IR systems (such as library catalogs)
    used title indexing

11
Keyword indexing
  • Parsing every individual words from documents
  • First decision What is a word?
  • Are digits words?
  • How about the letter and digit combination B6,
    B12
  • Is F-16 one word or two words?
  • Hyphens
  • Online, on-line, on line ?
  • F-16
  • List all the words alphabetically with points
    back to documents inverted indexing.

12
Phrase indexing
  • There is no safe ways to parse phrases out of
    titles or full text of documents.
  • One way to do phrase indexing is by positions if
    two word are used next to each other, they are
    (potentially) a phrase.
  • Most phrase indexes are done manually.

13
Inverted Indexing
  • Purpose
  • Preparing documents for search engines to search
  • Objective
  • Create a sorted list of words with pointers
    indicating which and WHERE the words appear in
    the documents.
  • Process the list in many different ways to meet
    the retrieval needs

14
Inverted Indexing
  • Inverted indexing consists of an ordered list of
    indexing terms, each indexing term is associated
    with some document identification numbers.
  • Retrieval is done by first searching in the
    ordered list to find the indexing term, then
    using the document identification numbers to
    locate documents

15
Examples
  • ISYS102 Introduction to information systems
  • Info110 Human computer interaction
  •  
  • info300 Information retrieval systems

16
Step 1 Generate a list of all the words
  • ISYS102 Introduction to information systems
  • ISYS110 Human computer interaction
  • ISYS300 Information retrieval theories and
    systems

ISYS102 Introduction ISYS102 to ISYS102
information ISYS102 systems ISYS110
human ISYS110 computer ISYS110 interaction ISYS300
information ISYS300 retrieval ISYS300
theories ISYS300 and ISYS300 systems
17
Step2 remove stop words
  • ISYS102 Introduction
  • ISYS102 to
  • ISYS102 information
  • ISYS102 systems
  • ISYS110 human
  • ISYS110 computer
  • ISYS110 interaction
  • ISYS300 information
  • ISYS300 retrieval
  • ISYS300 theories
  • ISYS300 and
  • ISYS300 systems

18
Step 3 Invert the list
  • Introduction ISYS102
  • information ISYS102
  • Systems ISYS102
  • Human ISYS110
  • Computer ISYS110
  • Interaction ISYS110
  • Information ISYS300
  • Retrieval ISYS300
  • Theories ISYS300
  • Systems ISYS300

19
Step 4 Sort the list
  • Computer ISYS110
  • Human ISYS110
  • Information ISYS102
  • Information ISYS300
  • Interaction ISYS110
  • Introduction ISYS102
  • Retrieval ISYS300
  • Systems ISYS102
  • Systems ISYS300
  • Theories ISYS300

20
Step 5 Merge same words in the list
  • Computer ISYS110
  • Human ISYS110
  • Information ISYS102, ISYS300
  • Interaction ISYS110
  • Introduction ISYS102
  • Retrieval ISYS300
  • Systems ISYS102, ISYS300
  • Theories ISYS300

21
Example retrieving in an inverted file
  • computer 110TI02
  • design 110TX01
  • human 110TI01
  • information 102TI02, 102TX01, 300TI01, 300TX02
  • interaction 110TI03
  • interface 110TX03
  • introduction 102TI01
  • management 102TX03
  • retrieval 300TI02, 300TX03
  • systems 102TI03, 300TI03, 300TX04, 102TX02
  • text 300TX01
  • user 110TX02

22
Second Examples
  • 1 TI Cats and dogs Best friends of Kate
  • DE Cats Dogs fiction 
  • 2 TI New methods of feeding cats and dogs
  • DE cats dogs feeding behaviors
  • 3 TI Canine mandibular structure
  • DE Dogs Anatomy Musculature skeleton
  • 4 TI It rains like cats and dogs last night
  • DE mystery fiction

23
The Inverted Indexing File
  • anatomy 03DE02
  • behaviors 02DE04
  • canine 03TI01
  • cats 01DE01, 01TI01, 02DE01, 02TI04
  • cultural 01DE04
  • dogs 01DE02, 01TI02, 02DE02, 02TI05,
    03DE01
  • enemies 01TI04feeding 02TI03feeding 02DE03
    mandibular 03TI02methods
    02TI02misunderstood 01TI06mortal
    01TI03musculature 03DE03 new
    02TI01simply 01TI05skeleton 03DE
    04structure 03TI03studies 01DE05


24
Example Create an inverted indexing for the
following
25
Unix Basics
  • Unix
  • The most powerful Operating system
  • Multi-tasks/multi-thread/multi-user OS
  • Excellent host for IR systems and databases as
    well as web servers.
  • Command-based access

26
Subject Indexing
  • A human analytic process for identifying,
    selecting, and representing document concepts
  • Create indexing languages
  • Using standardized, limited vocabularies for
    index purposes.
  • Assign indexing terms to documents
  • Using only the terms in the index language
    selected.

27
Second Examples
  • 1 TI Cats and dogs Best friends of Kate
  • DE Cats Dogs fiction 
  • 2 TI New methods of feeding cats and dogs
  • DE cats dogs feeding behaviors
  • 3 TI Canine mandibular structure
  • DE Dogs Anatomy Musculature skeleton
  • 4 TI It rains like cats and dogs last night
  • DE mystery fiction

28
Second Examples
  • 1 TI Cats and dogs Best friends of Kate
  • 2 TI New methods of feeding cats and dogs
  • 3 TI Canine mandibular structure
  • 4 TI It rains like cats and dogs last night

29
Metadata
  • Metadata are data about data
  • to describe features of the data (digital
    objects)
  • Content what the object is about
  • Context who, what, why, where and how aspects
    associated with the object
  • Structure associations within or among
    individual objects

30
Example Identify Content, context, and
structures in the following
  • Author Arms, William Y.
  • Title Digital libraries / William Y. Arms.
  • Imprint Cambridge, Mass. MIT Press, c2000.
  • CALL  Z692.C65 A76 2000 
  • Description x, 287 p. ill. 24 cm.
  • Series Digital libraries and electronic
    publishing
  • Note Includes index.
  • Subject
  • Libraries -- United States -- Special collections
    -- Electronic information resources.
  • Digital libraries -- United States.
  • ISBN 0262011808 (alk. paper)

31
Why Metadata?
  • Metadata is a key to ensuring that resources will
    survive and continue to be accessible into the
    future.
  • Standards
  • Structures and organization
  • Content and context

32
Functions of Metadata
  • To help organize resources
  • To facilitate resource discovery
  • To facilitate interoperability
  • To support digital identification
  • To support archiving and preservation

33
Types of Metadata
  • Descriptive
  • Title, abstract, keywords
  • Administrative
  • Who and how it is created
  • Right management
  • Structural
  • Relationships among objects

34
Attributes of Metadata
  • Source of metadata
  • Nature of metadata
  • Structure
  • Conform to a standard
  • Semantics
  • Controlled vocabulary or not
  • Level
  • How details the metadata are.

35
Metadata Schemes
  • A metadata schema provides a formal structure
    designed to identify the knowledge structure of a
    given discipline and to link that structure to
    the information of the discipline through the
    creation of an information system that will
    assist the identification, discovery and use of
    information within that discipline.

36
  • Schemes are sets of metadata elements to describe
    a resource
  • Semantics definitions and meanings of the
    metadata elements
  • Contents values given to metadata elements
  • Content rules what values should be used, how
    the values should be formulated.

37
XML
  • XML stands for eXtensible Markup Language
  • Designed to separate style, content, and context,
    and presentation in the web environment
  • Designed to deploy content-specific tags for
    content indexing and retrieval.
  • Designed as a subset of SGML

38
Example
  • lt?xml version"1.0" encoding"utf-8" ?gt
  • ltbook isbn"0836217462"gt
  • lttitlegtBeing a Dog Is a Full-Time Joblt/titlegt
  • ltauthorgtCharles M. Schulzlt/authorgt
  • ltcharactergt
  • ltnamegtSnoopylt/namegt
  • ltfriend-ofgtPeppermint Pattylt/friend-ofgt
  • ltsincegt1950-10-04lt/sincegt
  • ltqualificationgtextroverted
    beaglelt/qualificationgt
  • lt/charactergt
  • ltcharactergt
  • ltnamegtPeppermint Pattylt/namegt
  • ltsincegt1966-08-22lt/sincegt
  • ltqualificationgtbold, brash and
    tomboyishlt/qualificationgt
  • lt/charactergt
  • lt/bookgt

39
XML is an industry itself
  • All the major software companies implemented some
    types of XML-related software
  • XML-related standards are continually developed
    everyday.
  • XSL Extensible Stylesheet Language
  • XSLT -- Extensible Stylesheet Language
    Transformations
  • XSLT enables and empowers interoperability
  • Xlink -- XML Linking Language
  • Assign meanings to links
  • RDF Resource Description Framework

40
XML Example (www.XML.com)
  • lt?xml version"1.0"?gt
  • ltartistinfogt
  • ltsurnamegtModiglianilt/surnamegt
  • ltnamegtAmadeolt/namegt
  • ltborngtJuly 12, 1884lt/borngt
  • ltdiedgtJanuary 24, 1920lt/diedgt
  • ltbiographygt
  • ltpgtIn 1906, Modigliani settled in Paris,
    where ...lt/pgt
  • lt/biographygt
  • lt/artistinfogt

41
Example
  • lt?xml version"1.0"?gt
  • ltperiodgt
  • ltcitygtParislt/citygt
  • ltcountrygtFranceltcountrygt
  • lttimeframe begin"1900" end"1920"/gt
  • lttitlegtParis in the early 20th century (up to
    the twenties) lt/titlegt
  • ltendgtAmadeolt/endgt
  • ltdescriptiongt
  • ltpgtDuring this period, Russian, Italian,
    ...lt/pgt
  • lt/descriptiongt
  • lt/periodgt

42
  • ltenvironment xmlnsxlink"http//www.w3.org/1999/x
    link"
  • xlinktype"extended"gt
  • lt!-- The resources involved in our link
    are the artist --gt
  • lt!-- himself, his influences and the
    historical references --gt
  • ltartist xlinktype"locator"
    xlinklabel"artist"
  • xlinkhref"modigliani.xml"/gt
  • ltinfluence xlinktype"locator"
    xlinklabel"inspiration"
  • xlinkhref"cezanne.xml"/gt
  • ltinfluence xlinktype"locator"
    xlinklabel"inspiration"
  • xlinkhref"lautrec.xml"/gt
  • ltinfluence xlinktype"locator"
    xlinklabel"inspiration"
  • xlinkhref"rouault.xml"/gt
  • lthistory xlinktype"locator"
    xlinklabel"period"
  • xlinkhref"paris.xml"/gt
  • lthistory xlinktype"locator"
    xlinklabel"period"
  • xlinkhref"kisling.xml"/gt
  • lt/environmentgt

43
Discussion
  • Differences between XML and HTML?
  • Relationships between XML and metadata?

44
XML/Metadata Tools
  • Reggie
  • a metadata editor
  • Output RDF, HTML,
  • a Java application
  • URL http//metadata.net/dstc/

45
DC DOT
  • http//www.ukoln.ac.uk/metadata/dcdot/
  • Exercises
  • Add Dublin Core Headings to the class Web page.

46
Writing an XML Document
  • XML document must be well formed
  • A root element is required.
  • Closing tags are required.
  • Elements must be properly nested.
  • Case matters.
  • Entity references must be declared in a DTD or a
    schema.

47
XML document content
  • lttitlegtNASA Image Exchangelt/titlegt
  • ltsitegthttp//nix.nasa.gov/lt/sitegt
  • ltmetadatagt
  • ltrepository-namegtNASA Image Exchangelt/repository-n
    amegt
  • ltcategorygt
  • ltlabelgtCATEGORYlt/labelgt
  • ltdatagtimageslt/datagt
  • lt/categorygt

48
XML Scheme
49
XML Document Headings
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lt?xml-stylesheet type"text/css"
    href"http//project.cis.drexel.edu/classes/isys30
    0/XML/repository.css" ?gt
  • ltrepository xmlnsxsi"http//www.w3.org/2000/10/X
    MLSchema-instance" xsinoNamespaceSchemaLocation"
    http//project.cis.drexel.edu/classes/info653/XML/
    DLRepository.xsd"gt

50
Style Sheet
  • repository displayblock font-sizelargecolorM
    aroon
  • title displayblockfont-sizelargetext-alignce
    nter
  • site displayblock text-aligncenter
  • metadata floatrightclearrightwidth225pxbord
    erthin solid Tealpadding10px
  • repository-name dislplayblockfont-sizemediumb
    ackgroundNavycolorYellow text-aligncenter
  • label displayblockfont-sizemedium
  • data displayblock font-sizesmallcolorblue
    positionrelative left9px
  • descriptiondisplayblock
  • review displayblock colorblack
  • name displayblock text-alignright
    colorBlue fontsmall
  • term displaynone

51
Controlled Vocabulary
  • Goals
  • To permit easy locations of documents by topic.
  • To define topic areas, and hence relate one
    document to another.
  • to provide multiple access pointers to documents
  • to enforce a uniformity throughout an information
    retrieval system

52
Controlled Vocabulary
  • Formats
  • Hierarchical Classified list
  • hierarchical subject descriptors
  • associative cross references
  • classification notation (codes)
  • Alphabetical list
  • include both descriptors and other lead-in terms

53
Main Componentsin a Controlled Vocabulary
Broader Term
Keyword/ Descriptor
Synonymous Term
Related Term
Narrower Term
54
Example
Broader Terms
Diseases Neoplasms
Related Terms
Synonyms
Abdominal Neoplasms Hyperplasia Seminoma
Cancer
Malignancy Malignant tumor Cancer morphology
Malignant neoplasm of skins Breast Cancer
Primary malignant
neoplasm of liver
Narrower Terms
55
Controlled Vocabulary
  • Examples
  • Case studies Descriptor
  • SN Details analyses, usually focusing on a
    particular problem of an individual, group, or
    organization (note do not confuse with medical
    case histories
  • NT
  • Cross sectional studies
  • Longitudinal studies

56
Examples (Case Studies)
  • BT
  • Evaluation methods
  • Research
  • RT
  • Case records
  • Counseling
  • Qualitative research

57
Advantages of Subject Indexing
  • facilitates concept search
  • search by topics/subjects, not just by words
  • link related documents by subject terms
  • Make implicit information explicit
  • Provides a standard terminology to index and
    search documents.
  • Use small indexing vocabulary
  • Help the searcher find related terms

58
Disadvantages of Subject Indexing
  • Expensive manual operations
  • To construct the controlled vocabulary
  • To assign terms to documents
  • Difficult to keep up to date
  • Terminology changes very fast
  • New terms are added daily.
  • Inconsistent process of human indexing
  • Same documents are assigned different indexing
    terms by different indexers
  • The user may not use the same terms to find
    documents as the indexer would use to index the
    documents.

59
Two Examples of Document Representation
  • Controlled Vocabulary
  • human-based indexing
  • subject-based indexing
  • Inverted indexing
  • computer-based indexing
  • statistical-based indexing

60
Considerations of Document Representation
  • Discriminating power
  • to identify a document uniquely
  • to reduce ambiguity
  • Examples
  • ISBN number for book
  • bar codes for products

61
  • Descriptiveness
  • describe all the information as complete as
    possible
  • fulltext
  • abstracts
  • extracts
  • reviews
  • Completeness and correctness

62
Considerations of DR
  • Similarity Identification
  • to group similar documents
  • keywords or subject indexing
  • book classification numbers
  • Difficulty for the computer to assign keywords,
    subject descriptors, or classification numbers to
    documents

63
Considerations of DR
  • Conciseness
  • simple and clear
  • reduce process time and storage space
  • Examples
  • authors and titles
  • Needs by both the computer and the user

64
Relationships of four considerations
  • Higher discrimination power may lower the
    capability of identifying similarities among
    documents.
  • Good descriptiveness may defeat the conciseness
  • Whats good for the computer may not always be
    good for the user.
  • A good representation should seek a balance of
    the four, and take consideration of both the
    computer and the user.
Write a Comment
User Comments (0)
About PowerShow.com