Title: ISYS 300 Week 2 Document Representation
 1ISYS 300 - Week 2Document Representation
- Dr. Xia Lin 
- Associate Professor 
- College of Information Science and Technology 
- Drexel University
2QUIZ
- Question 1 What are two major components of IR 
 systems?
-   
-   
- Question 2 What are the two abstraction 
 principles of IR ?
3Reviews of Last Week 
- Information retrieval systems 
- The user requests information 
- The system processes queries 
- IR system is to match two abstractions 
- abstraction of information needs 
- abstraction of data from text 
- Differences between 
- Data, information, knowledge
4Documents
- Logical Units of Text 
- Units of records (text  other components) 
- Units that can be stored, retrieved, and 
 displayed as an unique entity
- Units of semantic entity 
- units of text grouped together for a purpose 
- Units of unformatted text 
- Text as written by authors of documents. 
5Document Representation
- Documents need to be represented in a concise and 
 identifiable formats/structures.
- Not every words of the text are meaningful for 
 searching/retrieval.
- Documents themselves do not have identifiable 
 attributes such as author, titles.
6Document Representation
- Document representation helps users identify and 
 receive information from the system.
- identify authors and titles 
- identify subjects 
- provide summaries/abstracts 
- classify subject categories 
- Document Citation 
- A set of information to make it easy to identify 
 a document.
7Document Surrogates
- Each document should have a unique identifier 
- Accession (sequential) number 
- Classification number 
- Barcodes 
- ISBN number 
- Good for the computer but not enough for the 
 user?
- Go to bookstore and get the book 0-471-14338-3. 
- Do you want to have 200737-103146 for dinner?
8Document Indexing
- Computerized Indexing 
- Indexing based on citations 
- Indexing based on full text 
- Subject indexing 
- Creating a set of control vocabularies (Thesaurus 
 or Subject headings) to represent documents
- Assigning terms of control vocabularies to 
 documents
9Computerizing Indexing
- The Computer creates indexing files based on 
 document surrogates
- to improve access speed 
- to increase access points 
- to improve precision 
- to reduce false drops 
- to identify similar documents 
- How? 
10Computerized Indexing
- Title indexing 
- Sort all the titles alphabetically 
- Not consider the beginning a or the 
- Convert all letters to uppercases. 
- Matching always starts from the beginning of the 
 title (not individual words).
- Most early IR systems (such as library catalogs) 
 used title indexing
11Keyword indexing
- Parsing every individual words from documents 
- First decision What is a word? 
- Are digits words? 
- How about the letter and digit combination B6, 
 B12
- Is F-16 one word or two words? 
- Hyphens 
- Online, on-line, on line ? 
- F-16 
- List all the words alphabetically with points 
 back to documents  inverted indexing.
12Phrase indexing
- There is no safe ways to parse phrases out of 
 titles or full text of documents.
- One way to do phrase indexing is by positions if 
 two word are used next to each other, they are
 (potentially) a phrase.
- Most phrase indexes are done manually. 
13Inverted Indexing
- Purpose 
- Preparing documents for search engines to search 
- Objective 
- Create a sorted list of words with pointers 
 indicating which and WHERE the words appear in
 the documents.
- Process the list in many different ways to meet 
 the retrieval needs
14Inverted Indexing
- Inverted indexing consists of an ordered list of 
 indexing terms, each indexing term is associated
 with some document identification numbers.
- Retrieval is done by first searching in the 
 ordered list to find the indexing term, then
 using the document identification numbers to
 locate documents
15Examples
- ISYS102 Introduction to information systems 
- Info110 Human computer interaction 
-   
- info300 Information retrieval systems 
16Step 1 Generate a list of all the words
- ISYS102 Introduction to information systems 
- ISYS110 Human computer interaction 
- ISYS300 Information retrieval theories and 
 systems
ISYS102 Introduction ISYS102 to ISYS102 
information ISYS102 systems ISYS110 
human ISYS110 computer ISYS110 interaction ISYS300
 information ISYS300 retrieval ISYS300 
theories ISYS300 and ISYS300 systems 
 17Step2 remove stop words
- ISYS102 Introduction 
- ISYS102 to 
- ISYS102 information 
- ISYS102 systems 
- ISYS110 human 
- ISYS110 computer 
- ISYS110 interaction 
- ISYS300 information 
- ISYS300 retrieval 
- ISYS300 theories 
- ISYS300 and 
- ISYS300 systems
18Step 3 Invert the list
- Introduction ISYS102 
- information ISYS102 
- Systems ISYS102 
- Human ISYS110 
- Computer ISYS110 
- Interaction ISYS110 
- Information ISYS300 
- Retrieval ISYS300 
- Theories ISYS300 
- Systems ISYS300
19Step 4 Sort the list
- Computer ISYS110 
- Human ISYS110 
- Information ISYS102 
- Information ISYS300 
- Interaction ISYS110 
- Introduction ISYS102 
- Retrieval ISYS300 
- Systems ISYS102 
- Systems ISYS300 
- Theories ISYS300 
20Step 5 Merge same words in the list
- Computer ISYS110 
- Human ISYS110 
- Information ISYS102, ISYS300 
- Interaction ISYS110 
- Introduction ISYS102 
- Retrieval ISYS300 
- Systems ISYS102, ISYS300 
- Theories ISYS300 
21Example retrieving in an inverted file
- computer 110TI02 
- design 110TX01 
- human 110TI01 
- information 102TI02, 102TX01, 300TI01, 300TX02 
- interaction 110TI03 
- interface 110TX03 
- introduction 102TI01 
- management 102TX03 
- retrieval 300TI02, 300TX03 
- systems 102TI03, 300TI03, 300TX04, 102TX02 
- text 300TX01 
- user 110TX02
22Second Examples
- 1 TI Cats and dogs Best friends of Kate 
-  DE Cats Dogs fiction  
- 2 TI New methods of feeding cats and dogs 
-  DE cats dogs feeding behaviors 
- 3 TI Canine mandibular structure 
-  DE Dogs Anatomy Musculature skeleton 
 
- 4 TI It rains like cats and dogs last night 
-  DE mystery fiction 
23The Inverted Indexing File
- anatomy 03DE02 
- behaviors 02DE04 
- canine 03TI01 
- cats 01DE01, 01TI01, 02DE01, 02TI04 
- cultural 01DE04 
- dogs 01DE02, 01TI02, 02DE02, 02TI05, 
 03DE01
- enemies 01TI04feeding 02TI03feeding 02DE03
 mandibular 03TI02methods
 02TI02misunderstood 01TI06mortal
 01TI03musculature 03DE03 new
 02TI01simply 01TI05skeleton 03DE
 04structure 03TI03studies 01DE05
24Example Create an inverted indexing for the 
following 
 25Unix Basics
- Unix 
- The most powerful Operating system 
- Multi-tasks/multi-thread/multi-user OS 
- Excellent host for IR systems and databases as 
 well as web servers.
- Command-based access 
26Subject Indexing
- A human analytic process for identifying, 
 selecting, and representing document concepts
- Create indexing languages 
- Using standardized, limited vocabularies for 
 index purposes.
- Assign indexing terms to documents 
- Using only the terms in the index language 
 selected.
27Second Examples
- 1 TI Cats and dogs Best friends of Kate 
-  DE Cats Dogs fiction  
- 2 TI New methods of feeding cats and dogs 
-  DE cats dogs feeding behaviors 
- 3 TI Canine mandibular structure 
-  DE Dogs Anatomy Musculature skeleton 
 
- 4 TI It rains like cats and dogs last night 
-  DE mystery fiction 
28Second Examples
- 1 TI Cats and dogs Best friends of Kate 
- 2 TI New methods of feeding cats and dogs 
- 3 TI Canine mandibular structure 
- 4 TI It rains like cats and dogs last night 
-  
29Metadata
- Metadata are data about data 
- to describe features of the data (digital 
 objects)
- Content  what the object is about 
- Context  who, what, why, where and how aspects 
 associated with the object
- Structure  associations within or among 
 individual objects
30Example Identify Content, context, and 
structures in the following 
- Author Arms, William Y. 
- Title Digital libraries / William Y. Arms. 
- Imprint Cambridge, Mass.  MIT Press, c2000. 
- CALL   Z692.C65 A76 2000  
- Description x, 287 p.  ill.  24 cm. 
- Series Digital libraries and electronic 
 publishing
- Note Includes index. 
- Subject 
- Libraries -- United States -- Special collections 
 -- Electronic information resources.
- Digital libraries -- United States. 
- ISBN 0262011808 (alk. paper) 
31Why Metadata? 
- Metadata is a key to ensuring that resources will 
 survive and continue to be accessible into the
 future.
- Standards 
- Structures and organization 
- Content and context 
32Functions of Metadata
- To help organize resources 
- To facilitate resource discovery 
- To facilitate interoperability 
- To support digital identification 
- To support archiving and preservation
33Types of Metadata
- Descriptive 
- Title, abstract, keywords 
- Administrative 
- Who and how it is created 
- Right management 
- Structural 
- Relationships among objects 
34Attributes of Metadata
- Source of metadata 
- Nature of metadata 
- Structure 
- Conform to a standard 
- Semantics 
- Controlled vocabulary or not 
- Level 
- How details the metadata are. 
35Metadata Schemes
- A metadata schema provides a formal structure 
 designed to identify the knowledge structure of a
 given discipline and to link that structure to
 the information of the discipline through the
 creation of an information system that will
 assist the identification, discovery and use of
 information within that discipline.
36- Schemes are sets of metadata elements to describe 
 a resource
- Semantics  definitions and meanings of the 
 metadata elements
- Contents  values given to metadata elements 
- Content rules  what values should be used, how 
 the values should be formulated.
37XML 
- XML stands for eXtensible Markup Language 
- Designed to separate style, content, and context, 
 and presentation in the web environment
- Designed to deploy content-specific tags for 
 content indexing and retrieval.
- Designed as a subset of SGML 
38Example
- lt?xml version"1.0" encoding"utf-8" ?gt 
- ltbook isbn"0836217462"gt 
-  lttitlegtBeing a Dog Is a Full-Time Joblt/titlegt 
-  ltauthorgtCharles M. Schulzlt/authorgt 
-  ltcharactergt 
-  ltnamegtSnoopylt/namegt 
-  ltfriend-ofgtPeppermint Pattylt/friend-ofgt 
-  ltsincegt1950-10-04lt/sincegt 
-  ltqualificationgtextroverted 
 beaglelt/qualificationgt
-  lt/charactergt 
-  ltcharactergt 
-  ltnamegtPeppermint Pattylt/namegt 
-  ltsincegt1966-08-22lt/sincegt 
-  ltqualificationgtbold, brash and 
 tomboyishlt/qualificationgt
-  lt/charactergt 
-  lt/bookgt
39XML is an industry itself 
- All the major software companies implemented some 
 types of XML-related software
- XML-related standards are continually developed 
 everyday.
- XSL  Extensible Stylesheet Language 
- XSLT -- Extensible Stylesheet Language 
 Transformations
- XSLT enables and empowers interoperability 
- Xlink -- XML Linking Language 
- Assign meanings to links 
- RDF  Resource Description Framework 
40XML Example (www.XML.com)
- lt?xml version"1.0"?gt 
-  ltartistinfogt 
-  ltsurnamegtModiglianilt/surnamegt 
-  ltnamegtAmadeolt/namegt 
-  ltborngtJuly 12, 1884lt/borngt 
-  ltdiedgtJanuary 24, 1920lt/diedgt 
-  ltbiographygt 
-  ltpgtIn 1906, Modigliani settled in Paris, 
 where ...lt/pgt
-  lt/biographygt 
-  lt/artistinfogt 
41Example
- lt?xml version"1.0"?gt 
-  ltperiodgt 
-  ltcitygtParislt/citygt 
-  ltcountrygtFranceltcountrygt 
-  lttimeframe begin"1900" end"1920"/gt 
-  lttitlegtParis in the early 20th century (up to 
 the twenties) lt/titlegt
-  ltendgtAmadeolt/endgt 
-  ltdescriptiongt 
-  ltpgtDuring this period, Russian, Italian, 
 ...lt/pgt
-  lt/descriptiongt 
-  lt/periodgt 
42- ltenvironment xmlnsxlink"http//www.w3.org/1999/x
 link"
-  xlinktype"extended"gt 
-  lt!-- The resources involved in our link 
 are the artist --gt
-  lt!-- himself, his influences and the 
 historical references --gt
-  ltartist xlinktype"locator" 
 xlinklabel"artist"
-  xlinkhref"modigliani.xml"/gt 
-  ltinfluence xlinktype"locator" 
 xlinklabel"inspiration"
-  xlinkhref"cezanne.xml"/gt 
-  ltinfluence xlinktype"locator" 
 xlinklabel"inspiration"
-  xlinkhref"lautrec.xml"/gt 
-  ltinfluence xlinktype"locator" 
 xlinklabel"inspiration"
-  xlinkhref"rouault.xml"/gt 
-  lthistory xlinktype"locator" 
 xlinklabel"period"
-  xlinkhref"paris.xml"/gt 
-  lthistory xlinktype"locator" 
 xlinklabel"period"
-  xlinkhref"kisling.xml"/gt 
-  lt/environmentgt
43Discussion
- Differences between XML and HTML? 
- Relationships between XML and metadata? 
44XML/Metadata Tools
- Reggie 
- a metadata editor 
- Output RDF, HTML, 
- a Java application 
- URL http//metadata.net/dstc/ 
45DC DOT
- http//www.ukoln.ac.uk/metadata/dcdot/ 
- Exercises 
- Add Dublin Core Headings to the class Web page. 
46Writing an XML Document
- XML document must be well formed 
- A root element is required. 
- Closing tags are required. 
- Elements must be properly nested. 
- Case matters. 
- Entity references must be declared in a DTD or a 
 schema.
47XML document content
- lttitlegtNASA Image Exchangelt/titlegt 
- ltsitegthttp//nix.nasa.gov/lt/sitegt 
- ltmetadatagt 
- ltrepository-namegtNASA Image Exchangelt/repository-n
 amegt
- ltcategorygt 
-  ltlabelgtCATEGORYlt/labelgt 
-  ltdatagtimageslt/datagt 
- lt/categorygt 
-  
48XML Scheme 
 49XML Document Headings
- lt?xml version"1.0" encoding"UTF-8"?gt 
- lt?xml-stylesheet type"text/css" 
 href"http//project.cis.drexel.edu/classes/isys30
 0/XML/repository.css" ?gt
- ltrepository xmlnsxsi"http//www.w3.org/2000/10/X
 MLSchema-instance" xsinoNamespaceSchemaLocation"
 http//project.cis.drexel.edu/classes/info653/XML/
 DLRepository.xsd"gt
50Style Sheet
- repository displayblock font-sizelargecolorM
 aroon
- title displayblockfont-sizelargetext-alignce
 nter
- site displayblock text-aligncenter 
- metadata floatrightclearrightwidth225pxbord
 erthin solid Tealpadding10px
- repository-name dislplayblockfont-sizemediumb
 ackgroundNavycolorYellow text-aligncenter
- label  displayblockfont-sizemedium 
- data  displayblock font-sizesmallcolorblue 
 positionrelative left9px
- descriptiondisplayblock 
- review displayblock colorblack 
- name displayblock text-alignright 
 colorBlue fontsmall
- term displaynone
51Controlled Vocabulary
- Goals 
- To permit easy locations of documents by topic. 
- To define topic areas, and hence relate one 
 document to another.
- to provide multiple access pointers to documents 
- to enforce a uniformity throughout an information 
 retrieval system
52Controlled Vocabulary
- Formats 
- Hierarchical Classified list 
- hierarchical subject descriptors 
- associative cross references 
- classification notation (codes) 
- Alphabetical list 
- include both descriptors and other lead-in terms
53Main Componentsin a Controlled Vocabulary
Broader Term
Keyword/ Descriptor
Synonymous Term
Related Term
Narrower Term 
 54Example
Broader Terms
Diseases Neoplasms
Related Terms
Synonyms
Abdominal Neoplasms Hyperplasia Seminoma
Cancer
Malignancy Malignant tumor Cancer morphology
Malignant neoplasm of skins Breast Cancer 
 Primary malignant 
neoplasm of liver
Narrower Terms 
 55Controlled Vocabulary
- Examples 
- Case studies Descriptor 
- SN Details analyses, usually focusing on a 
 particular problem of an individual, group, or
 organization (note do not confuse with medical
 case histories
- NT 
-  Cross sectional studies 
-  Longitudinal studies
56Examples (Case Studies)
- BT 
-  Evaluation methods 
-  Research 
- RT 
-  Case records 
-  Counseling 
-  Qualitative research 
57Advantages of Subject Indexing
- facilitates concept search 
- search by topics/subjects, not just by words 
- link related documents by subject terms 
- Make implicit information explicit 
- Provides a standard terminology to index and 
 search documents.
- Use small indexing vocabulary 
- Help the searcher find related terms
58Disadvantages of Subject Indexing
- Expensive manual operations 
- To construct the controlled vocabulary 
- To assign terms to documents 
- Difficult to keep up to date 
- Terminology changes very fast 
- New terms are added daily. 
- Inconsistent process of human indexing 
- Same documents are assigned different indexing 
 terms by different indexers
- The user may not use the same terms to find 
 documents as the indexer would use to index the
 documents.
59Two Examples of Document Representation
- Controlled Vocabulary 
- human-based indexing 
- subject-based indexing 
- Inverted indexing 
- computer-based indexing 
- statistical-based indexing
60Considerations of Document Representation 
- Discriminating power 
- to identify a document uniquely 
- to reduce ambiguity 
- Examples 
- ISBN number for book 
- bar codes for products 
61- Descriptiveness 
- describe all the information as complete as 
 possible
- fulltext 
- abstracts 
- extracts 
- reviews 
- Completeness and correctness 
62Considerations of DR
- Similarity Identification 
- to group similar documents 
- keywords or subject indexing 
- book classification numbers 
- Difficulty for the computer to assign keywords, 
 subject descriptors, or classification numbers to
 documents
63Considerations of DR
- Conciseness 
- simple and clear 
- reduce process time and storage space 
- Examples 
- authors and titles 
- Needs by both the computer and the user 
64Relationships of four considerations
- Higher discrimination power may lower the 
 capability of identifying similarities among
 documents.
- Good descriptiveness may defeat the conciseness 
- Whats good for the computer may not always be 
 good for the user.
- A good representation should seek a balance of 
 the four, and take consideration of both the
 computer and the user.