Title: Will XML and Information Retrieval Make Society Transparent
1Will XML and Information Retrieval Make Society
Transparent?
- Gregory B. NewbySchool of Information and
Library ScienceUniversity of North Carolina at
Chapel Hill - http//ils.unc.edu/gbnewby
2Basic Premise
- Information retrieval will be facilitated by XML
because of the additional structure that XML
adds. - This will result in better IR abilities compared
to plain text or HTML
3IR is Not Database Retrieval
Bibliographic Retrieval Controlled Vocabulary
Database Query Structured data
Natural Language (Semi-) Structured Or
unstructured
4Information Retrieval in One Slide
- IR is about matching information to info. needs
- Information may be contained in documents,
extracts, document surrogates, or newly-created
documents - Information needs may be poorly defined,
changeable, and context-specific - We evaluate IR systems by the numbers of relevant
documents they identify - Recall proportion of all relevant documents that
are retrieved - Precision proportion of documents that are
retrieved that are judged as relevant
5Why IR Sucks
- Human language is ambiguous
- Polysemy The same word can mean different things
- Synonymy Different words can mean the same thing
- The topic or aboutness of a document is hard to
assess - Queries are short and ambiguous
- Information needs are moving and vague targets
6Things that help IR
- Structure matching based on known types of
content (e.g., a list vs. discourse) - Relationships Knowing how groups of documents
are related - Metadata terms or phrases that are of assuredly
high importance - User knowledge context, user models, history
7Transparency through Information Access (utopic
view)
- What if organizations (government, corporations,
etc.) are less able to hide their actions? - What if individuals information is readily
accessible to all? - What if nearly all information that is generated
is available to all seekers?
8Inequity through Information Access (dystopic
view)
- Organizations share their data only when and with
whom they choose - Individuals information is hoarded by
businesses, government and the people themselves - Information is available on a fee- and authority
basis
9XML cant make societal decisions
- But XML brings about the opportunity for such
decisions to be made - If information is readily available to all, XML
will help make it more searchable - If information is only available to the
privileged, XML will make them more powerful
10XML Uncertainties
- Will XML be used for markup? Or only at the back
end? - Will standards such as Z39.50 or EDI make it
easier for sharing XML data? Or will translation
mapping be difficult? - What sort of variety will exist in DTDs? How
difficult will it be for IR and database systems
to map between DTDs?
11XML stakeholders Big organizations
- Organizations with lots of internal data
- (The IRS Time-Warner others big small)
- These organizations will benefit from XML IR by
being able to match database-type items with
IR-type information needs. - E.g., for people who purchase these products,
what email and chat messages have they exchanged
12XML stakeholders Organizations who share
- Organizations who broker, repackage or resell
information will benefit from XML IR - (Credit bureaus, investigative services)
- XML will make it easier to submit IR queries
against multiple datasets and merge the results - E.g.,See what this persons public Web pages say
before deciding whether to hire him or her.
13XML stakeholders Individuals
- Ultimately, lots of the most valuable information
is by or about individuals - (Lifestyle, health, purchasing, travel)
- IR systems that understand us better will be able
to serve us better - E.g., recommend a book based on my past reading,
movies and available time to read.
14What we know, revisited
- IR sucks, but is better to the extent that
language is unambiguated and structure is present - People have information needs, but have trouble
expressing those needs - Documents can address some needs, but often
real-world information needs are better met by
assembling answers from diverse sources
15What we dont know, revisited
- XML In the background or the foreground?
- How will organizations share XML data (will
they?) - What external forces might make data in all forms
more accessible across organizations and to
individuals?
16XML IR
- Despite problems, IR has continued to make good
progress - Despite problems, XML appears to be making a
strong contribution to storing, organizing and
presenting data of all types - With IR, XML will be more searchable for a
variety of purposes - With XML, IR will gain better precision and
ability to serve the needs of individuals and
organizations