Informetrics - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Informetrics

Description:

Presentation Readings Discussion & Review Projects & Papers Why use metrics? Apply theory from another field to solve IS problems We need new modeling techniques or ... – PowerPoint PPT presentation

Number of Views:436
Avg rating:3.0/5.0
Slides: 41
Provided by: DonTur5
Category:

less

Transcript and Presenter's Notes

Title: Informetrics


1
Informetrics IR
  • Presentation
  • Readings Discussion Review
  • Projects Papers

2
Why use metrics?
  • Apply theory from another field to solve IS
    problems
  • We need new modeling techniques or metaphors to
    examine these complex systems
  • An attempt to apply some new models and metaphors
    to complex systems
  • Bibliometrics
  • Direct Citation Counting
  • Bib Coupling
  • Co-Citation Analysis
  • Bibliometric Laws
  • Web Servers
  • Server Log
  • Log Analysis

3
How do Informetrics impact IR?
  • Measures of
  • Content subject area
  • Relationships
  • Use popularity
  • An information-based view of communications,
    focused on documents
  • Instead of the text in a document, focus on the
    document properties (metadata?)
  • Author(s)
  • Dates
  • Publication source(s)
  • Front Matter Titles Contact info
  • Back Matter Citations Support

4
What are these metrics?
  • Bibliometrics
  • series of techniques that seek to quantify the
    process of written communication. Ikpaahindi
  • counting and analyzing citations
  • consistently observable patterns
  • referenced in key places Science Citation Index,
    Social Science Citation Index, Arts and
    Humanities Citation Index
  • Webometrics
  • Applying bibliometric methods to Web pages Web
    sites
  • Informetrics
  • Wider scale application of methods to networked
    information sources

5
Citing Linking
  • paying homage to pioneers
  • giving credit for related work (homage to peers)
  • identifying methodology, equipment, etc.
  • background reading
  • correcting ones own work
  • correcting the work of others
  • criticizing previous work
  • substantiating claims
  • alerting to forthcoming work
  • providing leads to poorly disseminated, poorly
    indexed, or un-cited work
  • authenticating data and classes of fact -
    physical constants, etc.
  • identifying original pubs in which an idea or
    concept was discussed
  • id original pub or other work describing an
    eponymic concept or term (Hodgkins Disease)
  • disclaiming work or ideas of others (negative
    claims)
  • disputing priority claims of others (negative
    homage)

6
Direct Citation Counting
  • How many citations over a given period of time.
  • Impact formula
  • n journal citations/n citable articles published
  • Immediacy index
  • n citations received by article during the year/
    total number of citable articles published

7
Bibliometric Coupling
  • a number of papers bear a meaningful relation
    to each other when they have one or more
    references in common Kessler
  • Whats the Web equivalent?

8
Co-Citation Analysis
  • if two references are cited together, in a latter
    literature, the two references are themselves
    related. the greater the number of times they are
    cited together, the greater their cocitation
    strength. (Marshakova and Small 1973
    independently)
  • How about Web citations?
  • Whats a set of Web pages? A Site, a long page?

9
Finer Points
  • Classification of references
  • is the reference conceptual or operational
  • is the reference organic or perfunctory
  • is the reference evolutionary or juxtapositional
    (built on a preceding or an alternative to it)
  • is the reference confirmative or negational
  • Citation reference errors
  • multiple authors (not primary or et. al.) what
    contribution/influence by order of names?
  • self-citations
  • like-names, initial/full names, different fields
  • field variation of citation amounts/purposes
  • fluctuation of influence/use
  • typos

10
Bibliometric Laws
  • Seek to describe the working of science by
    mathematical means. Generally that a few entities
    account for the many citations.
  • Bradfords Law of Scattering
  • Lotkas Law
  • Zipfs Law

11
Bradfords Law of Scattering
  • How literature in a subject in distributed in
    journals.
  • If scientific journals are arranged in order of
    decreasing productivity of articles on a given
    subject, they may be divided into a nucleus of
    periodicals more particularly devoted to the
    subject and several other groups of zones
    containing the same number of articles as the
    nucleus.
  • 9 journals had 429 articles, the next 59 had 499,
    the last 258 had 404.
  • Bradford discovered this regularity of
    calculating the number of titles in each of the
    three groups 9 titles, 9x5 titles, 9x5x5 titles.
  • Can be influenced by sample size, area of
    specialization and journal policies.

12
Brookes on Bradfords Formula
  • The index terms assigned to documents also
    follow a Bradford distribution because those
    terms most frequently assigned become less and
    less specific and therefore increasingly
    ineffective in retrieval.

13
Bradfords Formula Itself
  • Bradfords Formula makes it possible to estimate
    how many of the most productive sources would
    yield any specified fraction p of the total
    number of items. The formula is
  • R(n) N log n/s (1 lt_ n lt_ N)
  • where R(n) cumulative total of items
    contributed by the sources of rank 1 to n.
  • N total number of contributing sources
  • s a constant characteristic of the literature
  • then
  • R(N) N log N/s
  • is the total number of items contributed by N
    sources.

14
More Bradfords Law
  • Citations originally counted year by year can be
    expressed as the geometric sequence
  • R, Ra, Ra2, Ra3, Ra4, ..., Rat-1
  • where R presumed number of citations during the
    first year, some of which do not immediately
    emerge in publication. But as alt1, the sum of the
    sequence converges to the finite limit R/(1-a).

15
Lotkas Law
  • An inverse square law that for every 100 authors
    contributing on article, 25 will contribute 2, 11
    will contribute 3 and 6 will contribute 4.
  • formula is- 1n2.
  • Voos found 1n3.5 for Info Science (1974).
  • What are other similar analysis tasks you could
    use Lotkas law for?
  • Are users, browsers, bloggers like authors?

16
Zipfs Law
  • The distribution which applied to word frequency
    in a text states that the nth ranking word will
    appear k/n times, where k is a constant for that
    text.
  • It is easier to choose and use familiar words,
    therefore probabilities of occurrence of familiar
    words is higher. rfC rank, frequency,
  • This can be applied by counting all of the words
    in a document (minus some words in a stop list -
    common words (the, therefore...)) with the most
    frequent occurrences representing the subject
    matter of the document. Could also use relative
    frequency (more often than expected) instead of
    absolute frequency.

17
Wyllys on Zipfs Law
  • Surprisingly constrained relationship between
    rank and frequency in natural language.
  • Zipf said the fundamental reason for human
    behavior the striving to minimize effort.
  • Mandelbrot - further refinement of Zipfs law
    (rm)Bfc where r is the rank of a word, f is its
    frequency, m, B and c are constants dependent on
    the corpus. m has the greatest effect when r is
    small.

18
Optimum utility of articles?
  • the most compact library is not the least costly
    because you get rid of articles more quickly
    therefore you buy more.
  • fewer articles are acquired and kept longer but
    more shelf space and maintenance is needed.
  • the challenge is to keep the most frequently
    accessed available.

19
Goffmans Theory
  • His General Theory of Information Systems
  • Ideas are endemic with minor outbreaks
    occurring from time to time. Cycles of use. Like
    memes and paradigm shifts (Kuhn). Based on
    epidemiology and Shannons communications theory.

20
Online Article Life
  • Burton proposed a measure for the decay in
    citations to older literature, a half-life
  • How is this different on the net?
  • a shorter life?
  • older sites referred less, more?
  • commercial sites vs. private sites.
  • advertised vs word of mouth?
  • linked from popular pages?

21
Prices Law
  • half of the scientific papers are contributed by
    the square root of the total number of scientific
    authors
  • Leads to
  • bibliographic coupling - the number of reference
    two papers have in common, as a measure of their
    similarity, a clustering based on this measure
    yields meaningful groupings of papers for
    information retrieval.

22
Cumulative advantage model
  • Price noticed this advantage
  • Success breeds success. also implies that an
    obsolescence factor is at work. You get mentioned
    a lot, you get mentioned in more and more cited
    papers.
  • Polya describes this as contagion

23
Bibliometrics on the Web
  • We can use these techniques, rules and formulas
    to analyze Web usage.
  • Like a bibliometric index for historical
    analysis.
  • Key question are citations like page
    browsing/using?
  • Using Web Servers Effectively
  • Server Logs give us much data to mine
  • Studies on the Web

24
Understanding the Web
  • User-based data collection
  • Surveys
  • GVU, Nielsen and GNN
  • Qualitative questions
  • phone
  • web forms
  • Self-selected sample problems
  • random selection
  • oversample

25
Understanding the Web
  • Web Servers
  • Serve
  • text
  • graphics
  • CGI
  • XMLHTTPRequest (REST, AJAX)
  • Web services (SOAP)
  • other MIME types
  • Server Logs represent this activity
  • A lot of empirical, quantitative data on use

26
Problems with Web Servers
  • Not as Foolproof as Print
  • No State Information
  • Interaction with Web pages or Web apps is
    difficult to log analyze
  • Server Hits not Representative
  • Counters inaccurate
  • Different, non HTTP requests effects
  • Floods/Bandwidth can Stop intended usage
  • Robots, Spam, (D)DoS, Caching, etc.

27
Web Server Records
  • Server-based
  • Proxy-based
  • Client-based
  • Network-based

28
Clever Web Content Setup
  • unique file and directory names
  • clear, consistent structure
  • FTP server for file transfer
  • frees up logs and server!
  • Judicious use of links
  • Wise MIME types
  • some hard/impossible to log

29
Clever Web Server Setup
  • Redirect CGI to find referrer
  • Use a database
  • store web content
  • record usage data
  • create state information with programming
  • NSAPI
  • ActiveX
  • Have contact information
  • Have purpose statements
  • Bibliometric Servlets?

30
Managing Log Files
  • Backup
  • Store Results or Logs?
  • Beginning New Logs
  • Posting Results

31
Log File Format
  • see Appendix
  • key advantage
  • computer storage cost decreases while paper cost
    rises
  • every server generates slightly different logs

32
Extended Log File Formats
  • WWW Consortium Standards
  • Will automatically record much of what is
    programmatically done now.
  • faster
  • more accurate
  • standard baselines for comparison
  • graphics standards

33
Log Analysis Tools
  • Analog
  • WWWStat
  • GetStats
  • Perl Scripts
  • Commercial Tools

34
Log Analysis Cumulative Sample
  • Program started at Tue-03-Dec-2006 0120 local
    time.
  • Analysed requests from Thu-28-Jul-2003 2031 to
    Mon-02-Dec-2003 2359 (858.1 days).
  • Total successful requests 4 282 156 (88 952)
  • Average successful requests per day 4 990 (12
    707)
  • Total successful requests for pages 1 058 526
    (17 492)
  • Total failed requests 88 633 (1 649)
  • Total redirected requests 14 457 (197)
  • Number of distinct files requested 9 638 (2 268)
  • Number of distinct hosts served 311 878 (11 284)
  • Number of new hosts served in last 7 days 7 020
  • Corrupt logfile lines 262
  • Unwanted logfile entries 976
  • Total data transferred 23 953 Mbytes (510 619
    kbytes)
  • Average data transferred per day 28 582 kbytes
    (72 946 kbytes)

35
Downie and Web Usage
  • User-based analyses
  • who
  • where
  • what
  • File-based analyses
  • amount
  • Request analyses
  • conform (loosely) to Zipfs Law
  • Byte-based analyses

36
Neat Bibliometric Web Tricks
  • use a search engine to find references
  • linkwww.ischool.utexas/donturn
  • key to using unique names
  • use many engines
  • update times different
  • blocking mechanisms are different
  • use Google News (and the like)
  • look for references
  • look for IP addresses of users

37
Neat Tricks, cont.
  • Walking up the Links
  • follow URLs upward
  • Reverse Sort
  • look for relations
  • Use your own robot to index
  • test

38
Projects
  • capture current and previous user information
    seeking behavior and modify interface and content
    to meet needs
  • Dynamic Web Publishing System
  • anticipate information seeking behavior
  • based on recorded preferences and pre-supplied
    rules, generate and guide users through a
    document space.

39
Summary
  • Bibliometrics, now Informetrics
  • Bradfords - distribution of documents in a
    specific discipline
  • Lotkas - number of authors of varying
    productivity
  • Zipfs - word frequency rankings
  • The Web
  • out of control in growth opportunities
  • wise setup can help
  • use good analysis tools

40
Projects Papers
  • Everyone have topic or project?
  • Lets talk more (via email too) about ideas and
    projects
Write a Comment
User Comments (0)
About PowerShow.com