Title: Masters Thesis Defense
1Masters Thesis Defense
- Bibliographic ToolsIn The Context Of WWW And
LaTeX - Munushree Thummala
- Committee members
- Dr. Prabhaker Mateti (Advisor)
- Dr. Thomas Hartrum
- Dr. T.K. Prasad
2Agenda
- Introduction
- BiBTeX Primer
- Bibliographic Tool Survey
- Requirements for the BiBTeXTools
- Design Discussion
- Conclusion
- Future Work
- Questions Answers Session
- Demonstration
3Introduction
- Preparing academic papers
- Collecting bibliographic entries
- Tools used to prepare the papers
- Common problems
4BibTeX Primer
- What is BibTeX?
- Helps prepare the References section in their
documents - Defines entry types and required/optional fields
- Uses style files to define the format of
references - Standards for publications are specified in style
files - Used with LaTeX
- Latex collects \cites in the .tex file
- BibTeX extracts corresponding references from
.bib file - BibTeX formats and sorts according to the .bst
style - Output of BibTeX program is LaTeX formatted text
5Sample BibTeX entry
- _at_mastersthesisThummala-2007, author
Munushree Thummala, title Bibliographic
tools in the context of WWW and \latex, month
November, year 2007, school Wright
State University, OPTkey , - OPTtype ,
- OPTaddress ,
- OPTnote ,
- OPTannote ,
- advisor Prabhaker Mateti
-
6Contribution Of Thesis
- Evaluation of Bibliographic tools
- BiBTeX to Database Suite of Tools
- Database to store BibTeX entries
- LoadBiBTeX
- BibSearch
- Discovery of Duplicate BiBTeX entries
- Normalization of BiBTeX entries
- Text to BiBTeX Translation
- TextToBiBTeX command line tool API
- PDFrefsToBiBTeX command line tool
- Integration of TextToBiBTeX into Aigaion
7Bibliographic Tools
- There are 100 tools
- In this thesis 87 are reviewed
- Tools were evaluated for the following
- Formats supported
- Navigating, Searching and Sorting capabilities
- Ease of maintaining bibliographic entries
- Duplicate discovery
- Import/Export to other formats
8Bibliographic Tools
- Web browser based tools
- Aigaion, Bibsonomy, CiteULike, Zotero, BibORB,
Basilic, PubsOnline, etc. - Desktop/Small scale tools
- JabRef, KBibTeX, TkBibTeX, BibDB, BibEdit, Open
Office Bibliographic Manager, Tellico, etc. - Commercial tools
- Scholars Aid, Bookends, NotaBene, ProCite, etc.
- Utilities
- Bib2html, Bibclean, Bp, Bibdup, Sixpack, etc.
9A Few Notable Tools
- Aigaion
- Zotero
- Bibsonomy
- JabRef
10Aigaion
- Web application, Open source
- Easy to use
- Supports basic editing features
- Supports Multiple Users
- Native format is BiBTeX
- Organizes references by Topics Sub Topics
- Maintains a list of authors to eliminate
duplication - Duplicate discovery present in import feature
11Aigaion (Contd. 2)
12Aigaion (Contd. 3)
13Aigaion (Contd. 4) Author Profile
14Zotero
- Firefox Browser Extension Easy to use
- Organizes entries in collections
- Captures bibliographic entries from websites
automatically - Some drawbacks
- Loses BiBTeX citation keys and custom fields
while importing - Not well suited for managing BiBTeX
bibliographies - Local storage
15Zotero (Contd. 2)
16Zotero (Contd. 3)
17Zotero (Contd. 4)
18Zotero (Contd. 5)
19Bibsonomy
- Web browser based, hosted service
- Easy to use
- References
- Users upload refs and bookmarks to Bibsonomy
- Made available to other users
- Tagged with keywords for categorization and
search - Can be exported as BiBTeX
- Browser shortcuts to capture entries from web
20Bibsonomy (Contd. 2)
21Bibsonomy (Contd. 3)
22Bibsonomy (Contd. 4)
23Bibsonomy (Contd. 5)
24JabRef
- Desktop Application
- Easy to use
- Multiple bib files can be edited
- Search online
- CiteSeer, Medline, IEEExplore, ArXiv.org
- Native format is BibTeX
- Auto generate BiBTeX keys
- Imports/Exports multiple formats
25JabRef (Contd. 2)
26JabRef (Contd. 3)
27JabRef (Contd. 4)
28Requirements for New Tools
- Text to BiBTeX translation
- Translating free style text into BibTeX
- Customizing the translation
- Certainty of Recognition measure
- Extract references section from PDF papers
- Provide an API for other developers to integrate
free style translation into their applications - Command line invocation
- GUI also
- Normalized BiBTeX output
29Requirements (Contd. 2)
- Database of Bibliographic entries
- Database to store BiBTeX files
- Tool to Detect duplicates
- Command line invocation
- Normalized BiBTeX output
30Requirements (Contd. 3)
- Search and Generate BiBTeX files
- Flexible searches
- Command line invocation
- Outputs BiBTeX format
- Normalized BiBTeX output
- Platform Independent
31Database on Local Machine
- Tables to store
- BiBTeX entries
- lookup data for text to BiBTeX translation
- search index data for fast and flexible searching
32Database Of BiBTeX Entries
- A schema to store BiBTeX entries
- including string macros
- Ability to specify a tag for each entry
- Tag defaults to .bib filename
33Database Of Lookup Data
- A database Schema to store lookup tables
- Lookup Tables
- Author Sub Names
- Journal Names
- Publishers
- Cities
- States
- Months
- Organizations
34Database Of Search Indexes
- A database Schema to store BiBTeX Search Index
data - Stores data as sequence of tokens
- Provides ability to search
- Any field(s)
- Any keyword(s)
- Citation key also stored as tokens
35LoadBiBTeX Tool
- Loads BiBTeX files into the database and updates
the search index tables - Loads the lookup tables used by Text to BiBTeX
tool - Detects duplicates
36LoadBibTeX Loads BiBTeX Files
- Program Usage
- LoadBiBTeX loadentries bibtag thesis2007
bibfile thesis.bib - Any entries that have errors are not loaded and
are shown in the output - Updates the index tables used by the BibSearch
tool
37LoadBibTeX Populate Lookup Tables
- Program Usage
- LoadBiBTeX loadauthors loadpublishers
loadjournals bibfile thesis.bib - Only new values are loaded
- The above command does not load the BiBTeX entries
38LoadBibTeX Duplicate Discovery
- Program Usage
- LoadBiBTeX dupdisc bibtag thesis2007 bibfile
thesis.bib - The BiBTeX entries in thesis.bib are read and
compared to the entries in the database
corresponding to the bibtag thesis2007 - Any entries considered to be duplicates are
displayed for the user
39BibSearch Searching The Database
- Program Usage
- BibSearch bibtag thesis2007 fields author
keywords Donald Knuth - The database is searched for entries with the tag
thesis2007 and the words Donald and Knuth
in the author field - The resulting BiBTeX entries and any required
_at_String constructs are normalized and written to
the output
40Normalization
- Make BiBTeX entries consistent
- Some of the rules
- Citation Keys are consistent
- Fields are enclosed in to preserve formatting
- Month field abbreviations are expanded
- Missing required fields are indicated to the user
appropriately - Order of the fields in the output
- Where is it implemented?
- In whichever tool a particular rule makes sense
- Spread across TextToBiBTeX, LoadBibTeX, BibSearch
41Normalization (Example 2)
- _at_mastersthesisThummala2007, title
Bibliographic tools in the context of WWW and
\latex, year 2007, school Wright State
University, month Nov, author Munushree
Thummala, advisor Prabhaker Mateti, - _at_MASTERSTHESISThummala-2007, AUTHOR
Munushree Thummala, TITLE
Bibliographic tools in the context of
WWW and
\latex, MONTH November, YEAR
2007, SCHOOL Wright State
University, ADVISOR Prabhaker
Mateti,
42Normalization (Example 3)
- _at_InCollection lawrence01access,
- author "Steve Lawrence",
- title "Access to Scientific Literature",
- journal "The \it Nature Yearbook of
Science and Technology", - editor "Declan Butler",
- publisher "Macmillan",
- address "London, England",
- pages "86-88",
- year 2001
-
- _at_INCOLLECTION Lawrence-2001,
- AUTHOR Steve Lawrence,
- TITLE Access to Scientific
Literature, - BOOKTITLE ,
- YEAR 2001,
- JOURNAL The \it Nature Yearbook of
Science and Technology, - EDITOR Declan Butler,
- PUBLISHER Macmillan,
- ADDRESS London, England,
43Text to BiBTeX Translation
- What are Free Style References and where would
authors find these ? - References at the end of academic papers
- References on Internet sites like CiteSeer
- A jotted-down text description
- How do authors benefit from this translation ?
- No need to manually convert to BiBTeX
- Significantly better accuracy
- Speeds the process of translating multiple
references
44Text to BiBTeX Translation (Contd. 2)
- Ways to translate free style text
- Write a routine to analyze the strings and guess
the fields - Develop
- Language Grammar
- Recursive Descent Parser
- Which method did we pick?
- Recursive Descent Parsing
- Tried other methods with varying degrees of
success
45Text to BiBTeX Translation (Contd. 3)
- How does the Parser work?
- Extent A sequence of tokens
- Field type An extent that matches the set of
okTokens for that field and ends when a
notOkToken (including a delimiting token) is hit. - Backtrack If the current token in an extent does
not match the field, it is backtracked to the
beginning token, and given a chance to match
other field types. - Unrecognized If the current token does not
match any field type, it is appended to the
unrecognized field list and the above process is
repeated starting at the next token.
46Text to BiBTeX Translation (Contd. 4)
- How is a series of tokens recognized as a field?
- Author, Journal fields - lookup table and
heuristics - Title field - quoted strings or heurisitics
- Pages field
- PAGES.PP.P. ltnumber numbergt
- Year field - a four digit number between 1900
and 2100 - Volume field
- VOL. VOLUME ltnumbergt
- Number field
- NO. NUMBER ltnumbergt
- Abbrev field
- ltvolumegt(ltnumbergt)ltstartpagegt-ltendpagegt
- Edition field-
- EDITIONltnumbergt or ltnumbergt EDITION
- Publisher field, Place, State - Lookup table
47Text to BiBTeX Translation (Contd. 5)
- A lexical analyzer tokenizes
- Holland, J. H. Adaptation in Natural and
Artificial Systems. The University of Michigan
Press, Ann Arbor, MI (1975).
48Text to BiBTeX Translation (Contd. 6)
- Author Field Recognition
- Holland was present in author lookup table
- J., H. are initials and the author is
recognized as present in the form lastname,
firstname - Author Field is set to J.H. Holland
49Text to BiBTeX Translation (Contd. 7)
- Title Field Recognition
- Since Adaptation is not recognized as a
possible starting token of any other field,
tokens are gathered till the next punctuation as
title field
50Text to BiBTeX Translation (Contd. 8)
- Publisher Field Recognition
- The sequence of tokens The University, of,
Michigan and Press represent a valid
publisher name in the publishers lookup table - Thus The University of Michigan Press is
publisher field
51Text to BiBTeX Translation (Contd. 9)
- Place and State Field Recognition
- The sequence of tokens Ann and Arbor
represents a valid place name in the cities
lookup table - The token MI represents a valid state name in
the states lookup table
52Text to BiBTeX Translation (Contd. 10)
- Year Field Recognition
- The token 1995 is a valid year value in the
range 1900 - 2100. As such it becomes the year
field
53Text to BiBTeX Translation (Contd. 11)
- Citation Entry Type
- Since there are no distinguishing fields
recognized, the entry type is defaulted to Misc - CORN calculations
- Author field is fully recognized ? a CORN of 100
- Title field follows Author field ? a CORN of 100
- Publisher field is in lookup table ? a CORN of
100 - There are no required fields for Misc entry type.
So multiplier is 1 - Entry CORN AVG ( Author Title Publisher)
multiplier 100
54Text to BiBTeX Translation (Contd. 12)
- -- Entry CORN 100 Author 100 Title 100
- -- Publisher 100
- _at_MISCHolland-1975
- AUTHOR J. H. Holland
- TITLE Adaptation in Natural and
Artificial Systems - YEAR 1975
- PUBLISHER The University of Michigan
Press - PLACE Ann Arbor
- STATE MI
-
55Text to BiBTeX Translation Example 1
- Werner Damm and Bernhard Josko. A sound and
relatively complete Hoare-logic for a language
with higher type procedures. Acta Informatica,
2059-101, 1983.
- -- Entry CORN 87 Author50 Title 100
Journal 100 Pages 100 - _at_ARTICLEDamm-Josko-1983,
- AUTHOR Werner Damm and
Bernhard Josko, - TITLE A sound and
relatively complete Hoare-logic - for a language with higher type
procedures, - YEAR 1983,
- JOURNAL Acta Informatica,
- PAGES 59-101,
- VOLUME 20,
-
56Text to BiBTeX Translation Example 2
- Collins R. J. and Jefferson D. R. "AntFarm
towards simulated evolution." In C. G. Langton,
C. Taylor, J. D. Farmer, and S. Rasmussen (Eds.),
Artificial Life II, Vol. X of SFI Studies in the
Sciences of Complexity. Redwood City, CA
Addison-Wesley, 1991, pp.579-601.
_at_INPROCEEDINGSJ-R-1991, AUTHOR Collins
R. J. and Jefferson D. R., TITLE
AntFarm towards simulated evolution., YEAR
1991, EDITOR G. Langton and C.
Taylor and J. D. Farmer and S.
Rasmussen, PAGES 579-601, PUBLISHER
Addison - Wesley, JOURNAL In C,
PLACE Redwood City, STATE CA,
OPTERRORFIELD0 Artificial Life II,
OPTERRORFIELD1 Vol. X of SFI Studies
in the Sciences of Complexity,
57Correctness Of Recognition Number
- CORN for entire BiBTeX entry is based on
- CORN for each field recognized
- Completeness of the entry ( of required fields
present) - CORN is calculated for
- Author field
- Editor field
- Title field
- Journal field
- Publisher field
- Pages field
58CORN Example 1
_at_INPROCEEDINGSWegener-2002, AUTHOR I.
Wegener, TITLE Methods for the
Analysis of Evolutionary
Algorithms on PseudoBoolean Functions,
BOOKTITLE , YEAR 2002,
PUBLISHER Kluwer Academic Publishers,
JOURNAL In Evolutionary
Optimization,
59CORN Example 1 (Contd.)
- Author, Title and Publisher were correctly
recognized and their field CORN is set to 100
each. - The journal field was recognized due to the
presence of string In. As such it is assigned
a CORN of 50. - The required field Booktitle is not present so
the multiplier is ¾. - This reduces the entry CORN to 65.
(10010010050)/43/4
60CORN Example 2
- _at_MISCLuckham-1990,
- AUTHOR David Luckham,
- TITLE Programming with Specifications,
- YEAR 1990,
- EDITION 1,
- OPTERRORFIELD0 Springer,
- OPTERRORFIELD1 Berlin,
-
61CORN Example 2 (Contd.)
- One of the Author names is not fully recognized
and hence reduces the CORN for author field to
1/2100 50 - Title is correctly recognized and its field CORN
is set to 100. - Year and Edition fields are correctly recognized
but do not impact entry CORN. - Entry CORN (10050)/2 75. Since the entry
type is MISC, the multiplier is 1.
62CORN Example 3
_at_INPROCEEDINGSCollins-Jefferson-1990,
AUTHOR Robert J. Collins and David
R. Jefferson, TITLE AntFarm
Towards simulated evolution, BOOKTITLE
, YEAR 1990, PAGES
579--601, MONTH February,
PUBLISHER Addison - Wesley, JOURNAL
In Artificial Life II Proceedings
of the Workshop on Artificial
Life, PLACE Santa Fe, STATE
NM,
63CORN Example 3 (Contd.)
- Author names are fully recognized and hence CORN
is set to 100. - Title is correctly recognized and its field CORN
is set to 100. - Pages is recognized and the page range is valid
so CORN is 100. - Journal is recognized with a heuristic, so CORN
is set to 50. - Publisher is publishers lookup table, so CORN is
set to 100. - Entry CORN (10010050100100)/5 (3/4) 67.
The multiplier ¾ is due to the missing booktitle
required field.
64TextToBiBTeX API
- SetupDbConnection
- setInputString
- setMarkupStream re colorized HTML
- setBiBTeXStream re BiBTeX entries
- textToBiBTeX text to BiBTeX translation
- getEntriesCount
- getBibTeXEntryFieldCount
- getBibTeXEntryField
65TextToBiBTeX API (Contd.)
- Java library jar
- Non-java programs can invoke
- TextToBiBTeX
- PDFrefsToBiBTeX
66TextToBiBTeX Command line tool
- Free style input in a file
- BiBTeX output
- Marked up HTML output
- Uses TextToBiBTeX API
- Usage
- TextToBiBTeX lttxt filegt bib file
67PDFrefsToBiBTeX Command line tool
- PDF file as input
- BiBTeX output
- Marked up HTML output
- Uses 3rd party tool PDFBox for parsing PDF file
- Uses TextToBiBTeX API
- Usage
- PDFrefsToBiBTeX -clean ltpdf filegt bib file
68Integrating into Aigaion
- Free Style translation functionality integrated
into Aigaion - Free Style recognition from PDF files
- Logic to clean the text recognized from PDF
- Synchronizing TextToBiBTeX lookup tables with
entries from Aigaion database
69Integrating Into Aigaion (Contd. 2)
70Integrating Into Aigaion (Contd. 3)
71Integrating Into Aigaion (Contd. 4)
72Integrating Into Aigaion (Contd. 5)
73Sync Tables with Aigaion (Contd. 6)
74Sync Tables with Aigaion (Contd. 7)
75Conclusion
- Tool Survey
- Evaluated over 80 tools
- Tool Recommendations
- Database of BiBTeX entries
- Store BiBTeX files as database entries
- Searching is based on token level instead of
string level which yields good results - Duplicates are detected logically instead of
string comparisons
76Conclusion (Contd.)
- Text to BiBTeX translation
- TextToBiBTeX saves scholars time and effort by
relieving them from the burden of translating and
maintaining BiBTeX entries - TextToBiBTeX API allows other tools to reuse free
style functionality - Integrated into Aigaion tool
- Converted PDF references into BiBTeX format
77Future Work
- Better duplicate detection by letting the users
configure the base rules for detecting duplicates - Recognizing more variations in Free style text
- Recognizing more fields
- Optimizing the database loading speed for BiBTeX
entries
78Demonstration
- Integration of free style into Aigaion
- Text file input
- PDF file input
- LoadBiBTeX Duplicate Discovery
- BibSearch Searching the database
- LoadBiBTeX loading a BiBTeX file
- LoadBiBTeX updating lookup tables