PDF Metadata Optimization - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

PDF Metadata Optimization

Description:

... of manipulating aspects of a Web site to improve its ranking in search engines. ... Sorted the list to get the first two text streams which have the largest ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 28
Provided by: roopak
Category:

less

Transcript and Presenter's Notes

Title: PDF Metadata Optimization


1
PDF Metadata Optimization
  • Masters Project Presentation
  • Advisor Dr. Stewart Shen
  • Student Roopa Kadiri
  • Dec 08, 2006

2
Overview
  • Introduction
  • Process Overview
  • PDF Metadata Update Process
  • PDF Metadata and Keyword Search
  • Conclusion

3
Introduction
  • SEO - Search Engine Optimization
  • Search Engine Optimization (SEO) is the practice
    of manipulating aspects of a Web site to improve
    its ranking in search engines.
  • PDF Metadata Optimization deals with updating the
    PDFs with proper metadata.

4
Technical Challenges
  • Automatically assign Metadata for each PDF
    (Title, Author, Keywords).
  • Process PDF and analyze the text inside to
    identify potential candidates for Title, Author
    and Keywords.
  • To accomplish the above challenge there is a need
    to choose appropriate Platform, Toolkits and
    develop relevant Scripts.

5
Process Overview
  • PDF Metadata Update Process
  • Upload PDF document
  • Uncompress PDF
  • Analyzing and Parsing
  • Identifying Title and Author
  • Identifying Keyword
  • Updating PDF Metadata
  • PDF Metadata and Keyword Search
  • PDF Metadata Search
  • PDF Text Keyword Search

6
PDF Metadata Update Process Diagram
7
Upload PDF Screenshot
8
Uploaded PDF Document Screenshot
9
Script Screenshot (Uncompress PDF)
10
Uncompress PDF
  • Uncompress PDF using pdftk tool.
  • pdftk is a tool used to manipulate PDF files.
  • Here it is used to uncompress a PDF into a
    temporary file which can be parsed and analyzed
    fairly easily.

11
Analyzing and Parsing
  • Developed a perl script which parses through the
    uncompressed pdfs, and identifies the text inside
    the pattern ()Tj and groups them based on their
    font value which is specified using Tf values.

12
Uncompressed PDF
13
Identifying Title and Author
  • Text streams collected in a list are grouped
    based on their font values.
  • Sorted the list to get the first two text streams
    which have the largest font values.
  • Above obtained text streams are good candidates
    for Title and Author respectively.

14
Identifying Keyword
  • Developed script to parse and look for Keywords
    text in PDF document.
  • If found, use the words preceding the Keywords
    text as a candidate for the metadata field
    Keywords.
  • If not, as an alternative, find the frequency of
    repeated words (after filtering the stop words)
    from the PDF Text and use top 15 words from them
    as Keywords for the metadata.

15
Updating PDF Metadata
  • PDF API2 is the perl module used to assign the
    Metadata for PDF files.
  • Used this module to update PDF files with the
    above identified Title, Author and Keywords.

16
Updated PDF Document Screenshot
17
PDF Screenshot
18
Updated PDF Document Properties Screenshot
19
PDF Metadata Search
  • Boolean search on Title, Author and Keywords.
  • Used PDFAPI to extract metadata from PDF and
    search for the user entered Title, Author and
    Keywords.
  • Display the matched PDF documents.

20
PDF Metadata Search Screenshot
21
PDF Metadata Search Results Screenshot
22
PDF Text Keyword Search
  • Used xpdf tool to extract text from PDF.
  • Loop through the contents of each pdf's text and
    look for the occurrences of the user entered
    Keyword.

23
PDF Text Keyword Search Screenshot
24
PDF Text Keyword Search Results Screenshot
25
Conclusion
  • Accomplished automatically updating PDF with
    Title, Author and Keywords so that the search
    engines can pickup the updated metadata while
    crawling.
  • Additionally, I have demonstrated metadata and
    full text keyword search of PDF document.

26
References
  • Active State Perl 5.8
  • http//www.activestate.com/Products/ActivePerl/?t
    n1
  • Pdftk tool
  • http//www.accesspdf.com/pdftk/
  • XPDF
  • http//www.foolabs.com/xpdf/
  • PDFAPI2
  • http//search.cpan.org/dist/PDF-API2/

27
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com