Title: ITEC810 Final Report Inferring Document Structure
1ITEC810 Final ReportInferring Document Structure
- Wieyen Lin/41348133
- Supervised by
- Jette Viethen
2Outlines
- Part A
- Introduction
- Related work
- Part B
- Material
- Methodology
- Part C
- Implementation
- Conclusion
3Part A Introduction
4Introduction
5Introduction (contd)
- Research Objective
- Analyze a document image and detect its logical
structure with annotated labels - Project Scope
- Focus on Academic articles
- Source Corpus Association for Computational
Linguistics (ACL) Anthology Corpus
6Related Work
- Physical Layout Analysis
- Top-down methods
- Bottom-up methods
- Logical Structure Analysis
- Syntactic methods
- Rule-based methods
7Part B Methodology
8MaterialXML Source by Text
An example of Input file of the project
9Methodology
10Methodology (contd)
11Methodology (contd)
Algorithm for aggregating blocks In Phase II
Check dominant font size
Read-in 3 lines at a time
A1A2A3
AAB
ABB
A1BA2
ABC
Checkspacing
A
B
C
A1
B
A2
A
BB
AA
B
A, B, C lines of texts with different
dominant font sizes A1, A2 lines of texts
with the same dominant font size s1
spacing between A1 and A2 s2 spacing between A2
and A3
s1s2
s1gts2
s1gts2
AAA
A2A3
A1A2
A1
A3
A
belongs to the same block
12Part C Outcomes
13Current Outcome
Original PDF document
14Current Outcome (contd)
Logical structure outcome in HTML
15ImplementationClass Diagram
16ImplementationUser Interfaces
17ConclusionInformation Evaluation
Error Type Error Found Accuracy of Detection
Incorrect title or missing title 1 97.5 (39/40)
Incorrect Abstract heading or Missing Abstract heading 4 90.0 (36/40)
Incorrect Abstract or Missing Abstract 4 90.0 (36/40)
Incorrect Affiliation(s) or Missing Affiliation(s) 11 72.5 (29/40)
Missing gt50 of Page number(s) or Erroneous Page number(s) found 15 62.5 (25/40)
Missing gt50 Section heading(s) or Erroneous Section heading(s) found 11 72.5 (29/40)
Summary of detection results out of 40 randomly
selected documents
18ConclusionFuture Work
- Improving Algorithms
- Aggregation of Homogenous blocks
- Detection of Abstract Heading, Section Heading,
and Paragraph - Removing Noise
- Incomplete table contents
- Incomplete mathematic formula