Concepts, Semantics and Syntax in EDiscovery - PowerPoint PPT Presentation

About This Presentation
Title:

Concepts, Semantics and Syntax in EDiscovery

Description:

Co-mention Affiliations. Chemist, INBIFO. Voncken, P. INBIFO. Hackenberg, Ulrich. Biologist, INBIFO ... Affiliation. Person. Semantics and Structure ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 22
Provided by: davidei5
Category:

less

Transcript and Presenter's Notes

Title: Concepts, Semantics and Syntax in EDiscovery


1
Concepts, Semantics and Syntax in E-Discovery
  • David Eichmann
  • Institute for Clinical and Translational Science
  • The University of Iowa

2
Our Approach
  • Analyze the human-generated metadata available
    for document collections for organizational and
    individual interactions
  • Explore the syntactic and semantic nature of
    document content and the potential for automatic
    generation of metadata
  • Explore the concept space generated by the
    previous step and its correspondence to boolean
    predicate specification in discovery

3
Our Target Corpus
  • The Illinois Institute of Technology Complex
    Document Information Processing Test Collection
    (IIT CDIP), v. 1.0
  • Derived from the tobacco master settlement
    agreement
  • Comprises 6,910,192 documents
  • Or more properly the OCR output from those
    documents
  • Two merged XML tag sets of metadata, with
    overlapping content
  • ltAgt
  • ltLTDLWOCRgt

4
Metadata Entity Frequencies
5
Metadata Entity Frequencies
6
Metadata Entity Frequencies
7
Metadata Entity Frequencies
8
Database Schema
  • We map the XML structure to a set of relational
    database tables
  • Non-recurring fields are collected in a table
    named document
  • docid
  • title
  • description
  • OCR text
  • Recurring elements each get a table
  • docid
  • value

9
Identifying an Individual
10
How Many Reininghaus?
  • Reininghaus,R
  • Reininghaus,W

11
Co-mention Connections
12
Co-mention Connections
13
Co-mention Connections
14
Co-mention Affiliations
15
Semantics and Structure
  • Our analysis of content involves the following
    phases
  • Lexical analysis
  • Sentence boundary detection
  • Named entity recognition
  • Sentence parsing
  • Relationship extraction
  • The nature of the OCR data seriously impacts each
    of the phases (sometimes in different ways)

16
CDIP Parse Tree Complexity
17
Clean Text Parse Tree Complexity
18
Next Steps
  • Experiment with custom lexical analysis of the
    OCR
  • Start with simple white space detection
  • Construct a lexicon and look for out-of-band
    vocabulary as OCR errors candidates
  • Rewrite the analyzer to support OCR error
    correction
  • Sentence boundary detect and parse the full
    corpus
  • Generate entity relationships using our question
    answering framework

19
And Beyond That
  • Return to the document images and analyze
    document layout
  • Regenerate OCR to include token coordinates
  • Use our PDF structure extraction framework to
    generate logical document structure
  • Generate a set of document models based upon
    similar layout
  • Use the document models to map OCR text to
    metadata elements

20
For Example
21
For Example
Write a Comment
User Comments (0)
About PowerShow.com