Multiple Retrieval Models and Regression Models for Prior Art Search - PowerPoint PPT Presentation

About This Presentation
Title:

Multiple Retrieval Models and Regression Models for Prior Art Search

Description:

... Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS Patent Document ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 32
Provided by: pl50
Category:

less

Transcript and Presenter's Notes

Title: Multiple Retrieval Models and Regression Models for Prior Art Search


1
Multiple Retrieval Models and Regression
Modelsfor Prior Art Search
  • Participating institution Humboldt Universität
    zu Berlin - IDSL

Patrice Lopezalso at EPO Berlin, Germany
Laurent Romary INRIA Gemo - Saclay,
France HUB-IDSL Berlin, Germany
2
Plan
  • Searching Scientific and Technical Documents
  • Issues related to Prior Art Search
  • Overview of PATATRAS
  • Patent Document processing
  • Combining metadata text in four steps
  • Results
  • Future work

3
PATATRAS !!
  • PATent and Article Tracking, Retrieval and AccesS
    addresses Scientific and Technical Publications
    in general
  • Scientific and Technical Publications have 5
    dimensions
  • metadata
  • document structure
  • textual content
  • supporting content
  • experimental data
  • How is this instantiated in patent publications?

4
Patent Publications
  • Metadata encode procedure-related data
  • Date, applicant, inventors, language(s)
  • Classification hierarchy of technical fields
    IPC, ECLA (ICO) G06F17/30T2P2X
  • Citations

Information retrieval
Query expansion
5
EPO Citation Statistics
EPO Search Reports produced in the last 5 years
(tot. 775.000)
6
Patent Publications
  • Metadata encode procedure-related data
  • Date, applicant, inventors, language(s)
  • Classification hierarchy of technical fields
    IPC, ECLA (ICO) G06F17/30T2P2X
  • Citations
  • Patent Document Structure Title, Abstract,
    Claims, Description (description of prior art,
    "subjective" technical problem, description of
    embodiments)
  • Strong interrelations between these structures
  • Each of these structures serves different goals

Information retrieval
Query expansion
7
Patent Publications
  • Textual Content of Patent
  • Attornish, multilinguality
  • Supporting content
  • tables, mathematical and chemical formulas,
    citations, technical drawing, etc.
  • Experimental data absent

8
PATATRAS !!
  • Scientific and Technical Publications have 5
    dimensions
  • metadata
  • document structure
  • textual content
  • supporting content
  • experimental data
  • What are the known practices in prior art search
    ?

9
Prior art search
10
PATATRAS !!
  • Scientific and Technical Publications have 5
    dimensions
  • metadata
  • document structure
  • textual content
  • supporting content
  • experimental data
  • We investigated only 1 and 3 in CLEF IP 2009
  • However...
  • how to combine metadata-based and text-content
    retrieval?
  • How to combine results in different languages?
  • How to combine different retrieval approaches?

11
Overview of PATATRAS
12
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Init
Merging
Post-Ranking
13
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
14
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
15
Patent Document Processing Text Indexing
  • Sound linguistic processing as groundwork
  • No stemming POS tagging lemmatisation
  • No stop words Only open grammatical categories
    are considered (N, V, Adj., Adv., numbers)
  • A total of 5 indexes
  • One word form (lemma) index per language (en,
    fr, de)
  • English phrase indexing (Dice coefficient)
  • Conceptual indexing
  • ISO/DIS 24611, Language resource management
    Morpho-syntactic annotation framework

16
Conceptual indexing
  • Creation of a multilingual terminological
    database base based on a conceptual model
    covering scientific technical fields
  • Sources MeSH, UMLS, Gene Ontology, SUMO,
    WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de
  • Merging on concept based on
  • Domain matches (manual mappings between sources)
  • Term matches
  • Represent terms/term variants/synonyms/acronyms
    and multilingual correspondences
  • Term disambiguation based on IPC class
  • 2,6 millions terms for en, 190.000 for de,
    140.000 for fr
  • 1,4 millions concepts (71.000 realized in de,
    65.000 in fr)
  • ISO 166422003, Computer applications in
    terminology Terminological markup framework

17
(No Transcript)
18
Limitations of text-only retrieval
  • Queries are based on all the textual content of
    the topic patent documents
  • Model Index Language base with citation text
  • KL lemma en 0.1068 0.1083
  • KL lemma fr 0.0611 0.0612
  • KL lemma de 0.0627 0.0634
  • KL phrase en 0.0717 0.0720
  • KL concept all 0.0671 0.0680
  • Okapi lemma en 0.0806 0.0813
  • Okapi lemma fr 0.0301 0.0303
  • Okapi lemma de 0.0598 0.0612
  • Okapi phrase en 0.0328 0.0330
  • Okapi concept all 0.0510 0.0516

19
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
20
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
21
Patent Document Processing Metadata
  • Additional extraction of cited patents in the
    descriptions (regular expressions)
  • 7960 additional cited EP doc. found in XL set
  • Metadata representation basic normalization
    (author, applicant),
  • Storage in a MySQL database (total 2,48 Go for
    the collection)

22
Prior working sets
  • Goal For a given patent topic, create the
    smallest set of patents containing the relevant
    documents
  • Iterative expansion from a core list of documents
    based on metadata citation tree, common
    applicant/author, patent family relation,
    classifications ? patent examiner's strategies
  • Result micro-recall of 0.7303, approx. 2600 doc.
    per patent topic (415 results per topic after
    final cutoff)
  • Significant improvement of MAP results
  • Model Index Language with cit.
    text with prior sets
  • KL lemma en 0.1083 0.1516 (40)
  • KL lemma de 0.0634 0.1145 (81)
  • KL phrase en 0.0720 0.1268 (76)
  • Okapi lemma en 0.0813 0.1365 (68)

23
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
24
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
25
Merging of results
  • Strong complementarities between the results sets
  • So many examples ! ? fully supervised ML
  • Regression model for estimating for each patent
    topic the pertinence of a result set
  • Features language, query size, init. working set
    size, max./min. range of retrieval scores, IPC
    main class, average phrase length
  • Training set 500 Addition of 4131 patents of
    the collection
  • Linear combination of weights

26
Merging of results
  • Feat. LeastMedSq MP SMO ?-SVM
  • f1 0.1681 (5.8) 0.1711 (7.7) 0.1706
    (7.4) 0.1691 (6.4)
  • f1-6 0.1689 (6.3) 0.1797 (13.1) 0.1807
    (13.7) 0.1976 (24.3)
  • all 0.1786 (12.4) 0.1898 (19.4) 0.2016
    (26.9) 0.2281 (43.5)
  • f1 language
  • f2-6 related to the retrieval score
  • f7-8 IPC (domains)
  • f9 av. phrase length

27
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
28
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
29
Post-ranking
  • Regression model for estimating the pertinence of
    a patent in the result set for a given patent
    topic
  • Features citation, of common IPC ECLA
    classes, prob. of citation, same applicant
    inventors
  • Training set 500 Addition of 4131 patents of
    the collection

30
Final Results
  • Measures S M XL en-XL fr-XL de-XL
  • MAP 0.2714 0.2783 0.2802 0.2358 0.1787
    0.2092
  • Prec. at 5 0.2780 0.2766 0.2768 0.2365
    0.1855 0.2122
  • Prec. at 10 0.1768 0.1748 0.1776 0.1575
    0.1338 0.1467
  • In average approx. 43s per topic
  • Final runs (10.000 patent topics) for all, en,
    fr, de took 5 days on 4 machines

31
Conclusion
  • We have proposed an architecture for retrieving
    Scientific and Technical Publications
  • We have adapted the architecture to patent search
    practices
  • Need
  • improve terminological representations
  • address document structures
  • refine query representations
  • Full text available in HAL http//hal.archives-ou
    vertes.fr/hal-00411835/fr/
Write a Comment
User Comments (0)
About PowerShow.com