Multiple Retrieval Models and Regression Models for Prior Art Search - PowerPoint PPT Presentation

About This Presentation

Title:

Multiple Retrieval Models and Regression Models for Prior Art Search

Description:

... Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS Patent Document ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 32

Provided by: pl50

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Retrieval Models and Regression Models for Prior Art Search

1
Multiple Retrieval Models and Regression
Modelsfor Prior Art Search

Participating institution Humboldt Universität
zu Berlin - IDSL

Patrice Lopezalso at EPO Berlin, Germany
Laurent Romary INRIA Gemo - Saclay,
France HUB-IDSL Berlin, Germany
2
Plan

Searching Scientific and Technical Documents
Issues related to Prior Art Search
Overview of PATATRAS
Patent Document processing
Combining metadata text in four steps
Results
Future work

3
PATATRAS !!

PATent and Article Tracking, Retrieval and AccesS
addresses Scientific and Technical Publications
in general
Scientific and Technical Publications have 5
dimensions
metadata
document structure
textual content
supporting content
experimental data
How is this instantiated in patent publications?

4
Patent Publications

Metadata encode procedure-related data
Date, applicant, inventors, language(s)
Classification hierarchy of technical fields
IPC, ECLA (ICO) G06F17/30T2P2X
Citations

Information retrieval
Query expansion
5
EPO Citation Statistics
EPO Search Reports produced in the last 5 years
(tot. 775.000)
6
Patent Publications

Metadata encode procedure-related data
Date, applicant, inventors, language(s)
Classification hierarchy of technical fields
IPC, ECLA (ICO) G06F17/30T2P2X
Citations
Patent Document Structure Title, Abstract,
Claims, Description (description of prior art,
"subjective" technical problem, description of
embodiments)
Strong interrelations between these structures
Each of these structures serves different goals

Information retrieval
Query expansion
7
Patent Publications

Textual Content of Patent
Attornish, multilinguality
Supporting content
tables, mathematical and chemical formulas,
citations, technical drawing, etc.
Experimental data absent

8
PATATRAS !!

Scientific and Technical Publications have 5
dimensions
metadata
document structure
textual content
supporting content
experimental data
What are the known practices in prior art search
?

9
Prior art search
10
PATATRAS !!

Scientific and Technical Publications have 5
dimensions
metadata
document structure
textual content
supporting content
experimental data
We investigated only 1 and 3 in CLEF IP 2009
However...
how to combine metadata-based and text-content
retrieval?
How to combine results in different languages?
How to combine different retrieval approaches?

11
Overview of PATATRAS
12
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Init
Merging
Post-Ranking
13
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
14
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
15
Patent Document Processing Text Indexing

Sound linguistic processing as groundwork
No stemming POS tagging lemmatisation
No stop words Only open grammatical categories
are considered (N, V, Adj., Adv., numbers)
A total of 5 indexes
One word form (lemma) index per language (en,
fr, de)
English phrase indexing (Dice coefficient)
Conceptual indexing
ISO/DIS 24611, Language resource management
Morpho-syntactic annotation framework

16
Conceptual indexing

Creation of a multilingual terminological
database base based on a conceptual model
covering scientific technical fields
Sources MeSH, UMLS, Gene Ontology, SUMO,
WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de
Merging on concept based on
Domain matches (manual mappings between sources)
Term matches
Represent terms/term variants/synonyms/acronyms
and multilingual correspondences
Term disambiguation based on IPC class
2,6 millions terms for en, 190.000 for de,
140.000 for fr
1,4 millions concepts (71.000 realized in de,
65.000 in fr)
ISO 166422003, Computer applications in
terminology Terminological markup framework

17
(No Transcript)
18
Limitations of text-only retrieval

Queries are based on all the textual content of
the topic patent documents
Model Index Language base with citation text
KL lemma en 0.1068 0.1083
KL lemma fr 0.0611 0.0612
KL lemma de 0.0627 0.0634
KL phrase en 0.0717 0.0720
KL concept all 0.0671 0.0680
Okapi lemma en 0.0806 0.0813
Okapi lemma fr 0.0301 0.0303
Okapi lemma de 0.0598 0.0612
Okapi phrase en 0.0328 0.0330
Okapi concept all 0.0510 0.0516

19
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
20
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
21
Patent Document Processing Metadata

Additional extraction of cited patents in the
descriptions (regular expressions)
7960 additional cited EP doc. found in XL set

Metadata representation basic normalization
(author, applicant),
Storage in a MySQL database (total 2,48 Go for
the collection)

22
Prior working sets

Goal For a given patent topic, create the
smallest set of patents containing the relevant
documents
Iterative expansion from a core list of documents
based on metadata citation tree, common
applicant/author, patent family relation,
classifications ? patent examiner's strategies
Result micro-recall of 0.7303, approx. 2600 doc.
per patent topic (415 results per topic after
final cutoff)
Significant improvement of MAP results
Model Index Language with cit.
text with prior sets
KL lemma en 0.1083 0.1516 (40)
KL lemma de 0.0634 0.1145 (81)
KL phrase en 0.0720 0.1268 (76)
Okapi lemma en 0.0813 0.1365 (68)

23
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
24
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
25
Merging of results

Strong complementarities between the results sets
So many examples ! ? fully supervised ML
Regression model for estimating for each patent
topic the pertinence of a result set
Features language, query size, init. working set
size, max./min. range of retrieval scores, IPC
main class, average phrase length
Training set 500 Addition of 4131 patents of
the collection
Linear combination of weights

26
Merging of results

Feat. LeastMedSq MP SMO ?-SVM
f1 0.1681 (5.8) 0.1711 (7.7) 0.1706
(7.4) 0.1691 (6.4)
f1-6 0.1689 (6.3) 0.1797 (13.1) 0.1807
(13.7) 0.1976 (24.3)
all 0.1786 (12.4) 0.1898 (19.4) 0.2016
(26.9) 0.2281 (43.5)
f1 language
f2-6 related to the retrieval score
f7-8 IPC (domains)
f9 av. phrase length

27
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
28
Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
29
Post-ranking

Regression model for estimating the pertinence of
a patent in the result set for a given patent
topic
Features citation, of common IPC ECLA
classes, prob. of citation, same applicant
inventors
Training set 500 Addition of 4131 patents of
the collection

30
Final Results

Measures S M XL en-XL fr-XL de-XL
MAP 0.2714 0.2783 0.2802 0.2358 0.1787
0.2092
Prec. at 5 0.2780 0.2766 0.2768 0.2365
0.1855 0.2122
Prec. at 10 0.1768 0.1748 0.1776 0.1575
0.1338 0.1467

In average approx. 43s per topic
Final runs (10.000 patent topics) for all, en,
fr, de took 5 days on 4 machines

31
Conclusion

We have proposed an architecture for retrieving
Scientific and Technical Publications
We have adapted the architecture to patent search
practices
Need
improve terminological representations
address document structures
refine query representations
Full text available in HAL http//hal.archives-ou
vertes.fr/hal-00411835/fr/