Title: Multiple Retrieval Models and Regression Models for Prior Art Search
1Multiple Retrieval Models and Regression
Modelsfor Prior Art Search
- Participating institution Humboldt Universität
zu Berlin - IDSL -
Patrice Lopezalso at EPO Berlin, Germany
Laurent Romary INRIA Gemo - Saclay,
France HUB-IDSL Berlin, Germany
2Plan
- Searching Scientific and Technical Documents
- Issues related to Prior Art Search
- Overview of PATATRAS
- Patent Document processing
- Combining metadata text in four steps
- Results
- Future work
3PATATRAS !!
- PATent and Article Tracking, Retrieval and AccesS
addresses Scientific and Technical Publications
in general - Scientific and Technical Publications have 5
dimensions - metadata
- document structure
- textual content
- supporting content
- experimental data
- How is this instantiated in patent publications?
4Patent Publications
- Metadata encode procedure-related data
- Date, applicant, inventors, language(s)
- Classification hierarchy of technical fields
IPC, ECLA (ICO) G06F17/30T2P2X - Citations
Information retrieval
Query expansion
5EPO Citation Statistics
EPO Search Reports produced in the last 5 years
(tot. 775.000)
6Patent Publications
- Metadata encode procedure-related data
- Date, applicant, inventors, language(s)
- Classification hierarchy of technical fields
IPC, ECLA (ICO) G06F17/30T2P2X - Citations
- Patent Document Structure Title, Abstract,
Claims, Description (description of prior art,
"subjective" technical problem, description of
embodiments) - Strong interrelations between these structures
- Each of these structures serves different goals
Information retrieval
Query expansion
7Patent Publications
- Textual Content of Patent
- Attornish, multilinguality
- Supporting content
- tables, mathematical and chemical formulas,
citations, technical drawing, etc. - Experimental data absent
8PATATRAS !!
- Scientific and Technical Publications have 5
dimensions - metadata
- document structure
- textual content
- supporting content
- experimental data
- What are the known practices in prior art search
?
9Prior art search
10PATATRAS !!
- Scientific and Technical Publications have 5
dimensions - metadata
- document structure
- textual content
- supporting content
- experimental data
- We investigated only 1 and 3 in CLEF IP 2009
- However...
- how to combine metadata-based and text-content
retrieval? - How to combine results in different languages?
- How to combine different retrieval approaches?
11Overview of PATATRAS
12Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Init
Merging
Post-Ranking
13Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
14Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
15Patent Document Processing Text Indexing
- Sound linguistic processing as groundwork
- No stemming POS tagging lemmatisation
- No stop words Only open grammatical categories
are considered (N, V, Adj., Adv., numbers) - A total of 5 indexes
- One word form (lemma) index per language (en,
fr, de) - English phrase indexing (Dice coefficient)
- Conceptual indexing
- ISO/DIS 24611, Language resource management
Morpho-syntactic annotation framework
16Conceptual indexing
- Creation of a multilingual terminological
database base based on a conceptual model
covering scientific technical fields - Sources MeSH, UMLS, Gene Ontology, SUMO,
WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de - Merging on concept based on
- Domain matches (manual mappings between sources)
- Term matches
- Represent terms/term variants/synonyms/acronyms
and multilingual correspondences - Term disambiguation based on IPC class
- 2,6 millions terms for en, 190.000 for de,
140.000 for fr - 1,4 millions concepts (71.000 realized in de,
65.000 in fr) - ISO 166422003, Computer applications in
terminology Terminological markup framework
17(No Transcript)
18Limitations of text-only retrieval
- Queries are based on all the textual content of
the topic patent documents - Model Index Language base with citation text
- KL lemma en 0.1068 0.1083
- KL lemma fr 0.0611 0.0612
- KL lemma de 0.0627 0.0634
- KL phrase en 0.0717 0.0720
- KL concept all 0.0671 0.0680
- Okapi lemma en 0.0806 0.0813
- Okapi lemma fr 0.0301 0.0303
- Okapi lemma de 0.0598 0.0612
- Okapi phrase en 0.0328 0.0330
- Okapi concept all 0.0510 0.0516
19Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
20Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
21Patent Document Processing Metadata
- Additional extraction of cited patents in the
descriptions (regular expressions) - 7960 additional cited EP doc. found in XL set
- Metadata representation basic normalization
(author, applicant), - Storage in a MySQL database (total 2,48 Go for
the collection)
22Prior working sets
- Goal For a given patent topic, create the
smallest set of patents containing the relevant
documents - Iterative expansion from a core list of documents
based on metadata citation tree, common
applicant/author, patent family relation,
classifications ? patent examiner's strategies - Result micro-recall of 0.7303, approx. 2600 doc.
per patent topic (415 results per topic after
final cutoff) - Significant improvement of MAP results
- Model Index Language with cit.
text with prior sets - KL lemma en 0.1083 0.1516 (40)
- KL lemma de 0.0634 0.1145 (81)
- KL phrase en 0.0720 0.1268 (76)
- Okapi lemma en 0.0813 0.1365 (68)
23Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
24Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
25Merging of results
- Strong complementarities between the results sets
- So many examples ! ? fully supervised ML
- Regression model for estimating for each patent
topic the pertinence of a result set - Features language, query size, init. working set
size, max./min. range of retrieval scores, IPC
main class, average phrase length - Training set 500 Addition of 4131 patents of
the collection - Linear combination of weights
26Merging of results
- Feat. LeastMedSq MP SMO ?-SVM
- f1 0.1681 (5.8) 0.1711 (7.7) 0.1706
(7.4) 0.1691 (6.4) - f1-6 0.1689 (6.3) 0.1797 (13.1) 0.1807
(13.7) 0.1976 (24.3) - all 0.1786 (12.4) 0.1898 (19.4) 0.2016
(26.9) 0.2281 (43.5) - f1 language
- f2-6 related to the retrieval score
- f7-8 IPC (domains)
- f9 av. phrase length
27Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
28Overview of PATATRAS
Lemur 4.9 - KL divergence - Okapi BM25
Post-Ranking
Merging
Init
29Post-ranking
- Regression model for estimating the pertinence of
a patent in the result set for a given patent
topic - Features citation, of common IPC ECLA
classes, prob. of citation, same applicant
inventors - Training set 500 Addition of 4131 patents of
the collection
30Final Results
- Measures S M XL en-XL fr-XL de-XL
- MAP 0.2714 0.2783 0.2802 0.2358 0.1787
0.2092 - Prec. at 5 0.2780 0.2766 0.2768 0.2365
0.1855 0.2122 - Prec. at 10 0.1768 0.1748 0.1776 0.1575
0.1338 0.1467
- In average approx. 43s per topic
- Final runs (10.000 patent topics) for all, en,
fr, de took 5 days on 4 machines
31Conclusion
- We have proposed an architecture for retrieving
Scientific and Technical Publications - We have adapted the architecture to patent search
practices - Need
- improve terminological representations
- address document structures
- refine query representations
- Full text available in HAL http//hal.archives-ou
vertes.fr/hal-00411835/fr/