Second Global Symposium of Intellectual Property Authorities - PowerPoint PPT Presentation

About This Presentation
Title:

Second Global Symposium of Intellectual Property Authorities

Description:

M. Barrou DIALLO (Ph.D., Computer Sciences) is currently the Head of Research at the European Patent Office. He manages IT global research projects in cooperation ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 32
Provided by: wipoInted
Category:

less

Transcript and Presenter's Notes

Title: Second Global Symposium of Intellectual Property Authorities


1
Second Global Symposium of Intellectual Property
Authorities
An approach to process multilingual patents
corpora
Dr. Barrou DIALLO Head of Research, European
Patent Office, Rijswijk
  • Geneva, September 17, 2010

2
About the RD Department
EPO RD Department
  • At the origin of the 1st Machine Translation
    System for patents
  • Entry point for testing and evaluating available
    solutions
  • Portfolio of academic international
    collaborations
  • Strong background in algorithmic and linguistics
  • Network of active users and testers

3
Our Mission
Providing an instrument to translate user needs
into Projects
Main tasks
  • Technical advises
  • Market studies and research
  • Technological Forecasting
  • Risk analysis and Strategic planning

Resulting to
On request supporting the EPO MT Task force
and/or IP5 MT activities
Management support on ICT issues
  • Performing quantitative analysis
  • Advise over technical solutions to
    decision-makers
  • Providing users with sensible options and
    recommending courses of action

Coordinating research initiatives across IT
  • Identifying and communicating business
    opportunities
  • Ensuring smooth transition from research to
    development
  • Communicate practices and experiences
  • Formalise research work across all departments

4
Current Research Subjects
Our Expertise
Machine Translation for Asian Languages
Semantic Search Engines
Graphical Visualization
5
Our Mission
Our Vision Turning Technology into an effective
IP Process
RD center as a source of Efficiency
  • Efficient Reading
  • Accurate Searching
  • Fast Granting

6
Example of an RD platform for linguistic purposes
7
Logical view of a document in an IR systems
Accents, spacing
Noun groups
Manual indexing
stopwords
stemming
Doc
structure
Doc - Translated x1
Doc - Translated xn...
Multiples languages add another dimension to
Retrieval systems for patents.
adapted from J. H. Wang, 2008
8
MT Setup of an evaluation platform
GOAL To provide the users with a clear
assessment of the quality of MT systems
  • Unix server hosting fullttext patent data of
    source and target languages
  • "mteval" scoring script for the Open MT
    Evaluation (http//www.itl.nist.gov/iad/mig//tools
    /)
  • Case of a small set of Japanese patent documents
  • 54 JP patents
  • 54 JP priority documents published at the USPTO
  • analysis over the claim section

Which indicators of quality can be considered as
valid?
9
Absolute score computation processing scheme
  • For each document
  • choose candidate sentences (ca. 10 segments)
  • find the corresponding HT
  • compute the BLEU score
  • compute the NIST score
  • compute the document average BLEU score
  • compute the document average NIST score
  • compute the BLEU - NIST correlation
  • compute the BLEU - HT correlation
  • compute the NIST - HT correlation
  • store the IPC class
  • For the collection
  • (per IPC class)
  • compute the average BLEU score
  • compute the average NIST score
  • compute the correlation between scores in each
    IPC class

10
Example of raw results
  • High variation of scores
  • High correlation between BLEU and NIST
  • Extreme cases
  • BLEU 0 at 9th position
  • BLEU 0.30 at 25th position

Bleu score JPO 54 documents
NIST score JPO 54 documents
11
NIST vs. BLEU correlation
JP system case
Google case
12
Our findings based on a limited example
BLEU is consistent with Human translations
  • Correlation between BLEU scores and
    Human-translated documents
  • high scores correlate with understandable
    translations
  • low scores correlate with non-understandable
    translations
  • Differences between documents from different IPC
    classes
  • Spread of scores is large (cf. std dev.)

But
  • Results are absolute they need to be compared
    to other systems
  • Bias can be introduced by the origin of data
    (IPC class, complexity, ...)

13
JP BLEU NIST vs. IPC classes
To address the issue of data origin
BLEU score
NIST score
14
MT systems relative score computation scheme
  • BLEU score JP translation system vs. Google system

BLEU score Google
BLEU score JP
15
Best, medium and worse case examples
Mean scores for the whole collection
  • JP translation system
  • NIST 4.7962
  • BLEU 0.1443

Google NIST 4.5796 BLEU 0.1185
Worse case Bleu score 0 for JP Medium
case Bleu score 0.15 for JP Best case Bleu
score 0.30 for JP
16
Worse case example JP (BLEU0)
JP Human Translation claim 1 1. A low-level
light detector, comprising an avalanche
photodiode with a bias voltage adjusted to
produce a multiplication factor of up to 30 a
capacitor connected to the avalanche photodiode
for accumulating carriers produced and multiplied
in the avalanche photodiodegtbiasing means of the
avalanche photodiode outputting means of a
capacitor voltage change and control means of
the biasing and outputting means wherein the
low-level light detector detects an intensity of
light impinging on the avalanche photodiode by
periodically reading capacitor voltages and
obtaining differences between the voltages.
JP MT Claim 1 an avalanche photo-diode (APD)
which adjusted bias voltage so that a
multiplication factor might become 30 or less
A microscopic weak optical power detector
detecting intensity of light irradiated by above
APD by connecting a capacitor for generating
inside this APD and accumulating a (ed)
carrier, reading voltage of this capacitor
periodically, and taking the difference
JP MT Google claim 1 01 claim Avalanche adjusted
so that the bias voltage multiplication factor of
30 or less (APD), and comprises APD occurs
within, connect a capacitor for storing carriers
multiplication reads regularly voltage capacitor
comprises, by taking the difference above, APD
was irradiated characterized by intensity of
light to detect Ru, very faint light detector.
2
  • Remarks
  • "APD" is not in Human Translation
  • "Avalanche photodiode appears 5 times in Human
    vs. different occurrences in MT
  • "photo diode" is missing in Google
  • Much more information in HT than MT

17
Medium case example JP (BLEU0.15)
JP Human Translation 1. An electronic throttle
control device of an internal-combustion engine
that controls an engine output by computing a
quantity of a throttle opening degree on the
basis of a manipulation quantity of an
accelerator pedal by a driver by means of a
computation portion in an electronic control
unit, and by controlling a throttle opening
degree using a specific actuator on the basis of
a computed command value of the throttle opening
degree,gtwherein the electronic control unit
includesgta judgment function portion
JP MT Claims 1. It has the following and is
characterized by choosing a predetermined map
from said two or more characteristic conversion
factor maps, and calculating a target throttle
opening command value corresponding to a judgment
result of said judgment function part. being
based on the amount of operations of a driver's
accelerator by a calculating means of an
electronic control unit (ECU) -- a throttle -- an
opening -- quantity calculating and,
Google MT claims claim 1 electronic control
unit (ECU) by means of operation, the driver's
accelerator operation is calculated caliber
throttle opening based on the amount that means
actuator given on the command throttle opening by
this operation, to control the opening of the
throttle control, electronic throttle the
internal combustion engine to control engine
output apparatus, the electronic control unit,
the normal operating conditions and engine
systems, engine control unit to determine the
abnormality detection capabilities,
Conclusion Quality is not good enough for
understanding the content
18
Best case example JP (BLEU0.30)
Human Translation JP 1. A signal processing
circuit comprisinggta pulse generation part that
generates a pulse signal corresponding to an
input signalgtan integration part that generates
an integrated voltage having a time slope
proportional to an input voltage with a duration
specified by said pulse signal being set as an
integration period andgta hold part that holds
and outputs a difference voltage between a start
voltage and an end voltage of said integrated
voltage in said integration period.
JP Machine Translation A signal-processing
circuit comprisingA pulse generating means which
generates a pulse signal according to an input
signal.An integrating means which generates
integration voltage which has a time slope which
is proportional to input voltage by making into
an integration period a period specified with
said pulse signal.A hold means which holds and
outputs difference voltage of starting potential
of said integration voltage and end voltage in
said integration period.
Google MT 01 and pulse generation to generate a
pulse signal corresponding to the input signal,
the pulsed integration time period as specified
in the signal integration means for generating a
voltage gradient with an integration time
proportional to the input voltage, the voltage
difference between voltage and hold the start
voltage and end voltage of said integration of
said integration period hold, and a signal
processing circuit means and output.
Conclusion Tiny differences between JP MT and HT
19
Rank-ordered N-gram co-occurrence scores
NIST scores for MT vs. Human translations
6 commercial MT systems and 7 professional
translators
Is NIST 0.4 sufficient for patent professionals?
Maximum score for MT
(c) NIST N-gram scoring study
20
Manual vs. Automatic evaluation Result
Interpretations
  • Scores have to be carefully interpreted no
    statistical significance at the moment.
  • There is a clear correlation between manual
    scores and automatic scores
  • Both scores NIST and BLEU are complementary and
    show different aspects
  • Relative scores should be calculted to assessment
    systems between each other
  • Both end-users assessments AND automatic scores
    are necessary for testing a system

21
Cross-Language Patent Retrieval a preliminary
approach
22
The general problem
Finding documents written in any language using
queries expressed in a single language
Source language the language that gives access
to the required information the query
language Target language the language of the
content in the database Usage patent query
translation and/or patent translation from the
source language.
  • Main strategies for query translation
  • dictionary-based methods
  • Limitations of dictionaries
  • Inflected word forms
  • Phrases and compound words
  • Lexical ambiguity
  • Possible solution Approximate string matching
  • corpus-based methods
  • frequency analysis (aboutness of the 2
    collections should be similar)
  • machine translation
  • use of morphological parser
  • Translates source language texts into target
    language using
  • Translation dictionaries
  • Other linguistic resources
  • Syntax analysis
  • Limited availability

23
Cross-language Retrieval in a nutshell
Mohsen Jamali, Sharif Univ. of technology
24
How to start Cross-language Information Retrieval
for Patent?
Classic CLIR system tree which strategy for
patent documents?
25
The main issue of CLIR Term disambiguation
How to deal with ambiguity?
  • Solution 1
  • Selecting the most likely translation (1st one
    offered by a dictionary?), the longest term?
  • Problem a low probability of success.
  • Solution 2
  • Use of all possible translations in the query
    with the OR operator.
  • Problem it includes the correct translation, but
    also introduces noise into the query. This can
    lead to the retrieval of many irrelevant
    documents
  • Solution 3 (most popular)
  • Term co-occurrences models.
  • A query defines a single concept or an
    information need, thus the terms in a query are
    assumed to exhibit relatively strong
    relationship. Therefore, the correct translation
    of one query term would be expected to show a
    strong correlation with other translated query
    words.

26
A proposed measure Mutual Information
Mutual Information (MI) is a technique based on
co-occurrence statistic
  • Relationship between query terms can be
    quantified co-occurrences model
  • The Mutual Information measure quantifies the
    distance between the joint distribution of terms
    X and Y and the product of their marginal
    distributions
  • x, y are the translation of two query terms
  • f(x), f(y), f(x,y) are the frequency that x
    appears, the frequency that y appears and the
    frequency that x and y appears together,
    respectively
  • N is the size of the corpus

27
Translation selection Total Correlation
Measure
  • We have a list of translation candidates.
  • Goal is to find the correct translation from the
    candidate list.
  • The correct translation will be selected using MI

Decision
  • Total correlation - a generalization of the
    Mutual Information to calculate the relationship
    between the query words is proposed
  • xi are the translation of query words
  • f(xi) is the frequency that the xi appears in
    the corpus
  • f(x1,x2,x3,...) is the frequency that all query
    words appears in the corpus.
  • N is the size of the corpus

If a set of translated query terms has a high MI
value, then this set of translated terms is to be
considered as correct
28
Conclusion on Term disambiguation
Mutual Information associated to Total
Correlation is proposed as a measure for
cross-language patent Retrieval
  • MI is a simple measure and not too
    computer-intensive
  • It performs as well as other co-occurrence
    approaches (Maeda et al. (2000).
  • Co-occurrences frequencies can be obtained from
    the document collection.

This approach is compatible with a collaborative
view
  • Make use or build test collections to evaluate
    the systems
  • example of CLEF (Cross Language Evaluation Forum)
  • collect set of queries (rare items in IP)
  • collect sets of relevance judgments (which
    documents are relevant to which queries)

29
Visualization and analysis of Patent Queries
Another solution for Term disambiguation
  • Graphical and textual editing of queries
  • Visual support of different search engines
  • Full-text search
  • Semantic search
  • Image similarity
  • Metadata search
  • Query management functionality
  • Storing of queries
  • Parameterization of queries using variables

Checking amending interactively when necessary
increase the chance of good results
30
Perspective and conclusion
The field of patent processing is still in a
maturing mode
  • The number of subjects to be addressed is large
    (MT, IR, SE theory, Scoring and Evaluation,
    etc...)
  • The difficulty of retrieving patents raise
    theoretical problems. Testing theory need a large
    amount of
  • clean datasets and queries
  • CPU power
  • feedbacks from users communities
  • Current implementations do no satisfy entirely
    the users needs (usability, language independent,
    etc...)
  • Metrics in place need to be revisited and/or
    replaced by patent-specific metrics (i.e
    PRES/Univ. Dublin)
  • Patents not only represent technical texts, but
    also a set of environmental attributes which have
    to be consulted in order to achieve the goals
    (IPC classes, patent searcher behaviours, legal
    changes, ...)

31
Thank you for your attentionAny Questions?
  • Barrou DIALLO bdiallo_at_epo.org
Write a Comment
User Comments (0)
About PowerShow.com