ICRA2011WS - PowerPoint PPT Presentation

About This Presentation

Title:

ICRA2011WS

Description:

Addressing Uncertainty in Performance Measurement of Intelligent Systems Raj Madhavan1,2 Elena Messina1 Hui-Min Huang1 Craig Schlenoff1 1Intelligent Systems Division – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 19

Provided by: RMadh2

Learn more at: http://telerobot.cs.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: ICRA2011WS

1
Addressing Uncertainty in Performance
Measurement of Intelligent Systems
Raj Madhavan1,2 Elena Messina1 Hui-Min
Huang1 Craig Schlenoff1 1Intelligent Systems
Division National Institute of Standards and
Technology (NIST) 2Institute for Systems
Research (ISR) University of Maryland, College
Park
Commercial equipment and materials are identified
in this presentation in order to adequately
specify certain procedures. Such identification
does not imply recommendation or endorsement by
NIST, nor does it imply that the materials
or equipment identified are necessarily the best
available for the purpose. The views and opinions
expressed are those of the presenter and does not
necessarily reflect those of the organizations he
is affiliated with.
2
Measuring Performance of Intelligent Systems

Performance Evaluation, Benchmarking, and
Standardization are critical enablers for wider
acceptance and proliferation of existing and
emerging technologies
Crucial for fostering technology transfer and
driving industry innovation
Currently, no consensus nor standards exist on
key metrics for determining the performance of a
system
objective evaluation procedures to quantitatively
deduce/measure the performance of robotic systems
against user-defined requirements
The lack of ways to quantify and characterize
performance of technologies and systems has
precluded researchers working towards a common
goal from
exchanging and communicating results,
intercomparing robot performance, and
leveraging previous work that could otherwise
avoid duplication and expedite technology
transfer.

3
Measuring Performance of Intelligent Systems

The lack of ways to quantify and characterize
technologies and systems also hinders adoption of
new systems
Users dont trust claims by developers
There is lack of knowledge about how to match a
solution with a problem
Users may be reluctant to try a new technology
for fear of expensive failure
Think of the graveyards of unused equipment in
some places

4
Challenges in Measuring Performance of IS

Diversity of applications and deployment
scenarios for the IS
Complexity of the Intelligent System itself
Software components
Hardware components
Interactions between components System of
Systems
Lack of a well-defined mathematical foundation
for dealing with uncertainty in a complex system
methods for computing performance measures and
related uncertainties
techniques for combining uncertainties and making
inferences based on those uncertainties
approaches for estimating uncertainties for
predicted performance

5
Uncertainty and Complexity

Uncertainty and complexity are often closely
related
The abilities to handle uncertainty and
complexity are directly related to the levels of
autonomy and performance

6
Autonomy Levels for Unmanned Systems (ALFUS)
Framework

Standard terms and definitions for characterizing
the levels of autonomy for unmanned systems
Metrics, methods, and processes for measuring
autonomy of unmanned systems
Contextual Autonomous Capability
http//www.nist.gov/el/isd/ks/autonomy_levels.cfm/
(Hui-Min Huang)

7
(No Transcript)
8
Addressing Uncertainty in Performance Measurement
via Complexity

In this context, performance that we are trying
to measure is taken to mean the successful
completion of the mission
Being able to handle higher level of mission and
environmental complexities results in higher
system performance
We can determine whether program-specific
performance requirements are achievable

Mobility Example
9
Test Methods (1)Hurdle Test Method
The purpose of this test method is to
quantitatively evaluate the vertical step
surmounting capabilities of a robot, including
variable chassis configurations and coordinated
behaviors, while being remotely teleoperated in
confined areas with lighted and dark
conditions. Metrics Maximum elevation (cm)
surmounted for 10 repetitions Average time per
repetition

Hurdle Test Method Results Numbers indicating
successful repetitions. 10 corresponds to
reliability of 80--probability of success--that
the robot can successfully perform the task at
the associated apparatus setting.
Measurement Uncertainty (in measuring Obstacle
Traverse Capability) One half of the obstacle
size increment (5 cm) and the elapsed time unit
(30 s)

10
Comms Example
11
Test Methods (2) Radio Comms (LoS) Test Method
The purpose of this test method is to
quantitatively evaluate the line of sight (LOS)
radio communications range for a remotely
teleoperated robot. Metric Maximum distance
(m) downrange at which the robot completes tasks
to verify the functionality of control, video,
and audio transmissions.
Line-of-Sight Radio Comms Test Method Stations
every 100 m for testing two-way communications.
Multiple testing tasks at each test station sum
up for the repeatability.
12
SCORE

System a set of interacting or interdependent
components forming an integrated whole intended
to accomplish a specific goal
Component a constituent part or feature of a
system that contributes to its ability to
accomplish a goal
Capability a specific purpose or functionality
that the system is designed to accomplish
Technical Performance metrics related to
quantitative factors (such as accuracy,
precision, time, distance, etc) as required to
meet end-user expectations
Utility Assessment metrics related to
qualitative factors that gauge the quality or
condition of being useful to the end-user

SCORE (System, Component and Operationally
Relevant Evaluations)
Is a unified set of criteria and software tools
for defining a performance evaluation approach
for complex intelligent systems
Provides a comprehensive evaluation blueprint
that assesses the technical performance of a
system, its components and its capabilities
through isolating and changing variables as well
as capturing end-user utility of the system in
realistic use-case environments

13
How SCORE Handles Complexity

The complexity of the system under test grows
as more components are introduced into the
evaluation
Components evaluated in the elemental tests are
less complex than sub-systems (which contain
multiple components) which are less complex than
the while system
SCORE tests at these various levels of complexity
Data in the following slides indicate that the
results of the elemental tests can accurately be
predictive of the performance of the subsystem
test (which is more complex) and so on.

14
TRANSTAC

GOAL Demonstrate capabilities to rapidly
develop and field free-form, two-way
speech-to-speech translation systems enabling
English and foreign language speakers to
communicate with one another in real-world
tactical situations.
NIST was funded over the past three years to
serve as the Independent Evaluation Team for this
effort.
METRICS (as specified by DARPA)
System usability testing providing overall
scores to the capabilities of the whole system
Software component testing evaluate components
of a system to see how well they perform in
isolation

15
TRANSTAC A QUICK TUTORIAL ON SPEECH TRANSLATION
16
TRANSTAC METRICS

Automated Metrics. For speech recognition, we
calculated Word-Error-Rate (WER). For machine
translation, we calculated BLEU and METEOR.
TTS Evaluation Human judges listened to the
audio outputs of the TTS evaluation and compared
them to the text string of what was fed into the
TTS engine. They then gave a Likert score to
indicate how understandable the audio file was.
WER was also used to judge the TTS output.
Low-Level Concept Transfer A directly
quantitative measure of the transfer of the
low-level elements of meaning. In this context, a
low-level concept is a specific content word (or
words) in an utterance. For example, the phrase
The house is down the street from the mosque.
is one high-level concept, but is made up of
three low-level concepts (house, down the street,
mosque).
Likert Judgment A panel of bilingual judges
rated the semantic adequacy of the translations,
an utterance at a time, choosing from a seven
point scale.
High-Level Concept Transfer The number of
utterances that are judged to have been
successfully transferred. The high-level concept
metric is an efficiency metric which shows the
number of successful utterances per unit of time,
as well as accuracy.
Surveys/Semi-Structured Interviews After each
live scenario, the Soldiers/Marines and the
foreign language speakers filled out a detailed
survey asking them about their experiences with
the TRANSTAC systems. In addition,
semi-structured interviews were performed with
all participants in which questions such as What
did you like?, What didnt you like? and What
would you change? were explored.

17
TRANSTAC
SCORE Level Metric Team 1 Team 2 Team 3
Elemental BLEU 1 2 2
Elemental METEOR 1 2 2
Elemental TTS 1 1 2
Sub-System Low-level Concept Transfer 1 2 2
System Likert Judgment 1 2 2
System High Level Concept Transfer 1 2 3
System (Qualitative) User Surveys 1 2 3
Complexity

From this data, it appears that
the quantitative performance of the elements of
the systems have a direct correlation to the
quantitative performance of the subsystems
the quantitative performance of the sub-systems
has a direct correlation to the quantitative
performance of the overall system
the quantitative performance of the overall
system has a direct correlation to the
qualitative perception of the soldiers using the
systems.

18
In Conclusion