ICRA2011WS - PowerPoint PPT Presentation

About This Presentation
Title:

ICRA2011WS

Description:

Addressing Uncertainty in Performance Measurement of Intelligent Systems Raj Madhavan1,2 Elena Messina1 Hui-Min Huang1 Craig Schlenoff1 1Intelligent Systems Division – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 19
Provided by: RMadh2
Category:

less

Transcript and Presenter's Notes

Title: ICRA2011WS


1
Addressing Uncertainty in Performance
Measurement of Intelligent Systems
Raj Madhavan1,2 Elena Messina1 Hui-Min
Huang1 Craig Schlenoff1 1Intelligent Systems
Division National Institute of Standards and
Technology (NIST) 2Institute for Systems
Research (ISR) University of Maryland, College
Park
Commercial equipment and materials are identified
in this presentation in order to adequately
specify certain procedures. Such identification
does not imply recommendation or endorsement by
NIST, nor does it imply that the materials
or equipment identified are necessarily the best
available for the purpose. The views and opinions
expressed are those of the presenter and does not
necessarily reflect those of the organizations he
is affiliated with.
2
Measuring Performance of Intelligent Systems
  • Performance Evaluation, Benchmarking, and
    Standardization are critical enablers for wider
    acceptance and proliferation of existing and
    emerging technologies
  • Crucial for fostering technology transfer and
    driving industry innovation
  • Currently, no consensus nor standards exist on
  • key metrics for determining the performance of a
    system
  • objective evaluation procedures to quantitatively
    deduce/measure the performance of robotic systems
    against user-defined requirements
  • The lack of ways to quantify and characterize
    performance of technologies and systems has
    precluded researchers working towards a common
    goal from
  • exchanging and communicating results,
  • intercomparing robot performance, and
  • leveraging previous work that could otherwise
    avoid duplication and expedite technology
    transfer.

3
Measuring Performance of Intelligent Systems
  • The lack of ways to quantify and characterize
    technologies and systems also hinders adoption of
    new systems
  • Users dont trust claims by developers
  • There is lack of knowledge about how to match a
    solution with a problem
  • Users may be reluctant to try a new technology
    for fear of expensive failure
  • Think of the graveyards of unused equipment in
    some places

4
Challenges in Measuring Performance of IS
  • Diversity of applications and deployment
    scenarios for the IS
  • Complexity of the Intelligent System itself
  • Software components
  • Hardware components
  • Interactions between components System of
    Systems
  • Lack of a well-defined mathematical foundation
    for dealing with uncertainty in a complex system
  • methods for computing performance measures and
    related uncertainties
  • techniques for combining uncertainties and making
    inferences based on those uncertainties
  • approaches for estimating uncertainties for
    predicted performance

5
Uncertainty and Complexity
  • Uncertainty and complexity are often closely
    related
  • The abilities to handle uncertainty and
    complexity are directly related to the levels of
    autonomy and performance

6
Autonomy Levels for Unmanned Systems (ALFUS)
Framework
  • Standard terms and definitions for characterizing
    the levels of autonomy for unmanned systems
  • Metrics, methods, and processes for measuring
    autonomy of unmanned systems
  • Contextual Autonomous Capability
  • http//www.nist.gov/el/isd/ks/autonomy_levels.cfm/
    (Hui-Min Huang)

7
(No Transcript)
8
Addressing Uncertainty in Performance Measurement
via Complexity
  • In this context, performance that we are trying
    to measure is taken to mean the successful
    completion of the mission
  • Being able to handle higher level of mission and
    environmental complexities results in higher
    system performance
  • We can determine whether program-specific
    performance requirements are achievable

Mobility Example
9
Test Methods (1)Hurdle Test Method
The purpose of this test method is to
quantitatively evaluate the vertical step
surmounting capabilities of a robot, including
variable chassis configurations and coordinated
behaviors, while being remotely teleoperated in
confined areas with lighted and dark
conditions. Metrics Maximum elevation (cm)
surmounted for 10 repetitions Average time per
repetition
  • Hurdle Test Method Results Numbers indicating
    successful repetitions. 10 corresponds to
    reliability of 80--probability of success--that
    the robot can successfully perform the task at
    the associated apparatus setting.
  • Measurement Uncertainty (in measuring Obstacle
    Traverse Capability) One half of the obstacle
    size increment (5 cm) and the elapsed time unit
    (30 s)

10
Comms Example
11
Test Methods (2) Radio Comms (LoS) Test Method
The purpose of this test method is to
quantitatively evaluate the line of sight (LOS)
radio communications range for a remotely
teleoperated robot. Metric Maximum distance
(m) downrange at which the robot completes tasks
to verify the functionality of control, video,
and audio transmissions.
Line-of-Sight Radio Comms Test Method Stations
every 100 m for testing two-way communications.
Multiple testing tasks at each test station sum
up for the repeatability.
12
SCORE
  • System a set of interacting or interdependent
    components forming an integrated whole intended
    to accomplish a specific goal
  • Component a constituent part or feature of a
    system that contributes to its ability to
    accomplish a goal
  • Capability a specific purpose or functionality
    that the system is designed to accomplish
  • Technical Performance metrics related to
    quantitative factors (such as accuracy,
    precision, time, distance, etc) as required to
    meet end-user expectations
  • Utility Assessment metrics related to
    qualitative factors that gauge the quality or
    condition of being useful to the end-user
  • SCORE (System, Component and Operationally
    Relevant Evaluations)
  • Is a unified set of criteria and software tools
    for defining a performance evaluation approach
    for complex intelligent systems
  • Provides a comprehensive evaluation blueprint
    that assesses the technical performance of a
    system, its components and its capabilities
    through isolating and changing variables as well
    as capturing end-user utility of the system in
    realistic use-case environments

13
How SCORE Handles Complexity
  • The complexity of the system under test grows
    as more components are introduced into the
    evaluation
  • Components evaluated in the elemental tests are
    less complex than sub-systems (which contain
    multiple components) which are less complex than
    the while system
  • SCORE tests at these various levels of complexity
  • Data in the following slides indicate that the
    results of the elemental tests can accurately be
    predictive of the performance of the subsystem
    test (which is more complex) and so on.

14
TRANSTAC
  • GOAL Demonstrate capabilities to rapidly
    develop and field free-form, two-way
    speech-to-speech translation systems enabling
    English and foreign language speakers to
    communicate with one another in real-world
    tactical situations.
  • NIST was funded over the past three years to
    serve as the Independent Evaluation Team for this
    effort.
  • METRICS (as specified by DARPA)
  • System usability testing providing overall
    scores to the capabilities of the whole system
  • Software component testing evaluate components
    of a system to see how well they perform in
    isolation

15
TRANSTAC A QUICK TUTORIAL ON SPEECH TRANSLATION
16
TRANSTAC METRICS
  • Automated Metrics. For speech recognition, we
    calculated Word-Error-Rate (WER). For machine
    translation, we calculated BLEU and METEOR.
  • TTS Evaluation Human judges listened to the
    audio outputs of the TTS evaluation and compared
    them to the text string of what was fed into the
    TTS engine. They then gave a Likert score to
    indicate how understandable the audio file was.
    WER was also used to judge the TTS output.
  • Low-Level Concept Transfer A directly
    quantitative measure of the transfer of the
    low-level elements of meaning. In this context, a
    low-level concept is a specific content word (or
    words) in an utterance. For example, the phrase
    The house is down the street from the mosque.
    is one high-level concept, but is made up of
    three low-level concepts (house, down the street,
    mosque).
  • Likert Judgment A panel of bilingual judges
    rated the semantic adequacy of the translations,
    an utterance at a time, choosing from a seven
    point scale.
  • High-Level Concept Transfer The number of
    utterances that are judged to have been
    successfully transferred. The high-level concept
    metric is an efficiency metric which shows the
    number of successful utterances per unit of time,
    as well as accuracy.
  • Surveys/Semi-Structured Interviews After each
    live scenario, the Soldiers/Marines and the
    foreign language speakers filled out a detailed
    survey asking them about their experiences with
    the TRANSTAC systems. In addition,
    semi-structured interviews were performed with
    all participants in which questions such as What
    did you like?, What didnt you like? and What
    would you change? were explored.

17
TRANSTAC
SCORE Level Metric Team 1 Team 2 Team 3
Elemental BLEU 1 2 2
Elemental METEOR 1 2 2
Elemental TTS 1 1 2
Sub-System Low-level Concept Transfer 1 2 2
System Likert Judgment 1 2 2
System High Level Concept Transfer 1 2 3
System (Qualitative) User Surveys 1 2 3
Complexity
  • From this data, it appears that
  • the quantitative performance of the elements of
    the systems have a direct correlation to the
    quantitative performance of the subsystems
  • the quantitative performance of the sub-systems
    has a direct correlation to the quantitative
    performance of the overall system
  • the quantitative performance of the overall
    system has a direct correlation to the
    qualitative perception of the soldiers using the
    systems.

18
In Conclusion
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com