Experimental Implementations - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Experimental Implementations

Description:

ExpTime-hard in worst case, but regular structure can be exploited by optimisations ... them can lead to important insights, e.g., the case of MOMS heuristic: ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 32
Provided by: seanbec
Category:

less

Transcript and Presenter's Notes

Title: Experimental Implementations


1
Experimental Implementations
Ian Horrocks University of Manchester
2
Experimental Implementations
Ian Horrocks University of Manchester
3
To Implement or Not To Implement?
  • That is the question make sure you have a good
    answer!
  • Implementation/evaluation can be very time
    consuming
  • Implementation may not in itself be very
    interesting or contribute much to paper/thesis
    writing
  • Results may be difficult to evaluate and/or
    inconclusive
  • Often done for the wrong reasons
  • Typically, as a displacement activity (easier
    than thinking)

4
Reasons to Implement
  • Can improve understanding of problem and prompt
    new ideas
  • Develop theories to explain observed behaviour
  • Theoretical analysis may be difficult
  • Large/complex systems/problems can be very hard
    to analyse
  • Hypothesis may require empirical analysis
  • Subjective hypotheses
  • Easier to use, promotes better design, etc
  • Relates to typical rather than worst case
  • Problem worst case intractable but procedure
    effective in typical cases
  • It can be fun and very satisfying (when things
    work out well)
  • And you may be able to sell it and become
    fabulously wealthy

5
Example Understanding ? Idea
  • ALC concept satisfiability w.r.t. Tbox known to
    be ExpTime-hard
  • Tests with data from application KB confirm bad
    performance
  • But careful study of behaviour shows that
    satisfiable cases generally easier
  • If they exist, then models often easy to find
  • Problems generally under-constrained
  • Structure of axioms means that search space can
    be dramatically reduced by lucky guessing
  • Idea rewrite all axioms to maximise such lucky
    guessing
  • Results in speedups of several orders of magnitude

6
What is Your Hypothesis?
  • Some hypotheses may be better/more useful than
    others
  • Algorithm A is better than algorithm B
  • Bad how can claim be tested when/how/why
    better?
  • CPU time for A is less than B on problem P
  • Slightly better clear how claim can be tested,
    but how useful is it?
  • CPU time for A is less than B on problems of type
    P
  • Better generalises hypothesis to a class of
    problems
  • CPU time for A is less than B on problems of type
    P because
  • Much better explanation suggests more precise
    analysis

7
Example Hypothesis
In spite of the known ExpTime-hardness of the
underlying satisfiability problem, an optimised
Description Logic classifier can exhibit good
behaviour with large application derived
knowledge bases because the structure of the KB
means that the search space can be effectively
pruned by suitable optimisations
  • Clearly establishes context
  • Identifies test data
  • Suggests form of analysis
  • Test/compare various optimisation techniques
  • Measure CPU time and size of search space

8
Reality Check Is It Worth The Effort?
  • Addresses important and/or interesting problem?
  • Will be of interest/benefit to other researchers?
  • Will resolve long-standing open problem?
  • Will eliminate world hunger?
  • Will get me a PhD?

9
How Will Hypothesis be Evaluated?
  • Consider evaluation before starting on
    implementation!
  • Important points include
  • Availability of suitable test data
  • Data from applications / benchmark suites
  • Hand crafted test data
  • Programmatically generated data
  • Availability of other implementations
  • Suggests that problem is of interest
  • Useful for performance testing/comparison
  • Also useful for correctness testing

10
First Catch Your Data
  • Its never too early to think about test data
  • Suitable test data is essential for
    effective/meaningful evaluation
  • Suitable test data may be hard to find
  • Different kinds of test data
  • Data from applications / benchmark suites
  • E.g., subsumption tests from DL knowledge bases
  • E.g., TPTP library
  • Hand crafted data
  • E.g., Heuerding Schwendimanns K, KT S4
    satisfiability tests
  • Programmatically generated data
  • E.g., randomly generated 3SAT tests
  • Combination of above
  • E.g., data generated using pattern from
    application data

11
Data From Applications
  • Advantages
  • Specificity good performance in an application
    is often our goal
  • May reveal structure that can be exploited in
    algorithm design
  • Many CS/AI problems are worst case intractable
    but quite easy in typical cases
  • Disadvantages
  • May be in short supply chicken and egg problem
  • May be too specific/unrepresentative
  • May be too hard/easy
  • Irregularity may make it difficult to understand
    characteristics of problem in general and/or
    behaviour of implementation
  • May be difficult to perform controlled experiments

12
E.g. Galen Medical Terminology KB
  • Specific goal was to test/improve performance
    of tableaux algorithms for classifying DL KBs
  • ExpTime-hard in worst case, but regular structure
    can be exploited by optimisations
  • Some optimisations proved to be quite specific to
    this KB
  • Highly optimised reasoner needed to solve any
    problems
  • But most subsumption tests now easy for optimised
    reasoners
  • Answers not know a priori, so doesnt test
    correctness
  • Experiments performed by
  • Measuring performance -v- number of axioms added
    to KB
  • Measuring performance with different
    (combinations of) optimisations

13
Measuring Performance
14
Hand Crafted Data
  • Advantages
  • Can control/vary structure, difficulty etc
  • Can explore pathological cases and exercise code
  • Useful for correctness testing (answers may be
    known)
  • Requires deep insight to create interesting
    problems
  • Disadvantages
  • May be unrepresentative/unbalanced
  • Can be very time consuming to create
  • May become outdated (too easy) as a result of new
    procedures
  • May be difficult to design hard problems
  • Requires deep insight to create interesting
    problems

15
E.g. Heuerding Schwendimann
  • Suite of formulae designed to test LWB system
  • Several classes of hard problem were devised
  • Satisfiable and unsatisfiable versions of each
    class
  • Problems could be scaled-up to increase
    difficulty
  • Measuring largest problem solvable in a fixed
    time reduces impact of constant factors
    (processor speed, etc)
  • Known satisfiability useful for correctness
    testing
  • Took years of effort to develop
  • Many classes of problem rendered trivial by
    Optimised reasoners
  • Hardest problems are encodings of known
    pathological cases
  • Not representative of problems occurring in
    applications

16
Test Trivialised by Optimisations
17
Programmatically Generated Data
  • Advantages
  • Plenty of it can generate as much as you like
  • Can control/vary the structure
  • Can explore the whole problem space (in theory)
  • Can conduct carefully controlled experiments
  • Disadvantages
  • May be unrepresentative/unbalanced
  • May be difficult/time consuming to create
    suitable generator, particularly for problems
    that can vary in many dimensions
  • May be difficult to generate reasonably sized
    hard problems

18
E.g. Giunchiglia Sebastiani
  • Developed random formula generator to test KSAT
    Km sat tester
  • Based on well know 3SAT random generators
  • Extended to generate modal as well as
    propositional atoms
  • Additional parameters controlled number of
    modalities, proportion of modal atoms and maximum
    modal depth
  • Using generator to test KSAT/FaCT suggested that
  • Fast propositional reasoning more important than
    modal reasoning
  • Semantic branching optimisation crucial to
    performance
  • NP style phase shift observable for PSpace problem

19
KSAT A Closer Look!
  • Unsatisfiable tests only generated with very
    small numbers of propositional variables
  • High likelihood of duplicating modal atoms in
    disjunctive clauses
  • Larger formulae increasingly likely to be
    trivially unsatisfiable, and observed phase shift
    may be a consequence of this effect
  • Duplicated modal atoms exaggerate significance of
    semantic branching optimisation
  • Increasing modal depth reduces constrainedness
  • Deep formulae are trivially satisfiable (unless
    very large)
  • Shallowness and size of harder tests exaggerates
    significance of fast propositional reasoning
  • Extending FaCT system with fast propositional
    reasoner semantic branching took months of
    effort and was totally ineffective on Galen KB
    in fact performance got worse

20
Test Data Summary
  • Good tests using suitable test data should be
  • Reproducible (by self and others)
  • Preserve/publish data sets, generators, random
    keys etc
  • Representative
  • Test data should be representative of (relevant)
    input space
  • Balanced
  • E.g., mixture of satisfiable and unsatisfiable
    problems
  • Of varying difficulty
  • Including easy, hard and impossible (for existing
    systems) tests

21
Implementation
  • This is the easy part, but still some points
    worth considering
  • Choose an implementation language for rapid
    prototyping rather than raw speed (not interested
    in constant factors)
  • Keep it simple, and dont waste time on
    unimportant details
  • Fancy data structures that will improve
    performance by a (relatively) small factor
  • Fancy user interfaces
  • Use profiling to find where the time is going
  • If it aint broke, dont fix it
  • Make it as flexible/tuneable as possible
  • Include lots of performance monitoring/statistics
    gathering
  • But be able to switch it off it may adversely
    affect performance

22
Experimental Method
  • Excellent set of resources collected by Toby
    Walsh at http//www-users.cs.york.ac.uk/tw/empir
    ical.html
  • Cohen, Gent and Walsh tutorial is particularly
    good
  • Key points include
  • Careful choice of test data
  • Collect as many measurements as possible
  • Always include CPU time
  • Try to vary all relevant factors, one at a time
  • But this may not be easy if there are many such
    factors
  • Watch out for interactions between different
    factors (tricky)
  • Collect as much data as possible
  • Make sure it is repeatable

23
Weighing Up the Competition
  • Comparisons with other systems can be
    useful/convincing
  • Availability of such systems indicates
    significance of problem
  • Improving on state-of-the-art system is a pretty
    convincing
  • Problems
  • Established implementations may be very well
    engineered and so hard to beat
  • Other systems may be difficult to
    obtain/install/use
  • Comparison may not be fair as other system may
  • Be aimed at a different class of problems
  • Require expert tuning to get the best out of it
  • Ideal solution is SOTA system that is easy to
    install and use, and performs very badly on all
    classes of problem ?

24
FacT -v- Kris
Galen KB classification times FaCT (left) -v-
Kris (right)
25
Presenting Results
  • Presentation is important
  • Graphs can be good, but can also be very bad
    (inadequate/illegible labelling, too much
    information, etc)
  • 3D plots can be particularly dangerous
  • Tables may sometimes be better/clearer

26
Lessons Learned
  • Stay close to theory but not too close
  • Success of DL implementations heavily dependent
    on theoretical foundations
  • But theoretical algorithms designed for ease of
    proof not ease of implementation or efficiency
  • Beware of encodings, even if they dont increase
    worst case complexity
  • Direct implementations usually much more
    efficient
  • Extend theory before extending implementation
  • Intuitions about trivial extensions can often
    be wrong

27
Lessons Learned
  • Dont ignore strange results
  • Tempting to treat occasional unexpected results
    as glitches, but trying to explain them can
    lead to important insights, e.g., the case of
    MOMS heuristic
  • Tried adding a popular literal selection
    heuristic to FaCT
  • Surprisingly, performance became considerably
    worse
  • Ignored it for a while as just one of many failed
    optimisations
  • Eventually decided to investigate cause
  • Heuristic disturbed natural FIFO ordering,
    which adversely effected crucial dependency
    directed backtracking optimisation
  • This gave me the idea to try an oldest first
    heuristic, which turned out to be very effective
    and widely applicable

28
Lessons Learned
  • Be suspicious of (too) good results
  • Tempting to be grateful for good results and not
    ask too many questions, e.g., the case of FaCT
    -v- KSAT
  • Compared FaCT with KSAT using improved random
    generator from Hustadt and Schmidt
  • On harder problems, FaCT outperformed KSAT by
    several orders of magnitude
  • I was very happy, and didnt question the results
  • G S later pointed out that a crucial
    optimisation in KSAT required literals in clauses
    to be sorted (according to some total ordering)
  • KSAT could do the sorting, but was default
    parameters had sorting turned off (G S
    generator produced sorted clauses)
  • Turning sorting on made KSAT much more
    competitive ?

29
Lessons Learned
  • Be organised
  • Take the time to write scripts to automate
    testing
  • It is a good way to document exactly what you
    did, and makes tests easily repeatable
  • Carefully organise and document your results
  • At first I used only file names, but I had to
    repeat many time consuming tests when system
    broke down and results got confused
  • Subsequently had system automatically write test
    details at start of every results file
  • Be considerate to those whose machines you want
    to hijack
  • Use cron to run test jobs at night and/or with
    low priority
  • Allow realistic amounts of time for
    implementation and testing

30
And Finally
Dont Panic!
It will all turn out well in the end
31
You may succeed in standing theory on its head!
And if not
32
You can always go mountain-biking instead
Write a Comment
User Comments (0)
About PowerShow.com