Title: Experimental Implementations
1Experimental Implementations
Ian Horrocks University of Manchester
2Experimental Implementations
Ian Horrocks University of Manchester
3To Implement or Not To Implement?
- That is the question make sure you have a good
answer! - Implementation/evaluation can be very time
consuming - Implementation may not in itself be very
interesting or contribute much to paper/thesis
writing - Results may be difficult to evaluate and/or
inconclusive - Often done for the wrong reasons
- Typically, as a displacement activity (easier
than thinking)
4Reasons to Implement
- Can improve understanding of problem and prompt
new ideas - Develop theories to explain observed behaviour
- Theoretical analysis may be difficult
- Large/complex systems/problems can be very hard
to analyse - Hypothesis may require empirical analysis
- Subjective hypotheses
- Easier to use, promotes better design, etc
- Relates to typical rather than worst case
- Problem worst case intractable but procedure
effective in typical cases - It can be fun and very satisfying (when things
work out well) - And you may be able to sell it and become
fabulously wealthy
5Example Understanding ? Idea
- ALC concept satisfiability w.r.t. Tbox known to
be ExpTime-hard - Tests with data from application KB confirm bad
performance - But careful study of behaviour shows that
satisfiable cases generally easier - If they exist, then models often easy to find
- Problems generally under-constrained
- Structure of axioms means that search space can
be dramatically reduced by lucky guessing - Idea rewrite all axioms to maximise such lucky
guessing - Results in speedups of several orders of magnitude
6What is Your Hypothesis?
- Some hypotheses may be better/more useful than
others - Algorithm A is better than algorithm B
- Bad how can claim be tested when/how/why
better? - CPU time for A is less than B on problem P
- Slightly better clear how claim can be tested,
but how useful is it? - CPU time for A is less than B on problems of type
P - Better generalises hypothesis to a class of
problems - CPU time for A is less than B on problems of type
P because - Much better explanation suggests more precise
analysis
7Example Hypothesis
In spite of the known ExpTime-hardness of the
underlying satisfiability problem, an optimised
Description Logic classifier can exhibit good
behaviour with large application derived
knowledge bases because the structure of the KB
means that the search space can be effectively
pruned by suitable optimisations
- Clearly establishes context
- Identifies test data
- Suggests form of analysis
- Test/compare various optimisation techniques
- Measure CPU time and size of search space
8Reality Check Is It Worth The Effort?
- Addresses important and/or interesting problem?
- Will be of interest/benefit to other researchers?
- Will resolve long-standing open problem?
- Will eliminate world hunger?
- Will get me a PhD?
9How Will Hypothesis be Evaluated?
- Consider evaluation before starting on
implementation! - Important points include
- Availability of suitable test data
- Data from applications / benchmark suites
- Hand crafted test data
- Programmatically generated data
- Availability of other implementations
- Suggests that problem is of interest
- Useful for performance testing/comparison
- Also useful for correctness testing
10First Catch Your Data
- Its never too early to think about test data
- Suitable test data is essential for
effective/meaningful evaluation - Suitable test data may be hard to find
- Different kinds of test data
- Data from applications / benchmark suites
- E.g., subsumption tests from DL knowledge bases
- E.g., TPTP library
- Hand crafted data
- E.g., Heuerding Schwendimanns K, KT S4
satisfiability tests - Programmatically generated data
- E.g., randomly generated 3SAT tests
- Combination of above
- E.g., data generated using pattern from
application data
11Data From Applications
- Advantages
- Specificity good performance in an application
is often our goal - May reveal structure that can be exploited in
algorithm design - Many CS/AI problems are worst case intractable
but quite easy in typical cases - Disadvantages
- May be in short supply chicken and egg problem
- May be too specific/unrepresentative
- May be too hard/easy
- Irregularity may make it difficult to understand
characteristics of problem in general and/or
behaviour of implementation - May be difficult to perform controlled experiments
12E.g. Galen Medical Terminology KB
- Specific goal was to test/improve performance
of tableaux algorithms for classifying DL KBs - ExpTime-hard in worst case, but regular structure
can be exploited by optimisations - Some optimisations proved to be quite specific to
this KB - Highly optimised reasoner needed to solve any
problems - But most subsumption tests now easy for optimised
reasoners - Answers not know a priori, so doesnt test
correctness - Experiments performed by
- Measuring performance -v- number of axioms added
to KB - Measuring performance with different
(combinations of) optimisations
13Measuring Performance
14Hand Crafted Data
- Advantages
- Can control/vary structure, difficulty etc
- Can explore pathological cases and exercise code
- Useful for correctness testing (answers may be
known) - Requires deep insight to create interesting
problems - Disadvantages
- May be unrepresentative/unbalanced
- Can be very time consuming to create
- May become outdated (too easy) as a result of new
procedures - May be difficult to design hard problems
- Requires deep insight to create interesting
problems
15E.g. Heuerding Schwendimann
- Suite of formulae designed to test LWB system
- Several classes of hard problem were devised
- Satisfiable and unsatisfiable versions of each
class - Problems could be scaled-up to increase
difficulty - Measuring largest problem solvable in a fixed
time reduces impact of constant factors
(processor speed, etc) - Known satisfiability useful for correctness
testing - Took years of effort to develop
- Many classes of problem rendered trivial by
Optimised reasoners - Hardest problems are encodings of known
pathological cases - Not representative of problems occurring in
applications
16Test Trivialised by Optimisations
17Programmatically Generated Data
- Advantages
- Plenty of it can generate as much as you like
- Can control/vary the structure
- Can explore the whole problem space (in theory)
- Can conduct carefully controlled experiments
- Disadvantages
- May be unrepresentative/unbalanced
- May be difficult/time consuming to create
suitable generator, particularly for problems
that can vary in many dimensions - May be difficult to generate reasonably sized
hard problems
18E.g. Giunchiglia Sebastiani
- Developed random formula generator to test KSAT
Km sat tester - Based on well know 3SAT random generators
- Extended to generate modal as well as
propositional atoms - Additional parameters controlled number of
modalities, proportion of modal atoms and maximum
modal depth - Using generator to test KSAT/FaCT suggested that
- Fast propositional reasoning more important than
modal reasoning - Semantic branching optimisation crucial to
performance - NP style phase shift observable for PSpace problem
19KSAT A Closer Look!
- Unsatisfiable tests only generated with very
small numbers of propositional variables - High likelihood of duplicating modal atoms in
disjunctive clauses - Larger formulae increasingly likely to be
trivially unsatisfiable, and observed phase shift
may be a consequence of this effect - Duplicated modal atoms exaggerate significance of
semantic branching optimisation - Increasing modal depth reduces constrainedness
- Deep formulae are trivially satisfiable (unless
very large) - Shallowness and size of harder tests exaggerates
significance of fast propositional reasoning - Extending FaCT system with fast propositional
reasoner semantic branching took months of
effort and was totally ineffective on Galen KB
in fact performance got worse
20Test Data Summary
- Good tests using suitable test data should be
- Reproducible (by self and others)
- Preserve/publish data sets, generators, random
keys etc - Representative
- Test data should be representative of (relevant)
input space - Balanced
- E.g., mixture of satisfiable and unsatisfiable
problems - Of varying difficulty
- Including easy, hard and impossible (for existing
systems) tests
21Implementation
- This is the easy part, but still some points
worth considering - Choose an implementation language for rapid
prototyping rather than raw speed (not interested
in constant factors) - Keep it simple, and dont waste time on
unimportant details - Fancy data structures that will improve
performance by a (relatively) small factor - Fancy user interfaces
- Use profiling to find where the time is going
- If it aint broke, dont fix it
- Make it as flexible/tuneable as possible
- Include lots of performance monitoring/statistics
gathering - But be able to switch it off it may adversely
affect performance
22Experimental Method
- Excellent set of resources collected by Toby
Walsh at http//www-users.cs.york.ac.uk/tw/empir
ical.html - Cohen, Gent and Walsh tutorial is particularly
good - Key points include
- Careful choice of test data
- Collect as many measurements as possible
- Always include CPU time
- Try to vary all relevant factors, one at a time
- But this may not be easy if there are many such
factors - Watch out for interactions between different
factors (tricky) - Collect as much data as possible
- Make sure it is repeatable
23Weighing Up the Competition
- Comparisons with other systems can be
useful/convincing - Availability of such systems indicates
significance of problem - Improving on state-of-the-art system is a pretty
convincing - Problems
- Established implementations may be very well
engineered and so hard to beat - Other systems may be difficult to
obtain/install/use - Comparison may not be fair as other system may
- Be aimed at a different class of problems
- Require expert tuning to get the best out of it
- Ideal solution is SOTA system that is easy to
install and use, and performs very badly on all
classes of problem ?
24FacT -v- Kris
Galen KB classification times FaCT (left) -v-
Kris (right)
25Presenting Results
- Presentation is important
- Graphs can be good, but can also be very bad
(inadequate/illegible labelling, too much
information, etc) - 3D plots can be particularly dangerous
- Tables may sometimes be better/clearer
26Lessons Learned
- Stay close to theory but not too close
- Success of DL implementations heavily dependent
on theoretical foundations - But theoretical algorithms designed for ease of
proof not ease of implementation or efficiency - Beware of encodings, even if they dont increase
worst case complexity - Direct implementations usually much more
efficient - Extend theory before extending implementation
- Intuitions about trivial extensions can often
be wrong
27Lessons Learned
- Dont ignore strange results
- Tempting to treat occasional unexpected results
as glitches, but trying to explain them can
lead to important insights, e.g., the case of
MOMS heuristic - Tried adding a popular literal selection
heuristic to FaCT - Surprisingly, performance became considerably
worse - Ignored it for a while as just one of many failed
optimisations - Eventually decided to investigate cause
- Heuristic disturbed natural FIFO ordering,
which adversely effected crucial dependency
directed backtracking optimisation - This gave me the idea to try an oldest first
heuristic, which turned out to be very effective
and widely applicable
28Lessons Learned
- Be suspicious of (too) good results
- Tempting to be grateful for good results and not
ask too many questions, e.g., the case of FaCT
-v- KSAT - Compared FaCT with KSAT using improved random
generator from Hustadt and Schmidt - On harder problems, FaCT outperformed KSAT by
several orders of magnitude - I was very happy, and didnt question the results
- G S later pointed out that a crucial
optimisation in KSAT required literals in clauses
to be sorted (according to some total ordering) - KSAT could do the sorting, but was default
parameters had sorting turned off (G S
generator produced sorted clauses) - Turning sorting on made KSAT much more
competitive ?
29Lessons Learned
- Be organised
- Take the time to write scripts to automate
testing - It is a good way to document exactly what you
did, and makes tests easily repeatable - Carefully organise and document your results
- At first I used only file names, but I had to
repeat many time consuming tests when system
broke down and results got confused - Subsequently had system automatically write test
details at start of every results file - Be considerate to those whose machines you want
to hijack - Use cron to run test jobs at night and/or with
low priority - Allow realistic amounts of time for
implementation and testing
30And Finally
Dont Panic!
It will all turn out well in the end
31You may succeed in standing theory on its head!
And if not
32You can always go mountain-biking instead