Experimental Implementations - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Experimental Implementations

Description:

ExpTime-hard in worst case, but regular structure can be exploited by optimisations ... them can lead to important insights, e.g., the case of MOMS heuristic: ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 32

Provided by: seanbec

Category:

more less

Transcript and Presenter's Notes

Title: Experimental Implementations

1
Experimental Implementations
Ian Horrocks University of Manchester
2
Experimental Implementations
Ian Horrocks University of Manchester
3
To Implement or Not To Implement?

That is the question make sure you have a good
answer!
Implementation/evaluation can be very time
consuming
Implementation may not in itself be very
interesting or contribute much to paper/thesis
writing
Results may be difficult to evaluate and/or
inconclusive
Often done for the wrong reasons
Typically, as a displacement activity (easier
than thinking)

4
Reasons to Implement

Can improve understanding of problem and prompt
new ideas
Develop theories to explain observed behaviour
Theoretical analysis may be difficult
Large/complex systems/problems can be very hard
to analyse
Hypothesis may require empirical analysis
Subjective hypotheses
Easier to use, promotes better design, etc
Relates to typical rather than worst case
Problem worst case intractable but procedure
effective in typical cases
It can be fun and very satisfying (when things
work out well)
And you may be able to sell it and become
fabulously wealthy

5
Example Understanding ? Idea

ALC concept satisfiability w.r.t. Tbox known to
be ExpTime-hard
Tests with data from application KB confirm bad
performance
But careful study of behaviour shows that
satisfiable cases generally easier
If they exist, then models often easy to find
Problems generally under-constrained
Structure of axioms means that search space can
be dramatically reduced by lucky guessing
Idea rewrite all axioms to maximise such lucky
guessing
Results in speedups of several orders of magnitude

6
What is Your Hypothesis?

Some hypotheses may be better/more useful than
others
Algorithm A is better than algorithm B
Bad how can claim be tested when/how/why
better?
CPU time for A is less than B on problem P
Slightly better clear how claim can be tested,
but how useful is it?
CPU time for A is less than B on problems of type
P
Better generalises hypothesis to a class of
problems
CPU time for A is less than B on problems of type
P because
Much better explanation suggests more precise
analysis

7
Example Hypothesis
In spite of the known ExpTime-hardness of the
underlying satisfiability problem, an optimised
Description Logic classifier can exhibit good
behaviour with large application derived
knowledge bases because the structure of the KB
means that the search space can be effectively
pruned by suitable optimisations

Clearly establishes context
Identifies test data
Suggests form of analysis
Test/compare various optimisation techniques
Measure CPU time and size of search space

8
Reality Check Is It Worth The Effort?

Addresses important and/or interesting problem?
Will be of interest/benefit to other researchers?
Will resolve long-standing open problem?
Will eliminate world hunger?
Will get me a PhD?

9
How Will Hypothesis be Evaluated?

Consider evaluation before starting on
implementation!
Important points include
Availability of suitable test data
Data from applications / benchmark suites
Hand crafted test data
Programmatically generated data
Availability of other implementations
Suggests that problem is of interest
Useful for performance testing/comparison
Also useful for correctness testing

10
First Catch Your Data

Its never too early to think about test data
Suitable test data is essential for
effective/meaningful evaluation
Suitable test data may be hard to find
Different kinds of test data
Data from applications / benchmark suites
E.g., subsumption tests from DL knowledge bases
E.g., TPTP library
Hand crafted data
E.g., Heuerding Schwendimanns K, KT S4
satisfiability tests
Programmatically generated data
E.g., randomly generated 3SAT tests
Combination of above
E.g., data generated using pattern from
application data

11
Data From Applications

Advantages
Specificity good performance in an application
is often our goal
May reveal structure that can be exploited in
algorithm design
Many CS/AI problems are worst case intractable
but quite easy in typical cases
Disadvantages
May be in short supply chicken and egg problem
May be too specific/unrepresentative
May be too hard/easy
Irregularity may make it difficult to understand
characteristics of problem in general and/or
behaviour of implementation
May be difficult to perform controlled experiments

12
E.g. Galen Medical Terminology KB

Specific goal was to test/improve performance
of tableaux algorithms for classifying DL KBs
ExpTime-hard in worst case, but regular structure
can be exploited by optimisations
Some optimisations proved to be quite specific to
this KB
Highly optimised reasoner needed to solve any
problems
But most subsumption tests now easy for optimised
reasoners
Answers not know a priori, so doesnt test
correctness
Experiments performed by
Measuring performance -v- number of axioms added
to KB
Measuring performance with different
(combinations of) optimisations

13
Measuring Performance
14
Hand Crafted Data

Advantages
Can control/vary structure, difficulty etc
Can explore pathological cases and exercise code
Useful for correctness testing (answers may be
known)
Requires deep insight to create interesting
problems
Disadvantages
May be unrepresentative/unbalanced
Can be very time consuming to create
May become outdated (too easy) as a result of new
procedures
May be difficult to design hard problems
Requires deep insight to create interesting
problems

15
E.g. Heuerding Schwendimann

Suite of formulae designed to test LWB system
Several classes of hard problem were devised
Satisfiable and unsatisfiable versions of each
class
Problems could be scaled-up to increase
difficulty
Measuring largest problem solvable in a fixed
time reduces impact of constant factors
(processor speed, etc)
Known satisfiability useful for correctness
testing
Took years of effort to develop
Many classes of problem rendered trivial by
Optimised reasoners
Hardest problems are encodings of known
pathological cases
Not representative of problems occurring in
applications

16
Test Trivialised by Optimisations
17
Programmatically Generated Data

Advantages
Plenty of it can generate as much as you like
Can control/vary the structure
Can explore the whole problem space (in theory)
Can conduct carefully controlled experiments
Disadvantages
May be unrepresentative/unbalanced
May be difficult/time consuming to create
suitable generator, particularly for problems
that can vary in many dimensions
May be difficult to generate reasonably sized
hard problems

18
E.g. Giunchiglia Sebastiani

Developed random formula generator to test KSAT
Km sat tester
Based on well know 3SAT random generators
Extended to generate modal as well as
propositional atoms
Additional parameters controlled number of
modalities, proportion of modal atoms and maximum
modal depth
Using generator to test KSAT/FaCT suggested that
Fast propositional reasoning more important than
modal reasoning
Semantic branching optimisation crucial to
performance
NP style phase shift observable for PSpace problem

19
KSAT A Closer Look!

Unsatisfiable tests only generated with very
small numbers of propositional variables
High likelihood of duplicating modal atoms in
disjunctive clauses
Larger formulae increasingly likely to be
trivially unsatisfiable, and observed phase shift
may be a consequence of this effect
Duplicated modal atoms exaggerate significance of
semantic branching optimisation
Increasing modal depth reduces constrainedness
Deep formulae are trivially satisfiable (unless
very large)
Shallowness and size of harder tests exaggerates
significance of fast propositional reasoning
Extending FaCT system with fast propositional
reasoner semantic branching took months of
effort and was totally ineffective on Galen KB
in fact performance got worse

20
Test Data Summary

Good tests using suitable test data should be
Reproducible (by self and others)
Preserve/publish data sets, generators, random
keys etc
Representative
Test data should be representative of (relevant)
input space
Balanced
E.g., mixture of satisfiable and unsatisfiable
problems
Of varying difficulty
Including easy, hard and impossible (for existing
systems) tests

21
Implementation

This is the easy part, but still some points
worth considering
Choose an implementation language for rapid
prototyping rather than raw speed (not interested
in constant factors)
Keep it simple, and dont waste time on
unimportant details
Fancy data structures that will improve
performance by a (relatively) small factor
Fancy user interfaces
Use profiling to find where the time is going
If it aint broke, dont fix it
Make it as flexible/tuneable as possible
Include lots of performance monitoring/statistics
gathering
But be able to switch it off it may adversely
affect performance

22
Experimental Method

Excellent set of resources collected by Toby
Walsh at http//www-users.cs.york.ac.uk/tw/empir
ical.html
Cohen, Gent and Walsh tutorial is particularly
good
Key points include
Careful choice of test data
Collect as many measurements as possible
Always include CPU time
Try to vary all relevant factors, one at a time
But this may not be easy if there are many such
factors
Watch out for interactions between different
factors (tricky)
Collect as much data as possible
Make sure it is repeatable

23
Weighing Up the Competition

Comparisons with other systems can be
useful/convincing
Availability of such systems indicates
significance of problem
Improving on state-of-the-art system is a pretty
convincing
Problems
Established implementations may be very well
engineered and so hard to beat
Other systems may be difficult to
obtain/install/use
Comparison may not be fair as other system may
Be aimed at a different class of problems
Require expert tuning to get the best out of it
Ideal solution is SOTA system that is easy to
install and use, and performs very badly on all
classes of problem ?

24
FacT -v- Kris
Galen KB classification times FaCT (left) -v-
Kris (right)
25
Presenting Results

Presentation is important
Graphs can be good, but can also be very bad
(inadequate/illegible labelling, too much
information, etc)
3D plots can be particularly dangerous
Tables may sometimes be better/clearer

26
Lessons Learned

Stay close to theory but not too close
Success of DL implementations heavily dependent
on theoretical foundations
But theoretical algorithms designed for ease of
proof not ease of implementation or efficiency
Beware of encodings, even if they dont increase
worst case complexity
Direct implementations usually much more
efficient
Extend theory before extending implementation
Intuitions about trivial extensions can often
be wrong

27
Lessons Learned

Dont ignore strange results
Tempting to treat occasional unexpected results
as glitches, but trying to explain them can
lead to important insights, e.g., the case of
MOMS heuristic
Tried adding a popular literal selection
heuristic to FaCT
Surprisingly, performance became considerably
worse
Ignored it for a while as just one of many failed
optimisations
Eventually decided to investigate cause
Heuristic disturbed natural FIFO ordering,
which adversely effected crucial dependency
directed backtracking optimisation
This gave me the idea to try an oldest first
heuristic, which turned out to be very effective
and widely applicable

28
Lessons Learned

Be suspicious of (too) good results
Tempting to be grateful for good results and not
ask too many questions, e.g., the case of FaCT
-v- KSAT
Compared FaCT with KSAT using improved random
generator from Hustadt and Schmidt
On harder problems, FaCT outperformed KSAT by
several orders of magnitude
I was very happy, and didnt question the results
G S later pointed out that a crucial
optimisation in KSAT required literals in clauses
to be sorted (according to some total ordering)
KSAT could do the sorting, but was default
parameters had sorting turned off (G S
generator produced sorted clauses)
Turning sorting on made KSAT much more
competitive ?

29
Lessons Learned

Be organised
Take the time to write scripts to automate
testing
It is a good way to document exactly what you
did, and makes tests easily repeatable
Carefully organise and document your results
At first I used only file names, but I had to
repeat many time consuming tests when system
broke down and results got confused
Subsequently had system automatically write test
details at start of every results file
Be considerate to those whose machines you want
to hijack
Use cron to run test jobs at night and/or with
low priority
Allow realistic amounts of time for
implementation and testing

30
And Finally
Dont Panic!
It will all turn out well in the end
31
You may succeed in standing theory on its head!
And if not
32
You can always go mountain-biking instead

Write a Comment

User Comments (0)