Title: Java Development for HLT
1Java Development for HLT
- Lars Degerstedt
- Linköping university, IDA
- larde_at_ida.liu.se
- Towards available and useful NLP software
2This Lecture
- 1st hour - Course introduction
- purpose
- motivation
- course overview
- relevance for NLP
- 2nd hour experiences of NLPLAB
3Aim of the Course
- Use of the Java platform for NLP
- Experience from software design
- Experience of mainstream techniques
How can basic NLP research lead to products?
4Trad. NLP(LAB) Results
- Larger projects slower growth/person
- Conflict between results
- paper or code?
- Subexpert vs. holistic view
- the GUI is not important
- Code is (at best) stable but not mature. Even
less useful.
5Weak points of NLP Software
- Closed architecture
- Weak on software methodology
- No differentiation of users
- Difficult to use
- Difficult to integrate
- Unclear in functionality
- No reuse
- Weak maintenance
- Imposes new formalisms
- A lot of bugs...
6Weak points of NLP Development
- Waterfall methods
- Little real usage during development
- think-a-year then code-a-week
- prolog, lisp, java
- No research value
- Large projects
- Subspecialists
- Lack of programmers
7So, What are the Solutions?
- Use commercially available technique
- what can we learn from industry?
- Global cooperation on code-level
- join mainstream technology?
- Adjust our working methods
- how do we better interact with society/real
usage?
8Selected Course Topics
- Lecture 2 - Java
- Language and platform
- Lecture 3 object-oriented design
- Basic concepts/techniques
- Lecture 4 design patterns
- Extremely useful architectural techniques
9History of OO-related Concepts (My View)
Component Systems
Iterative Development/ software evolution
Prog. in the large
System architecture
Operating system design/ scripts
Components
Subsystems/ modules
OO Frameworks
Design patterns
Objects
Object-oriented design
Web-centered Development/ Open Source
Interfaces/APIs
Contracts
Protocols
Code-level design
Formal specification
Idioms
High-level languages
Declarative languages
Extensive free libraries
Prog. in the small
Time of creation
60s
70s
80s
90s
This is just a sketch!
10What is Java?
- In short C syntax, byte code,
platform-neutral - High-level platform
- Unix/C is more low level
- Easy access beyond the desktop...
- Sub-platforms J2SE, J2EE, J2ME, JINI
- Buzz security, connectivity, heterogenuous,
multimedia, Swing, XML, beans, distributed
11Why use Java for NLP?
- HL-quality information available.
- Mature community free code!
- Rewrite for Java - not C
- Utilities sound, 3d graphics, xml
- Integration with industry.
- Joining the OO-movement.
12What is software development?
One Project View
Evolutionary Process View
Project
Project
Analysis
Specification
Design
Project
Implementation
Evaluation
Testing
Project
Project
Development in the Small
Development in the Large
13What is Design? (Part 1 The Ws)
- What software units, ui, interaction, language
- When role in dev. model, time constraints
- Who (by/for) product-design, linguistics,
hackers - Where organization, legasy, single/multiple
project - Why internal/external readers/publication
14What is Design? (Part 2 Definition)
- Theory of something (not everything)
- Design is sold (not proven)
- Defines the system, rather than realizes it
- Partitioning of the system
- Contracts for the interaction between the parts
- Design phase result a specification. E.g.
- Interfaces/APIs (with comments)
- documents
- Conceptual prototype
- Intertwined concepts architecture, development,
requirements analysis/capture, implementation
15What is Object-Oriented Design?
Design in the Small
- Object-Oriented Modeling
- Finding the objects
- Domain and artefact models problem vs domain
- Taxonomy and aggregation
- Real-life mapping/customer satisfaction ui
prototypes and scenarios - Object Interface Design
- Abstract Data Types (ADTs) dataop,
information-hiding, ... - Object Roles information, system, passive,
active, ... - Object as Machines statemethod, orthogonal
methods,...
16What are Design Patterns?
Design in the Middle
- Micro architecture - abstract designs.
- not a concept - a catalogue!
- Useful reuse of successful design.
- Used abstracts from experience.
- Usable includes coding details.
17Why Design Patterns for NLP?
Design patterns are truly useful!!
- Fill a gap between library modules and system
architectures. - Patterns are open-ended, not straight-jackets.
- Codify the (oo) design expertice.
- Open question How do NLP design patterns look
like?
18This is a Project Course!
Use your own code - write code you would want to
use.
- Not a basic programming course.
- Creative ideas but concrete results.
- Write reusable (generic enough) code try to
reuse when possible
19Course Examination
- Individual examination
- cooperation is encouraged.
- Two parts term paper and project
- Term paper (1/3) 75 hours
- Project (2/3) 125 hours
- Metrics for finished project
- Two iterations (with deliverables)
- Well-designed code (document how/why)
20Course Literature
- Recommended readings
- See the course pm at GSLT course page
- Recommended book to buy
- Erich Gamma et al. Design Patterns Elements of
Reusable Object-Oriented Software, Addison-Wesley
1994 - Further readings
- Stefan Sigfried, Understanding OO Software
Engineering, IEEE Press 1996 - Clemens Szyperski, Component Software Beyound
Object-oriented Programming, Addison-Wesley 2001.
- www.javaworld.com
21Related NLP activities
- nlpFarm and openNLP
- NLP OSS development and platform
- GATE 2
- tool-box for NLP processing
- SVENSK
- Swedish NLP platform
- NLSR
- NLP software registry (DFKI)
22nlpFarm
- An OSS Java-software at SourceForge.
- Farmstead mission
- A place where early research prototypes
- evolve into robust and useful open source.
- practical work towards useful things
- Global/Scandinavian cooperation?
- Will nlpFarm work? It is an OpenNLP experiment
sponsored by Vinnova
23SVENSK
- Language processing tool-box for Swedish.
- Reuse of existing NLP components.
- Based on the GATE architecture.
- Its successor Kaba for information access and
refinement only.
24GATE 2
- GATE document manager, gui, components.
- Installed at gt 250 sites.
- GATE 2 rewritten in Java
- A platform for Language engineering.
- Broad range of packages
- gate.sgml, gate.swing, gate.email
25This Lecture (2nd Hour)
- 1st hour - Course introduction
- 2nd hour Experiences of NLPLAB
- Evolutionary Process Model
- Iterative Method
- BirdQuest
- TvGuide
- nlpFarm
26NLPLAB Projects of Today
TvGuide
BirdQuest
TvGuide
App
Quaks, JavaChart, PGP,...
QUAC, DM, FS, TGEN, Guidia,...
MOLINC, FUNs, JavaChart, ...
Facility
2 persons
5 persons
4 persons
Iterative, incremental with free evolution
(mixing bottom-up and top-down design)
27Evolutionary Process Model for NLP/LE
Application Artefacts
p/n
p/n
p
Artefact Construction Theory
n
Language Modeling
p
n
p/n
p/n
Facility artefacts
p possibilties n needs
Multi-dimensional approach to NLP/LE development
avoid one sided approaches!
28Issues in Evolutionary Design
- Iterative and Incremental Design
- Robust for change formal revisions
- Refactorings
- Respect of Legacy both theory and code
- ...but dont be a slave under it!
- Free evolution of design
- Mixed bottom-up and top-down design
- Multiple-project approach
- Use feedback (both pos. and neg.) seriously!
- It is a bumpy ride!
- ...sometimes improvements make it worse!
29Application-Driven Dialogue System Development
30Two Problems
- Too much time is devoted to discussions on
features of the system that are interesting but
often rare and hard to realise - it is not easy to subdivide the work with design
and implementation into manageable pieces when
developing a dialogue system.
How does the incremental evolution path of a
dialogue application look like?
31Development space for DM
DM Framework Customisation
DM Capabilities
Tools
Sub-dialogue control
Framework templates
History
Code patterns
Atomic request handling
DM Design
Knowledge representation
Modularisation
Interfaces
32BirdQuest Two GUIs Phase-Based Design
- Bird encyclopaedia
- Corpus with user questions
- Dialogue systems framework
33Client-Server Design for BirdQuest
Application
UI Layout
Bird Database
UI Feature Code
JDBC
Server Code
Server Code
JDBC
HTTP
Browser
Web Server
RMI
Server Code
RMI Server
UI Layout
Web Servlet (UI Feature Code)
34Phase-based NLP of BirdQuest
35TvGuide Evolutionary Re-design
System Develop- Ment (round 1)
Application Artefact
Facility Software
Evaluation Dialogue Model Re-design
Application Artefact
Facility Software
System Develop-ment (round 2)
36Encapsulation for TvGuide
(non-strict) Layering
Subsystems
Application
Components
?
Framelets/Tools
Libraries
KR?
37Summing up Two Basic Design Dimensions
Splits the problem!
Problem Division Horizontal Design
Parsing
Access
Generate
Sign of Success High cohesion and Low
coupling between modules
Agenda
List
Array
Abstraction/Layering Vertical Design
Creates a language!
38http//nlpfarm.sourceforge.net
- Public web resource with open source
- A place to work
- Cooperation over time and place
- Development support
- Mostly facility software
- Formal release system
- Towards robust and useful code
- Link between research and industrial products
39Experiences from nlpFarm - Method
- Separate application from facility
- Different structure and methods
- Interdependent artefacts needs and possibilites
- Variation of evolutionary approach, e.g.
- Bottom-up vs top-down
- Theory vs code
- Background of personel / type of result
- Discriminate beginners from experts
- Newbies have creative eyes of a child
- Experts should focus on hidden continuity work
- Software experts should make the overall design
- Dont work alone find feedback
40Experiences from nlpFarm - Design
Facility Software
Library modules
Framelets
1. Non-strict Layering 2. Work bottom-up with
real applications 3. Add code only 4. Design
patterns in kernel 5. Inheritance/taxonomy
in external layer
Kernel
External API package
2nd Layer
Kernel Packages
...?
Application Artefact
Old Applications
1. Method important 2. Focus on the possible 3.
Look at the whole 4. Avoid duplication 5. Reuse
from Legacy
New Application
Facility Software
41Experience from nlpFarm - Implementation
- Inter-project conventions are hard to follow
- Code conventions important for continuity
- Project build support saves time improves result
- Version management hard with beginners code
- Automatic testing is important
- Context-independent unit-tests for facility
software - System-tests for applications with support for
incremental evolution - Code quality is generally low and programming is
time-consuming - Stay focused and make existing solutions a
little better - There is no script-layer where everything
becomes easy - Software construction is inherently creative
where every problem is unique dont kid
yourself!
42Experiences from nlpFarm - Community
- Too early?
- Not all can be users or script fillers
- Kernel of developers must exist (gt 3?)
- Projects/community are not important, but results
are - Are linguists like programmers?
- Will the Open Source/free software manifesto work
outside Hackerdale? - Willingness to engage in the e-society for its
own sake - What is the modern (90s) evolutionary society
vision of NLP? - OpenNLP needs a vision like GNU, but still lacks
one... - A talking thinking computer? Hm,...?
43Summing Up
- The Java the language for NLP?
- It has kept its promise so far!
- Java 1.5 is coming...
- Higher-orderness/meta-programming is still a
problem - The Java platform for NLP?
- Better than promised in many ways
- Example of well-handled software evolution
- Many elegant designs
- Still Open Knowledge Representation and
Mainstream Technology? - XML in Java shows both possibilities and problems
- XML is a format at a low layer in the formalism
stack! - XML as a script-language, e.g. the build-tool Ant
shows the way? - W3C is an example of evolution of representation
formats...