JavaNLP Jumpstart - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

JavaNLP Jumpstart

Description:

Shared repository for members of the NLP group ... If it is someone else's active research code, email them (or the list if you are ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 19

Provided by: chrism98

Category:

more less

Transcript and Presenter's Notes

Title: JavaNLP Jumpstart

1
JavaNLPJumpstart

Jenny Finkel
JavaNLP
October 8, 2009

2
What is JavaNLP

Shared repository for members of the NLP group
Some completed research code (parser, POS tagger,
NER tagger, )
Some dead research code
Some current research code
Lots of utility stuff
Lots of shared representation stuff
Meant to make your life easier

3
What this talk is

Basic JavaNLP policies
Web-based resources
Repository organization
Useful high-lever stuff in JavaNLP
Useful utility stuff in JavaNLP

4
What this talk is not

Java tutorial
ScalaNLP
Eclipse/IntelliJ tutorial
How to get JavaNLP functional on your computer

5
Basic JavaNLP policies

Repository should always compile
Your committed code does not need to function,
just compile (we have script to check
compilation)
Everyone messes up sometimes, so aim for this
goal but dont kill yourself over it
If you modify code other than your own
If non-trivial, email the list
If it is someone elses active research code,
email them (or the list if you are unsure)
Check machine info page, or use top, when running
jobs
If you have a question, email the list!

6
Web-based Resources

JavaNLP homepage
http//nlp.stanford.edu/javanlp/
Has links to everything
Has getting started guides for everything
(including a new users guide)
Javadocs
http//www-nlp.stanford.edu/nlp/javadoc/javanlp/
Web SVN browser
http//nlp.stanford.edu/viewvc/
Machine Info Page
http//nlp.stanford.edu/local/machines.shtml
Username/password nlp/lundard

7
Repository Organization

7 top level projects
Minerva
MT
Pubcrawl (dead)
Rte
Core
Research
Periphery
Only links between packages are things linking to
core.
Dont modify non-core stuff unless you are on
that project (or have permission from owners)

8
High Level Stuff

Lexicalized Parser
edu.stanford.nlp.parser.lexparser.LexicalizedParse
r
Two choices PCFG faster, factored better
/u/nlp/data/lexparser/englishPCFGfactored.ser.g
z
POS tagger
edu.stanford.nlp.tagger.maxent.MaxentTagger
/u/nlp/data/pos-tagger/wsj3t0-18-bidirectional/tra
in-wsj-0-18.holder

9
High Level Stuff

NER tagger
Old edu.stanford.nlp.ie.crf.CRFClassifier
Faster, less memory, less customizable
New edu.stanford.nlp.sequences.SequenceClassifier
Slower, more memory, more customizable
Both use same code for generating features (well,
technically, the new CRF can use the old CRFs
code, but not necessarily vice versa)
Lots of serialized versions, depending on
speed/memory/accuracy tradeoff desired (talk to
me)
/u/nlp/data/ner/goodClassifiers/all.3class.distsim
.ser.g

10
High Level Stuff

Semantic Role Labeling?
We have old code, maybe it works? Machine
reading people?
Coreference Resolution
Newer, and so less optimized and less easy to use
as a system component
Should be working soon if not already, because of
the machine reading project

11
Classifiers

edu.stanford.nlp.classify
Datum / RVFDatum RVF real valued feature
GeneralDataset / Dataset / RVFDataset
Classifier
LinearClassifierFactory how to train
lcf.trainClassifier()
Can specify
Initial value, optimizer, type/parameters of
prior, weight adaptation, cross validation for
sigma

12
Optimization

edu.stanford.nlp.optimization
We have lots of optimizers LBFGS (QNMinimizer)
and stochastic gradient descent (SGDMinimizer)
are my favorites
Everything is setup for minimization, so you may
need to negate your function
Interface is DiffFunction, but you never want to
use that. Instead, use
AbstractCachingDiffFunction
AbstractStochasticCachingDiffFunction
You just need to compute value and partial
derivatives for a given parameter array value

13
CoreLabel

edu.stanford.nlp.util.CoreMap
edu.stanford.nlp.util.ArrayCoreMap
edu.stanford.nlp.ling.CoreLabel
edu.stanford.nlp.ling.CoreAnnotation
Basically, a glorified heterogeneous typed Map
Make CoreAnnotation, which specifies type
These CoreAnnotations are the keys
cl.set(WordAnnotation.class, bug)
String word cl.get(WordAnnotation.class)

14
ObjectBank

Utility code for reading in collections of data
edu.stanford.nlp.objectbank
ObjectBank(ReaderIteratorFactory,
IteratorFromReaderFactory)
ReaderIteratorFactory you give it files,
directories, URLs, strings, whatev, and it vends
readers
IteratorFromReaderFactory (interface) takes a
reader, vends your objects
Lots of implementations exist
Sample code in javadocs for package

15
edu.stanford.nlp.stats.Counter

Anything with associating objects with numbers
Counting items
increment/decrement
Normalize
Log space functionality
Feature weights
Counter interface
ClassicCounter map-based
OpenAddressCounter uses fastutil
Better unless you are also removing lots of
stuff?
Counters
Useful utility stuff
Similar to Collections, Arrays, etc

16
Other Utility Stuff

edu.stanford.nlp.util
ArrayUtils
CollectionValuedMap
Function
Index
Pair / Triple / Quadruple
StringUtils
TwoDimensionalMap
TwoDimensionalCollectionValuedMap

17
Other Utility Stuff

edu.stanford.nlp.math
SloppyMath
ArrayMath
edu.stanford.nlp.io
IOUtils
edu.stanford.nlp.process
Morphology
WordShapeClassifier

18
Parting Words

JavaNLP is not a chore. We are really lucky to
have this resource, even if it is far from
perfect
JavaNLP is subject to tragedy of the commons. If
you see problems, fix them. Do tasks. Allow
your code to be used by others. Dont be
selfish.
If you get in-house testers, its easier to
release your code
If you use shared representations, it will make
it easier for others to use your stuff, but it
will also make it easier for you to use other
peoples stuff
If everyone contributes, everyone benefits
Newbies Ask questions! Use the mailing list!
We are friendly! Identify the things you had
problems with, and try to prevent those problems
for the next new person (update new users
webpage).