JavaNLP Jumpstart - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

JavaNLP Jumpstart

Description:

Shared repository for members of the NLP group ... If it is someone else's active research code, email them (or the list if you are ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 19
Provided by: chrism98
Category:

less

Transcript and Presenter's Notes

Title: JavaNLP Jumpstart


1
JavaNLPJumpstart
  • Jenny Finkel
  • JavaNLP
  • October 8, 2009

2
What is JavaNLP
  • Shared repository for members of the NLP group
  • Some completed research code (parser, POS tagger,
    NER tagger, )
  • Some dead research code
  • Some current research code
  • Lots of utility stuff
  • Lots of shared representation stuff
  • Meant to make your life easier

3
What this talk is
  • Basic JavaNLP policies
  • Web-based resources
  • Repository organization
  • Useful high-lever stuff in JavaNLP
  • Useful utility stuff in JavaNLP

4
What this talk is not
  • Java tutorial
  • ScalaNLP
  • Eclipse/IntelliJ tutorial
  • How to get JavaNLP functional on your computer

5
Basic JavaNLP policies
  • Repository should always compile
  • Your committed code does not need to function,
    just compile (we have script to check
    compilation)
  • Everyone messes up sometimes, so aim for this
    goal but dont kill yourself over it
  • If you modify code other than your own
  • If non-trivial, email the list
  • If it is someone elses active research code,
    email them (or the list if you are unsure)
  • Check machine info page, or use top, when running
    jobs
  • If you have a question, email the list!

6
Web-based Resources
  • JavaNLP homepage
  • http//nlp.stanford.edu/javanlp/
  • Has links to everything
  • Has getting started guides for everything
    (including a new users guide)
  • Javadocs
  • http//www-nlp.stanford.edu/nlp/javadoc/javanlp/
  • Web SVN browser
  • http//nlp.stanford.edu/viewvc/
  • Machine Info Page
  • http//nlp.stanford.edu/local/machines.shtml
  • Username/password nlp/lundard

7
Repository Organization
  • 7 top level projects
  • Minerva
  • MT
  • Pubcrawl (dead)
  • Rte
  • Core
  • Research
  • Periphery
  • Only links between packages are things linking to
    core.
  • Dont modify non-core stuff unless you are on
    that project (or have permission from owners)

8
High Level Stuff
  • Lexicalized Parser
  • edu.stanford.nlp.parser.lexparser.LexicalizedParse
    r
  • Two choices PCFG faster, factored better
  • /u/nlp/data/lexparser/englishPCFGfactored.ser.g
    z
  • POS tagger
  • edu.stanford.nlp.tagger.maxent.MaxentTagger
  • /u/nlp/data/pos-tagger/wsj3t0-18-bidirectional/tra
    in-wsj-0-18.holder

9
High Level Stuff
  • NER tagger
  • Old edu.stanford.nlp.ie.crf.CRFClassifier
  • Faster, less memory, less customizable
  • New edu.stanford.nlp.sequences.SequenceClassifier
  • Slower, more memory, more customizable
  • Both use same code for generating features (well,
    technically, the new CRF can use the old CRFs
    code, but not necessarily vice versa)
  • Lots of serialized versions, depending on
    speed/memory/accuracy tradeoff desired (talk to
    me)
  • /u/nlp/data/ner/goodClassifiers/all.3class.distsim
    .ser.g

10
High Level Stuff
  • Semantic Role Labeling?
  • We have old code, maybe it works? Machine
    reading people?
  • Coreference Resolution
  • Newer, and so less optimized and less easy to use
    as a system component
  • Should be working soon if not already, because of
    the machine reading project

11
Classifiers
  • edu.stanford.nlp.classify
  • Datum / RVFDatum RVF real valued feature
  • GeneralDataset / Dataset / RVFDataset
  • Classifier
  • LinearClassifierFactory how to train
  • lcf.trainClassifier()
  • Can specify
  • Initial value, optimizer, type/parameters of
    prior, weight adaptation, cross validation for
    sigma

12
Optimization
  • edu.stanford.nlp.optimization
  • We have lots of optimizers LBFGS (QNMinimizer)
    and stochastic gradient descent (SGDMinimizer)
    are my favorites
  • Everything is setup for minimization, so you may
    need to negate your function
  • Interface is DiffFunction, but you never want to
    use that. Instead, use
  • AbstractCachingDiffFunction
  • AbstractStochasticCachingDiffFunction
  • You just need to compute value and partial
    derivatives for a given parameter array value

13
CoreLabel
  • edu.stanford.nlp.util.CoreMap
  • edu.stanford.nlp.util.ArrayCoreMap
  • edu.stanford.nlp.ling.CoreLabel
  • edu.stanford.nlp.ling.CoreAnnotation
  • Basically, a glorified heterogeneous typed Map
  • Make CoreAnnotation, which specifies type
  • These CoreAnnotations are the keys
  • cl.set(WordAnnotation.class, bug)
  • String word cl.get(WordAnnotation.class)

14
ObjectBank
  • Utility code for reading in collections of data
  • edu.stanford.nlp.objectbank
  • ObjectBank(ReaderIteratorFactory,
  • IteratorFromReaderFactory)
  • ReaderIteratorFactory you give it files,
    directories, URLs, strings, whatev, and it vends
    readers
  • IteratorFromReaderFactory (interface) takes a
    reader, vends your objects
  • Lots of implementations exist
  • Sample code in javadocs for package

15
edu.stanford.nlp.stats.Counter
  • Anything with associating objects with numbers
  • Counting items
  • increment/decrement
  • Normalize
  • Log space functionality
  • Feature weights
  • Counter interface
  • ClassicCounter map-based
  • OpenAddressCounter uses fastutil
  • Better unless you are also removing lots of
    stuff?
  • Counters
  • Useful utility stuff
  • Similar to Collections, Arrays, etc

16
Other Utility Stuff
  • edu.stanford.nlp.util
  • ArrayUtils
  • CollectionValuedMap
  • Function
  • Index
  • Pair / Triple / Quadruple
  • StringUtils
  • TwoDimensionalMap
  • TwoDimensionalCollectionValuedMap

17
Other Utility Stuff
  • edu.stanford.nlp.math
  • SloppyMath
  • ArrayMath
  • edu.stanford.nlp.io
  • IOUtils
  • edu.stanford.nlp.process
  • Morphology
  • WordShapeClassifier

18
Parting Words
  • JavaNLP is not a chore. We are really lucky to
    have this resource, even if it is far from
    perfect
  • JavaNLP is subject to tragedy of the commons. If
    you see problems, fix them. Do tasks. Allow
    your code to be used by others. Dont be
    selfish.
  • If you get in-house testers, its easier to
    release your code
  • If you use shared representations, it will make
    it easier for others to use your stuff, but it
    will also make it easier for you to use other
    peoples stuff
  • If everyone contributes, everyone benefits
  • Newbies Ask questions! Use the mailing list!
    We are friendly! Identify the things you had
    problems with, and try to prevent those problems
    for the next new person (update new users
    webpage).
Write a Comment
User Comments (0)
About PowerShow.com