Title: Solr Performance
1Solr Performance Key Innovations
- Yonik Seeley, Lucid Imaginationyonik_at_lucidimagina
tion.com, May 26 2011
2Solr 3.1 Highlights
- Numeric range facets (similar to date faceting).
- New spatial search, including spatial filtering,
boosting and sorting capabilities. - Example Velocity driven search UI at
http//localhost8983/solr/browse - A new faster termvector-based highlighter.
- Extended dismax (edismax) query parser with
support for fielded queries, enhanced relevancy,
and full lucene syntax support. - Distributed search support for the Spell check
and Terms components.
3Solr 3.1 Highlights (continued)
- Suggester, a fast trie-based autocomplete
component. - Sort results by any function query.
- JSON document indexing.
- CSV response format
- Apache UIMA integration for metadata extraction.
- Tons of optimizations, bugfixes, and new analysis
capabilities via Apache Lucene 3.1.
4Whats not in 3.1?
- Result Grouping (AKA Field Collapsing)
- Pivot Faceting
- SolrCloud
- Pseudo-fields
- Pseudo-join
- Relevancy function queries
- Per-segment faceting
- Tons of new Lucene performance/efficiency
goodness
5Recent Lucene Performance
- TieredMergePolicy the new default
- Much better for incremental indexing / NRT
- Ignores segment order when selecting best merge
- Takes deletes into account
- Does not over-merge (no cascading merges)
- Finite State Transducer (FST) based terms index
6DocumentWriterPerThread (DWPT)
Indexing thread
- Flushing new segment is now concurrent w/
indexing - Use multiple indexing threads/connections
- When max mem is hit, biggest DWPT is concurrently
flushed
Index Writer
in-memory
Flush segment to disk
_1_0.tiv _1_0.prx _1_0.frq
_2_0.tiv _2_0.prx _2_0.frq
_3_0.tiv _3_0.prx _3_0.frq
7Solr Cloud
http//.../solr/collection1?distribtrue
Load-balanced sub-request
shard1(replica1)
shard2(replica1)
replica2
replica2
replica3
replica3
ZK node
/livenodes server18983/solr
server28983/solr server28983/solr
ZK node
/collections /collection1 configNamemyconf
/shards /shard1 server18983/solr
server28983/solr /shard2
server38983/solr server48983/solr
ZK node
/configs /myconf solrconfig.xml
schema.xml
ZK node
ZK node
ZooKeeper quorum
8Solr Cloud Getting Started
- http//wiki.apache.org/solr/SolrCloud
- java -Dbootstrap_confdir./solr/conf
- -Dcollection.configNamemyconf
- -DzkRun
- -jar start.jar
Upload /solr/conf to ZK and call it myconf
Run an internal ZK server
http//localhost8983/solr/collection1/admin/zooke
eper.jsp
9Distributed Requests
- Explicitly specify node addresses to load-balance
across - shardslocalhost8983/solrlocalhost8900/solr,
- localhost7574/solrlocalhost7500/solr
- A list of equivalent nodes are separated by
- Different phases of the same distributed request
use the same node - Specify logical shard ids to search across
- shardsNY_shard,NJ_shard
- Query across all shards in the collection
- http//localhost8983/solr/collection1/select?dist
ribtrue - public CloudSolrServer(String zkHost)
- SolrJ Java client that load-balances across all
nodes in cluster
10Extended Dismax Parser
- Superset of dismax
- Designed to directly handle user queries w/o
exceptions - defTypeedismaxqfooqfbody
- Fixes edge cases where dismax could still throw
exceptions - OR AND NOT -
- Full lucene syntax support
- Tries lucene syntax first
- Smart escaping is done if syntax errors
- Optionally supports treating and/or as AND/OR
in lucene syntax - Fielded queries (e.g. myfieldfoo) even in
degraded mode - uf parameter controls what field names may be
directly specified in q
11Extended Dismax Parser (continued)
- boost parameter for multiplicative
boost-by-function - Pure negative query clauses
- Example solr OR (-solr)
- Enhanced term proximity boosting
- pf2myfield results in term bigrams in sloppy
phrase queries - myfieldaa bb cc -gt myfieldaa bb
myfieldbb cc - Enhanced stopword handling
- stopwords omitted in main query, but added in
optional proximity boosting part - Example qsolr is awesome qfmyfield
pf2myfield -gt - myfield(solr awesome) (myfieldsolr is
myfieldis awesome) - Currently controlled by the absence of
StopWordFilter in index analyzer, and presence in
query analyzer
12Faceting Performance Improvements
- For facet.methodenum, speed up initial
population of the filterCache (i.e. first time
facet) from 30 to 32x improvement - Optimized facet.methodfc for multi-valued fields
and large facet.limit up to 3x faster - Optimized deep facet paging up to 10x faster
with really large facet.offsets - Less memory consumed by field cache entries
- Per-segment faceting with facet.methodfcs
- Only faster when re-opening index frequently
(many times a second) - Only works for single-valued fields
13Pivot Faceting
- Other names that could have made sense
- Grid Faceting, Cross-Product Faceting, Matrix
Faceting - Syntax facet.pivotfield1,field2,field3,
facet.pivotcat,inStock
docs docs w/ inStocktrue docs w/ instockfalse
catelectronics 14 10 4
catmemory 3 3 0
catconnector 2 0 2
catgraphics card 2 0 2
cathard drive 2 2 0
14Pivot Faceting
http//...facettruefacet.pivotcat,popularity
- "facet_counts"
- "facet_pivot"
- "cat,popularity"
- "field""cat",
- "value""electronics",
- "count"14,
- "pivot"
- "field""popularity",
- "value""6",
- "count"5,
-
- "field""popularity",
- "value""7",
- "count"4,
-
(continued)
"field""popularity",
"value""1", "count"2,
"field""cat",
"value""memory", "count"3,
"pivot",
14 docs w/ catelectronics
5 docs w/ catelectronics popularity6
15Range Faceting
- "facet_counts"
- "facet_ranges"
- "price"
- "counts"
- "0.0"5,
- "50.0"2,
- "100.0"0,
- "150.0"2,
- "200.0"0,
- "250.0"1,
- "300.0"2,
- "350.0"2,
- "400.0"0,
- "450.0"1,
- "gap"50.0,
- "start"0.0,
- "end"500.0
- Like Date faceting, but more generic
- http//...facettrue
- facet.rangeprice
- facet.range.start0
- facet.range.end500
- facet.range.gap50
16Spatial Search
Step1 Index some locations! ltfield
namenamegtThe Alpine Shoplt/fieldgt ltfield
namestoregt44.013617,-73.168264lt/fieldgt
Step2 Decide where you are pt44.0153371,-73.167
34 d1 sfieldstore
Step3 Profit! Spatial Filter
fq!geofilt Bounding Box fq!bbox Distanc
e Function sortgeodist() asc Returning the
distance flgeodist()
Pseudo-fields!
Note You can now sort by any arbitrary function
query!
17Pseudo-Fields
- Returns other info along with document stored
fields - Function queries
- flname,location,geodist(),add(myfield,10)
- Fieldname globs
- flid,attr_
- Multiple fl (field list) values
- flid,attr_flgeodist()fltermfreq(text,solr
) - Aliasing
- flid,locationloc,_dist_geodist()
- Future inlined highlighting, explain,
sort-values, group-value
18Result Grouping / Field Collapsing
- Goal
- Limit the number of results per category
- category normally defined by unique values in a
field - Uses
- Web Search collapse by web site
- Email threads collapse by thread id
- Ecommerce/retail
- Show the top 5 items for each store category
(music, movies, etc)
19Field Collapsing by Site
20Field Collapse on Product Type
Result Grouping by Category
21Group by Field
- http//...flid,nameqipodgrouptruegroup.fiel
dmanu_exact -
"grouped" "manu_exact"
"matches"3, "groups"
"groupValue""Belkin",
"doclist""numFound"2,"start"0,"docs"
"id""IW-02",
"name""iPod iPod Mini USB 2.0 Cable"
, "groupValue""Apple
Computer Inc.", "doclist""numFound"1
,"start"0,"docs"
"id""MA147LL/A", "name""Apple
60 GB iPod with Video Playback Black"
22Group by Query
http//...grouptruegroup.queryprice0 TO
99.99group.queryprice100 TO group.limit5
"grouped" "price0 TO 99.99"
"matches"3, "doclist""numFound"2,"start"
0,"docs"
"id""IW-02", "name""iPod iPod
Mini USB 2.0 Cable",
"id""F8V7067-APL-KIT",
"name""Belkin Mobile Power Cord for iPod"
, "price100 TO " "matches"3,
"doclist""numFound"1,"start"0,"docs"
"id""MA147LL/A",
"name""Apple 60 GB iPod with Video Playback
Black"
23Grouping Params
parameter meaning default
group.fieldltfieldgt Like facet.field group by unique field values
group.queryltquerygt Like facet.query top docs that also match
group.functionltfunction querygt Group by unique values produced by the function query
group.limitltngt How many docs per group 1
group.sortltsort specgt How to sort documents within a group Same as sort
rowsltngt How many groups to return 10
sortltsort specgt How to sort the groups relative to each other (based on top doc)
group.formatltformatgt grouped/simple if simple, a single flat list is used and rows units are docs grouped
group.maintrue/false If true, the first field grouping command is used as main result set false
24Pseudo-Join
id post1 blog_id blog1 author Yonik
Seeley title Solr relevancy function
queries body Lucenes default ranking
id blog1 name Solr n Stuff owner Yonik
Seeley Started 2007-10-26
id post2 blog_id blog1 author Yonik
Seeley title Solr result grouping body Result
Grouping, also called
id blog2 name lifehacker owner Gawker
Media started 2005-1-31
id post3 blog_id blog2 author Whitson
Gordon title How to Install Netflix on Almost
Any Android Device
Restrict to blogs mentioning netflix
fq!join fromblog_id toidbodynetflix
- Finds all documents matching netflix
- Maps to different docs by following blog_id to id
25Pseudo-Join Examples
- Only show posts from blogs started after 2010
- qfoofq!join fromid toblog_idstarted2010
TO - If any post in a blog mentions obama, then
search all posts in that blog for bomb
(self-join) - qbombfq!join fromblog_id toblog_idobama
- If any blog post mentions obama, then search
all websites with the same blog owner for bomb - qbombfq!join fromowner towebsite_owner!joi
n fromblog_id toidobama
26Cross-Core Join
- http//localhost8983/solr/collection1/select?qfo
ofq!join fromIndexsec1 fromsecurity_groups
tosecurityuserjohn
27Pseudo-Join vs Grouping
Pseudo-Join Result Grouping / Field Collapsing
O(n_terms_in_join_fields) O(n_docs_in_result)
Single or multi-valued fields Single-valued fields only
Filters only (no info currently passed from the from docs to the to docs). Can order docs within a group and groups by top doc within that group using normal sort criteria.
Chainable (one join can be the input to another) Not currently chainable can only group one field deep
Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs) Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.
28Auto-Suggest
- Many people previously used terms component
- Can be slow for a large corpus
- New auto-suggest builds off SpellCheck component
- TST implementation compact memory based trie
- FST implementation slower to build, but smaller
faster lookup - Based on a field in the main index, or on a
dictionary file - http//localhost8983/solr/suggest?wtjsonindent
trueqult
"spellcheck" "suggestions" "ult",
"numFound"1, "startOffset"0,
"endOffset"3, "suggestion""ultrasha
rp", "collation","ultrasharp"
29Index with JSON
URLhttp//localhost8983/solr/update/json
curl URL -H 'Content-typeapplication/json' -d
"id" "978-0641723445", "cat"
"book","hardcover", "title" "The
Lightning Thief", "author" "Rick Riordan",
"series_t" "Percy Jackson and the
Olympians", "sequence_i" 1, "genre_s"
"fantasy", "inStock" true, "price"
12.50, "pages_i" 384 '
30Query Results in CSV
- http//localhost8983/solr/select?qipodflname,p
rice,cat,popularitywtcsv - name,price,cat,popularity
- iPod iPod Mini USB 2.0 Cable,11.5,"electronics,c
onnector",1 - Belkin Mobile Power Cord for iPod w/
Dock,19.95,"electronics,connector",1 - Apple 60 GB iPod with Video Playback
Black,399.0,"electronics,music",10 - Can handle multi-valued fields (see cat field
in example) - Completely compatible with the CSV update handler
(can round-trip) - Results are streamed good for dumping entire
parts of the index
31http//localhost8983/solr/browse
32QA