Lucene - PowerPoint PPT Presentation

About This Presentation

Title:

Lucene

Description:

Browse to http://localhost:8080/luceneweb. Tomcat will deploy the web app. ... Search at http://localhost:8080/ CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080 ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 15

Provided by: csU45

Learn more at: https://sites.cs.ucsb.edu

Category:

Tags: localhost | lucene

Transcript and Presenter's Notes

Title: Lucene

1
Lucene Nutch

Lucene
Project name
Started as text index engine
Nutch
A complete web search engine, including
Crawling, indexing, searching
Index 100M pages, crawl gt10M/day
Provide distributed architecture
Written in JAVA
Other language ports are work-in-progress

2
Lucene

Open source search project
http//lucene.apache.org
Index search local files
Download lucene-2.2.0.tar.gz from
http//www.apache.org/dyn/closer.cgi/lucene/java/
Extract files
Build an index for a directory
java org.apache.lucene.demo.IndexFiles dir_path
Try search at command line
java org.apache.lucene.demo.SearchFiles

3
Deploy Lucene

Copy luceneweb.war to your tomcat-home/webapps
Browse to http//localhost8080/luceneweb
Tomcat will deploy the web app.
Edit webapps/luceneweb/configuration.jsp
Point indexLocationto your indexes
Search at http//localhost8080/luceneweb

4
Nutch

A complete search engine http//lucene.apache.org/
nutch/release/
Mode
Intranet/local search
Internet search
Usage
Crawl
Index
Search

5
Intranet Search

Configuration
Input URLs create a directory and seed file
mkdir urls
echo http//www.cs.ucsb.edu gt urls/ucsb
Edit conf/crawl-urlfilter.txt and replace
MY.DOMAIN.NAME with cs.ucsb.edu
Edit conf/nutch-site.xml

6
Intranet Running the Crawl

Crawl options include
-dir dir names the directory to put the crawl
in.
-threads threads determines the number of
threads that will fetch in parallel.
-depth depth indicates the link depth from the
root page that should be crawled.
-topN N determines the maximum number of pages
that will be retrieved at each level up to the
depth.
E.g.
bin/nutch crawl urls -dir crawl -depth 3
-topN 50

7
Intranet Search

Deploy nutch war file
rm -rf TOMCAT_DIR/webapps/ROOT
cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war
The webapp finds indexes in ./crawl, relative to
where you start Tomcat
TOMCAT_DIR/bin/catalina.sh start
Search at http//localhost8080/
CS.UCSB domain demo http//hactar.cs.ucsb.edu808
0

8
Internet Crawling

Concept
crawldb all URL info
linkdb list of known links to each url
segments each is a set of urls that are fetched
as a unit
indexes Lucene-format indexes

9
Internet Crawling Process

Get seed URLs
Fetch
Update crawl DB
Compute top URLs, goto 2
Create Index
Deploy

10
Seed URL

URLs from the DMOZ Open Directory
wget http//rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser
content.rdf.u8 -subset 5000 gt dmoz/urls
Kids search URL from ask.com
Inject URLs
bin/nutch inject kids/crawldb 67k-url/
Edit conf/nutch-site.xml

11
Fetch

Generate a fetchlist from the database
bin/nutch generate kids/crawldb kids/segments
Save the name of fetchlist in variable s1
s1ls -d kids/segments/2 tail -1
Run the fetcher on this segment
bin/nutch fetch s1

12
Update Crawl DB and Re-fetch

Update craw db with the results of the fetch
bin/nutch updatedb kids/crawldb s1
Generate top-scoring 50K pages
bin/nutch generate kids/crawldb kids/segments
-topN 50000
Refetch
s1ls -d kids/segments/2 tail -1
bin/nutch fetch s1

13
Index, Deploy, and Search

Create inverted index
bin/nutch invertlinks kids/linkdb kids/segments/
Index the segments
bin/nutch index kids/indexes kids/crawldb
kids/linkdb kids/segments/
Deploy Search
Same as in Intranet search
Demo of 1M pages (570K 500K)?

14
Issues

Default crawling cycle is 30 days for all URLs
Duplicates are those have same URL or md5 of page
content
JavaScript parser uses regular expression to
extract URL literals from code.

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Lucene PowerPoint PPT Presentation

Lucene - Ferret. Show Me Some Code! ... I'm using Acts As Ferret, a plug-in for Ruby on Rails ... Ferret. Extra Stuff ... | PowerPoint PPT presentation | free to view

Data Information Service based on Open Archives Initiative Protocols and Apache Lucene PowerPoint PPT Presentation

Data Information Service based on Open Archives Initiative Protocols and Apache Lucene - 1PANGAEA Group at MARUM, University of Bremen, Bremen, Germany ... WDC-MARE with its information system PANGAEA currently provides data portals ... | PowerPoint PPT presentation | free to view

Numeric Range Queries with Lucene TrieRange PowerPoint PPT Presentation

Numeric Range Queries with Lucene TrieRange - Classical RangeQuery hits TooManyClausesException on large ranges and is very slow. ... pangaea.de (main site) www.wdc-mare.org (displays query time) 10. Thank ... | PowerPoint PPT presentation | free to view

The Lucene Search Engine: Powerful, Flexible PowerPoint PPT Presentation

The Lucene Search Engine: Powerful, Flexible - ... it is written in a modular fashion, it allows a developer tremendous amount of ... effectively used to index an e-mail Inbox, a database, or a set of news feeds. ... | PowerPoint PPT presentation | free to view

Flexible Querying of XML Documents PowerPoint PPT Presentation

Flexible Querying of XML Documents - ... APIs in ... SAXParser APIs. 23. Mapping to Lucene. XML documents to Lucene ... Implemented using Lucene 2.0 APIs. Indexing: Time and Space Intensive. But ... | PowerPoint PPT presentation | free to view

Lucene Boot Camp PowerPoint PPT Presentation

Lucene Boot Camp - Lucene doesn't care about XML, Word, PDF, etc. ... Analysis is the process of creating Tokens to be indexed ... languages that use a space for word segmentation ... | PowerPoint PPT presentation | free to view

Advanced Lucene PowerPoint PPT Presentation

Advanced Lucene - Case Studies from CNLP. Collection analysis for domain specialization ... We convert a user's query into an internal representation that can be searched ... | PowerPoint PPT presentation | free to view

Nutch in a Nutshell PowerPoint PPT Presentation

Nutch in a Nutshell - Java based, open source. Features: Customizable. Extensible. Distributed. Nutch as a crawler ... http://lucene.apache.org/hadoop/ -- Hadoop homepage ... | PowerPoint PPT presentation | free to view

Document Indexing and Scoring in Lucene and Nutch PowerPoint PPT Presentation

Document Indexing and Scoring in Lucene and Nutch - IMAP. Server. FS. Crawler. Larm. PDF. HTML. DOC. TXT. TXT. parser. PDF. parser. HTML. parser. Lucene ... Finding a large relevant subset is normally done with ... | PowerPoint PPT presentation | free to view

Lucene Homework PowerPoint PPT Presentation

Lucene Homework - import org.apache.lucene.index.IndexWriter; import org.apache. ... import java.io.FileNotFoundException; import java.io.IOException; import java.util.Date; ... | PowerPoint PPT presentation | free to view

Intelligent Crawling and Indexing using Lucene PowerPoint PPT Presentation

Intelligent Crawling and Indexing using Lucene - Intelligent Crawling and Indexing using Lucene. By. Shiva Thatipelli. Mohammad Zubair (Advisor) ... Single, Multiple Phase queries, Results ranking, Sorting, ... | PowerPoint PPT presentation | free to view

OSCI - LIS PowerPoint PPT Presentation

OSCI - LIS - Fedora-Commons. Postgres. database. Objekt. repository. Lucene. indexer. Add on. DC ... Linux; Apache; TomCat; Postgress, Lucene, SolR, Fedora Commons, Drupal, ... | PowerPoint PPT presentation | free to view

Lucene Continued PowerPoint PPT Presentation

Lucene Continued - ... in this class that we calculate any boosts that we want to place on fields, or ... The date boost has been really important for us. ... | PowerPoint PPT presentation | free to view

Introduction to Ant and XDoclet PowerPoint PPT Presentation

Introduction to Ant and XDoclet - exception IOException if Lucene I/O exception *@todo refactor! private void indexDocs() throws IOException { Ant taskdef taskdef name='todo' ... | PowerPoint PPT presentation | free to view

Discussion Class 6 PowerPoint PPT Presentation

Discussion Class 6 - Is there some information that you were unable to find? (b) Who created Lucene? ... (b) What algorithms does it use? What data structures does it use? ... | PowerPoint PPT presentation | free to view

Lucene and Solr PowerPoint PPT Presentation

Lucene and Solr - ... Pluggable Java 5 Solr Powered by Solr Netflix CNET Smithsonian AOL:sports and music RightNow ?? Drupal module GameSpot Configuration (solrconfig.xml) ... | PowerPoint PPT presentation | free to view

Lucene-Demo PowerPoint PPT Presentation

Lucene-Demo - ... index should be created Analyzer Standard Analyzer Porter Stemming w/ Stop Words Krovetz Stemmer-Example package org.apache.lucene.analysis; ... | PowerPoint PPT presentation | free to view

SRU and Lucene PowerPoint PPT Presentation

SRU and Lucene - SRU and Lucene Ralph LeVan Research Scientist levan@oclc.org SRU Overview A Simple Web Service Supports REST-ful and SOAP requests Responses are always XML records ... | PowerPoint PPT presentation | free to view

SRU and Lucene PowerPoint PPT Presentation

SRU and Lucene - SRU and Lucene Ralph LeVan Research Scientist levan@oclc.org | PowerPoint PPT presentation | free to view

Lucene PowerPoint PPT Presentation

Lucene - Lucene Open source search project http://lucene.apache.org Index & search local files ... kids/crawldb $s1 Generate top-scoring 50K pages bin/nutch ... | PowerPoint PPT presentation | free to view

Lucene PowerPoint PPT Presentation

Lucene - FSDirectory , java.nio . | PowerPoint PPT presentation | free to view

Lucene PowerPoint PPT Presentation

Lucene - PowerPoint Presentation ... Lucene | PowerPoint PPT presentation | free to view

Lucene Lab 2 PowerPoint PPT Presentation

Lucene Lab 2 - Lucene Lab 2 030209 General IR Process Start Indexing (start stepping though all files) Tokenize & stem each file Index 1st, Index User enters (roughly) natural ... | PowerPoint PPT presentation | free to view

Introduction to Lucene Dot Net PowerPoint PPT Presentation

Introduction to Lucene Dot Net - Lucene.Net is a line-by-line port of well known Apache Lucene , which is an elite, full-highlighted content Internet searching library composed altogether in Java. It is an innovation reasonable for about any application that requires a full-message look. Particularly, an application where you need to accomplish something near Google indexed lists, and no simply list elements, however, quick list elements, or might be just madly quick list elements, yet just in your application and on your terms! | PowerPoint PPT presentation | free to view

Future of Careers in Hadoop PowerPoint PPT Presentation

Future of Careers in Hadoop - As most of us are aware it is not an easy task to store large amount of data. Hence, many corporate and large sized companies seek help of a tool called Hadoop. This software was developed by Doug Cutting, also known as the creator of Apache Lucene. | PowerPoint PPT presentation | free to view

future of careers in hadoop (1) PowerPoint PPT Presentation

future of careers in hadoop (1) - As most of us are aware it is not an easy task to store large amount of data. Hence, many corporate and large sized companies seek help of a tool called Hadoop. This software was developed by Doug Cutting, also known as the creator of Apache Lucene. | PowerPoint PPT presentation | free to view

What is Apache Solr Online Training, and for what reason is it Trending hot at this moment? PowerPoint PPT Presentation

What is Apache Solr Online Training, and for what reason is it Trending hot at this moment? - Apache Solr is an open source look server. It depends on the full content internet searcher called Apache Lucene. So essentially Solr is a HTTP wrapper around an altered file given by Lucene. A reversed file could be viewed as a rundown of words where each word-section connects to the archives it is contained in. That way getting all reports for the pursuit question "dzone" is a basic 'get' task. | PowerPoint PPT presentation | free to view