Title: Webmaster Search Awareness
1WebmasterSearch Awareness
- improving visitor searchexperience
California County Information Architecture
Project Meeting Friday, March 25 2005
2Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Site Search Options
3Platform Architecture
- 2 search servers (host collections)
- 2 spider servers for 293 websites
- once scheduled, spidering 24/7
- 10 million documents (projected)
- software Verity K2 Enterprise v5.5, on Solaris
5.8 - hardware Sun Fire V440, 4G RAM, 2cpu
(simply put, were a bit underpowered)
4Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Advice for Webmasters
- Future Enhancements
- Site Search Options
5Collection Building - Spider
- How it works
- Port 80 requests
- From user agent State of CA Spider, Teale Data
Center - Six connections
- The original Verity spider ignored robots
directives we looked like a DoS. We now honor
all robots directives and strongly encourage
their use. - Document meta-data and content are indexed into
searchable lists called collections. - Whats in
- html, PDF, Word, Excel, PPT and Flash (maybe)
- Whats out
- all other mime types
- pages with no Title tag or blank tag content
- documents with no Title property or Untitled
- path segments gt 50
6Collection Building - Index
- Database Repositories
- Portal customers
- Broadvision data
- extracted via SQL from Oracle DB
- Perl scriptl indexes, builds and deploy
collections - HTTP Repositories (your web sites)
- K2 spider calls K2 indexer every 1024 docs
- parameter driven
- Other Possible Repositories
- file systems, Exchange, Notes, Documentum, etc.
- Example www.water.ca.gov
7Collection Building Index (cont.)
- -collection /data/verity5/colls/ca_water
- -style /data/verity5/stylesets/Def_HTTP
- -start http//www.water.ca.gov
- -domain water.ca.gov
- -mimeinclude text/html
- -mimeinclude application/pdf
- -mimeinclude application/msword
- -mimeinclude application/excel
- -mimeinclude application/powerpoint
- -mimeinclude application/x-shockwave-flash
Example spider and index job (command file) for
www.water.ca.gov
8Collection Building Index (cont.)
- -regexp
- -exclude 'file//'
- -exclude '/mailto'
- -exclude '/calendar/'
- -exclude '/guestbook/'
- -exclude '/espanol/'
- -exclude '../'
- -indskip title '\s'
- -indskip title ''
- -indskip title 'uUntitled'
- -agentname "State of CA Spider, Teale Data Center
(916)464-4233" - -connections 6
- -indexers 3
- -jumps 10
- -cgiok
- -retry 0
- -delay 0
- -timeout 10
spider and index command file for www.water.ca.gov
9Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Site Search Options
10Query Parsing
- JSP calls to Verity Java API
- var vs null
- vs CreateJavaObject('com.verity.search.VSearch')
- vs.setServerSpec("localhost9910")
- if ( qparser "simple" )
- etc., etc.
- Weighted by field (tag or property)
- Title, 95
- 70 chars max (imposed tentative limit)
- Subject, 92
- 300 chars max (imposed definite limit)
- Keywords, 90
- no limit
- URL, 85
- Content, 80
- Weighting can be usedto tweak score by
emphasizingcertain fields over others - ltaccruegt(95((unclaimed, property) ltingt title),
92((unclaimed, property)ltingtsubject),90((uncl
aimed, property)ltingtkeywords),90((unclaimed,
property)ltingturl),85(unclaimed, property))
- Portal customers can use custom weights
- Tourism
- CalOHI
- Governor
- Film Commission
- extensive customization possible
11Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Site Search Options
12Visitor Experience
- Search Box enhancements (since 11-15-2004)
- Wider
- more text visible without horizontal scrolling
- psychologically encourages entry of more terms (I
hope) - internet average 1.3 words per query
- State of CA average 2.3 words per query
- Text remains visible in box from search to search
- Cursor focus stays in box. Just type to add terms
- FireFox nice
- IE ok
13Visitor Experience (cont.)
- Standard Results Data added
- Date
- Doc size (open a 5MB pdf over dialup?)
- Mime-type
- Sorted by Score/Date
- Simple advanced search from either search box
- Nobody uses Advanced Search - Tim Bray
- Phrases , grouping (), AND, OR and NOT. Thats
it. - Recommended Links
- Hit Score v. numbering v. nothing (like Google).
Jury still out.
14Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Search Service Options
15Webmaster Guidelines
- Use a robots.txt file. (Please!)
- User-agent Disallow /directory1/calendar.plD
isallow /directory2/exchange/ - Disallow /directory3/images/
- or, at least, use the
- Robots meta-tag ltmetaname"robots"content"noinde
x,nofollow"gt - Publish only PDF documents. Unless you have a
specific need for Word, Excel, txt, etc. files on
your web sites, convert every format to PDF. - Test every PDF to insure it is not corrupted.
16Webmaster Guidelines (page 2)
- Meta-Data Application Properties
- The Big 4 Title, Subject, Date and Keywords
- indexer uses modified date (Date changes every
time document is saved.) - Educate your source authors PDF clerks
- Always enter a Title for application documents
(Word, PDF, etc.) Do not use hard spaces in the
Title like this - Licensing_and_Certification_Individual_Licenses_Is
sued_in_the_Last_15_Days. - Documents without a Title are excluded.
- Always enter a Subject. Any doc destined for the
Web should have human-keyed text in the Subject
property. 300 character max. The Subject is the
search results Summary! Search engine will
dynamically generate text if Subject is blank. - Keywords very important if title and subject do
not fully identify the document content.
17Webmaster Guidelines (page 3)
- Meta-data HTML tags
- Title -- maps to search results hyperlink
lttitlegt State of California ltsDynamicTitlegtlt/ti
tlegt - No Title, no document.
- Keywords -- maps to Keywords (hidden)
- ltMETA name"keywords" content"California,
portal, homepage"gt - Description -- maps to Summary)
- ltMETA name"description" content"Homepage for
myCalifornia site, contains sections for Online
Services, What's New, Navigational links, Feature
links, Quick Hits, and Picture Contest. When
registered user logs in, page will be
personalized to show various info dependent on
profile selection."gt
18Webmaster Guidelines (page 4)
- Which sites are included in CA Search?
- Only State Agency Index sites.
- Be consistent. If you list once as
www.smogcheck.ca.gov try not to list another link
as an alias www.autorepair.ca.govThe
consequence could be duplicate documents in
search results. - Domain Names
- Register domain names in the sub-level/top-level
domain ca.gov At least register a .ca.gov
alias for each site. - What is the CA State Fair URL? Why?
19Webmaster Guidelines (page 5)
- No bad docs! Test every PDF before publishing.
- If you cant open it, neither can the search
indexer. Bad docs not only make you look bad, but
search engine wastes time and bandwidth retrying. - No bad links!
- Run a link checker regularly. Standard protocol
for webmasters. - Spider does not check URI.
- Indexer does check and wastes time bandwidth
retrying bad links. - Online Guide Important Webmaster Search
Information - The CA Top 40 (the big hits)
- A single collection (not a collection of singles)
with documents containing the most searched-for
terms - Dynamic priority collection. Sites spidered and
indexed frequently - Log file analysis determines the play list. (no
payola, please)
20Webmaster Guidelines (page 6)
- Helpful Links
- Googles Guidelines
- Better Document Titles and Summaries
- Fixing Poor Data Quality Part 1
- Fixing Poor Data Quality Part 2
- Enterprise Search
- Pew Report on Internet Search
- Stephen E. Arnold Search Articles
- Other Google Ranking Tips
21Webmaster Guidelines (page 6)
- Current Statistics (www.ca.gov)
- Reported on 02-15-2005 for the last 10 days.
- Total number of visitor queries 332,275
- Total number of unique queries 109,478
- Average number of queries per day 33,228
- Average number of words per query 2.36
- 293 State of California web sites.
- 20 collections, 2 million documents
22Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Search Service Options
23Future Enhancements
- Vague term filter?
- Spelling suggestion
- Visitor Configurable
- Sort by date only
- Adjust number of results
- View as HTML (convert PDFs on our side)
- Icons for file types
- Automated recommendation engine
- (existing recommended links derive from a manual
process) - Parametric search
- (field selectable terms based on categories)
- State-wide taxonomies
- (documents categorically arranged in a
hierarchical presentation) - 3rd party (Google, Yahoo) handoff?
- Clairvoyance engine
- (solves all remaining relevancy problems)
24Presentation Outline
- State of CA Search Architecture
- Platform
- Collection Building
- Spidering (crawling)
- Indexing
- Query parsing
- Visitor Experience Improvements
- Webmaster Guidelines
- Future Enhancements
- Search Service Options
25Search Service Options
- Appliances
- Google Applicance, Mini ? BOE, FTB
- Thunderstone, Webinator ? Caltrans
- Purchase Software
- Verity UltraSeek, Webinator, etc.
- Vendors galore
- Free and Internet-based Search Services
- Teale Data Center! ? 24/7, data center security,
etc. - Htdig (GNU open source)
- FreeFind
- PicoSearch
- Atomz ? CalHum
- Many, many others
26Q A
27Contact Information
- Kevin Paddock
- Teale Data Center
- Internet Services Division
- 916-464-4233
- kpaddock_at_teale.ca.gov
28WebmasterSearch Awareness
- improving visitor search experience
Wednesday, November 11, 2009
29The Top 40
30Application Properties
- Acrobat File Document Properties
31Search Vendors (to name a few)
32Teale Site Search Service
- Bad Link/Document Analysis
- Malformed keys (file paths, URLs)
- Bad links, non-existent scripts, docs, etc. (404
errors) - Stream Errors (corrupted files)
- Connection timeouts (sites down, bad sites, etc.)
- Authorization failures
- Full-domain (multiple site) spidering
- One collection
- Sized and priced to fit
- 1000 10,000 documents
- 10k 50k
- 50k 150k
- 150k 250k
- Any mime-types you want