Webmaster Search Awareness - PowerPoint PPT Presentation

1 / 30
About This Presentation

Webmaster Search Awareness


Disallow: /directory1/calendar.pl. Disallow: /directory2/exchange/ Disallow: /directory3/images ... links, Feature links, Quick Hits, and Picture Contest. ... – PowerPoint PPT presentation

Number of Views:684
Avg rating:3.0/5.0
Slides: 31
Provided by: DCCN


Transcript and Presenter's Notes

Title: Webmaster Search Awareness

WebmasterSearch Awareness
  • improving visitor searchexperience

California County Information Architecture
Project Meeting Friday, March 25 2005
Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Site Search Options

Platform Architecture
  • 2 search servers (host collections)
  • 2 spider servers for 293 websites
  • once scheduled, spidering 24/7
  • 10 million documents (projected)
  • software Verity K2 Enterprise v5.5, on Solaris
  • hardware Sun Fire V440, 4G RAM, 2cpu

(simply put, were a bit underpowered)
Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Advice for Webmasters
  • Future Enhancements
  • Site Search Options

Collection Building - Spider
  • How it works
  • Port 80 requests
  • From user agent State of CA Spider, Teale Data
  • Six connections
  • The original Verity spider ignored robots
    directives we looked like a DoS. We now honor
    all robots directives and strongly encourage
    their use.
  • Document meta-data and content are indexed into
    searchable lists called collections.
  • Whats in
  • html, PDF, Word, Excel, PPT and Flash (maybe)
  • Whats out
  • all other mime types
  • pages with no Title tag or blank tag content
  • documents with no Title property or Untitled
  • path segments gt 50

Collection Building - Index
  • Database Repositories
  • Portal customers
  • Broadvision data
  • extracted via SQL from Oracle DB
  • Perl scriptl indexes, builds and deploy
  • HTTP Repositories (your web sites)
  • K2 spider calls K2 indexer every 1024 docs
  • parameter driven
  • Other Possible Repositories
  • file systems, Exchange, Notes, Documentum, etc.
  • Example www.water.ca.gov

Collection Building Index (cont.)
  • -collection /data/verity5/colls/ca_water
  • -style /data/verity5/stylesets/Def_HTTP
  • -start http//www.water.ca.gov
  • -domain water.ca.gov
  • -mimeinclude text/html
  • -mimeinclude application/pdf
  • -mimeinclude application/msword
  • -mimeinclude application/excel
  • -mimeinclude application/powerpoint
  • -mimeinclude application/x-shockwave-flash

Example spider and index job (command file) for
Collection Building Index (cont.)
  • -regexp
  • -exclude 'file//'
  • -exclude '/mailto'
  • -exclude '/calendar/'
  • -exclude '/guestbook/'
  • -exclude '/espanol/'
  • -exclude '../'
  • -indskip title '\s'
  • -indskip title ''
  • -indskip title 'uUntitled'
  • -agentname "State of CA Spider, Teale Data Center
  • -connections 6
  • -indexers 3
  • -jumps 10
  • -cgiok
  • -retry 0
  • -delay 0
  • -timeout 10

spider and index command file for www.water.ca.gov
Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Site Search Options

Query Parsing
  • JSP calls to Verity Java API
  • var vs null
  • vs CreateJavaObject('com.verity.search.VSearch')
  • vs.setServerSpec("localhost9910")
  • if ( qparser "simple" )
  • etc., etc.
  • Weighted by field (tag or property)
  • Title, 95
  • 70 chars max (imposed tentative limit)
  • Subject, 92
  • 300 chars max (imposed definite limit)
  • Keywords, 90
  • no limit
  • URL, 85
  • Content, 80
  • Weighting can be usedto tweak score by
    emphasizingcertain fields over others
  • ltaccruegt(95((unclaimed, property) ltingt title),
    92((unclaimed, property)ltingtsubject),90((uncl
    aimed, property)ltingtkeywords),90((unclaimed,
    property)ltingturl),85(unclaimed, property))
  • Portal customers can use custom weights
  • Tourism
  • CalOHI
  • Governor
  • Film Commission
  • extensive customization possible

Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Site Search Options

Visitor Experience
  • Search Box enhancements (since 11-15-2004)
  • Wider
  • more text visible without horizontal scrolling
  • psychologically encourages entry of more terms (I
  • internet average 1.3 words per query
  • State of CA average 2.3 words per query
  • Text remains visible in box from search to search
  • Cursor focus stays in box. Just type to add terms
  • FireFox nice
  • IE ok

Visitor Experience (cont.)
  • Standard Results Data added
  • Date
  • Doc size (open a 5MB pdf over dialup?)
  • Mime-type
  • Sorted by Score/Date
  • Simple advanced search from either search box
  • Nobody uses Advanced Search - Tim Bray
  • Phrases , grouping (), AND, OR and NOT. Thats
  • Recommended Links
  • Hit Score v. numbering v. nothing (like Google).
    Jury still out.

Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Search Service Options

Webmaster Guidelines
  • Use a robots.txt file. (Please!)
  • User-agent Disallow /directory1/calendar.plD
    isallow /directory2/exchange/
  • Disallow /directory3/images/
  • or, at least, use the
  • Robots meta-tag ltmetaname"robots"content"noinde
  • Publish only PDF documents. Unless you have a
    specific need for Word, Excel, txt, etc. files on
    your web sites, convert every format to PDF.
  • Test every PDF to insure it is not corrupted.

Webmaster Guidelines (page 2)
  • Meta-Data Application Properties
  • The Big 4 Title, Subject, Date and Keywords
  • indexer uses modified date (Date changes every
    time document is saved.)
  • Educate your source authors PDF clerks
  • Always enter a Title for application documents
    (Word, PDF, etc.) Do not use hard spaces in the
    Title like this
  • Licensing_and_Certification_Individual_Licenses_Is
  • Documents without a Title are excluded.
  • Always enter a Subject. Any doc destined for the
    Web should have human-keyed text in the Subject
    property. 300 character max. The Subject is the
    search results Summary! Search engine will
    dynamically generate text if Subject is blank.
  • Keywords very important if title and subject do
    not fully identify the document content.

Webmaster Guidelines (page 3)
  • Meta-data HTML tags
  • Title -- maps to search results hyperlink
    lttitlegt State of California ltsDynamicTitlegtlt/ti
  • No Title, no document.
  • Keywords -- maps to Keywords (hidden)
  • ltMETA name"keywords" content"California,
    portal, homepage"gt
  • Description -- maps to Summary)
  • ltMETA name"description" content"Homepage for
    myCalifornia site, contains sections for Online
    Services, What's New, Navigational links, Feature
    links, Quick Hits, and Picture Contest. When
    registered user logs in, page will be
    personalized to show various info dependent on
    profile selection."gt

Webmaster Guidelines (page 4)
  • Which sites are included in CA Search?
  • Only State Agency Index sites.
  • Be consistent. If you list once as
    www.smogcheck.ca.gov try not to list another link
    as an alias www.autorepair.ca.govThe
    consequence could be duplicate documents in
    search results.
  • Domain Names
  • Register domain names in the sub-level/top-level
    domain ca.gov At least register a .ca.gov
    alias for each site.
  • What is the CA State Fair URL? Why?

Webmaster Guidelines (page 5)
  • No bad docs! Test every PDF before publishing.
  • If you cant open it, neither can the search
    indexer. Bad docs not only make you look bad, but
    search engine wastes time and bandwidth retrying.
  • No bad links!
  • Run a link checker regularly. Standard protocol
    for webmasters.
  • Spider does not check URI.
  • Indexer does check and wastes time bandwidth
    retrying bad links.
  • Online Guide Important Webmaster Search
  • The CA Top 40 (the big hits)
  • A single collection (not a collection of singles)
    with documents containing the most searched-for
  • Dynamic priority collection. Sites spidered and
    indexed frequently
  • Log file analysis determines the play list. (no
    payola, please)

Webmaster Guidelines (page 6)
  • Helpful Links
  • Googles Guidelines
  • Better Document Titles and Summaries
  • Fixing Poor Data Quality Part 1
  • Fixing Poor Data Quality Part 2
  • Enterprise Search
  • Pew Report on Internet Search
  • Stephen E. Arnold Search Articles
  • Other Google Ranking Tips

Webmaster Guidelines (page 6)
  • Current Statistics (www.ca.gov)
  • Reported on 02-15-2005 for the last 10 days.
  • Total number of visitor queries 332,275
  • Total number of unique queries 109,478
  • Average number of queries per day 33,228
  • Average number of words per query 2.36
  • 293 State of California web sites.
  • 20 collections, 2 million documents

Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Search Service Options

Future Enhancements
  • Vague term filter?
  • Spelling suggestion
  • Visitor Configurable
  • Sort by date only
  • Adjust number of results
  • View as HTML (convert PDFs on our side)
  • Icons for file types
  • Automated recommendation engine
  • (existing recommended links derive from a manual
  • Parametric search
  • (field selectable terms based on categories)
  • State-wide taxonomies
  • (documents categorically arranged in a
    hierarchical presentation)
  • 3rd party (Google, Yahoo) handoff?
  • Clairvoyance engine
  • (solves all remaining relevancy problems)

Presentation Outline
  • State of CA Search Architecture
  • Platform
  • Collection Building
  • Spidering (crawling)
  • Indexing
  • Query parsing
  • Visitor Experience Improvements
  • Webmaster Guidelines
  • Future Enhancements
  • Search Service Options

Search Service Options
  • Appliances
  • Google Applicance, Mini ? BOE, FTB
  • Thunderstone, Webinator ? Caltrans
  • Purchase Software
  • Verity UltraSeek, Webinator, etc.
  • Vendors galore
  • Free and Internet-based Search Services
  • Teale Data Center! ? 24/7, data center security,
  • Htdig (GNU open source)
  • FreeFind
  • PicoSearch
  • Atomz ? CalHum
  • Many, many others

  • Your turn

Contact Information
  • Kevin Paddock
  • Teale Data Center
  • Internet Services Division
  • 916-464-4233
  • kpaddock_at_teale.ca.gov

WebmasterSearch Awareness
  • improving visitor search experience

Wednesday, November 11, 2009
The Top 40
Application Properties
  • Word File Properties
  • Acrobat File Document Properties

Search Vendors (to name a few)
Teale Site Search Service
  • Bad Link/Document Analysis
  • Malformed keys (file paths, URLs)
  • Bad links, non-existent scripts, docs, etc. (404
  • Stream Errors (corrupted files)
  • Connection timeouts (sites down, bad sites, etc.)
  • Authorization failures
  • Full-domain (multiple site) spidering
  • One collection
  • Sized and priced to fit
  • 1000 10,000 documents
  • 10k 50k
  • 50k 150k
  • 150k 250k
  • Any mime-types you want
Write a Comment
User Comments (0)
About PowerShow.com