Title: vinashak @ google
1- From the Inside Out
- Michael Hunter
- Reference Librarian
- Hobart and William Smith Colleges
2Google from the Inside Out
- Hardware and Database Creation
- Relevance Ranking and Link Analysis
- Advanced and Hidden Search Features
- Hands-on Session
- Pay-for-Placement and Revenue Issues
- Our Google Wish List
- Other Services to Keep Our Eyes On
3Googles Beginnings
- 1996 -- Sergey Brin, Larry Page of Stanford
develop BackRub based on analysis of links TO
a page from other sites - Sept. 7, 1998 Menlo Park, CA - Google launches
in beta with over 10,000 queries a day - December, 1998 Listed in PC Magazines Top 100
Websites
4(No Transcript)
5Whats in a name?
- Google is a play on googol, a term coined by
mathematician Milton Sirotta to refer to the
number one followed by 100 zeros
6Googles Hardware
- Over 10,000 servers in two locations containing
hundreds of copies of the database - Index of more than 3 billion web documents
- Handles thousands of queries on a sub-second
basis - Interviews in MP3 format with Chief Operations
Engineer Jim Reese - //technetcast.com/tnc_play_stream.html?
- stream_id420 (1 hr. 13 min)
- //technetcast.com/tnc_play_stream.html?
- stream_id421 (15 min.)
7Googles Multi-faceted Database
- Indexed html pages
- Unindexed html pages
- Other file types
- Html pages that are re-indexed daily
8Multi-faceted Database
9What types of pages are unindexed? (25)
- Dead or inaccurate links
- Duplicate pages
- Database-generated URLs
- Pages with robots.txt or noindex meta tags
- Pages on an intranet
- Pages waiting to be indexed fully
10How did they get into Google?
- Google crawls and downloads links in the
documents it encounters - Some of these links are dead, or inaccurate or
cannot be crawled for other reasons (intranets,
robots.txt) - The URLs are in the database, but the documents
are not
11Why does Google leave them in?
- They are not COMPLETELY unindexed
- Indexed elements include
- Words in the URL
- http//members.home.net/gourdeaud/
- Words in the anchor text on indexed pages that
link to the unindexed URL - lta href members.home.net/gourdeaud/ gtGourdeauds
biographylt/agt - Can be useful in URL searches or unique term
queries and PageRank
12How can I distinguish unindexed pages in search
results?
- No extract
- No page size
- No cached copy of the page
13(No Transcript)
14Deep Web Components Non-html filetypes
(1.75) SEARCH SYNTAX california power
shortage filetypepdf
- Adobe Portable Document Format (pdf)
- Adobe PostScript (ps)
- Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk
- Lotus WordPro (lwp)
- MacWrite (mw)
- Microsoft Excel (xls)
-
- Microsoft PowerPoint (ppt)
-
- Microsoft Word (doc)
-
- Microsoft Works (wks, wps, wdb)
- Microsoft Write (wri)
- Rich Text Format (rtf)
-
- Text (ans, txt)
15Google Non-html FiletypesWarning!
- FOR NON-HTML FILES
- Clicking on a title in the results list opens the
application as well, involving risk of a virus or
worm that may be attached to the file - INSTEAD, click the View as HTML option no
applications will be opened and no risk of virus
or worm - NOTE Titles for non-html files are frequently
not descriptive of content
16Non-html filetypes in GoogleNotess Study March
6, 2002 25 One-Word Searches
17(No Transcript)
18homeland security filetypeppt
19Deep Web ComponentsDaily re-indexed pages
(.15)
- Over 3 million
- Regular html pages that Google has noticed are
frequently updated. - Google re-indexes these every day or so
- Date of Googles last visit to the page appears
in the results listing
20(No Transcript)
21Googles Database
22Database Freshness
- Refreshes its entire web index on a roughly
monthly basis, about every 28 days. - On-going process
- Some segments fresher than others
23Notess Study April 6, 2002Pages that are
updated daily and report that date
24Database Breadth (Size)
- About 3 billion documents (indexed and unindexed)
- Daily figure on the homepage
- 3,083,324,652 on March 8, 2003
- (Not including Images or Usenet)
- FAST (alltheweb.com) claimed
- 2.1 billion indexed documents ,
- March 8, 2003
25(No Transcript)
26Database Depth
- Google typically downloads the first 110 K of a
web document - Download includes URLs of outgoing links
27Database Blending
- Results from Googles News vertical engine are
included in results for all searches - Blending is increasingly common among search
services - News
- Shopping
- Directory
28Relevance Ranking and Link Analysis
- Googles PageRank
- Demystified
29Relevance Ranking
- Processing and presenting retrieved results
- Proprietary information
- Search Engine Optimization Industry has made it
even more so - How can I make my site rank high in Google?
30What happens when I enter a search at Google?
- Check of search syntax and spelling
- Query routed to the appropriate server based on
the database segment on which the answer is
likely to be found
31What happens when I enter a search at Google?
- Processing of Visible text
- Search term(s) position title, heading, text
- Search term(s) frequency
- Search term(s) proximity
- Processing of Invisible text
- Meta tags
- Anchor text (within the ltagt tag href)
- lta hrefwww.hws.edu gtHobart and William Smith
Collegeslt/agt
32What happens when I enter a search at Google?
- PageRank link analysis applied
- Click popularity (Google Toolbar voting data)
- Link context (Proximity of links to your search
term(s) within the document) - Final dynamic mix of about 25 factors
33PageRank Demystified
- Patented link analysis program
- Part of Google since its beginnings
- Objective To make ranking more of a human
process - Assigns each page in Google a PageRank score,
which is dynamic (changeable) - Weighs heavily in final ranking of results
34PageRanks Multi-layered processing
- Layer I
- Do others think your site is of value as
demonstrated by linking to you? - IF SO
- Layer II
- Are these others in turn linked to by sites
recognized through linkage within web
communities?
35PageRanks Multi-layered processing
- A Favorable Ranking Scenario
- A .com site selling prosthetics linked
- TO by
- A local orthopedic association in turn linked TO
by - A national orthopedic group in turn linked TO by
- The National Institutes of Health
-
36Visualizing Linkage in Googles Database with
TouchGraph
- Browser
- http//www.touchgraph.com/TGGoogleBrowser.html
- Instructions
- http//www.touchgraph.com/TGGB_FullInstructions.ht
ml -
37(No Transcript)
38How Does Google Identify Web Communities?
- Mutual linkage patterns
- Metadata elements and keywords found in common
- Human examination/verification of the quality of
key sites within the community - Other proprietary factors ???????
39PageRank Nitty Gritty
- Every page of a site can have a PageRank score,
not just the main page - The value of a link from Site B to Site A is
decreased with each additional link from Site B
to any other site - Rationale If Site B has only a few links,
each one could be more important than if Site B
has hundreds of outgoing links
40PageRank Nitty Gritty
- Requires human adjustment in the case of large
subject directories and quality lists of links - PageRank scoring is a dynamic process always in
flux - To find a pages PageRank score, go to the
Toolbar and click on the green meter
41PageRank Feedback
- Site A has NO outgoing links, but is linked TO by
Site B - Site A decides to create a single link to Site B
- This increases Site Bs PageRank score
- Site Bs increased score in turn automatically
increases Site As score
42Sounds easy to manipulate
- Possibilities include
- Spam
- Link farms
- Cloaking (sneaky re-directs)
- Google is vigilant
- If Google detects any manipulation of PageRank,
it eliminates the domain from its database and
never crawls there again.
43PageRank Processing
- How does Google know who has linked to Site A,
for example? - By searching its database for all sites with
links to Site A - No way to do this by examining Site A, as there
is no physical change to a document when it is
linked TO
44Implications of PageRank
- PageRank is entirely dependent on linkage data
derived from the Google database - Breadth, depth and freshness of the crawl is
critical to accurate and current data for
PageRank scoring
45A Different Perspective on PRAnti-Google
- Daniel Brandt claims
- PageRank discriminates against new web sites
(which may not yet be linked to by other sites). - Careless custodian of private information
(Google associates each search with a cookie, set
to last 36 years) - Maintains googlewatch.org
46PageRank A Summary
- All links are not created equal
- Is this site linked TO by good web pages
associated with this topic? - EXAMPLE If a page is linked to by a subject
directory (Yahoo, OD, LII) its rank will be
higher than another page with many links from
personal web pages, link farms, etc. - NOTE Link Analysis (PageRank) is not the same as
Link Popularity (number of links)
47Searching Google Touring the Known and the
Unknown
- Please share your discoveries with us!
48Command Searching with Googles Fields (aka
Search Operators)
- Field Searches that cannot be combined with other
search elements - NOTE No space allowed between operator and
following text - cache retrieves cached version of the specified
URL - link retrieves pages that have links to the
specified URL - related retrieves pages that are similar to
the specified URL (same as Similar Pages feature
in results listing)
49Command Searching with Googles Fields (aka
Search Operators)
- Field Searches that cannot be combined with other
search elements - info retrieves information that Google has about
the specified URL - stocks retrieves stock information about the
companies whose ticker symbols follow the stocks
operator - stocksintc (Intel)
50Command Searching with Googles Fields (aka
Search Operators)
- Field Searches that can be combined with other
search elements - site restrict results to those from the
specified domain - sitewww.google.com PageRank
- NOTE retrieves all pages from www.google.com
that contain PageRank anywhere
51Field Searches that can be combined with other
search elements
- allintitle restrict results to those with all
terms present in the html title element - allintitlesynchrotron radiation
- intitle restrict results to those with this
single term in the title element - intitlesynchrotron intitleradiation
- NOTE intitlesynchrotron radiation retrieves
- synchrotron in title and radiation anywhere
52Field Searches that can be combined with other
search elements
- allinurl restrict results to those with all
terms present in the URL - Note ignores all punctuation
- allinurlusda pesticides
- inurl restrict results to those with this
single term in the URL - inurlusda inurlpesticides
- NOTE inurlusda pesticides retrieves
- usda in URL and pesticides anywhere
53Google Answers
- Fee Based answer service
- User sets fee (2.50-up) and time frame for
question (Guidelines offered) - Searchable archive available
- Comments can be added (by anyone) to unanswered
questions - Users rate answers
54Google AnswersWho are the researchers?
- Must be 18 years old
- Write an essay on why you want to be a researcher
- Answer 5 sample questions
- Training manual available at
- http//answers.google.com/answers/ researchertrai
ning.html
55Google APIApplication Program Interface
- Free programs for developers and researchers
interested in incorporating Google in their
applications - Iterative searches on a topic (SDI)
- Search via non-html interfaces
- Games that play with Web information
- Daily limit of 1,000 queries
- Uses SOAP (Simple Object Access Protocol) that is
XML-based - More at //google.com/apis/index.html
56Froogle
- New Service launched in Dec, 2002
- Locates information about products for sale
online - Gives URLs of sites offering the item
- Provides links to exact page in the site where
you can make the purchase
57Froogle
- Ranking follows normal Google ranking processes
- Paid placements always clearly marked
- Sort by price may be a future enhancement
- Access at http//froogle.google.com or via Google
Advanced Search
58Googles Hidden Features
- Daterange search
- Wildcard words
- Phonebook command search
- info field search
- Dictionary feature
59Daterange
- Not officially supported at google.com
(unreliable) - Reliable only through API programs
- At google.com, MAY be most reliable for the past
1 or 2 days - Searches the date of the documents entry into
the database, not its creation.
60Daterange Search Results (???) each days
entries for dog search executed on Oct. 9,
- Oct. 9 No hits
- Oct. 8 6
- Oct. 7 about 212,000 many dated 10/7
- Oct. 6 about 8980 many dated 10/7
- Oct. 5 about 5900 many dated 10/7
- Oct. 6-7 about 57,100 !!!!
- NOT TRUE DATERANGE FUNCTIONALITY
61With those caveats ..
- Daterange uses the Julian calendar, a continuous
count of days since noon, UTC, of Jan. 1, 4713 BC - Date changes at noon, not midnight
- 24525651200pm Oct. 16 to 1159 am Oct. 17
- Often used in astronomical and military contexts
- JD convertor
- //aa.usno.navy.mil/data/docs/JulianDate.html
62Daterange Search for Oct 14news
daterange2452561-2452561 (4,450 hits)
63Phonebook Command Search
- Searches US residential (rphonebook) and
business (bphonebook) listings of Yahoo,
MapQuest and other services - rphonebook
- MUST INCLUDE
- Last name City and/or State
- MAY INCLUDE
- First name
- bphonebook
- MUST INCLUDE
- Business name (min. 1 word) City and/or State
- MAY INCLUDE
- Full Business name
64Wildcard Words
- Google offers a word-sized asterisk to function
as a wildcard - Stands for a whole word
- Cannot be used for part of a word
- three mice 22,000
- three bl mice 0
65Wildcard Words
- Several can be used together
- milosevic International Hague
-
- Retrieves military tribunal OR
- military court OR war tribunal OR military
tribunal
66info
- Not exactly hidden, but not well-known
- Searches for any information Google has about a
site - Convenient way to monitor linkage
67(No Transcript)
68Dictionary Feature
- Term(s) in a query for which Google has
definitions are underlined in the text above the
results listing (Searched the Web for ) - Clicking on the term(s) sends you to the
dictionary provider (you leave Google). - Definitions are provided from sources selected
solely on the basis of quality
69A Few Good Alternatives to Google
- FAST - //alltheweb.com
- Teoma - //teoma.com
- Gigablast - //gigablast.com
70Pay-For-Placement and Other Revenue Issues
71Revenue at GoogleSelling Search Software
- Provides search software and interface for
portals and corporate intranets -Powered by
Google - Over 150 customers worldwide (Yahoo, Sony,
AOL/Netscape, Cisco Systems) - Google charges an initial set-up fee and a charge
per 1,000 searches
72Revenue at GoogleAdvertising AdWords
- Ads located to the right of search results
- Cost-per-click model (pay only if someone
actually clicks into your site from Google) - No monthly minimum charge
73Revenue at GoogleAdvertising AdWords
- Highest bidder does NOT take top placement
- Google measures number of visitors to an
advertisers site and length of visits - This popularity-based relevance helps determine
position of an ad - Offers smaller businesses a chance to compete for
visibility
74Revenue at GooglePremium Sponsorships
- Launched in mid-2002
- Advertisers purchase keywords or phrases
- Limited to no more than two sites per keyword or
phrase - Highest bidders site appears at the top of
results listing, labeled Sponsored Site
75(No Transcript)
76 and Ranking A Mini-Glossary
- Pay-for-Placement
- Paying for a specific position within search
results retrieved using specific search terms - Pay-for-Inclusion
- Paying for inclusion anywhere within search
results retrieved using specific search terms - Pay-for-Submission
- Paying to be included in the database (no special
ranking treatment) - To date, no pay for inclusion or submission at
Google -
77Revenue at GoogleThe Professionals View
- To date, advertising clearly labeled at Google
- If revenues decline,database size and quality may
be effected - Development and support of search features and
enhancements will be driven by commercial sector - Change in ownership can alter the nature and
educational value of any search service
78The Last 12 Months at Google
- Dec. 2001 - Database is at 3 billion
- 2 Billion Web documents (all types)
- 700 Million Usenet Postings
- 330 Image files
- March - 3rd party sells advertising based on
PageRank scores - Ongoing - Accused of censorship and manipulation
of ranking algorithms
79The Last 12 Months at Google
- Sept 2 - Access to Google (and Altavista) blocked
in China by Chinese Government - Sept 11 - Chinese government restores access, but
continues to monitor Google - Sept. 23 - Re-designed News Service launched
- December - Froogle launched
- Year-End Zeitgeist at
- http//www.google.com/press/zeitgeist2002.html
80Google is Good, but heres a Wish List for Future
Improvements
- Categorization of Results (Folders)
- Teoma, WiseNut, FAST all do
- Nesting
- Way to limit link search to external links only
- Indexing XML documents that have no html
equivalents - Crawling Deep Web databases
- Advanced NEWS search
- OTHERS??????
81Thank you and best of luck in Getting MORE from
Google!!!
- Michael Hunter
- Reference Librarian
- Hobart and William Smith Colleges
- Geneva, NY 14456
- (315) 781-3552 hunter_at_hws.edu