vinashak @ google

About This Presentation
Title:

vinashak @ google

Description:

it is powerpoint about google.com – PowerPoint PPT presentation

Number of Views:156
Slides: 82
Provided by: vinashak

less

Transcript and Presenter's Notes

Title: vinashak @ google


1
  • From the Inside Out
  • Michael Hunter
  • Reference Librarian
  • Hobart and William Smith Colleges

2
Google from the Inside Out
  • Hardware and Database Creation
  • Relevance Ranking and Link Analysis
  • Advanced and Hidden Search Features
  • Hands-on Session
  • Pay-for-Placement and Revenue Issues
  • Our Google Wish List
  • Other Services to Keep Our Eyes On

3
Googles Beginnings
  • 1996 -- Sergey Brin, Larry Page of Stanford
    develop BackRub based on analysis of links TO
    a page from other sites
  • Sept. 7, 1998 Menlo Park, CA - Google launches
    in beta with over 10,000 queries a day
  • December, 1998 Listed in PC Magazines Top 100
    Websites

4
(No Transcript)
5
Whats in a name?
  • Google is a play on googol, a term coined by
    mathematician Milton Sirotta to refer to the
    number one followed by 100 zeros

6
Googles Hardware
  • Over 10,000 servers in two locations containing
    hundreds of copies of the database
  • Index of more than 3 billion web documents
  • Handles thousands of queries on a sub-second
    basis
  • Interviews in MP3 format with Chief Operations
    Engineer Jim Reese
  • //technetcast.com/tnc_play_stream.html?
  • stream_id420 (1 hr. 13 min)
  • //technetcast.com/tnc_play_stream.html?
  • stream_id421 (15 min.)

7
Googles Multi-faceted Database
  • Indexed html pages
  • Unindexed html pages
  • Other file types
  • Html pages that are re-indexed daily

8
Multi-faceted Database
9
What types of pages are unindexed? (25)
  • Dead or inaccurate links
  • Duplicate pages
  • Database-generated URLs
  • Pages with robots.txt or noindex meta tags
  • Pages on an intranet
  • Pages waiting to be indexed fully

10
How did they get into Google?
  • Google crawls and downloads links in the
    documents it encounters
  • Some of these links are dead, or inaccurate or
    cannot be crawled for other reasons (intranets,
    robots.txt)
  • The URLs are in the database, but the documents
    are not

11
Why does Google leave them in?
  • They are not COMPLETELY unindexed
  • Indexed elements include
  • Words in the URL
  • http//members.home.net/gourdeaud/
  • Words in the anchor text on indexed pages that
    link to the unindexed URL
  • lta href members.home.net/gourdeaud/ gtGourdeauds
    biographylt/agt
  • Can be useful in URL searches or unique term
    queries and PageRank

12
How can I distinguish unindexed pages in search
results?
  • No extract
  • No page size
  • No cached copy of the page

13
(No Transcript)
14
Deep Web Components Non-html filetypes
(1.75) SEARCH SYNTAX california power
shortage filetypepdf
  • Adobe Portable Document Format (pdf)
  • Adobe PostScript (ps)
  • Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk
  • Lotus WordPro (lwp)
  • MacWrite (mw)
  • Microsoft Excel (xls)
  • Microsoft PowerPoint (ppt)
  • Microsoft Word (doc)
  • Microsoft Works (wks, wps, wdb)
  • Microsoft Write (wri)
  • Rich Text Format (rtf)
  • Text (ans, txt)

15
Google Non-html FiletypesWarning!
  • FOR NON-HTML FILES
  • Clicking on a title in the results list opens the
    application as well, involving risk of a virus or
    worm that may be attached to the file
  • INSTEAD, click the View as HTML option no
    applications will be opened and no risk of virus
    or worm
  • NOTE Titles for non-html files are frequently
    not descriptive of content

16
Non-html filetypes in GoogleNotess Study March
6, 2002 25 One-Word Searches
17
(No Transcript)
18
homeland security filetypeppt
19
Deep Web ComponentsDaily re-indexed pages
(.15)
  • Over 3 million
  • Regular html pages that Google has noticed are
    frequently updated.
  • Google re-indexes these every day or so
  • Date of Googles last visit to the page appears
    in the results listing

20
(No Transcript)
21
Googles Database
  • Freshness
  • Breadth
  • Depth

22
Database Freshness
  • Refreshes its entire web index on a roughly
    monthly basis, about every 28 days.
  • On-going process
  • Some segments fresher than others

23
Notess Study April 6, 2002Pages that are
updated daily and report that date
24
Database Breadth (Size)
  • About 3 billion documents (indexed and unindexed)
  • Daily figure on the homepage
  • 3,083,324,652 on March 8, 2003
  • (Not including Images or Usenet)
  • FAST (alltheweb.com) claimed
  • 2.1 billion indexed documents ,
  • March 8, 2003

25
(No Transcript)
26
Database Depth
  • Google typically downloads the first 110 K of a
    web document
  • Download includes URLs of outgoing links

27
Database Blending
  • Results from Googles News vertical engine are
    included in results for all searches
  • Blending is increasingly common among search
    services
  • News
  • Shopping
  • Directory

28
Relevance Ranking and Link Analysis
  • Googles PageRank
  • Demystified

29
Relevance Ranking
  • Processing and presenting retrieved results
  • Proprietary information
  • Search Engine Optimization Industry has made it
    even more so
  • How can I make my site rank high in Google?

30
What happens when I enter a search at Google?
  • Check of search syntax and spelling
  • Query routed to the appropriate server based on
    the database segment on which the answer is
    likely to be found

31
What happens when I enter a search at Google?
  • Processing of Visible text
  • Search term(s) position title, heading, text
  • Search term(s) frequency
  • Search term(s) proximity
  • Processing of Invisible text
  • Meta tags
  • Anchor text (within the ltagt tag href)
  • lta hrefwww.hws.edu gtHobart and William Smith
    Collegeslt/agt

32
What happens when I enter a search at Google?
  • PageRank link analysis applied
  • Click popularity (Google Toolbar voting data)
  • Link context (Proximity of links to your search
    term(s) within the document)
  • Final dynamic mix of about 25 factors

33
PageRank Demystified
  • Patented link analysis program
  • Part of Google since its beginnings
  • Objective To make ranking more of a human
    process
  • Assigns each page in Google a PageRank score,
    which is dynamic (changeable)
  • Weighs heavily in final ranking of results

34
PageRanks Multi-layered processing
  • Layer I
  • Do others think your site is of value as
    demonstrated by linking to you?
  • IF SO
  • Layer II
  • Are these others in turn linked to by sites
    recognized through linkage within web
    communities?

35
PageRanks Multi-layered processing
  • A Favorable Ranking Scenario
  • A .com site selling prosthetics linked
  • TO by
  • A local orthopedic association in turn linked TO
    by
  • A national orthopedic group in turn linked TO by
  • The National Institutes of Health

36
Visualizing Linkage in Googles Database with
TouchGraph
  • Browser
  • http//www.touchgraph.com/TGGoogleBrowser.html
  • Instructions
  • http//www.touchgraph.com/TGGB_FullInstructions.ht
    ml

37
(No Transcript)
38
How Does Google Identify Web Communities?
  • Mutual linkage patterns
  • Metadata elements and keywords found in common
  • Human examination/verification of the quality of
    key sites within the community
  • Other proprietary factors ???????

39
PageRank Nitty Gritty
  • Every page of a site can have a PageRank score,
    not just the main page
  • The value of a link from Site B to Site A is
    decreased with each additional link from Site B
    to any other site
  • Rationale If Site B has only a few links,
    each one could be more important than if Site B
    has hundreds of outgoing links

40
PageRank Nitty Gritty
  • Requires human adjustment in the case of large
    subject directories and quality lists of links
  • PageRank scoring is a dynamic process always in
    flux
  • To find a pages PageRank score, go to the
    Toolbar and click on the green meter

41
PageRank Feedback
  • Site A has NO outgoing links, but is linked TO by
    Site B
  • Site A decides to create a single link to Site B
  • This increases Site Bs PageRank score
  • Site Bs increased score in turn automatically
    increases Site As score

42
Sounds easy to manipulate
  • Possibilities include
  • Spam
  • Link farms
  • Cloaking (sneaky re-directs)
  • Google is vigilant
  • If Google detects any manipulation of PageRank,
    it eliminates the domain from its database and
    never crawls there again.

43
PageRank Processing
  • How does Google know who has linked to Site A,
    for example?
  • By searching its database for all sites with
    links to Site A
  • No way to do this by examining Site A, as there
    is no physical change to a document when it is
    linked TO

44
Implications of PageRank
  • PageRank is entirely dependent on linkage data
    derived from the Google database
  • Breadth, depth and freshness of the crawl is
    critical to accurate and current data for
    PageRank scoring

45
A Different Perspective on PRAnti-Google
  • Daniel Brandt claims
  • PageRank discriminates against new web sites
    (which may not yet be linked to by other sites).
  • Careless custodian of private information
    (Google associates each search with a cookie, set
    to last 36 years)
  • Maintains googlewatch.org

46
PageRank A Summary
  • All links are not created equal
  • Is this site linked TO by good web pages
    associated with this topic?
  • EXAMPLE If a page is linked to by a subject
    directory (Yahoo, OD, LII) its rank will be
    higher than another page with many links from
    personal web pages, link farms, etc.
  • NOTE Link Analysis (PageRank) is not the same as
    Link Popularity (number of links)

47
Searching Google Touring the Known and the
Unknown
  • Please share your discoveries with us!

48
Command Searching with Googles Fields (aka
Search Operators)
  • Field Searches that cannot be combined with other
    search elements
  • NOTE No space allowed between operator and
    following text
  • cache retrieves cached version of the specified
    URL
  • link retrieves pages that have links to the
    specified URL
  • related retrieves pages that are similar to
    the specified URL (same as Similar Pages feature
    in results listing)

49
Command Searching with Googles Fields (aka
Search Operators)
  • Field Searches that cannot be combined with other
    search elements
  • info retrieves information that Google has about
    the specified URL
  • stocks retrieves stock information about the
    companies whose ticker symbols follow the stocks
    operator
  • stocksintc (Intel)

50
Command Searching with Googles Fields (aka
Search Operators)
  • Field Searches that can be combined with other
    search elements
  • site restrict results to those from the
    specified domain
  • sitewww.google.com PageRank
  • NOTE retrieves all pages from www.google.com
    that contain PageRank anywhere

51
Field Searches that can be combined with other
search elements
  • allintitle restrict results to those with all
    terms present in the html title element
  • allintitlesynchrotron radiation
  • intitle restrict results to those with this
    single term in the title element
  • intitlesynchrotron intitleradiation
  • NOTE intitlesynchrotron radiation retrieves
  • synchrotron in title and radiation anywhere

52
Field Searches that can be combined with other
search elements
  • allinurl restrict results to those with all
    terms present in the URL
  • Note ignores all punctuation
  • allinurlusda pesticides
  • inurl restrict results to those with this
    single term in the URL
  • inurlusda inurlpesticides
  • NOTE inurlusda pesticides retrieves
  • usda in URL and pesticides anywhere

53
Google Answers
  • Fee Based answer service
  • User sets fee (2.50-up) and time frame for
    question (Guidelines offered)
  • Searchable archive available
  • Comments can be added (by anyone) to unanswered
    questions
  • Users rate answers

54
Google AnswersWho are the researchers?
  • Must be 18 years old
  • Write an essay on why you want to be a researcher
  • Answer 5 sample questions
  • Training manual available at
  • http//answers.google.com/answers/ researchertrai
    ning.html

55
Google APIApplication Program Interface
  • Free programs for developers and researchers
    interested in incorporating Google in their
    applications
  • Iterative searches on a topic (SDI)
  • Search via non-html interfaces
  • Games that play with Web information
  • Daily limit of 1,000 queries
  • Uses SOAP (Simple Object Access Protocol) that is
    XML-based
  • More at //google.com/apis/index.html

56
Froogle
  • New Service launched in Dec, 2002
  • Locates information about products for sale
    online
  • Gives URLs of sites offering the item
  • Provides links to exact page in the site where
    you can make the purchase

57
Froogle
  • Ranking follows normal Google ranking processes
  • Paid placements always clearly marked
  • Sort by price may be a future enhancement
  • Access at http//froogle.google.com or via Google
    Advanced Search

58
Googles Hidden Features
  • Daterange search
  • Wildcard words
  • Phonebook command search
  • info field search
  • Dictionary feature

59
Daterange
  • Not officially supported at google.com
    (unreliable)
  • Reliable only through API programs
  • At google.com, MAY be most reliable for the past
    1 or 2 days
  • Searches the date of the documents entry into
    the database, not its creation.

60
Daterange Search Results (???) each days
entries for dog search executed on Oct. 9,
  • Oct. 9 No hits
  • Oct. 8 6
  • Oct. 7 about 212,000 many dated 10/7
  • Oct. 6 about 8980 many dated 10/7
  • Oct. 5 about 5900 many dated 10/7
  • Oct. 6-7 about 57,100 !!!!
  • NOT TRUE DATERANGE FUNCTIONALITY

61
With those caveats ..
  • Daterange uses the Julian calendar, a continuous
    count of days since noon, UTC, of Jan. 1, 4713 BC
  • Date changes at noon, not midnight
  • 24525651200pm Oct. 16 to 1159 am Oct. 17
  • Often used in astronomical and military contexts
  • JD convertor
  • //aa.usno.navy.mil/data/docs/JulianDate.html

62
Daterange Search for Oct 14news
daterange2452561-2452561 (4,450 hits)
63
Phonebook Command Search
  • Searches US residential (rphonebook) and
    business (bphonebook) listings of Yahoo,
    MapQuest and other services
  • rphonebook
  • MUST INCLUDE
  • Last name City and/or State
  • MAY INCLUDE
  • First name
  • bphonebook
  • MUST INCLUDE
  • Business name (min. 1 word) City and/or State
  • MAY INCLUDE
  • Full Business name

64
Wildcard Words
  • Google offers a word-sized asterisk to function
    as a wildcard
  • Stands for a whole word
  • Cannot be used for part of a word
  • three mice 22,000
  • three bl mice 0

65
Wildcard Words
  • Several can be used together
  • milosevic International Hague
  • Retrieves military tribunal OR
  • military court OR war tribunal OR military
    tribunal

66
info
  • Not exactly hidden, but not well-known
  • Searches for any information Google has about a
    site
  • Convenient way to monitor linkage

67
(No Transcript)
68
Dictionary Feature
  • Term(s) in a query for which Google has
    definitions are underlined in the text above the
    results listing (Searched the Web for )
  • Clicking on the term(s) sends you to the
    dictionary provider (you leave Google).
  • Definitions are provided from sources selected
    solely on the basis of quality

69
A Few Good Alternatives to Google
  • FAST - //alltheweb.com
  • Teoma - //teoma.com
  • Gigablast - //gigablast.com

70
Pay-For-Placement and Other Revenue Issues
71
Revenue at GoogleSelling Search Software
  • Provides search software and interface for
    portals and corporate intranets -Powered by
    Google
  • Over 150 customers worldwide (Yahoo, Sony,
    AOL/Netscape, Cisco Systems)
  • Google charges an initial set-up fee and a charge
    per 1,000 searches

72
Revenue at GoogleAdvertising AdWords
  • Ads located to the right of search results
  • Cost-per-click model (pay only if someone
    actually clicks into your site from Google)
  • No monthly minimum charge

73
Revenue at GoogleAdvertising AdWords
  • Highest bidder does NOT take top placement
  • Google measures number of visitors to an
    advertisers site and length of visits
  • This popularity-based relevance helps determine
    position of an ad
  • Offers smaller businesses a chance to compete for
    visibility

74
Revenue at GooglePremium Sponsorships
  • Launched in mid-2002
  • Advertisers purchase keywords or phrases
  • Limited to no more than two sites per keyword or
    phrase
  • Highest bidders site appears at the top of
    results listing, labeled Sponsored Site

75
(No Transcript)
76
and Ranking A Mini-Glossary
  • Pay-for-Placement
  • Paying for a specific position within search
    results retrieved using specific search terms
  • Pay-for-Inclusion
  • Paying for inclusion anywhere within search
    results retrieved using specific search terms
  • Pay-for-Submission
  • Paying to be included in the database (no special
    ranking treatment)
  • To date, no pay for inclusion or submission at
    Google

77
Revenue at GoogleThe Professionals View
  • To date, advertising clearly labeled at Google
  • If revenues decline,database size and quality may
    be effected
  • Development and support of search features and
    enhancements will be driven by commercial sector
  • Change in ownership can alter the nature and
    educational value of any search service

78
The Last 12 Months at Google
  • Dec. 2001 - Database is at 3 billion
  • 2 Billion Web documents (all types)
  • 700 Million Usenet Postings
  • 330 Image files
  • March - 3rd party sells advertising based on
    PageRank scores
  • Ongoing - Accused of censorship and manipulation
    of ranking algorithms

79
The Last 12 Months at Google
  • Sept 2 - Access to Google (and Altavista) blocked
    in China by Chinese Government
  • Sept 11 - Chinese government restores access, but
    continues to monitor Google
  • Sept. 23 - Re-designed News Service launched
  • December - Froogle launched
  • Year-End Zeitgeist at
  • http//www.google.com/press/zeitgeist2002.html

80
Google is Good, but heres a Wish List for Future
Improvements
  • Categorization of Results (Folders)
  • Teoma, WiseNut, FAST all do
  • Nesting
  • Way to limit link search to external links only
  • Indexing XML documents that have no html
    equivalents
  • Crawling Deep Web databases
  • Advanced NEWS search
  • OTHERS??????

81
Thank you and best of luck in Getting MORE from
Google!!!
  • Michael Hunter
  • Reference Librarian
  • Hobart and William Smith Colleges
  • Geneva, NY 14456
  • (315) 781-3552 hunter_at_hws.edu
Write a Comment
User Comments (0)