Title: Blog Search Engines
1Blog Search Engines
- Poonam Bhatti, Mimi Lam, Paul MacDonell, and
Barrie Olmstead - LIBR 557 Advanced Information Retrieval
- November 21, 2005
2Brief History of Blogging
- Blogs are kind of an outgrowth of the alternative
press that came about in the 1960s. - In the strictest sense, a blog is someone's
online record of the Web sites he or she visits. - 1999 Brigitte Eaton starts the first portal
devoted to blogs with about 50 listings. - In July 1999, a Toronto programmer named Andrew
Smales launched the first do-it-yourself blog
tool called Pitas.com, helping to facilitate an
"online diary" community. Smales later developed
a sister site, Diaryland.
3Blogging History Contd
- Blogger.com, which was launched in August 1999 by
Evan Williams, Paul Bausch, and Meg Hourihan, is
a tool that enables anyone to not only create and
maintain a blog, but to store on their own server
with a personalized address, and not on a remote
base. - Blogger gave users a free for-dummies choice of
templates, a short and easy-to-navigate
registration process, and Web hosting. - In the November 2000 issue of The New Yorker,
Rebecca Mead christened the blogging phenomenon
"the CB radio of the Dave Eggers generation." - The events of September 11, 2001, had a huge
impact on blogging activity, as people felt
compelled to discuss, argue, rant, and mourn
online. There was also a prevailing
dissatisfaction with Big Media. - It is estimated, as of summer 2005, that there
are over 10 million blogs in the blogosphere.
4The Structure and Anatomy of Blogs
- Blogs are basically Web pages that consist of
individual posts arranged in reverse
chronological order. - A blogger may read a piece of news or a tidbit of
interest and post it to his or her blog within
seconds. Feed aggregators then query the site
about every hour searching for new feeds. - Blogs have a low barrier to access and one may
post using email, voicemail, Web forms, or a
downloaded WYSIWYG program. - Many hosted blogs have preformatted template
choices that do not require a detailed knowledge
of HTML, CSS, or XML. - Blogs have home pages with mostly static content
and a list of recent posts. - Posts usually have the following fields title,
date/time stamp, body, comments, trackbacks, and
permalinks. - Archives are created contiguously.
5Feeds
- File in which blog lists latest content/posts
- RSS feed is new information from website in
format that RSS reader can read - RSS Reader program that checks web pages for new
content in RSS format - List newest content in format that software can
read shows at once when blog has been updated - may contain only the title of the post, the title
plus the first few lines of a post, or the entire
post - Most blog search engines do not crawl the entire
web crawl RSS Feeds
6Why are blogs difficult to search?
- Content changes on daily, or sometimes hourly,
basis - Blogs organized with most recent material at top,
while older material is further down the page - Most blogs dont have descriptive titles
- Many topic on the same page
- Difficult to search in traditional format
- While larger search indexes (such as Google)
index weblogs, they do not crawl web frequently
enough to provide most up-to-date information - Miss the immediacy of weblogs
- Most larger search engines have changed
algorithms so that blogs not the most highly
ranked sites
7Blog Search Engines
- Need for specialized blog search engines
- Blog search engines crawl RSS feeds often,
therefore provide the most recent material - blog search engines more focused provide
todays internet
8Blog Search Engines
- Two types of blog search engines
- Directory or Index Style
- Weblogs organized into categories (language,
country, alphabetical order, topic, etc.) - Examples Globe of Blogs, Blogwise, Weblogs.com,
Blog Universe - Free text search engines
- perform keyword searches
- Examples Feedster, Waypath, Technorati, Day Pop
9Weblogs.com
- ping server that automatically notifies
subscribers when new content is posted to a
website or blog - receives millions of pings every day from blogs
that have configured their publishing software to
notify Weblogs.com the moment content is
published - must be told when weblog has changed
- doesn't automatically check
- blogging tool or content management system can be
programmed to tell Weblogs.com about the change - no information on how many blogs tracking
10Weblogs.com
11Blog Universe
12Blog Universe - News
13Blog Universe
- Browse by topic no other access points
- 7682 websites in directory
- Blogs added by users
- Person adding the blog decides which category to
put it in - Results change constantly
14Globe of Blogs
15Globe of Blogs Browse by Title
16Globe of Blogs - Results
17Globe of Blogs
- 28,641 weblogs registered
- Can register blogs with Globe of Blogs to have it
appear in directory - Index of weblogs indexed by author (name and
birthday), title, topic, and location - Ability to search within directory
- Check links periodically throughout year to
remove dead links
18Feedster
- Free text blog search engine dedicated to
indexing and finding blogs - Mission allow organizations and individuals to
harness rich quantity of information available in
RSS universe - Largest index of RSS feeds searchable index of
over 17 million feeds and hundreds of millions of
XML documents - Feedsters proprietary technology continuously
crawls web fetching updated posts and RSS feeds - Fresh index of the over 17 million feeds several
times per hour adds millions of new documents
daily - Enables delivery of fresh index of information
from millions of sources more frequently than
traditional search engines
19Feedster Basic Search
20Feedster Advanced Search
21Feedster Search Tools
- Supports Boolean searching
- Available operators OR, NOT, NEAR AND is
implied - Proximity searches
- Use double quotes to search for phrase
- NEAR how close terms should be use NEAR
to specify order of terms - Wildcard/Truncation
- Stemming automatically searches for singular and
plural forms of search terms - Wildcards supports single (?) and multiple ()
character wildcards
22Feedster Search Tools (contd)
- Search Syntax
- Supports field searching can limit search to
title, description, author, feed ID, top level
domain, site, host, URL, encoding - Can limit search by date or time
- Can search for range of numbers . . .
- Available functions
- typo use if dont know how to spell term or
if want to find incorrect spellings of term - soundslike finds variations of term (e.g.,
soundslikechicken will find chicken, chickennn,
and chukin) - literal use if want to find exact character
23Feedster Special Features
- Can save customized search and check back hourly
or daily to see what the latest posting on topic
are - Provides cached version of pages it finds can
still look at page even if has disappeared
24Feedster - Performance
- Works quickly
- Results vary minute by minute
- Some of the same blogs on first page of results
list, but in different order - Not always immediately clear how results related
to search terms
25Feedster Results Display
- Results ranked by date (newest items first)
- Can also choose to rank by relevance
- When ranking by relevance, can weight terms for
importance, using term, with 1 being normal
value - Each results looks like this
26Feedster Search Results
Results sorted by date
Results sorted by relevance
27Feedster Help Support
- Very detailed help and search tips
- Explains how to use Feedster
- Available search tools
- Search tips
- How to interpret results
28Feedster Whats Missing?
- Cannot limit results to pages written in certain
language - Feedster in process of adding this feature
29Waypath
- A blog discovery engine
- Includes Blogs on News
- Utilizes Topic Streams
- Covers fewer blogs than such engines as
Technorati or PubSub, but it uses two types of
RSS feeds for results - It is also one of the few search engines that
indexes the entire blog, and not just the feed.
30Waypath Basic Search
31Waypath Basic Search
32Waypath Advanced Search
33Waypath Search Tools
- Supports Boolean operators
- Supports using parentheses to group clauses to
form sub-queries - Supports single (?) and multiple () character
wildcard searches - Supports fuzzy searches and looks for terms that
are similar in spelling to the query term - Supports proximity searching
34Waypath Special Features
- Ranks the posts it returns based on matches
between the query terms and the terms found in
individual posts - Weighting terms
- By default, unmodified query terms have a weight
of 1.0. - One may specify a new weight by specifying it at
the end of a term, separated by the symbol - The Waypath bookmarklet feature allows a searcher
to access Waypath related weblog posts from any
page by clicking on a bookmark link.
35Waypath Performance
- Retrieval is slow
- Reasoning behind the order of the search results
is unclear
36Waypath Results Display
37Waypath Results Display
38Waypath Help Support
- Waypath has help information on
- Single term searching
- Using phrases in quotes
- Boolean operators
- Grouping
- Wildcard operators
- Weighting terms
- Fuzzy searches
- Proximity searches
39Waypath Whats Missing
- Does not have cached search results for free-text
searching - Cannot search for blogs in languages other than
English - Could be more comprehensive in terms of the
number of blogs indexed
40Technorati
- A real-time search engine
- Currently tracking 21.3 million blogs and 1.7
billion links - Rankings are based on the number of sources that
point to a particular blog relative to other blogs
41Technorati Basic Search
- Search box available on the homepage
42Technorati Advanced Search
- On the Search page there is an options arrow that
expands to reveal advanced features
43Technorati Advanced Search
44Technorati Advanced Search
45Technorati Advanced Search
46Technorati Search Tools
- Boolean operators are integrated into search
boxes - Exact phrase searching is integrated also
- URL search
- Metadata Tags search
47Technorati Special Features
- Technorati Blog Finder Beta
- Blog Finder uses metadata tags to categorize
posts. - Technorati is currently tracking 3 million tags
in numerous languages.
48Technorati Special Features
- Technorati Membership
- Offers users special features for signing up
- Membership is free
- Create a profile page
- Add a photo to your blog
- Claim your blog and trick it out with Technorati
tools - Receive a Watchlist where users are kept up to
date on topics that they select
49Technorati Performance
- Retrieval was fairly fast
- Photos took longer to load but everything loaded
within a reasonable time - Results seemed to be relevant to search terms
- Did deliver what they promised the results
ranked first were usually posted less than an
hour ago
50Technorati Results Display
51Technorati Results Display
- Website URL search result
52Technorati Results Display
53Technorati Help Support
- Has various help topics pages to assist users
- Using Technorati
- Blogging 101
- Frequently Asked Questions
- Publisher Guide
- Tags
- Blog Finder
- There is also a Contact Us link beside the Help
section that users can resort to with questions.
54Technorati Whats Missing
- Cannot limit search by time or date
- Does not support proximity searching
- Does not support wildcard or fuzzy searches
55Daypop http//www.daypop.com
- Well designed, easy to use, one of the most
respected News and Blog search engines on the web
56Daypop
- Created and maintained by a Daniel Chan
- Chan blogged 2000 US election but could not
share and retrieve web info - Daypop online August 2001
- A year later searched over 7,500 sites
- Now over 59,000 news sites, weblogs, RSS feeds
57Daypop was down for a while recently
- Daypop has been described as the front page of
the Internet - A couple of posts to the Daypop weblog when it
was down, and Dan was on vacation
58Daypop Search Engine
- Keywords automatically AND ed
- Uses to force inclusion, - to force exclusion
- multiple phrase searching or multiple-phrase-sea
rching (with dashes) - Also period, slash (e.g., good for dates), back
slash, underscore, and ampersand - Drop down either news or blogs searched (or both
at same time) also RSS news headlines or RSS blog
posts - Advanced Page by language, country, time periods
(3 hours min, 2 weeks max), and results per page
59Daypop Pages that link to a URL
- E.g. Noam Chomskys Blog Turning the Tide
- Use link follow by the url of Chomsky Blog
60Daypop Human and Automatic
- Daypop is different from most search engines
uses a human-edited list of sources to index
- Daypop also crawls using its daypopbot/0.2 spider
- meta tag to exclude ltMETA NAMEROBOTS
CONTENTNOARCHIVEgt - Major news sites (3 hours), lesser (24 hours)
- Blogs (crawled 12 hours)
- Faster if using weblogs.com ping notification
- Ranking algorithm word placement and proximity
(e.g., a word in title, two search words near
each other) - Does not use the Daypop Score when returning
results
61Daypop Scoring for Authority or Importance
- Citation a link (a web page cites another)
- Daypop Scoring gives more weight to citations
that come from popular blogs - Idea is that this reflects relevance in a more
meaningful way - Citation analysis simply counts total links
(i.e., how many bloggers link to that blog?) - Technorati authority links (citations)
- Daypop authority measured 2 ways
62Daypop Scoring Vs. Citation Ranking
63Daypop Trend Analysis
- Top 40 unfiltered list of links of what is
currently popular (updated as soon as crawled)
64Other Daypop Trend Analysis
- Filter results for Top News Posts or Top Blog
Posts like Top 40 rank based on links - Word bursts (blogs) measurement of words with
recent heightened usage (what is being written
versus what is being linked to) - News bursts same algorithm but for front pages
of news sites - Interesting feature is Top Wishlist tracks the
weblogging communitys Amazon wishlists and the
items on those lists.
65Daypop Top Wishlist
66Blog only Search Results
- Similar to Google (keyword in context, cached
copy, size) - Citations links and uses N (news source) and W
(weblogs)
67Daypop Comments and Criticism
- Limited number of sites indexed is both a plus
and a minus - Approach to Analysis of Web Pages
- Element of Human Control
- A few duplicate links
- W Weblog,N News (occasional mistaken)
- Engine can be slow
- Archives not working properly
- Occasionally whole site is down
- No Subject Indexing
- No related tags as with Technorati on results
page - Lack of Boolean OR (no bird flu or avian flu
search is possible at the same time)
68Blogs and More
- End of formal part of presentation
- Remaining slides designed to initiate discussion
69UBC, Librarians, and Blogs
- Some SLAIS students have maintained blogs
- Emily Yearwood-Lee (temp summer blog)
http//coffeespoonsafternoons.blogspot.com - Heidi Dolamore http//quiddle.blogspot.com/
- Cheryl Hill http//bc-scrapbook.blogspot.com/
- Recent SLAIS grad Sunni Nishimura added a RSS,
Wikis and Blog page to the UBC Library site - Many libraries maintain various blogs and often
conferences are blogged. The 2005 Internet
Library Conference blog is a recent example. An
example of a librarian blog is Sites and
Soundbytes - UBC Librarian Dean Guistini keeps a blog on
Google Scholar - Map of Blogging Librarians http//www.frappr.com/
blogginglibrarians - Any UBC student can start their own blog through
the UBC Office of Learning Technology
70Blogs as Controversial Items
- Little scholarly literature exists on blogs and
related issues - Still less on blog search engines
- But the numbers of blogs keeps growing
http//www.sifry.com/alerts/archives/000343.html - Blogs as a news items (or the creators of news)
occasionally make it into mainstream media
examples include Dan Rather, Kryptonite Locks,
and the current issue of TIME Magazine and the
small piece about blogs and rioters and France - So what is the deal with blogs? Egotistical navel
gazing or real social force? Or something else? - Blogs as a means of archival preservation?
71Opinions, Rants, Insights A Cross-Section of
What is Out There
- Libraries and Related Issues
- Library Web Logs (Laurel A. Clyde) An Actual
Study! http//www.slais.ubc.ca/macdonell/bloglit/l
ibraries_opinions/clyde.pdf - Weblogs Do They Belong in a Library (Penny
Garrod) http//www.slais.ubc.ca/macdonell/bloglit
/libraries_opinions/garrod.htm - Weblogs Their Use and Application in Science and
Technology Libraries http//stlq.info/archives/bl
ogstl.pdf - Revenge of the Blog People (Micheal Gorman)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/gorman.htm - I See Blog People (T. Sott Plutchak)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/plutchak.htm - The Passion of the Blog (Irene McDermott)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/mcdermott.htm - All Generalizations Are False, Including This One
(Marydee Ojala) http//www.slais.ubc.ca/macdonell
/bloglit/libraries_opinions/ojala.pdf
72Opinions, Rants, Insights A Cross-Section of
What is Out There cont
- Social Role and Meaning
- Blogs and the New Politics of Listening (Stephen
Coleman) http//www.slais.ubc.ca/macdonell/blogli
t/libraries_opinions/coleman.pdf - Blogs as Protected Space (Michelle Gumbrecht)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/gumbrecht.pdf - Miscellaneous
- An Introduction to Teaching With Weblogs ( Trey
Martindale) http//www.slais.ubc.ca/macdonell/blo
glit/libraries_opinions/teaching.pdf - Emerging Technologies (Bob Godwin-Jones)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/rssblogswikis.htm - Finding a Blog in a Haystack (Stephen Baker)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/haystack.htm - Your blog? who gives a _at_! (Aaron Weiss)
http//www.slais.ubc.ca/macdonell/bloglit/librarie
s_opinions/weiss.htm
73Bibliography Search Engines
- "Daypop Search." in Metamend Software Design
Limited database online. cited 18 November
2005. Available from http//www.metamend.com/dayp
op-search-engine.html. - (Metamend also has pages on Feedster,
IceRocket, and Technocrati) - Bradley, Phil. "Search Engines Weblog Search
Engines." Ariadne, no. 36 Journal on-line.
Available from http//www.ariadne.ac.uk/issue36/se
arch-engines/intro.html, 18 November 2005. - Notess, G. "The Blog Realm News Sources,
Searching with Daypop, and Content Management."
Online 26, no. 5 (Sep/Oct 2002) 70-72. 132354. - http//www.slais.ubc.ca/macdonell/bloglit/engine
s/notess.pdf - Pikas, Christina K. "Blog Searching for
Competitive Intelligence, Brand Image, and
Reputation Management." Online 29, no. 4 (Jul
2005-Aug 2005) 16-21. 375032. - http//www.slais.ubc.ca/macdonell/bloglit/engine
s/pikas.pdf - Vara, Vauhini. "New Search Engines Help Users
Find Blogs." Wall Street Journal - Eastern
Edition 246, no. 47 (09/07/ 2005) D1-D3. - http//www.slais.ubc.ca/macdonell/bloglit/engine
s/vauhini.htm
74Bibliography Creators and Creation of Blogs,
Users, and Types of Blogs
- Bar-Ilan, Judit. "Information Hub Blogs." Journal
of Information Science 31, no. 4 (2005) 297-307. - http//www.slais.ubc.ca/macdonell/bloglit/types/
infohubs.pdf - Gouge, Marianne. "Blogs as a Means of
Preservation Selection for the World Wide Web."
MA. diss., School of Information and Library
Science, University of North Carolina at Chapel
Hill, 2004. http//etd.ils.unc.edu/dspace/bitstrea
m/1901/108/1/mariannegouge.pdf - Herring, Susan C., Lois Ann Scheidt, Sabrina
Bonus, and Elijah Wright. "Bridging the Gap A
Genre Analysis of Weblogs." Proceedings of the
37th Hawaii International Conference on System
Sciences (2004). - http//www.slais.ubc.ca/macdonell/bloglit/types/
herring.pdf - Lindahl, Charlie, and Elise Blount. "Weblogs
Simplifying Web Publishing." Computer 36, no. 11
(2003) 114-116. - http//www.slais.ubc.ca/macdonell/bloglit/types/
lindahl.pdf
75End of Presentation
- Questions?
- Any opinions that the class has on blogs or blog
search engines? - Any searches or engines youd like to look at?