Title: Graphs
 1Graphs  more on Web search
- 15-211 Fundamental Data 
Structures and Algorithms 
Stefan Niculescu  James Lyons March 21, 2002 
 2Announcements 
 3Homework 5
- Homework Assignment 5 will be out on Friday. 
 - Must do some reading in order to complete it. 
 - Must take a progress quiz. 
 - Get started today and as usual, think b4 u hack!
 
  4Reading
- About graphs 
 - Chapter 14 
 - About Web search 
 - http//www.cadenza.org/search_engine_terms/srchad.
htm  - A HTML tutorial 
 - http//www.cwru.edu/help/introHTML/toc.html 
 
  5Introduction to Graphs 
 6Graphs  an overview 
 7Graphs  an overview
BOS
DTW
SFO
PIT
JFK
LAX 
 8Graphs  an overview
BOS
DTW
SFO
PIT
JFK
LAX
Undirected 
 9Graphs  an overview
BOS
618
DTW
SFO
2273
211
190
PIT
318
JFK
1987
344
2145
2462
LAX
Weights
Undirected 
 10Terminology
- Graph G  (V,E) 
 - Set V of vertices (nodes) 
 - Set E of edges 
 - Elements of E are pairs (v,w) where v,w ? V. 
 - An edge (v,v) is a self-loop. (Usually assume no 
self-loops.)  - Weighted graph 
 - Elements of E are (v,w,x) where x is a weight.
 
  11Terminology, contd
- Directed graph (digraph) 
 - The edge pairs are ordered 
 - Every edge has a specified direction 
 - The Web is a directed graph 
 - Undirected graph 
 - The edge pairs are unordered 
 - E is a symmetric relation 
 - (v,w) ? E implies (w,v) ? E 
 - In an undirected graph (v,w) and (w,v) are 
usually treated as though they are the same edge 
  12Directed Graph (digraph)
BOS
DTW
SFO
PIT
JFK
LAX 
 13Undirected Graph
BOS
DTW
SFO
PIT
JFK
LAX 
 14Terminology, contd
- v and w adjacent (neighbors) if (v,w)?E or 
(w,v)?E  - d(v) (degree of v)   neighbors of v (for 
undirected graphs)  - d(v) (out-degree of v)  of edges (v,w)?E 
 - d-(v) (in-degree of v)  of edges (w,v)?E 
 
  15Terminology, contd
- Path 
 -  a list of nodes (v1, v2,...,vn) s.t. 
(vi,vi1) ? E for all 0 lt i lt n  - The length of the above path is n-1 
 - Cycle 
 -  a path that begins and ends with the same node 
 - Cyclic graph  contains at least one cycle 
 - Acyclic graph - no cycles 
 
  16Elements of a Graph
BOS
DTW
SFO
PIT
JFK
LAX 
 17Terminology, contd
- Subgraph of a graph G 
 -  a subset of V with the corresponding edges 
from E.  - Connected graph 
 -  a graph where for every pair of nodes there 
exists a sequence of edges starting at one node 
and ending at the other.  - Connected component of a graph G 
 -  a connected subgraph of G. 
 
  18Elements of a Graph, contd
BOS
DTW
SFO
PIT
JFK
LAX 
 19Terminology, contd
- Unrooted (undirected) tree 
 -  a acyclic connected undirected graph 
 - Theorem in any unrooted tree T(V,E), VE1. 
 -  Proof by induction on V 
 - Base case V1 (E0) 
 - Show there exists a node of degree one 
 - Remove that node and apply induction hypothesis 
 
  20Example of a unrooted tree
BOS
DTW
SFO
PIT
JFK
LAX 
 21Quiz Break 
 22So, is this a connected graph?
Cyclic or Acyclic?
3
2
Directed or Undirected?
4
1
6
5 
 23Directed graph (unconnected)
3
2
Cyclic or Acyclic?
4
1
6
5 
 24Representing Graphs 
 25Representing graphs
- Adjacency matrix 
 - 2-dimensional array 
 - For each edge (u,v), set Auv to true 
otherwise false  
- Adjacency lists 
 - For each vertex, keep a list of adjacent vertices
 
  26Choosing a representation
- Size of V relative to size of E is a primary 
factor.  - Dense E/V is large 
 - Sparse E/V is small 
 - Adjacency matrix is expensive in terms of space 
if the graph is sparse (O(V2 gt O(EV)).  - Adjacency list is expensive in terms of checking 
edges if the graph is dense.  
  27Size of a Graph
- How many edges in a undirected graph with n 
vertices?  - Minimum 0 
 - Maximum 
 
- How many undirected graphs for a set of n given 
vertices?  - Answer
 
  28Graphs are Everywhere 
 29Graphs as models
- The Internet 
 - Communication pathways 
 - DNS hierarchy 
 - The WWW 
 - The physical world 
 - Road topology and maps 
 - Airline routes and fares 
 - Electrical circuits 
 - Job and manufacturing scheduling
 
  30Graphs as models
- Physical objects are often modeled by meshes, 
which are a particular kind of graph structure. 
By Jonathan Shewchuk 
 31More graph models
NASA CFD labs
By Paul Heckbert and David Garland
See also http//java.sun.com/applets/jdk/1.1/demo/
WireFrame/index.html and http//www.mapquest.com 
 32Structure of the Internet
MAPS
UUNET MAP
SOURCE CISCO SYSTEMS 
 33Relationship graphs
- Graphs are also used to model relationships among 
entities.  - Scheduling and resource constraints. 
 - Inheritance hierarchies. 
 
  34Where are we right now? 
 35The Web Graph 
 36Web Graph
- Documents written in HTML 
 - HTML (HyperText Markup Language) 
 - TAGS 
 - ltheadgt, ltbodygt, lttitlegt 
 - ltH1gt, ltpgt 
 - ltagt (anchor, link)
 
  37A simple HTML example
 lthtmlgt ltheadgt ltTITLEgtA Simple HTML 
Examplelt/TITLEgt lt/headgt ltbodygt lta 
href"http//www.cmu.edu"gt Carnegie Mellon 
University lt/agt lt/bodygt lt/htmlgt 
 38Web Graph
- A directed graph where  
 - V  (all web pages) 
 - E  (all HTML-defined links from one web 
page to another) 
  39Web Graph
lta href gt lta href gt
lta href gt 
lta href gt lta href gt
lta href gt lta href gt
- Web Pages are nodes (vertices) 
 - HTML references are links (edges)
 
  40Is the Web Graph connected?
- Sparse, unconnected graph 
 - AUTHORITIES 
 -  web pages containing a reasonable amount of 
relevant information about a specific topic  - HUBS 
 -  web pages that point (link) to many pages 
containing relevant information about a given 
topic  -  
 
  41Finding Hubs  Authorities
- Nice iterative algorithm by Jon Kleinberg 
 -  www.cs.cornell.edu/home/kleinber/auth.ps 
 - HUB Avrims Machine Learning page 
 -  www.cs.cmu.edu/avrim/ML 
 - AUTHORITY www.java.sun.com 
 - Extra credit opportunity for homework 5 ? 
 -  
 
  42Graphs  Application Search Engines 
 43Search Engines 
 44What are they?
- Tools for finding information on the Web 
 - Problem hidden databases, e.g. New York Times 
(ie, databases of keywords hosted by the web site 
itself. These cannot be accessed by Yahoo, 
Google etc.)  - Search engine 
 - A machine-constructed index (usually by keyword) 
 - So many search engines, we need search engines to 
find them. Searchenginecollosus.com 
  45Did you know?
- Vivisimo was developed here at CMU 
 - Developed by Prof. Raul Valdes-Perez 
 - Developed in 2000
 
  46SE Architecture
- Spider 
 - Crawls the web to find pages. Follows 
hyperlinks. Never stops  - Indexer 
 - Produces data structures for fast searching of 
all words in the pages (ie, it updates the 
lexicon)  - Retriever 
 - Query interface 
 - Database lookup to find hits 
 - 1 billion documents 
 - 1 TB RAM, many terabytes of disk 
 - Ranking
 
  47A look at
- 10,000 servers (WOW!) ? 
 - Web site traffic grows over 20 per month 
 - Spiders over 2 Billion URLs 
 - Supports 28 language searches 
 - Over 100 million searches per day 
 - Even CMU uses it! ?
 
  48Googles server farm 
 49Web Crawlers
- Start with an initial page P0. Find URLs on P0 
and add them to a queue  - When done with P0, pass it to an indexing 
program, get a page P1 from the queue and repeat  - Can be specialized (e.g. only look for email 
addresses)  - Issues 
 - Which page to look at next? (Special subjects, 
recency)  - How deep within a site do you go (depth search)? 
 - How frequently to visit pages?
 
  50So, why Spider the Web?
- Refresh Collection by deleting dead links 
 - OK if index is slightly smaller 
 - Done every 1-2 weeks in best engines 
 - Finding new sites 
 - Respider the entire web 
 - Done every 2-4 weeks in best engines
 
  51Cost of Spidering
- Spider can (and does) run in parallel on hundreds 
of severs  - Very high network connectivity (e.g. T3 line) 
 - Servers can migrate from spidering to query 
processing depending on time-of-day load  - Running a full web spider takes days even with 
hundreds of dedicated servers 
  52Indexing
- Arrangement of data (data structure) to permit 
fast searching  - Which list is easier to search? 
 -  sow fox pig eel yak hen ant cat dog hog 
 -  ant cat dog eel fox hen hog pig sow yak 
 - Sorting helps. Why? 
 - Permits binary search. About log2n probes into 
list  - log2(1 billion)  30 
 - Permits interpolation search. About log2(log2n) 
probes  - log2 log2(1 billion)  5
 
  53Inverted Files
POS 1 10 20 30 36
 A file is a list of words by position - First 
entry is the word in position 1 (first word) - 
Entry 4562 is the word in position 4562 (4562nd 
word) - Last entry is the last word An inverted 
file is a list of positions by word!
FILE 
 54Inverted Files for Multiple Documents
jezebel occurs 6 times in document 34, 3 times 
in document 44, 4 times in document 56 . . .
LEXICON
WORD INDEX 
 55Ranking (Scoring) Hits
- Hits must be presented in some order 
 - What order? 
 - Relevance, recency, popularity, reliability, 
alphabetic?  - Some ranking methods 
 - Presence of keywords in title of document 
 - Closeness of keywords to start of document 
 - Frequency of keyword in document 
 - Link popularity (how many pages point to this one)
 
  56Spamdexing  Link Popularity
- Spamdexing means influencing retrieval ranking by 
altering a web page. (Puts spam in the index)  - www.linkpopularity.com 
 - Link popularity is used for ranking 
 - Many measures 
 - Number of links in (In-links) 
 - Weighted number of links in (by weight of 
referring page) 
  57Search Engine Sizes
AV Altavista EX Excite FAST FAST GG Google INK Ink
tomi NL Northern Light
SOURCE SEARCHENGINEWATCH.COM 
 58Historical Notes
- WebCrawler first documented spider 
 - Lycos first large-scale spider 
 - Top-honors for most web pages spidered First 
Lycos, then AltaVista, then Google... 
  59Overview
- Engines are a critical Web resource 
 - Very sophisticated, high technology, but secret 
 - Most spidering re-traverses stable web graph 
 - They dont spider the Web completely 
 - Spamdexing is a problem 
 - New paradigms needed as Web grows 
 - What about images, music, video? 
 - Googles image search engine 
 - Napster