Title: Algorithms (Contd.)
1Algorithms (Contd.)
2How do we describe algorithms?
- Pseudocode
- Combines English, simple code constructs
- Works with various types of primitives
- Could be - /
- Could be more complex operations
- Describes how data is organized
- Describes operations on the data
- Is meant to be higher level than programming
3Searching with indices (pseudocode)
- Build the indices
- Do this by going through the list and determining
where department names change - Store the results in an array called Indices
- Search the indices
- Do a binary search on the array Indices
- Do this by comparing to the middle element
- Then use binary search to compare to the upper
half - Or use binary search to compare to the lower half
4Building a web search engine
- Crawl/spider the web
- Organize the results for fast query processing
- Process queries
5Crawl the web
- Every month use networking to go to as many
reachable web pages as you can - 10B pages, 10 Kbytes/page, so 100 terabytes
- Can compress an average page to 3Kbytes
- Numeracy
- To crawl 10B pages in 100 days
- Crawl 100M pages per day
- Crawl 4M pages per hour
- Crawl 1,000 pages per second
6Organize the results
- Put into alphabetical order
- Build indices for faster lookup
- Make multiple copies so that searching can
proceed in parallel. - When you update, you rebuild the indices
7Process search queries
- Look up indices
- Look up words/phrases
- Advertiser can buy a word or phrase
- This search gives you internal addresses of web
pages - Look them up to build results page
- Ranking results content match, popularity, price
paid by advertisers,
8Ranking by Popularity
- The web is a collection of links
- A documents importance is determined by
- How many pages point to it
- How important those pages are
- Used for determining
- How often to crawl a page
- How to order pages presented.
9Content Relevance
- Simple string matching
- Does the document/string contain the word
computer? - More complex string matching
- Did the word computer occur before or after the
word science? - Did it appear within 10 words of the word science?
10How does string matching work?
- State machines ?
- Move along states as long as you keep matching
- Back off when you miss a match
11State machine looking for abcd
Read a
What happens if input is abccadbacabcd?
Sa Sb Sc Sd Sa Sb Sa Sa Sb Sa Sb Sc Sd OK
12State machine looking for abcd
Read a
What happens if input is abcabcd?
Sa Sb Sc Sd Sa Sa Sa Sa
13State machine looking for abcd
Read a
Read a
Read b
Read c
Sd
Sa
Sb
Sc
Read a
Other
Read a
Read d
Other
OK
Other
14Larger search challenges
- Allow strings to have dont cares
- Starts with a and ends with e
- Has come number of copies of the substring ab
- Finding strings similar to but not the same as
your string - For spelling corection
15Algorithms -- summary
- Methods for solving problems
- Understand at a high level
- Make sure your reasoning is correct
- Worry about efficiency in situations where that
matters - Write as pseudocode
16Distributed Algorithms
17Distributed computing
- Key idea
- Buying 1000 machines of speed x is significantly
cheaper than buying one machine of speed 1000x - No one person has to buy all 1000 machines A lot
of computational, communication and storage
resources already in place and can be harvested
for bigger things - Key challenge
- Making the machines work together for effective
speedup. Communication between machines is a key
challenge. - Approaches
- Find problems that can be distributed easily
18Distributed problems
- Problems that can use decentralized computing
- Weather prediction
- Weather in a location is most affected by weather
nearby - Movie generation
- Individual frames can be generated separately
- Google search engine
- 10,000s PCs. all of them cheap, many of them
identical - Can answer over 100,000,000 queries per day in ½
sec or less each - Looking for the origin of the universe
- Can be localized like weather prediction
- File swapping and access (distributed storage)
- Looking for extra terrestrial intelligence
- Content caching and distribution
19Distributed computers
- Scales of distributed computing
- Cluster-in-a-room hundreds of machines
- All dedicated to the task
- PCs on a campus thousands of machines
- Using spare cycles
- SETI cluster millions of machines
- Screen saver situation
20Cluster in a Room
- Machines are dedicated to the network
- All machines run similar software
- Problem is divided into pieces
- Each piece is assigned to a machine in the
cluster - Problem pieces should be loosely linked
- Computation is faster than communication
21PCs on a Campus
- Loosely coupled on a local-area-network
- PCs do other things some of the time
- When free cycles are available, theyre used
- Many more machines, but less of each machine
available
22Workstation Network at Google
Retrieving machines
Searching machines
Fit 40-80 machines in a 7x2x3 rack
23SETI
- Telescope at Arecibo, PR collects data
- Data is processed in real time by fast machines
- But, no one looks for weak signals
- Too costly
- SETI_at_Home project built to do this
24SETI_at_Home
- Receive data from Arecibo
- 35 Gbytes per day by snail mail
- Break into Work Units
- .25 Mbyte each, so 140,000 WUs per day
- WU takes 20 hours to process
- Need about 117,000 dedicated machines to process
one day
25SETI_at_Home
- Get individual users to download software
- Machine idle and screen saver runs software
- Download WU
- Compute
- When finished send back result
- Database at Berkeley reassembles results
- Progress to date -- Seti_at_HomeStats
26Medical/Biological Applications
- Peer-to-Peer Medicine
- Cancer Research
-