Design a Data Structure - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Design a Data Structure

Description:

... build a web search engine, a la Alta Vista (so you can search for 'banana slugs' or 'zyzzyvas' ... index say 1 billion documents of 1000 words each 1 ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 18
Provided by: HarryPl6
Learn more at: http://cs.calvin.edu
Category:
Tags: alta | data | design | structure | vista

less

Transcript and Presenter's Notes

Title: Design a Data Structure


1
Design a Data Structure
  • Suppose you wanted to build a web search engine,
    a la Alta Vista (so you can search for banana
    slugs or zyzzyvas)
  • index say 1 billion documents of 1000 words each
    ? 1 trillion word occurrences.
  • average word length 8
  • speed determined largely by disk accesses
  • may want boolean searches (e.g. banana and
    slug)
  • order results by relevance (title, keywords,
    repetitions)
  • what data structure, algorithms?
  • what will the space requirements of your data
    structure be?
  • how long will a search take?

2
Search Engine Ideas
  • Binary search tree
  • With a node for each word occurrence, memory
    needed 1 trillion nodes, 20-30 bytes each?
  • Insert, delete, find O(log n) would that be OK?
  • Or one node for all occurrences of a word, with a
    linked list of pointers to documents?
  • perhaps 10 million nodes, each with a 10,000
    element list?
  • keep nodes (but not lists) in RAM
  • each element of list has URL, title, excerpt 8K
    bytes?
  • How about a list of documents with excerpts.
  • 1. Banana Slugs, http//, Banana slugs are
    yellow, 8 long
  • 8K per document would be 800 GB for the whole
    list.

3
Getting results
  • What should we store at the nodes of the BST?
  • A hit list for a word? 10000 entries?
  • Store a pointer to a hit list instead, to
    minimize BST size
  • For each hit store document number and byte
    offset
  • Order hit list by relevance criteria
  • Size of hit list 8GB?
  • How many disk accesses to find the hits in a BST?
  • At 10 million 20-30 bytes per node, can we
    store it all in RAM?
  • How to perform a Boolean search?
  • or union two lists (merge)
  • and intersect two lists (merge-like algorithm)
  • Total disk accesses needed?
  • search BST access hit list access each
    documents info
  • total one for getting the hit list one per hit

4
A Better Data Structure
  • BSTs waste space. Much duplication in the keys
  • BSTs waste comparison time, for the same reason
  • Search by bit? or by letter?
  • Build a search tree, but
  • Go left if first bit is 0, right for 1
  • Or, nodes have 26 children, for a..z
  • Words at the leaves. (Different sort of node.)
  • Each leaf node is a hit list
  • Dont need to store the words!
  • How much space is needed?
  • suppose you have all 11.9M 5-letter words.
  • space for tree about 1 pointer per word, 4 bytes,
    vs. 20(?) in BST
  • Space savings possible--but what about wasted
    pointer space?

5
Radix Search (Ch. 15)
  • Radix-search methods provide reasonable
    worst-case performance without balanced-tree
    complexity
  • Space savings are also possible.
  • They work by comparing pieces (bytes) of the
    key rather than the whole key, as in a BST
  • Analogous to Radix Sorting methods

6
Symbol Tables (Ch. 12 quickie)
  • But first, a word about symbol tables and BSTs
    (review)
  • Symbol table store items. retrieve them by key.
  • e.g. a compilers symbol table
  • e.g. a database with primary key
  • e.g. Perls hash data structure (essentially an
    array indexed by a word.) phonejohn
    x6789.
  • fundamental to much of computation
  • Symbol table ADT (with additional desirable ops)
  • insert, delete, find
  • select (kth largest)
  • sort
  • union (of two symbol tables)
  • Extensively studied and still an area of active
    research(eg web)

7
BSTs for Symbol Tables
  • The Binary Search Tree is a common data structure
    used to implement symbol tables
  • Operations
  • insert, delete, find recursive algs, O(n) worst
    case
  • O(log n) worst case in balanced BSTs
  • sort inorder traversal
  • O(n)
  • kth largest?
  • augment tree with number of descendants stored at
    each node
  • O(log n) time in a balanced BST
  • pred, succ?
  • union?

8
Digital Search Trees (Ch. 15 again)
  • Like a BST, but go left for 0, right for 1 in the
    bit in question (stop when you hit a null
    pointer)
  • Store key at node
  • Root is most significant bit ith level - ith
    bit from left
  • Search like BST search, but compare appropriate
    bit
  • Insert ditto
  • Note not inorder!
  • Each key is somewherealong the path specifiedby
    its bits
  • Cant support sort, select
  • Search time?
  • O(b), b of bits

9
Digital Search Tree Insertion
  • How to insert Z?
  • Z11010
  • Trace down bitsuntil you find anempty spot

Runtime?
O(b), bnumber of bits
But this is not in BST order. Can we make it so?
10
Trie
  • How can we keep the BST order?
  • Trie a binary tree withkeys at the leaves
  • for an empty setis a null pointer
  • for a single key a leaf containing it
  • for many keys, a node with keys starting with
    bit 0 in its left subtree andnodes starting with
    1 in its right subtree
  • trie is for retrieval but, ironically,
    pronounced try to distinguish from tree

11
Trie Insertion
  • Perform search as usual.
  • If search ends at null link, insert there
  • If the search ends on a leaf, we need to add
    enough nodes on the way down to differentiate the
    leaf and the inserted node
  • Runtime? O(b) -- or maybe better!
  • Inserting N random bitstrings requires lg N bit
    comparisons on average per insertion
  • Note that leaf nodes and internal nodes are
    different. Wasted space if we use only one sort.
    (This gets especially significant in a large
    radix!)
  • Even with different node types, there may be
    wasted space

12
R-way Tries
  • You can save search time by using a larger radix
  • (at the expense of wasted space)
  • For example, have 26 children of each node, one
    for each letter of the alphabet

13
Tries for strings
  • 26 pointers per internal node, one for each
    letter of the alphabet
  • What if one word is the prefix of another?
  • Example aardvark and aardvarkish
  • How do you represent that aardvark is a word if
    that nodes i pointer points to another
    internal node?
  • Add a bit per letter which means this is a word
  • Keys are stored implicitly by the sequence of
    links taken to find it.

14
A Trie node for strings
  • struct node
  • char isword26
  • node links26
  • node()
  • for (int i0 i
  • iswordi0 linksi0

But where is the word stored?
15
Insertion (simplified)
  • How do you insert a string into a trie?

void insert(string word, node n, int pos)

if (pos word.size() - 1)

n-iswordindex(wordpos)
1 return

if
(n-linksindex(wordpos) NULL)
n-linksindex(wordpos) new node
insert(word, n-linksindex(wordpos),
pos1) return


int index(char ch) return int(ch-a)
16
Experimental Results
  • In my implementation a node used 132 bytes
  • 20068 words were read in
  • 45747 nodes were allocated
  • Total space 6,038,604 bytes
  • (compared with 200k size of /usr/dict/words)
  • Average word length 7.4 characters
  • Average comparisons per search 7.4 one-character
    comparisons (compared to 15 word comparisons for
    a balanced BST)
  • Easier to implement than a balanced BST

17
Using a Trie Examples
  • Spell checker fast but big
  • Symbol table with lots of short symbols
  • Boggle-playing program
  • read /usr/dict/words into a trie
  • generate a 4x4 square of random letters
  • DFS (backtracking search) starting at each
    square, not re-using letters, finding all words
    from trie
  • Anagrams
  • Read /usr/dict/words into a trie
  • Read in a string
  • Anagram(lettersToUse,wordPrefix)
  • Anagram(vilcan,) Anagram(vilan,c)
    Anagram(viln,ca)
  • Anagram(n,calvi) prints calvin
  • Anagram(vi,caln) returns without expanding
    (no caln words)
Write a Comment
User Comments (0)
About PowerShow.com