Storage by Hashing - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Storage by Hashing

Description:

Familiarity with and prior knowledge of hashing is assumed. ... 1. To chop into pieces; mince. hash browns. pl.n. Chopped cooked potatoes, fried until brown. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 34
Provided by: win1252
Category:
Tags: hashing | mince | storage

less

Transcript and Presenter's Notes

Title: Storage by Hashing


1
Storage by Hashing
  • Regio Florea
  • COT4810
  • Spring 2005
  • 1/24/05

2
Introduction
  • Familiarity with and prior knowledge of hashing
    is assumed.
  • Also, it is assumed that a good portion of
    hashing knowledge is lost or fuzzy.
  • This presentation is dual-purpose, serving as
    both an introduction to hashing and as a review,
    depending on experience.
  • After the presentation you should understand the
    fundamental concepts of hashing, and have
    knowledge of various hashing techniques.

3
Flow of Discussion
  • Background and history of hashing
  • Fundamentals of hashing
  • Hash functions
  • Collision resolution
  • Uses of storage by hashing

4
What is Hashing?
  • Hashing is a technique of storing data by means
    of a dynamic data structure known as a hash
    table.
  • A hash table can be visualized by means of an
    array.
  • Insertion, deletion, and searching of data items
    are allowed through the use of a hash table.

5
Background
  • hash
  • . . .
  • tr.v.
  • 1. To chop into pieces mince.
  • . . .
  • hash browns
  • pl.n.
  • Chopped cooked potatoes, fried until brown.
    Also called hash brown potatoes.

6
Background
  • The word hash used in CS comes from the obvious
    actual definition of the term.
  • It is believed that Hans Peter Luhn from IBM was
    the first to use the concept, in 1953.
  • 10 years later the term hash then began to be
    used.

7
Hash Example
8
Why Hashing?
  • To store and find records or data elements in
    large files one may use one of the following
    three methods
  • Sequentially searching through each of the n
    files, requiring O(n) steps to find a given
    record.
  • Using a data structure that needs to be ordered,
    or a tree, which would require O(logn) steps to
    find a particular record.
  • Or using certain data from a record to generate
    an address of a location in memory, which would
    require O(1) steps to find a specific record.
  • Which would you choose?

9
Designations
  • k the key, normally an alphanumeric string, we
    hash the key to generate a memory address.
  • h the hash function, h(k), which takes a key as
    an input and produces the hash code or index as
    output.
  • M the max. integer address used, e.g.- 0 to M
    can be memory address or simply array indices.
  • m number of addresses in the hash table that
    contain keys.

10
Hash Functions
11
Division
  • A good, common hash function involves modulus.
  • e.g- h(k) k mod M
  • We want to select an M that is PRIME so that the
    generated addresses are distributed well.
  • Also, M should not take the form rh a, where r
    is the radix and a is a small integer (to ensure
    proper distribution).

12
Division
  • Example If keys with the hash codes 200, 205,
    210, 215,, 600 are inserted in an array of size
    100, each hash code will collide with three
    others.
  • But if we select a size of 101, no collisions
    will occur!

13
Multiplication
  • The following steps produce a good hash function,
    provided x is an irrational number
  • Multiply k by x.
  • Take the fractional part of the previous result.
  • Multiply by M.
  • Take the integer part of the previous result
    (floor function).

14
Interesting Note
  • Selecting the golden mean (designated g) to be
    the value of the irrational number x will produce
    a better hashing than any other rational number.
  • Golden ratio 1 / 1.6180339 golden mean
    0.61803399

15
Interesting Note
  • Evidence of the use and existence of golden
    mean/ratio can be found throughout time and
    history.
  • Present in architecture, art, music, the human
    body, and nature.
  • Related to Fibonacci sequence (also present in
    nature, flowers, etc.).
  • Most aesthetic.

16
Collisions
  • Collisions occur when two or more records hash to
    the same address.
  • Requires collision resolution to appropriately
    handle the collision.
  • Two techniques
  • 1.) Chaining
  • 2.) Open addressing

17
Chaining
  • More common, simpler collision resolution
    technique.
  • Hash table can be viewed as an array of linked
    lists.
  • When a collision occurs the data record becomes
    linked to that hash indexs linked list.

18
Chaining
  • See the following website at the University of
    Michigan for a view of chaining among other
    animated hashing visualizations
  • http//www.engin.umd.umich.edu/CIS/course.des/cis3
    50/hashing/WEB/HashApplet.htm

19
Open Addressing
  • More complex, but no auxiliary structures are
    employed.
  • All records in a hash table are stored in the
    table itself one element per bin.
  • When searching for a record we go through slots
    until its found or its determined that its not
    in the table.
  • Table is successively examined (probed) when
    inserting/searching.

20
Open Addressing
  • 3 Types
  • Linear probing
  • Quadratic probing
  • Double hashing

21
Linear Probing
  • Given an ordinary hash function h(k), the probe
    sequence becomes
  • hi(k) (h(k) i) mod M,
  • i 0, 1, 2,, M
  • Examination sequence is h(k), h(k)1,, M, 0,,
    h(k)-1.
  • Simple technique, but primary clustering occurs,
    where large amounts of consecutive slots are
    full.
  • Primary clustering increases the average search
    time.

22
Quadratic Probing
  • Similar to linear probing, but works better.
  • Probe sequence takes the form
  • hi(k) (h(k) c1i c2i2) mod M,
  • i 0, 1, 2, M
  • Clustering occurs as in linear probing, but it is
    milder and known as secondary clustering.

23
Quadratic Probing
  • If the table size isnt prime, then its possible
    that an empty slot wont be found, even before
    the table is half full.
  • Logical conclusion GO PRIME!
  • Prime numbers should always be used with hashing.

24
Double Hashing
  • Considered to be one of the best open addressing
    methods.
  • Takes the form
  • hi(k) (h(k) i h(k)) mod M,
  • h(k) and h(k) are hash functions
  • The entire probe sequence depends on the key.

25
Practical Applications
  • Aside from being relevant in general computing
    areas (e.g- when data items need to be stored),
    there exist more specific areas where hashing
    plays a significant part.
  • Three main areas in computer science where
    hashing plays a role is in the following
  • Databases
  • Cryptography
  • Compiler Design

26
Databases
  • The need or at least possible use of hashing in
    databases should be obvious.
  • Namely, the storing and searching of DATA!
  • Hashing can be used effectively in a database
    management system (DBMS), for instance.

27
Databases
  • Seeing as adding records and searching for
    records is such a common operation, deciding to
    use hashing is a no-brainer.
  • As databases come in various sizes, hashing can
    likely be used effectively in all but the most
    large databases (hashing isnt so
    memory-efficient!).

28
Cryptography
  • One use is in encrypting and decrypting digital
    signatures, which are used to authenticate both
    the sender and receiver of a message.
  • The signature is transformed by the hash function
    and then the resulting value (aka message-digest)
    and the signature itself are sent to the receiver
    (in separate transmissions).

29
Cryptography
  • The receiver then uses the same hash function,
    derives a message-digest from the signature and
    then compares the resulting value to the
    message-digest it received from the sender.
  • These two values being the same confirms the
    authentication.

30
Compiler Design
  • Hashing often used in the design of compilers.
  • The common place to find hashing used in a
    compiler is in the symbol table.
  • It is considered that good compilers use
    hashing for their symbol tables.
  • Symbol tables are simply the data areas where
    symbols and their associated values are held.
  • Some common symbols include labels in assembly
    language, and variable names in higher-level
    languages.

31
Summary
  • Hashing is an effective method for managing data.
  • Common hashing functions employ the
    previously-mentioned division and multiplication
    methods.
  • Collision resolution needs to be done, either in
    the form of chaining or open addressing.

32
Summary
  • Its strength lies in its speed, namely in tending
    towards O(1) constant performance when done
    correctly.
  • Its main weakness is that it doesnt always use
    memory effectively (e.g.- more data you have,
    more empty slots will exist).
  • Hashing is part of the bread and butter of
    computer science, and is used in various related
    areas.

33
References
  • Baranova, Nadia. Hash Tables. Lecture
    presentation. 3 March 2004.
  • Dewdney, A.K. The New Turing Omnibus. New York
    Henry Holt and Company, 2001.
  • Goodrich, Michael T., and Roberto Tamassia. Data
    Structures and Algorithms in Java. 3rd ed.
    Hoboken John Wiley Sons, 2004.
  • Hashing. searchDatabase.com 20 Jan. 2005
    http//search.database.techtarget.com/sDefinition/
    0,,sid13_gci212230,00.html.
  • RSA Security 2.1.6 What is a hash function? 2
    Jan. 2005 http//www.rsasecurity.com/rsalabs/node.
    asp?id2176.
Write a Comment
User Comments (0)
About PowerShow.com