Discussion - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Discussion

Description:

A complete collection of classic movies. Structured information (records) People. Cars. Students. Electronic appliance parts. CS 100. Discussion #35 Data ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 47
Provided by: danrol
Category:
Tags: discussion

less

Transcript and Presenter's Notes

Title: Discussion


1
Discussion 35 Data Organization
2
Data Organization
3
A question of scale
  • 10 things
  • 100 things
  • 1,000 things
  • 10,000 things
  • 1,000,000 things
  • 1,000,000,000 things
  • 1,000,000,000,000 things
  • 1,000,000,000,000,000 things
  • Grains of sand on the earth

4
2,823 Terabytes
ADICs Pathlight VX 2.0 is the scalable backup
and restore solution that increases the capacity
and reduces the cost of disk backup by
integrating disk and tape in a single, unified
system. Pathlight VX 2.0 starts by leveraging
EMC CLARiiON ATA disk arrays to double users
backup performance and give it RAID
reliability. Pathlight integrates the capacity
and value of tape to deliver true enterprise
scalability and disaster recovery supportand
reduce system costs by half. ADICs capacity of
tape with disk is from 3.8 to 2,823 TBs (2.8 PB).
5
Terabyte
  • A terabyte is a unit of measurement in computers.
  • 1,000,000,000,000 bytes - 1012. This definition
    is used in most contexts relating to disk
    storage, networking, or other hardware.
  • 1,099,511,627,776 bytes - 10244 or 240. This is
    1024 times a gigabyte (a binary gigabyte). This
    is the definition most often used in computer
    science and computer programming most software
    uses this definition.
  • The symbol for terabyte is TB.

6
Petabyte
  • A petabyte is a unit of measurement in computers
    of one thousand million million (short-scale
    quadrillion) bytes.
  • 1,000,000,000,000,000 bytes - or 1015.
  • 1,125,899,906,842,624 bytes - 10245, or 250. This
    is 1024 times larger than a terabyte. This is the
    definition most often used in information
    storage.
  • The symbol for petabyte is PB.
  • An exabyte is 1000 (or 1024) times a petabyte.

7
Where are we headed?
  • A typical video store contains about 8 terabytes
    of video.
  • The books in the largest library in the world,
    the U.S. Library of Congress, contain about 20
    terabytes of text.
  • The San Diego Supercomputer Center (SDSC) has a 1
    petabyte hard disk store and a 6 petabyte robotic
    tape store, both attached to the National Science
    Foundation's TeraGrid network.
  • The Internet Archive Wayback Machine contains
    approximately 1 petabyte of data and is currently
    growing at a rate of 20 terabytes per month.

8
Internet Archive
  • The Internet Archive, located in the Presidio of
    San Francisco, was founded by Brewster Kahle in
    1996 and is dedicated to maintaining an archive
    of the Internet. Their collections include
  • snapshots of the World Wide Web and Usenet.
  • movies
  • audio recordings, mostly from live concerts
  • books
  • software
  • The archive also maintains the Wayback Machine.
    Once given a URL, this tool allows the user to
    see versions of the corresponding web page over
    time.

9
Wayback Machine
  • Examples of the Wayback Machine's archives
  • Amazon, Microsoft, BBC News, Google, Wikipedia
  • The archive always waits six months before
    putting pages online.
  • The name "Wayback Machine" is a reference to a
    Rocky and Bullwinkle Show cartoon serial. Mr.
    Peabody, a bowtie-endowed dog with a professorial
    air, and his assistant, a boy named "Sherman",
    use a time machine named the "Wayback Machine" to
    visit famous events in history, usually going
    awry for comic reasons.

10
Problem
  • Huge amounts of information
  • How do I find
  • Information that I know I want
  • Information related to what I want
  • How do I understand
  • Particular pieces of information
  • The whole collection of information

11
Limitations
  • Screen space
  • Network bandwidth
  • Bandwidth - how much information can be
    transmitted per second
  • Human attention

12
Kinds of things to organize
  • Menu items
  • MS Word - about 150 menu items
  • Text
  • Pages in a book - 500
  • Documents on the WWW - gazillions
  • Images
  • All of the pictures created in a commercial
    advertising company

13
Kinds of things to organize
  • Sounds
  • Sound tracks to all TV and Radio news broadcasts
  • Video
  • A complete collection of classic movies
  • Structured information (records)
  • People
  • Cars
  • Students
  • Electronic appliance parts

14
Three ways to find things
  • Lists
  • arrays
  • Trees
  • organize in to categories
  • Search
  • describe what you want and have the computer find
    it

15
Finding things in lists
  • How long will it take to find Ron Dallin in the
    Provo/Orem phone book?
  • How long will it take to find 764-0588 in the
    Provo/Orem phone book?

16
Binary search - for Goodrich
Lower 0 Upper 10
Guess (010)/2 5
17
Binary search - for Goodrich
Lower 0 Upper 5
Guess (05)/2 2
18
Binary search - for Goodrich
Lower 2 Upper 5
Guess (25)/2 3
19
Binary search - for Goodrich
Lower 3 Upper 5
Guess (35)/2 4
20
Binary search
  • If there are 64 things in a list, how many times
    can you divide that list in half?
  • 32, 16, 8, 4, 2, 1
  • 6 times

21
Binary search
  • If there are 1024 things in a list, how many
    times can you divide that list in half?
  • 512, 256, 128, 64, 32, 16, 8, 4, 2, 1
  • 10 times

22
Binary search
  • If the size of the list doubles, how many more
    steps are required in a binary search?
  • 1

23
Binary search
  • If there are N items in a list then binary search
    takes
  • log2(N) steps

24
Binary search
  • Estimating log2(N)
  • Count the number of digits and multiply by 2.5
  • 1000
  • 42.5 10 steps
  • 1,000,000
  • 72.5 17-18 steps
  • 1,000,000,000
  • 102.5 25 steps

25
Provo/Orem phone book
  • How long to find Ron Dallin
  • 400,000 in Utah county
  • Log2(400,000) approx 62.5 15 steps

26
How to find a phone number
  • 920-3231
  • 1 step
  • 130-2313
  • 11 steps
  • Average?
  • 5 steps
  • Average N?
  • N/2

27
Provo/Orem phone book
  • How many steps to find a phone number?
  • 400,000/2 200,000 average
  • How can we improve this?

28
Sort by phone number
  • What if I want to search on both name and number?

29
Using an Index
30
Using an Index
Anderson
31
Using an Index
Anderson, Bilinski
32
Using an Index
Anderson, Bilinski, Clark
33
Using an Index
Anderson, Bilinski, Clark, Garcia
34
Using an Index
123-3123
35
Using an Index
123-3123, 130-2313
36
Using an Index
123-3123, 130-2313, 232-0312
37
Using an Index
123-3123, 130-2313, 232-0312, 238-1234
38
Search for Goodrich
Lower 0 Upper 10
Guess 5 lower
39
Search for Goodrich
Lower 0 Upper 5
Guess 2 above
40
Search for Goodrich
Lower 2 Upper 5
Guess 3 above
41
Search for Goodrich
Lower 3 Upper 5
Guess 4 above
42
Search for 823-1242
Lower 0 Upper 10
Guess 5 above
43
Search for 823-1242
Lower 5 Upper 10
Guess 7 below
44
Search for 823-1242
Lower 5 Upper 7
Guess 6 MATCH
45
Using an Index
  • What about first name or city?
  • another index

46
Data Organization
  • What are we organizing for?
  • Scale
  • 10 - 1,000 - 1,000,000 - 1,000,000,000
  • Lists
  • Unsorted (N/2)
  • Sorted Log2(N)
  • count the digits and multiply by 2.5
  • To access in many ways
  • Use many indices into the same data
Write a Comment
User Comments (0)
About PowerShow.com