Title: Discussion
1Discussion 35 Data Organization
2Data Organization
3A question of scale
- 10 things
- 100 things
- 1,000 things
- 10,000 things
- 1,000,000 things
- 1,000,000,000 things
- 1,000,000,000,000 things
- 1,000,000,000,000,000 things
- Grains of sand on the earth
42,823 Terabytes
ADICs Pathlight VX 2.0 is the scalable backup
and restore solution that increases the capacity
and reduces the cost of disk backup by
integrating disk and tape in a single, unified
system. Pathlight VX 2.0 starts by leveraging
EMC CLARiiON ATA disk arrays to double users
backup performance and give it RAID
reliability. Pathlight integrates the capacity
and value of tape to deliver true enterprise
scalability and disaster recovery supportand
reduce system costs by half. ADICs capacity of
tape with disk is from 3.8 to 2,823 TBs (2.8 PB).
5Terabyte
- A terabyte is a unit of measurement in computers.
- 1,000,000,000,000 bytes - 1012. This definition
is used in most contexts relating to disk
storage, networking, or other hardware. - 1,099,511,627,776 bytes - 10244 or 240. This is
1024 times a gigabyte (a binary gigabyte). This
is the definition most often used in computer
science and computer programming most software
uses this definition. - The symbol for terabyte is TB.
6Petabyte
- A petabyte is a unit of measurement in computers
of one thousand million million (short-scale
quadrillion) bytes. - 1,000,000,000,000,000 bytes - or 1015.
- 1,125,899,906,842,624 bytes - 10245, or 250. This
is 1024 times larger than a terabyte. This is the
definition most often used in information
storage. - The symbol for petabyte is PB.
- An exabyte is 1000 (or 1024) times a petabyte.
7Where are we headed?
- A typical video store contains about 8 terabytes
of video. - The books in the largest library in the world,
the U.S. Library of Congress, contain about 20
terabytes of text. - The San Diego Supercomputer Center (SDSC) has a 1
petabyte hard disk store and a 6 petabyte robotic
tape store, both attached to the National Science
Foundation's TeraGrid network. - The Internet Archive Wayback Machine contains
approximately 1 petabyte of data and is currently
growing at a rate of 20 terabytes per month.
8Internet Archive
- The Internet Archive, located in the Presidio of
San Francisco, was founded by Brewster Kahle in
1996 and is dedicated to maintaining an archive
of the Internet. Their collections include - snapshots of the World Wide Web and Usenet.
- movies
- audio recordings, mostly from live concerts
- books
- software
- The archive also maintains the Wayback Machine.
Once given a URL, this tool allows the user to
see versions of the corresponding web page over
time.
9Wayback Machine
- Examples of the Wayback Machine's archives
- Amazon, Microsoft, BBC News, Google, Wikipedia
- The archive always waits six months before
putting pages online. - The name "Wayback Machine" is a reference to a
Rocky and Bullwinkle Show cartoon serial. Mr.
Peabody, a bowtie-endowed dog with a professorial
air, and his assistant, a boy named "Sherman",
use a time machine named the "Wayback Machine" to
visit famous events in history, usually going
awry for comic reasons.
10Problem
- Huge amounts of information
- How do I find
- Information that I know I want
- Information related to what I want
- How do I understand
- Particular pieces of information
- The whole collection of information
11Limitations
- Screen space
- Network bandwidth
- Bandwidth - how much information can be
transmitted per second - Human attention
12Kinds of things to organize
- Menu items
- MS Word - about 150 menu items
- Text
- Pages in a book - 500
- Documents on the WWW - gazillions
- Images
- All of the pictures created in a commercial
advertising company
13Kinds of things to organize
- Sounds
- Sound tracks to all TV and Radio news broadcasts
- Video
- A complete collection of classic movies
- Structured information (records)
- People
- Cars
- Students
- Electronic appliance parts
14Three ways to find things
- Lists
- arrays
- Trees
- organize in to categories
- Search
- describe what you want and have the computer find
it
15Finding things in lists
- How long will it take to find Ron Dallin in the
Provo/Orem phone book? - How long will it take to find 764-0588 in the
Provo/Orem phone book?
16Binary search - for Goodrich
Lower 0 Upper 10
Guess (010)/2 5
17Binary search - for Goodrich
Lower 0 Upper 5
Guess (05)/2 2
18Binary search - for Goodrich
Lower 2 Upper 5
Guess (25)/2 3
19Binary search - for Goodrich
Lower 3 Upper 5
Guess (35)/2 4
20Binary search
- If there are 64 things in a list, how many times
can you divide that list in half? - 32, 16, 8, 4, 2, 1
- 6 times
21Binary search
- If there are 1024 things in a list, how many
times can you divide that list in half? - 512, 256, 128, 64, 32, 16, 8, 4, 2, 1
- 10 times
22Binary search
- If the size of the list doubles, how many more
steps are required in a binary search? - 1
23Binary search
- If there are N items in a list then binary search
takes - log2(N) steps
24Binary search
- Estimating log2(N)
- Count the number of digits and multiply by 2.5
- 1000
- 42.5 10 steps
- 1,000,000
- 72.5 17-18 steps
- 1,000,000,000
- 102.5 25 steps
25Provo/Orem phone book
- How long to find Ron Dallin
- 400,000 in Utah county
- Log2(400,000) approx 62.5 15 steps
26How to find a phone number
- 920-3231
- 1 step
- 130-2313
- 11 steps
- Average?
- 5 steps
- Average N?
- N/2
27Provo/Orem phone book
- How many steps to find a phone number?
- 400,000/2 200,000 average
- How can we improve this?
28Sort by phone number
- What if I want to search on both name and number?
29Using an Index
30Using an Index
Anderson
31Using an Index
Anderson, Bilinski
32Using an Index
Anderson, Bilinski, Clark
33Using an Index
Anderson, Bilinski, Clark, Garcia
34Using an Index
123-3123
35Using an Index
123-3123, 130-2313
36Using an Index
123-3123, 130-2313, 232-0312
37Using an Index
123-3123, 130-2313, 232-0312, 238-1234
38Search for Goodrich
Lower 0 Upper 10
Guess 5 lower
39Search for Goodrich
Lower 0 Upper 5
Guess 2 above
40Search for Goodrich
Lower 2 Upper 5
Guess 3 above
41Search for Goodrich
Lower 3 Upper 5
Guess 4 above
42Search for 823-1242
Lower 0 Upper 10
Guess 5 above
43Search for 823-1242
Lower 5 Upper 10
Guess 7 below
44Search for 823-1242
Lower 5 Upper 7
Guess 6 MATCH
45Using an Index
- What about first name or city?
- another index
46Data Organization
- What are we organizing for?
- Scale
- 10 - 1,000 - 1,000,000 - 1,000,000,000
- Lists
- Unsorted (N/2)
- Sorted Log2(N)
- count the digits and multiply by 2.5
- To access in many ways
- Use many indices into the same data