Title: MFM Observation Of Magnetization Reversal Process In Recording Media
1Finding Needles in the Internet Haystack
Ron K. Cytron Washington University in Saint
Louis Department of Computer Science http//www.cs
.wustl.edu/cytron/
Roger Chamberlain, Mark Franklin, Ron Indeck,
John Lockwood, George Varghese (UCSD) Mahesh
Jayaram Thanks Ben Brodie Center for Distributed
Object Computing Department of Computer
Science Washington University
Century Club May 2002
2Outline
- Computers have come a long way
3Outline
- Computers have come a long way
- Todays computers are never lonely
4Outline
- Computers have come a long way
- Todays computers are never lonely
- Volumes and volumes of data
5Outline
- Computers have come a long way
- Todays computers are never lonely
- Volumes and volumes of data
- Fast searching of magnetic media
6Outline
- Computers have come a long way
- Todays computers are never lonely
- Volumes and volumes of data
- Fast searching of magnetic media
- Internet packet filtering
7Outline
- Computers have come a long way
- Todays computers are never lonely
- Volumes and volumes of data
- Fast searching of magnetic media
- Internet packet filtering
- Conclusion
8A Grandchilds Gift
9If cars improved that much in 30 years
- 4000
- 60,000 miles per hour
- Seats 10,000 people
- Gets 20,000 miles per gallon
- Breaks every 70 years
10The Haystack
- The Internet is large and growing
- Content on the Internet is growing even faster
- A haystack sits still, but the Internet.
11Growth of the Internet (why computers arent
lonely anymore)
Y2K Problem (?) More computers sold than TVs
12Growth of Internet Content (volumes and volumes
of data)
Anybody can publish Problem is how to find what
you want
139/17/2001
Page 6B What can tech companies do? Some say
they're at a loss, but others offer budding
solutions By Kevin Maney On July 7, 1940, as
the nation edged toward World War II, IBM put out
a statement that made headlines. The company
offered all its facilities for national defense,
ready to convert to making anything the
government needed. Other leaders in the
electro-mechanical technology of the day -- Ford
Motor, General Motors, General Electric -- also
threw their weight into defense efforts. They
switched from making cars and washing machines to
building tanks, aircraft engines and machine
guns. So here we are in 2001, readying for
another war. The U.S. technology industry is the
best and most innovative in the world. It is the
nation's pride and joy. Shouldn't it do
something?
14. . . One possibility is in data-mining
technology. Data mining is a way to collect
millions of pieces of information in a computer
system, sift through that data, make sense of
them and come up with something useful. ''We (the
U.S. tech industry) are experts at data mining
and have vast resources of data to mine,'' says
Tom Evslin, CEO of Internet communications
company ITXC. ''We have used it to target
advertising. We can probably use it to identify
suspicious activity or potential terrorists.''
. . .
15Fast searching of magnetic media with Roger
Chamberlain, Mark Franklin, Ron Indeck, John
Lockwood
16Enabling Technology Disk Drives
Almost 10,000,000x increase in 45 years!
Magnetic disk storage areal density vs. year of
IBM product introduction(From D. A. Thompson)
17Cost per Megabyte
Cost decreasing 3 per week!
Price history of hard disk product vs. year of
product introduction (From D. A. Thompson)
18Massive Storage Data
- Storage industry will ship 4,000,000,000,000,000,0
00 Bytes this year - FedEx generated 14 Terabytes of data last year
- US intelligence collects data equaling the
printed collection of the US library every day!
19Massive Data Sets
- Employee records
- Consumer information
- Maps/mission/intelligence data
- Genome maps
- Data sets now measured in Terabytes, and are
dynamic!
20Genome Application
- Genome maps growing expanded daily
- Wash U sequencing center
- Each of us has 80,000 genes found among 3 billion
characters of DNA (A,C,G,T) - Look for matches
- Identify function
- Disease understand, diagnose, detect, medicine,
therapy - Biofuels, warfare, toxic waste
- Understand evolution
- Forensics, organ donors, authentication
- More effective crops, disease resistance
21DNA String Matching
- Looking for CACGTTAGTTAGC
- Interested in matches and near matches
- Search human genome and other gene oceans
- Need to search entire data sets
22Bio Computation Problem
BIG Genome Databases
A C G
T G
DNA sequence
T A C
A G
DNA pattern
Match?
Approximate matches are just as useful
23Finding a needel in a heystuck
- DNA and live text can contain errors
- We often seek an approximate match, for example
- needle
- No match? Try 2-transpositions
- enedle, needle, nedele, neelde, needel
- No match? Try 1-deletions
- eedle, nedle, nedle, neele, neede, needl
- No match? Try insertions, larger edits,
- An exponential number of possibilities
24How is this done today?
- Think of every way a word can be misspelled
- Present each misspelling to the computer for an
exact match
enedle needle nedele neelde needel
No
25How can we do better?
- Data is present on magnetic media
- Hardware at the disk is
- Already fault tolerant (more on this later)
- needel ? needle
- Distributed across all surfaces
We win if number of misspellings is large, and
the number of false hits is small
26Another ApplicationIntelligence Data
- Lots of data
- Changing constantly
- Many perturbations
- Tzar, tsar, czar, . . .
- Dont know what we want to look for beforehand
27Google Search Engine
- Crawls the web once per month
- Caches web pages
- Fast, exact text-based search (see how soon)
28Image Database Applications
- Challenging database
- Unstructured
- Massive data sets
- Dont know what we need to look for in each
picture
29Satellite Data
- Low-orbit fly-over every 90 minutes
- Look for differences in images
- Large objects
- Troops
- Changes to landscape
- Flag, transmit these differences immediately
- National Reconnaissance Office
- City assessors . . .
30Washington University
Hilltop Campus
31How do we find what were looking for?!
32Conventional Structured Database
Word
Inverted list - pointers
agent
lt1,2gt
Bond
lt1,4gt
computer
lt2gt
James
lt1,3,4gt
Madison
lt3gt
mobile
lt2gt
movie
lt3,4gt
33Challenges in SearchingMassive Databases
- Know what to search for
- need to build index beforehand
- maintain index as it changes
- Do not know what to search for
- need to search the whole database!
34Conventional Search
Processor
Hard drive
Memory bus
Memory
I/O bus
35Conventional Search
Conventional Search
find .
Processor
Hard drive
Memory bus
Memory
I/O bus
36Conventional Search
Conventional Search
yes, no, no, yes, yes .
Processor
Hard drive
contents
Memory bus
Memory
I/O bus
37Conventional Approach
38WUSTLs Approach
39Streaming Approach
Processor
Hard drive
Reconfigurable hardware
Memory/ processing
Memory Bus
Memory
I/O bus
40Streaming Approach
find
Processor
Hard drive
Reconfigurable hardware
Memory/ processing
Memory Bus
Memory
I/O bus
41Streaming Approach
Processor
find
Hard drive
Reconfigurable hardware
Memory/ processing
Memory Bus
Memory
I/O bus
42Streaming Approach
yes, no, no, yes, yes
Processor
find
Hard drive
Reconfigurable hardware
Memory/ processing
Memory Bus
Memory
I/O bus
Parallelism through each transducer and drive
43Magnetic Recording Channel Schematic
Channel Bits
Input User Data
Head
Disk
Encoder
A
Decoded User Data
To Bus or Cache
Decoder
Detector
B
C
Analog Readback
44Key streaming over Data
45Disk Level Implementation
matches
score
100-bit-key matching through a pseudo-random
binary series
46Status Prototype in progress
Host ATAPI Controller
IDE_to_ATM module
Hard drive
47Internet Packet Filtering with Mahesh Jayaram
and George Varghese
48Finding Needles in a Moving Haystack
49Cost of Internet Request
- As technology improves, transmission time
decreases but latency stays the same
Time
Year
50Example Garden Hose
Fire department and gardener suffer the same wait
51Example Hot Shower
You want this water
Latency (time to get hot water) distance
52Latency-Free Hot Shower
Convection circuit continuously circulates hot
water Latency 0
53Better to receive than to give
- Cable broadcast
- Radio broadcast
- TV guide channel
- Gate connection announcements in flight
- Winning lottery number
Modern name push technology
54Better to receive than to give
55How do you get what you want?
56Packet Filters
Filter F (Weather)
57Packet Filters
Filter F (Weather)
58Existing Approach
IBM Quote
Weather
Flight Schedule
59Our approach
IBM Quote
Weather
Flight Schedule
Composite filter makes just one pass
60How we do it
IBM Quote
Weather
Flight Schedule
61Sample grammar for TCP packet
TCPConnHeader EtherType IPHeader
TCPPortPair EtherType IP_TYPE IPHeader
Vers HlenPlusRest Vers HalfByte HlenPlusRes
t 0 1 0 1 FixedRest 0 1 1 0
FixedRest OneIPOption 0 1 1 1
FixedRest TwoIPOption 1 0 0 0
FixedRest ThreeIPOption 1 0 0 1
FixedRest FourIPOption 1 0 1 0
FixedRest FiveIPOption 1 0 1 1
FixedRest FiveIPOption OneIPOption
1 1 0 0 FixedRest FiveIPOption TwoIPOption
1 1 0 1 FixedRest FiveIPOption
ThreeIPOption 1 1 1 0 FixedRest
FiveIPOption FourIPOption 1 1 1 1
FixedRest FiveIPOption FiveIPOption FixedRest
ServiceType TotalLength Identification Flags
FragmentOffset TimeToLive Protocol
HeaderChecksum IPAddrPair
ServiceType Byte TotalLength
TwoByte Identification TwoByte Flags bit bit
bit FragmentOffset bit Byte HalfByte TimeToLive
Byte Protocol TCP_PROTOCOL HeaderChecksum
TwoByte IPAddrPair IP_SRC_DST_PAIR FiveIPOption
ThreeIPOption TwoIPOption FourIPOption
TwoIPOption TwoIPOption ThreeIPOption
TwoIPOption OneIPOption TwoIPOption OneIPOption
OneIPOption OneIPOption Option Padding Option
ThreeByte Padding Byte TCPPortPair
TCP_PORT_PAIR FourByte TwoByte
TwoByte ThreeByte TwoByte Byte TwoByte Byte
Byte Byte HalfByte HalfByte HalfByte bit bit
bit bit bit 0 1
62Results
The more things you want, the slower existing
approaches get Our performance doesnt degrade
63Conclusions
- The Internet and its content are growing
explosively - Disk storage is abundant, cheap, reliable
- Technology must provide fast, inexact searching
of text and images - As more data is hurled at and past us, fast
filtering of Internet traffic is a must
64Questions?