Trustworthy Keyword Search for Regulatory Compliant Record Retention

About This Presentation

Title:

Trustworthy Keyword Search for Regulatory Compliant Record Retention

Description:

IBM Almaden. Marianne Winslett. University of Illinois. Soumyadeb ... IBM Almaden. There is a need for trustworthy record keeping. HIPAA. Focus on Compliance ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 38

Provided by: soumyad

Category:

more less

Transcript and Presenter's Notes

Title: Trustworthy Keyword Search for Regulatory Compliant Record Retention

1
Trustworthy Keyword Search for Regulatory
Compliant Record Retention
Windsor W. Hsu IBM Almaden
Marianne Winslett University of Illinois
Soumyadeb Mitra University of Illinois IBM Almaden
2
There is a need for trustworthy record keeping
Instant Messaging
Email
Files
Corporate Misconduct
Digital Information Explosion
Soaring Discovery Costs
Records
IDC Forecasts 60B Business Emails Annually By 2006
Focus on Compliance
HIPAA
Sources IDC, Network World (2003), Socha /
Gelbmann (2004)
3
What is trustworthy record keeping?
Establish solid proof of events that have occurred
Storage Device
time
Commit Record
Adversary
Bob should get back Alices data
4
This leads to a unique threat model
time
Query is trustworthy
Commit is trustworthy
Adversary has super-user privileges
Record is created properly
Record is queried properly

Access to storage device
Access to any keys

Adversary could be Alice herself
5
Traditional schemes do not work
time
Cannot rely on Alices signature
6
WORM storage helps address the problem
Record Overwrite/ Delete
New Record
Adversary cannot delete Alices record
Write Once Read Many
7
Index required due to high volume of records
Index
time
Commit Record Update Index
Query from Index
Regret
Bob
Alice
Adversary
8
In effect, records can be hidden/altered by
modifying the index
Or replace B with B
Hide record B from the index
B
B
The index must also be secured (fossilized)
9
Most business records are unstructured, searched
by inverted index
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3

One WORM file for each posting list

10
Index must be updated as new documents arrive
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3

500 keywords 500 disk seeks
1 sec per document

11
Amortize cost by updating in batch
Buffer
Keywords
Posting Lists
Doc 79
Query
Query
1
3
11
17
Doc 80
Data
3
9
Doc 81
Base
3
19
Query
Worm
7
36
Index
3
Doc 82
Doc 83

1 seek per keyword in batch
Large buffer to benefit infrequent terms
Over 100,000 documents to achieve 2 docs/sec

Query
12
Index is not updated immediately
Alice
time
Alter
Omit

Prevailing practice email must be committed
before it is delivered

Adversary
13
Can storage server cache help?

Storage servers have huge cache
Data committed into cache is effectively on disk
Is battery backed-up
Inside the WORM box, so is trustworthy

14
Caching works in blocks
Cache Hit
Cache Miss
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
Doc 80
Query

Caching does not benefit infrequent terms

15
Simulation results show caching is not enough
16
Simulation results show caching is not enough

What if number posting lists Number of cache
blocks?
Each update will hit the cache

17
So, merge posting lists so that the tails blocks
fit in cache
Keyword Encodings
Query
1
3
11
Document IDs
Data
3
9
31
Base
3
19
00
01
10
00
1
3
3
3
Worm
7
36
01
00
10
01
9
11
19
31
Index
3

Only 1 random I/O per document, for 4K block size

18
The tradeoff is longer lists to scan during lookup
VLDB 2006 Query answered by scanning
both posting lists
length of posting list for keyword w
of times w is queried in workload

Workload lookup cost before merging
? tw qw
After merging into A A1, , An
? ( ? tw ) (? qw )

length of A
w
of times A is searched
19
Which lists to merge? We need a heuristic
solution

Choose AA1, A2 .. An
n Cache blocks
Minimize ? ( ? tw ) (? qw )
Problem is NP-complete
Reduction from minimum-sum-of-squares
So, try some merging heuristics on a real-world
workload
1 million documents from IBMs intranet
300,000 queries

20
A few terms contribute most of the query workload
cost
(tw qw)
21
Different merging heuristics were tried

Separate lists for high contributor terms
Merging heuristics
Based on qw tw
Random merging
Details of heuristics, evaluation in paper

22
Additional index support is needed to answer
conjunctive queries quickly
VLDB and 2006
VLDB
2006
2
3
2
2
3
2
7
24
m
13
7
24
31
13
n
13
24
31
24
31
31
Merge Join O (mn)
Index Join m log(n)
23
How to maintain Btrees on WORM

Btrees require node split and join
Btree on posting list are special case
Documents IDs are inserted in an increasing order
Can be built bottom up without split/joins
Please refer to our paper

24
Btree index is insecure, even on WORM
23
7
13
31
4
7
11
13
19
23
29
31
2

Path to an element depends on elements inserted
later Adversary can attack it

25
Our solution is jump indexes

Path to an element only depends on elements
inserted before
Jump index is provably trustworthy
Leverages the fact that document IDs are
increasing
O(log N) lookup N - of documents
Supports range queries too
Reasonable performance as compared to B trees
for conjunctive queries in experiments with
real-workload
For details, see our paper

26
Conclusions

WORM storage by itself is not enough
We need a trustworthy index too
Trustworthy inverted indexes can be built
efficiently
10-15 slowdown, for non-conjunctive queries
Within 1.5x of optimal B-tree performance, for
conjunctive queries
Other possible uses
Indexing time

27
Future work

Ranking attacks
Adversary can attack a lot of junk documents
Secure generic index structure
Path to an element is fixed
Supports range queries

28
Questions
29
A new index structure required Jump Index

ith pointer points to an element ni
n 2i ni lt n 2(i1)

30
Jump index in action
0
1
2
3
4
1
31
Path to an element does not depend on future
elements
Lookup (7)
0
1
2
3
4
1
32
Jump index elements are stored in blocks

Storing pointers with every element is
inefficient
p entries are grouped together
Brach factor B.
Pointer (i,j) from block b points to b having
smallest x

l jBi x lt l (j1)Bi
Jump Pointers
p entries
33
Jump index evaluation parameters

p Elements grouped together
B Branching factor
L Block size
p (B-1) logB N L
Evaluation
Index update performance
Pointers have to be set
Query performance

34
Update performance levels off at reasonable cache
size
35
Query performance is close to optimal (Btree)
36
Btree on WORM
37
Btree for increasing sequence can be created on
WORM
13
2
4
7
11
19
13
38
Is this a real threat?

Would someone want to delete a record after a day
its created?
Intrusion detection logging
Once adversary gain control, he would like to
delete records of his initial attack
Record regretted moments after creation
Email best practice - Must be committed before
its delivered

Write a Comment

User Comments (0)