Trustworthy Keyword Search for Regulatory Compliant Record Retention - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Trustworthy Keyword Search for Regulatory Compliant Record Retention

Description:

IBM Almaden. Marianne Winslett. University of Illinois. Soumyadeb ... IBM Almaden. There is a need for trustworthy record keeping. HIPAA. Focus on Compliance ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 38
Provided by: soumyad
Category:

less

Transcript and Presenter's Notes

Title: Trustworthy Keyword Search for Regulatory Compliant Record Retention


1
Trustworthy Keyword Search for Regulatory
Compliant Record Retention
Windsor W. Hsu IBM Almaden
Marianne Winslett University of Illinois
Soumyadeb Mitra University of Illinois IBM Almaden
2
There is a need for trustworthy record keeping
Instant Messaging
Email
Files
Corporate Misconduct
Digital Information Explosion
Soaring Discovery Costs
Records
IDC Forecasts 60B Business Emails Annually By 2006
Focus on Compliance
HIPAA
Sources IDC, Network World (2003), Socha /
Gelbmann (2004)
3
What is trustworthy record keeping?
Establish solid proof of events that have occurred
Storage Device
time
Commit Record
Adversary
Bob should get back Alices data
4
This leads to a unique threat model
time
Query is trustworthy
Commit is trustworthy
Adversary has super-user privileges
Record is created properly
Record is queried properly
  • Access to storage device
  • Access to any keys

Adversary could be Alice herself
5
Traditional schemes do not work
time
Cannot rely on Alices signature
6
WORM storage helps address the problem
Record Overwrite/ Delete
New Record
Adversary cannot delete Alices record
Write Once Read Many
7
Index required due to high volume of records
Index
time
Commit Record Update Index
Query from Index
Regret
Bob
Alice
Adversary
8
In effect, records can be hidden/altered by
modifying the index
Or replace B with B
Hide record B from the index
B
B
The index must also be secured (fossilized)
9
Most business records are unstructured, searched
by inverted index
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
  • One WORM file for each posting list

10
Index must be updated as new documents arrive
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
  • 500 keywords 500 disk seeks
  • 1 sec per document

11
Amortize cost by updating in batch
Buffer
Keywords
Posting Lists
Doc 79
Query
Query
1
3
11
17
Doc 80
Data
3
9
Doc 81
Base
3
19
Query
Worm
7
36
Index
3
Doc 82
Doc 83
  • 1 seek per keyword in batch
  • Large buffer to benefit infrequent terms
  • Over 100,000 documents to achieve 2 docs/sec

Query
12
Index is not updated immediately
Alice
time
Alter
Omit
  • Prevailing practice email must be committed
    before it is delivered

Adversary
13
Can storage server cache help?
  • Storage servers have huge cache
  • Data committed into cache is effectively on disk
  • Is battery backed-up
  • Inside the WORM box, so is trustworthy

14
Caching works in blocks
Cache Hit
Cache Miss
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
Doc 80
Query
  • Caching does not benefit infrequent terms

15
Simulation results show caching is not enough
16
Simulation results show caching is not enough
  • What if number posting lists Number of cache
    blocks?
  • Each update will hit the cache

17
So, merge posting lists so that the tails blocks
fit in cache
Keyword Encodings
Query
1
3
11
Document IDs
Data
3
9
31
Base
3
19
00
01
10
00
1
3
3
3
Worm
7
36
01
00
10
01
9
11
19
31
Index
3
  • Only 1 random I/O per document, for 4K block size

18
The tradeoff is longer lists to scan during lookup
VLDB 2006 Query answered by scanning
both posting lists
length of posting list for keyword w
of times w is queried in workload
  • Workload lookup cost before merging
  • ? tw qw
  • After merging into A A1, , An
  • ? ( ? tw ) (? qw )

length of A
w
of times A is searched
19
Which lists to merge? We need a heuristic
solution
  • Choose AA1, A2 .. An
  • n Cache blocks
  • Minimize ? ( ? tw ) (? qw )
  • Problem is NP-complete
  • Reduction from minimum-sum-of-squares
  • So, try some merging heuristics on a real-world
    workload
  • 1 million documents from IBMs intranet
  • 300,000 queries

20
A few terms contribute most of the query workload
cost
(tw qw)
21
Different merging heuristics were tried
  • Separate lists for high contributor terms
  • Merging heuristics
  • Based on qw tw
  • Random merging
  • Details of heuristics, evaluation in paper

22
Additional index support is needed to answer
conjunctive queries quickly
VLDB and 2006
VLDB
2006
2
3
2
2
3
2
7
24
m
13
7
24
31
13
n
13
24
31
24
31
31
Merge Join O (mn)
Index Join m log(n)
23
How to maintain Btrees on WORM
  • Btrees require node split and join
  • Btree on posting list are special case
  • Documents IDs are inserted in an increasing order
  • Can be built bottom up without split/joins
  • Please refer to our paper

24
Btree index is insecure, even on WORM
23
7
13
31
4
7
11
13
19
23
29
31
2
  • Path to an element depends on elements inserted
    later Adversary can attack it

25
Our solution is jump indexes
  • Path to an element only depends on elements
    inserted before
  • Jump index is provably trustworthy
  • Leverages the fact that document IDs are
    increasing
  • O(log N) lookup N - of documents
  • Supports range queries too
  • Reasonable performance as compared to B trees
    for conjunctive queries in experiments with
    real-workload
  • For details, see our paper

26
Conclusions
  • WORM storage by itself is not enough
  • We need a trustworthy index too
  • Trustworthy inverted indexes can be built
    efficiently
  • 10-15 slowdown, for non-conjunctive queries
  • Within 1.5x of optimal B-tree performance, for
    conjunctive queries
  • Other possible uses
  • Indexing time

27
Future work
  • Ranking attacks
  • Adversary can attack a lot of junk documents
  • Secure generic index structure
  • Path to an element is fixed
  • Supports range queries

28
Questions
29
A new index structure required Jump Index
  • ith pointer points to an element ni
  • n 2i ni lt n 2(i1)

30
Jump index in action
0
1
2
3
4
1
31
Path to an element does not depend on future
elements
Lookup (7)
0
1
2
3
4
1
32
Jump index elements are stored in blocks
  • Storing pointers with every element is
    inefficient
  • p entries are grouped together
  • Brach factor B.
  • Pointer (i,j) from block b points to b having
    smallest x

l jBi x lt l (j1)Bi
Jump Pointers
p entries
33
Jump index evaluation parameters
  • p Elements grouped together
  • B Branching factor
  • L Block size
  • p (B-1) logB N L
  • Evaluation
  • Index update performance
  • Pointers have to be set
  • Query performance

34
Update performance levels off at reasonable cache
size
35
Query performance is close to optimal (Btree)
36
Btree on WORM
37
Btree for increasing sequence can be created on
WORM
13
2
4
7
11
19
13
38
Is this a real threat?
  • Would someone want to delete a record after a day
    its created?
  • Intrusion detection logging
  • Once adversary gain control, he would like to
    delete records of his initial attack
  • Record regretted moments after creation
  • Email best practice - Must be committed before
    its delivered
Write a Comment
User Comments (0)
About PowerShow.com