Title: Trustworthy Keyword Search for Regulatory Compliant Record Retention
1Trustworthy Keyword Search for Regulatory
Compliant Record Retention
Windsor W. Hsu IBM Almaden
Marianne Winslett University of Illinois
Soumyadeb Mitra University of Illinois IBM Almaden
2There is a need for trustworthy record keeping
Instant Messaging
Email
Files
Corporate Misconduct
Digital Information Explosion
Soaring Discovery Costs
Records
IDC Forecasts 60B Business Emails Annually By 2006
Focus on Compliance
HIPAA
Sources IDC, Network World (2003), Socha /
Gelbmann (2004)
3What is trustworthy record keeping?
Establish solid proof of events that have occurred
Storage Device
time
Commit Record
Adversary
Bob should get back Alices data
4This leads to a unique threat model
time
Query is trustworthy
Commit is trustworthy
Adversary has super-user privileges
Record is created properly
Record is queried properly
- Access to storage device
- Access to any keys
Adversary could be Alice herself
5Traditional schemes do not work
time
Cannot rely on Alices signature
6WORM storage helps address the problem
Record Overwrite/ Delete
New Record
Adversary cannot delete Alices record
Write Once Read Many
7Index required due to high volume of records
Index
time
Commit Record Update Index
Query from Index
Regret
Bob
Alice
Adversary
8In effect, records can be hidden/altered by
modifying the index
Or replace B with B
Hide record B from the index
B
B
The index must also be secured (fossilized)
9Most business records are unstructured, searched
by inverted index
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
- One WORM file for each posting list
10Index must be updated as new documents arrive
Keywords
Posting Lists
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
- 500 keywords 500 disk seeks
- 1 sec per document
11Amortize cost by updating in batch
Buffer
Keywords
Posting Lists
Doc 79
Query
Query
1
3
11
17
Doc 80
Data
3
9
Doc 81
Base
3
19
Query
Worm
7
36
Index
3
Doc 82
Doc 83
- 1 seek per keyword in batch
- Large buffer to benefit infrequent terms
- Over 100,000 documents to achieve 2 docs/sec
Query
12Index is not updated immediately
Alice
time
Alter
Omit
- Prevailing practice email must be committed
before it is delivered
Adversary
13Can storage server cache help?
- Storage servers have huge cache
- Data committed into cache is effectively on disk
- Is battery backed-up
- Inside the WORM box, so is trustworthy
14Caching works in blocks
Cache Hit
Cache Miss
Query
1
3
11
17
Data
3
9
Base
3
19
Worm
7
36
Index
3
Doc 80
Query
- Caching does not benefit infrequent terms
15Simulation results show caching is not enough
16Simulation results show caching is not enough
- What if number posting lists Number of cache
blocks? - Each update will hit the cache
17So, merge posting lists so that the tails blocks
fit in cache
Keyword Encodings
Query
1
3
11
Document IDs
Data
3
9
31
Base
3
19
00
01
10
00
1
3
3
3
Worm
7
36
01
00
10
01
9
11
19
31
Index
3
- Only 1 random I/O per document, for 4K block size
18The tradeoff is longer lists to scan during lookup
VLDB 2006 Query answered by scanning
both posting lists
length of posting list for keyword w
of times w is queried in workload
- Workload lookup cost before merging
- ? tw qw
- After merging into A A1, , An
- ? ( ? tw ) (? qw )
length of A
w
of times A is searched
19Which lists to merge? We need a heuristic
solution
- Choose AA1, A2 .. An
- n Cache blocks
- Minimize ? ( ? tw ) (? qw )
- Problem is NP-complete
- Reduction from minimum-sum-of-squares
- So, try some merging heuristics on a real-world
workload - 1 million documents from IBMs intranet
- 300,000 queries
20A few terms contribute most of the query workload
cost
(tw qw)
21Different merging heuristics were tried
- Separate lists for high contributor terms
- Merging heuristics
- Based on qw tw
- Random merging
- Details of heuristics, evaluation in paper
22Additional index support is needed to answer
conjunctive queries quickly
VLDB and 2006
VLDB
2006
2
3
2
2
3
2
7
24
m
13
7
24
31
13
n
13
24
31
24
31
31
Merge Join O (mn)
Index Join m log(n)
23How to maintain Btrees on WORM
- Btrees require node split and join
- Btree on posting list are special case
- Documents IDs are inserted in an increasing order
- Can be built bottom up without split/joins
- Please refer to our paper
24Btree index is insecure, even on WORM
23
7
13
31
4
7
11
13
19
23
29
31
2
- Path to an element depends on elements inserted
later Adversary can attack it
25Our solution is jump indexes
- Path to an element only depends on elements
inserted before - Jump index is provably trustworthy
- Leverages the fact that document IDs are
increasing - O(log N) lookup N - of documents
- Supports range queries too
- Reasonable performance as compared to B trees
for conjunctive queries in experiments with
real-workload - For details, see our paper
26Conclusions
- WORM storage by itself is not enough
- We need a trustworthy index too
- Trustworthy inverted indexes can be built
efficiently - 10-15 slowdown, for non-conjunctive queries
- Within 1.5x of optimal B-tree performance, for
conjunctive queries - Other possible uses
- Indexing time
27Future work
- Ranking attacks
- Adversary can attack a lot of junk documents
- Secure generic index structure
- Path to an element is fixed
- Supports range queries
28Questions
29A new index structure required Jump Index
- ith pointer points to an element ni
- n 2i ni lt n 2(i1)
30Jump index in action
0
1
2
3
4
1
31Path to an element does not depend on future
elements
Lookup (7)
0
1
2
3
4
1
32Jump index elements are stored in blocks
- Storing pointers with every element is
inefficient - p entries are grouped together
- Brach factor B.
- Pointer (i,j) from block b points to b having
smallest x
l jBi x lt l (j1)Bi
Jump Pointers
p entries
33Jump index evaluation parameters
- p Elements grouped together
- B Branching factor
- L Block size
- p (B-1) logB N L
- Evaluation
- Index update performance
- Pointers have to be set
- Query performance
34Update performance levels off at reasonable cache
size
35Query performance is close to optimal (Btree)
36Btree on WORM
37Btree for increasing sequence can be created on
WORM
13
2
4
7
11
19
13
38Is this a real threat?
- Would someone want to delete a record after a day
its created? - Intrusion detection logging
- Once adversary gain control, he would like to
delete records of his initial attack - Record regretted moments after creation
- Email best practice - Must be committed before
its delivered