lucene kv-store
DESCRIPTION
A fast Key-Value store for large datasetsTRANSCRIPT
Lucene KV-StoreA high-performance key-value store
Mark Harwood
Benefits
High-speed reads and writes of key/value pairs sustained over growing volumes of data
Read costs are always 0 or 1 disk seek
Efficient use of memory
Simple file structures with strong durability guarantees
Why “Lucene” KV store?
Uses Lucene’s “Directory” APIs for low-level file access
Based on Lucene’s concepts of segment files, soft deletes, background merges, commit points etc BUT a fundamentally different form of index
I’d like to offer it to the Lucene community as a “contrib” module because they have a track record in optimizing these same concepts (and could potentially make use of it in Lucene?)
Example benchmark results
Note, regular Lucene search indexes follow the same trajectory of the “Common KV Store” when it comes to lookups on a store with millions of keys
KV-Store High-level Design
Map held in RAM
Disk
Key hash (int)
Disk pointer (int)
23434 0
6545463 10
874382 22
Num keys with hash (VInt)
Key 1 size (VInt)
Key 1 (byte [ ])
Value 1 size(Vint)
Value 1(byte[ ])
Key/values 2,3,4…
1 3 Foo 3 Bar
2 5 Hello 5 World 7,Bonjour,8,Le Mon..
Most hashes have only one associated key and valueSome hashes will have key collisions
requiring the use of extra columns
here
Read logic (pseudo code)
int keyHash=hash(searchKey);
int filePointer=ramMap.get(keyHash);
if filePointer is null
return null for value;
file.seek(filePointer);
int numKeysWithHash=file.readInt()
for numKeysWithHash
{
storedKey=file.readKeyData();
if(storedKey==searchKey)
return file.readValueData();
file.readValueData();
}
There is a guaranteed maximum of one random disk seek for any lookup
With a good hashing
function most lookups will only need to go once around this loop
Write logic (pseudo code)
int keyHash=hash(newKey);
int oldFilePointer=ramMap.get(keyHash);
ramMap.put(keyHash,file.length());
if oldFilePointer is null
{
file.append(1);//only 1 key with hash
file.append(newKey);
file.append(newValue);
}else
{
file.seek(oldFilePointer);
int numOldKeys=file.readInt();
Map tmpMap=file.readNextNKeysAndValues(numOldKeys);
tmpMap.put(newKey,newValue);
file.append(tmpMap.size());
file.appendKeysAndValues(tmpMap);
}
Updates will always append to the end of
the file, leaving older values unreferenced
In case of any key collisions,
previously stored values are copied to
the new position at the end of the file along with the new content
Segment generations: writes
Maps held in RAM
Key and value disk
stores
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
203765
0
37594 10
… …
old new
0 1
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
152433
0
742297
10
… …
23
Writes append to the end of
the latest generation
segment until it reaches a set size then it is made read-
only and new segment is
created.
Segment generations: reads
Maps held in RAM
Key and value disk
stores
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
203765
0
37594 10
… …
old new
0 1
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
152433
0
742297
10
… …
23
Read operations search memory maps in reverse order. The first map found with
a hash is expected to
have a pointer into its
associated file for all the latest keys/values with
this hash
Segment generations: merges
Maps held in RAM
Key and value disk
stores
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
203765
0
37594
10
… …
0 1
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
152433
0
742297
10
… …
23
A background thread merges read-only
segments with many outdated entries into new, more compact
versions
4
Segment generations: durability
Maps held in RAM
Key and value disk
stores
Hash Pointer
23434
0
65463
10
… …
Hash Pointer
203765
0
37594 10
… …
0 4
Hash Pointer
152433
0
742297
10
… …
Like Lucene, commit operations create a new
generation of a “segments” file, the
contents of which reflect the committed (i.e.
fsync’ed state of the store.)
Completed Segment IDs
0,4
Active Segment ID
3
Active segment committed length
423423
3
Implementation detailsJVM needs sufficient RAM for 2 ints for every active key (note: using “modulo N” on the hash can reduce RAM max to Nx2 ints at the cost of more key collisions = more disk IO)
Uses Lucene Directory forAbstraction from choice of file system
Buffered reads/writes
Support for Vint encoding of numbers
Rate-limited merge operations
Borrows successful Lucene concepts:Multiple segments flushed then made read-only.
“Segments” file used to list committed content (could potentially support multiple commit points)
Background merges
Uses LGPL “Trove” for maps of primitives