com.sleepycat.je.rep.vlsn
Class VLSNIndex

java.lang.Object
  extended by com.sleepycat.je.rep.vlsn.VLSNIndex

public class VLSNIndex
extends Object

A VLSN (Virtual LSN) is used to identify every log entry shared between members of the replication group. Since a JE log is identified by LSNs, we must have a way to map VLSN->LSNs in order to fetch a replicated log record from the local log, using the VLSN. The VLSNIndex implements those mappings. The VLSNIndex has these responsibilities: Generating new VLSNs. Only masters need to generate VLSNs, but any node may have the potential to be a master. The VLSN sequence must ascend over time and across recoveries, so the VSLN id must be preserved much like the database, node and txn ids. Maintaining the VLSN range. Although each node needs to receive and store each log entry from the replication stream, over time the part of the stream that is stored can be reduced, either by log cleaning, or by syncups which can truncate the replication stream. A node always holds a contiguous portion of the replication stream. The VLSN range identifies that portion by having the start and end VLSNs, as well as key landmarks such as the lastSync-able log entry and the last commit log entry. VLSN range information is used by elections and syncup. Gatekeeper for waiting for the most recently logged entries. Feeders block upon the VLSNIndex when they are trying to fetch the most recently logged entries. These recent log entries are held in a two level cache within the VLSNIndex. A call to VLSNIndex.waitForLsn() goes through this sequence: 1) check the log item stored in the vlsn wait latch, if the call did wait. 2) check the log item cache If both fail, the FeederReader will fetch the required log entry from log buffers or disk Providing the LSN mapping for a log record identified by its VLSN. The Feeders and the syncup protocol both need to retrieve log records by VLSN. To do that, we need an LSN mapping. Mappings are added to VLSNIndex when replicated log entries are written into the local log. Although all mappings are registered, the VLSNIndex does not keep every one, in order to save on disk and in-memory storage. Only a sparse set is kept. When searching for a log entry by VLSN, the caller uses the closest available mapping and then scans the log looking for that entry. The VLSNIndex relies on the assumption that VLSN tagged log entries are ordered and contiguous in the log. That is, the LSN for VLSN 1 is < the LSN for VLSN 2 < LSN for VLSN 3, and there is never a gap in the VLSNs. However, at node syncup, the replication stream may need to be truncated when rolling back a non-committed log entry. We can't literally truncate the log files because the JE logs contain intermingled transactional and non transactional information. Instead, the truncation is done both logically by amending the VLSNIndex, and physically by overmarking those entries in the JE logs. Because of that, a physical dump of the log may show some VLSN tagged entries as duplicate and/or out of order because they're abandoned log entries that are not logically part of the replication stream any more For example, the log can look like this: LSN 100, VLSN 1 LSN 200, VLSN 2 <- overmarked LSN 300, VLSN 3 <- overmarked --- syncup, rollback to VLSN 1, restart at VLSN 2 LSN 400, VLSN 2 LSN 500, VLSN 3 VLSN->LSN mappings are created under the log write latch, which ensures that all VLSN tagged log entries are ordered in the logical replication stream in the log. However, the mapping is added to the VLSNIndex outside the log write latch, so the VLSNIndex database may have a momentary gap. For example, t0- thread 1 logs entry at VLSN=1, LSN=100, within log write latch t1- thread 2 logs entry at VLSN=2, LSN=150, within log write latch t2- thread 3 logs entry at VLSN=3, LSN=200, within log write latch t3- thread 1 calls VLSNIndex.put(VLSN=1/LSN=100) t4- thread 3 calls VLSNIndex.put(VLSN=3/LSN=200) t5- thread 2 calls VLSNIndex.put(VLSN=2/LSN=150) At t4, the VLSNIndex contains 1/100, 3/200, but not 2/150. However, we know that the VLSNIndex always represents a contiguous range of VLSNs, so the fact that 2/150 is not yet is handled, and is just like the case where the VLSNIndex optimized away the mapping in order to keep the index sparse. We do guarantee that the start and end VLSNs in the range have mappings, in order to always be able to provide a LTE and GTE mapping for all valid VLSNs. Because of that, if a VLSN comes out of order, it does not update the range. Persistent storage: The VLSN->LSN mappings in the range are grouped into instances of com.sleepycat.je.util.VLSNBucket. Each bucket knows the first and last VLSN within its mini-range. We observe these invariants - buckets are ordered by VLSN in the database and the bucket cache, - only the last bucket is the target of updates at any time, - a single bucket corresponds to a single file, but a single file may have multiple buckets covering it. While it would be nice to also guarantee that there are no gaps between buckets, ie: bucket(N-1).last == bucket(N).first - 1 bucket(N).last == bucket(N-1).first - 1 it is not possible to do so because we the put() call is not serialized because we don't want to add overhead to the log write latch. In order to permit out of order puts(), and to require that only the last bucket is updated, we must permit gaps between buckets. Buckets are both cached in memory by the VLSNIndex and are stored persistently in a internal, non-replicated database. The database holds key/value pairs where key = bucket.first data = bucket Since the first valid VLSN is 1, key = -1 is reserved for storage of the VLSNRange. Buckets are filled up as new VLSNs arrive (either because they've been generated by write operations on the master, or because they're incoming operations on the replica). They're flushed to disk periodically rather than with every new VLSN, because the update rate would have too much of a performance impact. Since there is this level of caching happening, we must be careful to write in-memory buckets to disk at well known points to support recoverability. The flushing must be instigated by a third party activity, such as checkpointing, rather than by the action of adding a new mapping. That's because mappings are registered by the logging system, and although we are not holding the log write latch at that point, it seems inadvisable to recursively generate another logging call on behalf of the flush. Currently the VLSNIndex is flushed to disk at every checkpoint. It can also optionally happen more often, and (TODO) we may want to do so because we've seen cases where checkpoints take a very long time. Perhaps we should flush when we flip to a new log file? Once written to disk, the buckets are generally not updated. Updates can happen when the range is truncated, such as for syncup rollback, but the system is quiescent at that time. Log cleaning can delete buckets, but will not modify them. The VLSNRange does naturally change often, and that data record does get updated. Recovery: The VLSN database is restored at recovery time just as all other databases are. However, there may be a portion of the VLSN range that was not flushed to disk. At recovery, we piggyback onto the log scanning done and re-track the any mappings found within the recovery range. Those mappings are merged into those stored on disk, so that the VLSNIndex correctly reflects the entire replication stream at startup. For example, suppose a log has: LSN 100 firstActiveLSN 200 Checkpoint start 300 VLSN 78 400 VLSNIndex flushed here 500 Checkpoint end 600 VLSN 79 The VLSNIndex is initially populated with the version of the index found at LSN 400. That doesn't include VLSN 79. A tracking pass is done from checkpoint start -> end of log, which sweeps up VLSN 78 and VLSN 79 into a temporary tracker. That tracker is merged in the VLSNIndex, to update its mappings to VLSN 79. Note that the checkpoint VLSNIndex must encompass all vlsn mappings that are prior to the checkpoint start of that recovery period. This follows the general philosophy that checkpoint flushes all metadata, and recovery reads from checkpoint start onewards to add on any neede extra data. Retrieving mappings: Callers who need to retrieve mappings obtain a VLSNScanner, which acts as a cursor over the VLSNIndex. A VLSNScanner finds and saves the applicable VLSNBucket, and queries the bucket directly as long as it can provide mappings. This reduces the level of contention between multiple readers (feeders) and writers (application threads, or the replay thread) Synchronization hierarchy: To write a new mapping, you must have the mutex on the VLSIndex, and then the tracker, which lets you obtain the correct bucket, and then you must have a mutex on the bucket. To read a mapping, you must have the tracker mutex to obtain the right bucket. If you already have the right bucket in hand, you only need the bucket mutex. In truth, buckets which are not the "currentBucket" are not modified again, so a future optimization would allow for reading a mapping on a finished bucket without synchronization. The VLSNRange is updated as an atomic assignment to a volatile field after taking the mutex on the current bucket. It is read without a mutex, by looking at it as a volatile field. The hierarchy is VLSNIndex -> VLSNTracker -> VLSNBucket VLSNIndex -> VLSNTracker -> VLSNRange VLSNIndex -> VLSNIndex.mappingSynchronizer VLSNIndex.flushSynchronizer -> VLSNTracker Removing mappings vs reading mappings - sync on the range. We also need to consider that fact that callers of the VLSNIndex may be holding other mutex, or IN latches, and that the VLSNIndex methods may do database operations to read or write to the internal VLSN database. That can result in a nested database operation, and we need to be careful to avoid deadlocks. To be safe, we disable critical eviction [#18475] VLSNBucket.writeDatabase(). Writers ------- Allocating a new VLSN: bump() - sync on log write latch Note that since there is no synchronization on the VLSNINdex itself, [allocating new VLSN, logging its entry] and [flushing the vlsn index to disk] is not atomic. See awaitConsistency(). Adding a mapping: put() - sync on VLSNIndex -sync on VLSNTracker to access the right bucket, and possibly create a new bucket. Atomically modify the VLSNRange. Flushing mappings to disk: writeToDatabase() - sync on VLSNIndex.flushSyncrhonizer -> VLSNTracker Replica side syncup truncates the VLSNIndex from the end: - no synchronization needed, the system is quiescent, and we can assume that VLSNs are neither read nor written by other threads. Log cleaning truncates the VLSNIndex from the beginning: We assume that the log cleaner is prohibited from deleting files that are being used for current feeding. We can also assume that the end of the log is not being deleted, and that we're not conflict with put(). We do have to worry about conflicting with backwards scans when executing syncup as a feeder, and with flushing mappings to disk. Shall we disable log file deletion at this point? Steps to take: First change the VLSNRange: - sync on VLSNIndex - atomically modify the VLSNRange to ensure that no readers or writers touch the buckets that will be deleted. - sync on VLSNTracker to delete any dead buckets. Do that before updating the on-disk database, so that we don't lose any buckets to writeToDatabase(). - without synchronization, scan the database and non-transactionally delete any on-disk buckets that are <= the log cleaned file. Readers ------- Active forward feeder checks if a mapping exists, and waits if necessary - read the current VLSNRange w/out a mutex. If not satisfactory - sync on VLSNIndex - sync on VLSNIndex.mappingSynchronizer Active forward feeder reads a mapping: first - getBucket() - sync on VLSNTracker to access the right bucket if bucket is in hand - sync on target bucket to read bucket


Nested Class Summary
static class VLSNIndex.BackwardVLSNScanner
          Assumes that VLSNs are scanned backwards.
static class VLSNIndex.ForwardVLSNScanner
          Scans VLSNs in a forward direction, used by feeders.
static class VLSNIndex.VLSNAwaitLatch
          Associates the logItem with the latch, so that it's readily available when the latch is released.
static class VLSNIndex.WaitTimeOutException
           
 
Field Summary
static int AWAIT_CONSISTENCY_MS
           
 
Constructor Summary
VLSNIndex(EnvironmentImpl envImpl, String mappingDbName, NameIdPair nameIdPair, int vlsnStride, int vlsnMaxMappings, int vlsnMaxDistance, RecoveryInfo recoveryInfo)
          The mapping db's name is passed in as a parameter instead of the more intuitive approach of defining it within the class to facilitate unit testing of the VLSNIndex.
 
Method Summary
 void abnormalClose()
           
 void awaitConsistency()
          Ensure that the in-memory vlsn index encompasses all logged entries before it is flushed to disk.
 VLSN bump()
           
 void close()
           
 void close(boolean doFlush)
           
 void decrement()
           
 Map<VLSN,Long> dumpDb(boolean display)
          For debugging and unit tests
 void flushToDatabase()
          Mappings are flushed to disk at close, and at checkpoints.
 VLSNBucket getLTEBucketFromDatabase(VLSN vlsn)
           
 long getLTEFileNumber(VLSN vlsn)
          Return the nearest file number <= the log file that houses this VLSN.
(package private)  VLSN getPutWaitVLSN()
          For unit test only.
 VLSNRange getRange()
          All range points (first, last, etc) ought to be seen as one consistent group.
 void initAsMaster()
          Initialize before this node begins working as a master.
 boolean isFlushedToDisk()
           
(package private)  void merge(VLSNRecoveryTracker recoveryTracker)
           
 void put(LogItem logItem)
           
 void truncateFromHead(VLSN deleteEnd, long deleteFileNum)
          Remove all information from the VLSNIndex for VLSNs <= deleteEndpoint.
 void truncateFromTail(VLSN deleteStart, long lastLsn)
          Remove all information from the VLSNIndex for VLSNs >= deleteStart Used by replica side syncup, when the log is truncated.
 boolean verify(boolean verbose)
           
static void verifyDb(Environment env, PrintStream out, boolean verbose)
          For DbStreamVerify utility.
 LogItem waitForVLSN(VLSN vlsn, int waitTime)
          Wait for the vlsn, or a higher numbered vlsn, to make its appearance in the VLSN index.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

AWAIT_CONSISTENCY_MS

public static final int AWAIT_CONSISTENCY_MS
See Also:
Constant Field Values
Constructor Detail

VLSNIndex

public VLSNIndex(EnvironmentImpl envImpl,
                 String mappingDbName,
                 NameIdPair nameIdPair,
                 int vlsnStride,
                 int vlsnMaxMappings,
                 int vlsnMaxDistance,
                 RecoveryInfo recoveryInfo)
          throws DatabaseException
The mapping db's name is passed in as a parameter instead of the more intuitive approach of defining it within the class to facilitate unit testing of the VLSNIndex.

Throws:
DatabaseException
Method Detail

initAsMaster

public void initAsMaster()
Initialize before this node begins working as a master. This node may become a Master directly after recovery, or it may transition to the master state after running for some time as a Replica.

Reset the vlsnIndex so the VLSN sequence corresponds to what this node thinks is the next VLSN.


bump

public VLSN bump()

decrement

public void decrement()

put

public void put(LogItem logItem)

waitForVLSN

public LogItem waitForVLSN(VLSN vlsn,
                           int waitTime)
                    throws InterruptedException,
                           VLSNIndex.WaitTimeOutException
Wait for the vlsn, or a higher numbered vlsn, to make its appearance in the VLSN index.

Returns:
the LogItem associated with the vlsn, or null if the entry is now present in the log, but is not available in the LogItemCache.
Throws:
InterruptedException
VLSNIndex.WaitTimeOutException - if the VLSN did not appear within waitTime or the latch was explicitly terminated.

getPutWaitVLSN

VLSN getPutWaitVLSN()
For unit test only.


truncateFromHead

public void truncateFromHead(VLSN deleteEnd,
                             long deleteFileNum)
                      throws DatabaseException
Remove all information from the VLSNIndex for VLSNs <= deleteEndpoint. Used by log cleaning. To properly coordinate with readers of the VLSNIndex, we need to update the range before updating the buckets. We assume that deleteEnd is always the last vlsn in a file, and because of that, truncations will never split a bucket. A truncation may leave a gap at the head of the vlsn index though. This could occur if the buckets have a gap, due to out of order VLSNs. For example, it's possible that the index has these buckets: bucket A: firstVLSN = 10, lastVLSN = 20 bucket B: firstVLSN = 22, lastVLSN = 30 If we truncate the index at 20 (deleteEnd == 20), then the resulting start of the range is 21, but the first bucket value is 22. In this case, we need to insert a ghost bucket.

Throws:
DatabaseException

truncateFromTail

public void truncateFromTail(VLSN deleteStart,
                             long lastLsn)
                      throws DatabaseException
Remove all information from the VLSNIndex for VLSNs >= deleteStart Used by replica side syncup, when the log is truncated. Assumes that the vlsnIndex is quiescent.

Throws:
DatabaseException

getRange

public VLSNRange getRange()
All range points (first, last, etc) ought to be seen as one consistent group. Because of that, VLSNIndex doesn't offer getLastVLSN, getFirstVLSN type methods, to discourage the possibility of retrieving range points across two different range sets.


getLTEFileNumber

public long getLTEFileNumber(VLSN vlsn)
                      throws DatabaseException
Return the nearest file number <= the log file that houses this VLSN. This method is meant to be efficient and will not incur I/O. If there is no available, it does an approximation. The requested VLSN must be within the VLSNIndex range.

Throws:
DatabaseException

getLTEBucketFromDatabase

public VLSNBucket getLTEBucketFromDatabase(VLSN vlsn)
                                    throws DatabaseException
Throws:
DatabaseException

merge

void merge(VLSNRecoveryTracker recoveryTracker)

close

public void close()

abnormalClose

public void abnormalClose()

close

public void close(boolean doFlush)
           throws DatabaseException
Throws:
DatabaseException

flushToDatabase

public void flushToDatabase()
Mappings are flushed to disk at close, and at checkpoints.


dumpDb

public Map<VLSN,Long> dumpDb(boolean display)
For debugging and unit tests

Throws:
DatabaseException

verifyDb

public static void verifyDb(Environment env,
                            PrintStream out,
                            boolean verbose)
                     throws DatabaseException
For DbStreamVerify utility. Verify the on-disk database, disregarding the cached tracker.

Throws:
DatabaseException

verify

public boolean verify(boolean verbose)
               throws DatabaseException
Throws:
DatabaseException

isFlushedToDisk

public boolean isFlushedToDisk()

awaitConsistency

public void awaitConsistency()
Ensure that the in-memory vlsn index encompasses all logged entries before it is flushed to disk. A No-Op for non-replicated systems. The problem is in the interaction of logging and VLSN tracking. Allocating an new VLSN and logging a replicated log entry is done within the log write latch, without any VLSNINdex synchronization. That must be done to keep the log write latch critical section as small as possible, and to avoid any lock hiearchy issues. The VLSNIndex is updated after the log write latch critical section. The VLSNIndex is flushed to disk by checkpoint, and it is assumed that this persistent version of the index encompasses all VLSN entries prior to checkpoint start. Since the logging of a new VLSN, and the flushing of the index are not atomic, it's possible that the checkpointer may start the flush of the vlsnIndex before the last vlsn's mapping is recorded in the index. To obey the requirement that the checkpointed vlsn index encompass all mappings < checkpoint start, check that the vlsn index is up to date before the flush. [#19754] awaitConsistency() works by using the same waitFroVLSN() method used by the Feeders. WaitForVLSN asserts that all feeders are waiting on single vlsn, to assure that no feeders are left in limbo, awaiting a vlsn that has gone by. This contract is valid for the feeders, because they wait for vlsns sequentially, consuming each one by one. However, this ckpter awaitConsistency functionality uses the nextVLSNCounter, which can leapfrog ahead arbitrarily, in this case: vlsn range holds 1 -> N-1 Feeder is present, awaiting vlsn N thread A bumps vlsn to N and writes record under log write latch thread B bumps vlsn to N + 1 and writes record under log write latch ckpter awaits consistency, using N+1, while feeders are awaiting N thread A puts VLSN N outside log write latch thread B puts VLSN N+1 outside log write latch Because of this, the ckpter must distinguish between what it is really waiting on (VLSN N+1) and what is can next wait on to fulfil the feeder waiting contract (VLSN N)



Copyright (c) 2004-2010 Oracle. All rights reserved.