com.sleepycat.je.recovery
Class Checkpointer

java.lang.Object
  extended by com.sleepycat.je.utilint.DaemonThread
      extended by com.sleepycat.je.recovery.Checkpointer
All Implemented Interfaces:
EnvConfigObserver, DaemonRunner, ExceptionListenerUser, Runnable

public class Checkpointer
extends DaemonThread
implements EnvConfigObserver

The Checkpointer looks through the tree for internal nodes that must be flushed to the log. Checkpoint flushes must be done in ascending order from the bottom of the tree up. Checkpoint and IN Logging Rules ------------------------------- The checkpoint must log, and make accessible via non-provisional ancestors, all INs that are dirty at CkptStart. If we crash and recover from that CkptStart onward, any IN that became dirty (before the crash) after the CkptStart must become dirty again as the result of replaying the action that caused it to originally become dirty. Therefore, when an IN is dirtied at some point in the checkpoint interval, but is not logged by the checkpoint, the log entry representing the action that dirtied the IN must follow either the CkptStart or the FirstActiveLSN that is recorded in the CkptEnd entry. The FirstActiveLSN is less than or equal to the CkptStart LSN. Recovery will process LNs between the FirstActiveLSN and the end of the log. Other entries are only processed from the CkptStart forward. And provisional entries are not processed. Example: Non-transactional LN logging. We take two actions: 1) log the LN and then 2) dirty the parent BIN. What if the LN is logged before CkptStart and the BIN is dirtied after CkptStart? How do we avoid breaking the rules? The answer is that we log the LN while holding the latch on the parent BIN, and we don't release the latch until after we dirty the BIN. The construction of the checkpoint dirty map requires latching the BIN. Since the LN was logged before CkptStart, the BIN will be dirtied before the checkpointer latches it during dirty map construction. So the BIN will always be included in the dirty map and logged by the checkpoint. Example: Abort. We take two actions: 1) log the abort and then 2) undo the changes, which modifies (dirties) the BIN parents of the undone LNs. There is nothing to prevent logging CkptStart in between these two actions, so how do we avoid breaking the rules? The answer is that we do not unregister the transaction until after the undo phase. So although the BINs may be dirtied by the undo after CkptStart is logged, the FirstActiveLSN will be prior to CkptStart. Therefore, we will process the Abort and replay the action that modifies the BINs. Exception: Lazy migration. The log cleaner will make an IN dirty without logging an action that makes it dirty. This is an exception to the general rule that actions should be logged when they cause dirtiness. The reasons this is safe are: 1. The IN contents are not modified, so there is no information lost if the IN is never logged, or is logged provisionally and no ancestor is logged non-provisionally. 2. If the IN is logged non-provisionally, this will have the side effect of recording the old LSN as being obsolete. However, the general rules for checkpointing and recovery will ensure that the new version is used in the Btree. The new version will either be replayed by recovery or referenced in the active Btree via a non-provisional ancestor. Checkpoint Algorithm -------------------- The final checkpointDirtyMap field is used to hold (in addition to the dirty INs) the state of the checkpoint and highest flush levels. Access to this object is synchronized so that eviction and checkpointing can access it concurrently. When a checkpoint is not active, the state is CkptState.NONE and the dirty map is empty. When a checkpoint runs, we do this: 1. Get set of files from cleaner that can be deleted after this checkpoint. 2. Set checkpointDirtyMap state to DIRTY_MAP_INCOMPLETE, meaning that dirty map construction is in progress. 3. Log CkptStart 4. Construct dirty map, organized by Btree level, from dirty INs in INList. The highest flush levels are calculated during dirty map construction. Set checkpointDirtyMap state to DIRTY_MAP_COMPLETE. 5. Flush INs in dirty map. + First, flush the bottom two levels a sub-tree at a time, where a sub-tree is one IN at level two and all its BIN children. Higher levels (above level two) are logged strictly by level, not using subtrees. o If je.checkpointer.highPriority=false, we log one IN at a time, whether or not the IN is logged as part of a subtree, and do a Btree search for the parent of each IN. o If je.checkpointer.highPriority=true, for the bottom two levels we log each sub-tree in a single call to the LogManager with the parent IN latched, and we only do one Btree search for each level two IN. Higher levels are logged one IN at a time as with highPriority=false. + The Provisional property is set as follows, depending on the level of the IN: o level is max flush level: Provisional.NO o level is bottom level: Provisional.YES o Otherwise (middle levels): Provisional.BEFORE_CKPT_END 6. Flush VLSNIndex cache to make VLSNIndex recoverable. 7. Flush UtilizationTracker (write FileSummaryLNs) to persist all tracked obsolete offsets and utilization summary info, to make this info recoverable. 8. Log CkptEnd 9. Delete cleaned files from step 1. 10. Set checkpointDirtyMap state to NONE. Provisional.BEFORE_CKPT_END --------------------------- See Provisional.java for a description of the relationship between the checkpoint algorithm above and the BEFORE_CKPT_END property. Coordination of Eviction and Checkpointing ------------------------------------------ Eviction can proceed concurrently with all phases of a checkpoint, and eviction may take place concurrently in multiple threads. This concurrency is crucial to avoid blocking application threads that perform eviction and to reduce the amount of eviction required in application threads. Eviction calls Checkpointer.coordinateEvictionWithCheckpoint, which calls DirtyINMap.coordinateEvictionWithCheckpoint, just before logging an IN. coordinateEvictionWithCheckpoint returns whether the IN should be logged provisionally (Provisional.YES) or non-provisionally (Provisional.NO). Other coordination necessary depends on the state of the checkpoint: + NONE: No additional action. o return Provisional.NO + DIRTY_MAP_INCOMPLETE: The parent IN is added to the dirty map, exactly as if it were encountered as dirty in the INList during dirty map construction. o IN level GTE highest flush level: return Provisional.NO o IN level LT highest flush level: return Provisional.YES + DIRTY_MAP_COMPLETE: o IN is root: return Provisional.NO o IN is not root: return Provisional.YES In general this is designed so that eviction will use the same provisional value that would be used by the checkpoint, as if the checkpoint itself were logging the IN. However, there are several conditions where this is not exactly the case. 1. Eviction may log an IN with Provisional.YES when the IN was not dirty at the time of dirty map creation, if it became dirty afterwards. In this case, the checkpointer would not have logged the IN at all. This is safe because the actions that made that IN dirty are logged in the recovery period. 2. Eviction may log an IN with Provisional.YES after the checkpoint has logged it, if it becomes dirty again. In this case the IN is logged twice, which would not have been done by the checkpoint alone. This is safe because the actions that made that IN dirty are logged in the recovery period. 3. An intermediate level IN (not bottom most and not the highest flush level) will be logged by the checkpoint with Provisional.BEFORE_CKPT_END but will be logged by eviction with Provisional.YES. See below for why this is safe. 4. Between checkpoint step 8 (log CkptEnd) and 10 (set checkpointDirtyMap state to NONE), eviction may log an IN with Provisional.YES, although a checkpoint is not strictly active during this interval. See below for why this is safe. It is safe for eviction to log an IN as Provisional.YES for the last two special cases, because this does not cause incorrect recovery behavior. For recovery to work properly, it is only necessary that: + Provisional.NO is used for INs at the max flush level during an active checkpoint. + Provisional.YES or BEFORE_CKPT_END is used for INs below the max flush level, to avoid replaying an IN during recovery that may depend on a file deleted as the result of the checkpoint. You may ask why we don't use Provisional.YES for eviction when a checkpoint is not active. There are two reason, both related to performance: 1. This would be wasteful when an IN is evicted in between checkpoints, and that portion of the log is processed by recovery later, in the event of a crash. The evicted INs would be ignored by recovery, but the actions that caused them to be dirty would be replayed and the INs would be logged again redundantly. 2. Logging a IN provisionally will not count the old LSN as obsolete immediately, so cleaner utilization will be inaccurate until the a non-provisional parent is logged, typically by the next checkpoint. It is always important to keep the cleaner from stalling and spiking, to keep latency and throughput as level as possible. Therefore, it is safe to log with Provisional.YES in between checkpoints, but not desirable. Although we don't do this, it would be safe and optimal to evict with BEFORE_CKPT_END in between checkpoints, because it would be treated by recovery as if it were Provisional.NO. This is because the interval between checkpoints is only processed by recovery if it follows the last CkptEnd, and BEFORE_CKPT_END is treated as Provisional.NO if the IN follows the last CkptEnd. However, it would not be safe to evict an IN with BEFORE_CKPT_END during a checkpoint, when logging of the IN's ancestors does not occur according to the rules of the checkpoint. If this were done, then if the checkpoint completes and is used during a subsequent recovery, an obsolete offset for the old version of the IN will mistakenly be recorded. Below are two cases where BEFORE_CKPT_END is used correctly and one showing how it could be used incorrectly. 1. Correct use of BEFORE_CKPT_END when the checkpoint does not complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 Crash and recover Recovery will process BIN-A at 200 (it will be considered non-provisional) because there is no following CkptEnd. It is therefore correct that BIN-A at 050 is obsolete. 2. Correct use of BEFORE_CKPT_END when the checkpoint does complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 400 IN-B parent of BIN-A, non-provisional 500 CkptEnd Crash and recover Recovery will not process BIN-A at 200 (it will be considered provisional) because there is a following CkptEnd, but it will process its parent IN-B at 400, and therefore the BIN-A at 200 will be active in the tree. It is therefore correct that BIN-A at 050 is obsolete. 3. Incorrect use of BEFORE_CKPT_END when the checkpoint does complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 400 CkptEnd Crash and recover Recovery will not process BIN-A at 200 (it will be considered provisional) because there is a following CkptEnd, but no parent IN-B is logged, and therefore the IN-B at 060 and BIN-A at 050 will be active in the tree. It is therefore incorrect that BIN-A at 050 is obsolete. This last case is what caused the LFNF in SR [#19422], when BEFORE_CKPT_END was mistakenly used for logging evicted BINs via CacheMode.EVICT_BIN. During the checkpoint, we evict BIN-A and log it with BEFORE_CKPT_END, yet neither it nor its parent are part of the checkpoint. After being counted obsolete, we crash and recover. Then the file containing the BIN (BIN-A at 050 above) is cleaned and deleted. During cleaning, it is not migrated because an obsolete offset was previously recorded. The LFNF occurs when trying to access this BIN during a user operation. CacheMode.EVICT_BIN ------------------- Unlike in JE 4.0 where EVICT_BIN was first introduced, in JE 4.1 and later we do not use special rules when an IN is evicted. Since concurrent eviction and checkpointing are supported in JE 4.1, the above rules apply to EVICT_BIN as well as all other types of eviction.


Nested Class Summary
static class Checkpointer.CheckpointReference
           
static class Checkpointer.FlushStats
          A struct to hold log flushing stats for checkpoint and database sync.
 
Field Summary
static TestHook beforeFlushHook
           
static TestHook<IN> examineINForCheckpointHook
           
static TestHook maxFlushLevelHook
          For unit testing only.
 
Fields inherited from class com.sleepycat.je.utilint.DaemonThread
logger, name, nWakeupRequests, stifleExceptionChatter
 
Constructor Summary
Checkpointer(EnvironmentImpl envImpl, long waitTime, String name)
           
 
Method Summary
 void clearEnv()
           
 boolean coordinateEvictionWithCheckpoint(IN target, IN parent)
          Coordinates an eviction with an in-progress checkpoint and returns whether provisional logging is needed.
 void doCheckpoint(CheckpointConfig config, boolean flushAll, String invokingSource)
          The real work to do a checkpoint.
 void envConfigUpdate(DbConfigManager cm, EnvironmentMutableConfig ignore)
          Process notifications of mutable property changes.
static long getWakeupPeriod(DbConfigManager configManager)
          Figure out the wakeup period.
 void initIntervals(long lastCheckpointEnd, long lastCheckpointMillis)
          Initializes the checkpoint intervals when no checkpoint is performed while opening the environment.
 StatGroup loadStats(StatsConfig config)
          Load stats.
protected  long nDeadlockRetries()
          Return the number of retries when a deadlock exception occurs.
protected  void onWakeup()
          Called whenever the DaemonThread wakes up from a sleep.
static void setBeforeFlushHook(TestHook hook)
           
 void setCheckpointId(long lastCheckpointId)
          Set checkpoint id -- can only be done after recovery.
static void setMaxFlushLevelHook(TestHook hook)
           
 void syncDatabase(EnvironmentImpl envImpl, DatabaseImpl dbImpl, boolean flushLog)
          Flush a given database to disk.
 void wakeupAfterWrite()
          Wakes up the checkpointer if a checkpoint log interval is configured and the number of bytes written since the last checkpoint exeeds the size of the interval.
 
Methods inherited from class com.sleepycat.je.utilint.DaemonThread
checkErrorListener, createLogger, getExceptionListener, getNWakeupRequests, getThread, isPaused, isRunning, isShutdownRequested, requestShutdown, run, runOrPause, setExceptionListener, shutdown, toString, wakeup
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

maxFlushLevelHook

public static TestHook maxFlushLevelHook
For unit testing only. Called before we flush the max level. This field is static because it is called from the static flushIN method.


beforeFlushHook

public static TestHook beforeFlushHook

examineINForCheckpointHook

public static TestHook<IN> examineINForCheckpointHook
Constructor Detail

Checkpointer

public Checkpointer(EnvironmentImpl envImpl,
                    long waitTime,
                    String name)
Method Detail

envConfigUpdate

public void envConfigUpdate(DbConfigManager cm,
                            EnvironmentMutableConfig ignore)
Process notifications of mutable property changes.

Specified by:
envConfigUpdate in interface EnvConfigObserver

initIntervals

public void initIntervals(long lastCheckpointEnd,
                          long lastCheckpointMillis)
Initializes the checkpoint intervals when no checkpoint is performed while opening the environment.


coordinateEvictionWithCheckpoint

public boolean coordinateEvictionWithCheckpoint(IN target,
                                                IN parent)
Coordinates an eviction with an in-progress checkpoint and returns whether provisional logging is needed.

Returns:
true if the target must be logged provisionally.

getWakeupPeriod

public static long getWakeupPeriod(DbConfigManager configManager)
                            throws IllegalArgumentException
Figure out the wakeup period. Supplied through this static method because we need to pass wakeup period to the superclass and need to do the calcuation outside this constructor.

Throws:
IllegalArgumentException - via Environment ctor and setMutableConfig.

setCheckpointId

public void setCheckpointId(long lastCheckpointId)
Set checkpoint id -- can only be done after recovery.


loadStats

public StatGroup loadStats(StatsConfig config)
Load stats.


clearEnv

public void clearEnv()

nDeadlockRetries

protected long nDeadlockRetries()
Return the number of retries when a deadlock exception occurs.

Overrides:
nDeadlockRetries in class DaemonThread

onWakeup

protected void onWakeup()
                 throws DatabaseException
Called whenever the DaemonThread wakes up from a sleep.

Specified by:
onWakeup in class DaemonThread
Throws:
DatabaseException

wakeupAfterWrite

public void wakeupAfterWrite()
Wakes up the checkpointer if a checkpoint log interval is configured and the number of bytes written since the last checkpoint exeeds the size of the interval.


doCheckpoint

public void doCheckpoint(CheckpointConfig config,
                         boolean flushAll,
                         String invokingSource)
                  throws DatabaseException
The real work to do a checkpoint. This may be called by the checkpoint thread when waking up, or it may be invoked programatically through the api.

Parameters:
flushAll - if true, this checkpoint must flush all the way to the top of the dbtree, instead of stopping at the highest level last modified.
invokingSource - a debug aid, to indicate who invoked this checkpoint. (i.e. recovery, the checkpointer daemon, the cleaner, programatically)
Throws:
DatabaseException

syncDatabase

public void syncDatabase(EnvironmentImpl envImpl,
                         DatabaseImpl dbImpl,
                         boolean flushLog)
                  throws DatabaseException
Flush a given database to disk. Like checkpoint, log from the bottom up so that parents properly represent their children.

Throws:
DatabaseException

setMaxFlushLevelHook

public static void setMaxFlushLevelHook(TestHook hook)

setBeforeFlushHook

public static void setBeforeFlushHook(TestHook hook)


Copyright (c) 2004-2010 Oracle. All rights reserved.