For 6.824 we are using Piazza for class communication and Q&A. Over the course of the semester, a number of good questions have been asked that may be of use to others trying to come to grips with Raft. A selection of the questions and answers are given below. These are all adapted from the questions and answers given by 6.824 students and TAs.
This post accompanies the Students’ Guide to Raft. You will probably want to read that first.
Assume we have three servers, and that S0 is elected leader. S0 receives some commands from a client and adds them to its log (say, 1:100, 1:101, and 1:102, on the form term:command). S0 is then immediately partitioned from the rest of the servers before propagating those entries to any clients.
Next, S1 is elected leader right after the partitioning of S0. It gets two commands, 2:103 and 2:104, and replicates them to S2 (and thus also commits them). Immediately after this, S1 is partitioned off, and S0 is re-connected. S2 would now be elected leader, since logs on S0 are not up to date.
S0 would learn the commit index of S2 is 2 from its
AppendEntries
, but it should not commit any command from its log, because its log doesn’t match that of S2. If there is no new command from clients, when should we erase all conflicting log entries on S0?
You’re right that S0 will learn the new leaderCommit
from S2’s
heartbeat. But before any follower updates its commitIndex
to match
the leader’s, what does it need to do? Take another look at the order of
instructions in the receiver implementation for the AppendEntries
RPC
in Figure 2. It’s very specific about when a follower’s conflicting log
entries should be erased.
In figure 8 of the raft paper: (d) seems bad because [2] has been committed, but gets rolled back. What are they saying prevents (d) from happening? What prevents S5 from being elected in step (d)?
[2] can’t be committed in term 4, because Raft never explicitly commits old entries (only implicitly by the Log Matching Property as explained in section 5.4.2). So in step (c), S1 would only be able to commit 2 if it were also able to commit 4, which would then exclude S5 from being an eligible candidate.
When is current term supposed to be incremented? If my Raft instances are left idle (i.e., no client commands), they do not eventually agree on the current term. I’m incrementing the
currentTerm
field whenever I start an election, but I only do it for the candidate (sending out the votes). Is this correct, or should I be incrementingcurrentTerm
for all the servers I send the request to?
You’re right that you should increment currentTerm
when the current
term times out and you start a new election. However, followers also
have to update their terms at some point, or else you’ll end up with
servers agreeing on the same leader, but for different terms. Just
incrementing on all servers that you send the request to won’t work,
because the voting servers might already have different terms, and might
end up incrementing twice if they receive the same two RequestVotes
for the same term, etc.
To figure out when and how you should update currentTerm
(on all
servers), see the rule for all servers in Figure 2 of the Raft paper:
“If RPC request or response contains term T > currentTerm
: set
currentTerm = T
, convert to follower.”
I’m quite confused about the difference between the
RequestVote
RPC arguments and theAppendEntries
RPC arguments.RequestVote
haslastLogIndex/Term
, whileAppendEntries
hasprevLogIndex/Term
. Are these equivalent?I’m trying to mentally think of an example: If I have entries 0,1,2,3,4 in my log, then I’m assuming my
lastLogIndex
would be 4. But wouldprevLogIndex
be the same thing? Does this differ for each follower you send it to? (i.e., do we usenextIndex[]
/matchIndex[]
to help determineprevLogIndex
?)
When a candidate sends a RequestVote
RPC, the lastLogIndex
should be
the index of its last log entry (so 5 in your example).
For AppendEntries
, the prevLogIndex/Term
should refer to the log entry
immediately preceding the first element of the entries[]
field of the
RPC arguments in the leader’s log. Suppose the leader has log entries
0,1,2,3,4,5,6,7,8 and nextIndex[i]
is 6 for some follower i
. The
leader wants to send entries 6,7,8 to that follower. The leader would
copy 6,7,8 to entries[]
, and set prevLogIndex
to 5.
What exactly is meant by “volatile state” in the Raft paper? Is this data lost if the server storing it crashes? If so, why are
commitIndex
andlastApplied
volatile? Shouldn’t they be persistent?
Yes, “volatile” means it is lost if there’s a crash.
commitIndex
is volatile because Raft can figure out a correct value
for it after a reboot using just the persistent state. Once a leader
successfully gets a new log entry committed, it knows everything before
that point is also committed. A follower that crashes and comes back up
will be told about the right commitIndex
whenever the current leader
sends it an AppendEntries
RPC.
lastApplied
starts at zero after a reboot because the Figure 2 design
assumes the service (e.g., a key/value database) doesn’t keep any
persistent state. Thus its state needs to be completely recreated by
replaying all log entries. If the service does keep persistent state, it
is expected to persistently remember how far in the log it has executed,
and to ignore entries before that point. Either way it’s safe to start
with lastApplied = 0
after a reboot.
According to Figure 2, persistent state is saved before responding to RPCs, but a leader never receive RPC from other raft servers when it’s working normally. This means if we only save persistent state when receiving
AppendEntries
orRequestVote
, a leader never gets a chance to store persistent state, which is kind of weird… Or do the RPCs including ones called by clients?
For simplicity, you should save Raft’s persistent state just after any
change to that state. The most important thing is that you save
persistent state before you make it possible for anything else to
observe the new state, i.e., before you send an RPC, reply to an RPC,
return from Start()
, or apply a command to the state machine.
If a server changes persistent state, but then crashes before it gets
the chance to save it, that’s fine – it’s as if the crash happened
before the state was changed. However, if the server changes persistent
state, makes it visible, and then crashes before it saves it, that’s
not fine – forgetting that persistent state may cause it to violate
protocol invariants (for example, it could vote for two different
candidates in the same term if it forgot votedFor
).
Figure 2 says “if an existing entry conflicts with a new one”, delete it and everything that follows. But inside Section 5.3, it suggests we should find the latest log entry where the two logs agree.
What is the difference between Bullet point #2 and #3? Right now, I’m basically checking the value at
prevLogIndex+1
, and seeing if it’s equal to the leader’s term. If it isn’t, I delete it and everything following it. But based on 5.3, should I actually be going through the entire log from the end and checking if they agree?
I think you are just mixing up the roles of the follower and the leader
in this case. The leader is essentially probing the follower’s log to
find the last point where the two agree. This is what the nextIndex
variable is used for. The follower helps the leader do this by a)
rejecting any AppendEntries
RPCs that doesn’t immediately follow a
point where the two agree (this is #2), and b) overwriting any following
entries in its log once #2 is satisfied (this is #3).
To put it simply, #2 makes sure that the entries before the ones
contained in the AppendEntries
RPC from the leader match on the leader
and the follower. #3 ensures that the entries in the follower’s log
following the prefix the leader and follower agree about are the same as
the entries the leader holds in its log.
I’m a little confused; what the difference is between a log entry that is “applied” and one that is “committed”. Are all applied entries committed, but not the other way around? When does a committed entry become applied?
Any log entry that you have applied to the application state machine is “applied”. An entry should never be applied unless it has already been committed. An entry can be committed, but not yet applied. You will likely apply committed entries very soon after they become committed.
Why is it necessary to check if log[N].term == currentTerm? In the “Leaders” section of the “Rules for Servers” part of Figure 2, why is it necessary to only update
matchIndex
iflog[N].term == currentTerm
? What should happen iflog[N].term != currentTerm
?
Raft leaders can’t be sure an entry is actually committed (and will not ever be changed in the future) if it’s not from their current term, which Figure 8 from the paper illustrates. One way to think about it is that a follower only shows their “allegiance” to leader A by replicating a log record from A’s current term. If they haven’t, and a follower has only replicated a log entry from an earlier term, then another candidate B can come along with a conflicting entry in their log (same index but higher term) and “steal” the votes from a majority of such followers: B’s log is more-up-to-date than the followers by virtue of having a higher-termed entry in the last spot, so the followers have to vote for it. Then B, now a leader, overwrites that original log record with their own higher-termed one.
Note that this can’t happen if the follower replicates a log entry from A’s current term. In this case, since A’s current term is the highest at which a command was issued, a candidate’s log can only be more up-to-date than the follower’s if it includes that log entry, so any new candidate this follower votes for will never overwrite it.
In
AppendEntries
, if theprevLogTerm/Index
matches, I get rid of the log after theprevlogIndex
, and just append the entries from the RPC arguments:rf.Log = rf.Log[:args.PrevLogIndex + 1] rf.Log = append(rf.Log, args.Entries ...)
However, what if the
AppendEntries
RPCs are received out-of-order?Say that there are 3 machines, and S1 is leader. All this is in term 1.
S1: [C1] S2: [C1] S3: [C1]
S1 receives requests C2-5, making the logs:
S1: [C1,C2,C3,C4,C5] S2: [C1] S3: [C1]
S1 sends out
AppendEntries
RPCs with:{ prevLogIndex: 1, prevLogTerm: 1, entries: [C2, C3], }
S1 sends out
AppendEntries
RPCs with:{ prevLogIndex: 1, prevLogTerm: 1, entries: [C2, C3, C4, C5], }
This could happen if there’s a pause between leader receiving C3 and C4, and the other servers haven’t responded to the first RPC yet.
S1’s second
AppendEntries
RPC arrives, so we have:S2: [C1,C2,C3,C4,C5] S3: [C1,C2,C3,C4,C5]
They both respond positively.
- S1 commits up to C5, and responds back to client.
- Now, S1’s first
AppendEntries
RPC arrives at S2 and S3. S2 and S3 revert back to[C1,C2,C3]
.- S1 crashes.
- S2 can now become leader and the committed rule is broken!
Why can this not happen in Raft?
This is a very good question. The answer lies in the exact wording of
point #3 in the AppendEntries
RPC definition in Figure 2 in the paper:
“If an existing entry conflicts with a new one (same index but different
terms), delete the existing entry and all that follow it.”
Note that the rule starts with “if an existing entry conflicts”. This
is important exactly for the scenario you outline above. Here’s what
should happen + some intuition: There is a rule in Raft that the
commitIndex
can never be reduced (intuitively because we cannot
un-apply a log entry). We know that if our commitIndex
is ever
changed so that it points to somewhere in the log, all entries before
that point will never change ever again. Without the if in the
aforementioned rule from Figure 2, this rule would be violated, because,
as you point out, it could cause a follower to uncommit entries that it
has already committed.
The if is what saves us. To see why, consider what happens if
prevLogIndex
’s term matches prevLogTerm
(and the leader has an
up-to-date term of course). This means that whatever the leader sent us
must be a prefix of the “true” log (that is, what leaderCommit
applies to). This is true both for the second (“long”) AppendEntries
RPC, and the first (“short”) AppendEntries
RPC. It follows from this
that the entries in the “short” RPC must be a prefix of those in the
“long” RPC. A consequence of this is that we don’t match the if from the
rule — no existing entry conflicts with a new one, and we should
therefore not truncate our log, but simply append to it. Now, we
aren’t invalidating our old commitIndex
, and all is well in the world.