In the previous post we reviewed how the essence of the SCTP protocol works - the user data transfer. Now it's time to see some failure detection procedures and finally how an SCTP association is closed. Both topics will be discussed in this post. Error detection procedures are specified in Section 8, titled 'Fault management', but in this post I will mainly discuss heartbeating. The other failure detection mechanisms will be described in brief, so if you wish to know more about the specific mechanism - read the corresponding Section of the specification (hyperlinks are provided in each section). Association teardown is specified in Section 9 - 'Termination of association'.
Remember that SCTP has a feature called multi-homing? I mentioned it in the post about association initialization. In nutshell it provides the option to use more than one IP address for each endpoint in the association. This option makes error handling in SCTP a bit more fancy. It has two main aspects - endpoint and path failure detection. If this topic is interesting to you can review Section 8, when you finish with this post.
Endpoint Failure Detection
Each endpoint keeps the count of the consecutive retransmissions to its peer. If the value exceeds the 'Association.Max.Retrans' parameter, the association has to be closed. For more information check Section 8.1.
Path Failure Detection
In multi-homing scenario, each endpoint keeps error count for each path to its peer. When the error count exceeds the 'Path.Max.Retrans' parameter value, the endpoint should consider the destination address inactive. The SCTP stack should also notify the user about the unavailability of the peer's address. For more details about how errors are detected, implementation notes and configuration tips, check Section 8.2, which covers the path failure detection.
When the SCTP association is established the protocol stack is supposed to monitor each idle IP address of its peer. In case there is no traffic from/to this address, the SCTP stack should sent HEARTBEAT chunk to the idle IP address. The receiver should respond with HEARTBEAT ACK chunk. This operation is performed only when the association is established. Section 8.3 describes the path heartbeat in detail.
You can see a sample HEARTBEAT chunk on figure 1. It is described in Section 3.3.5. Its chunk type is 4 and it has got one variable length parameter - heartbeat information. It contains sender specific information, usually the time when the chunk is sent. The receiver doesn't need to understand this information - it is just sent back with the HEARTBEAT ACK chunk.
HEARTBEAT ACK chunk
A sample HEARTBEAT ACK chunk is shown on figure 2. It is specified in Section 3.3.6. It also has got only one variable length parameter - hearbeat information, received in the HEARTBEAT chunk.
The whole trace
The chunks above were extracted from a dummy client-server program, which sends some random data over the wire. You can download the whole trace from here.
Each association is terminated when it's not used anymore or any critical error occurs. There are two ways to achieve this - via shutdown or via abort. The first one is considered graceful termination. Any pending data is transmitted and the both peers consider the association as terminated. The latter is erroneous termination and all unsent user data is discarded.
Association is aborted with ABORT chunk. It can't be bundled with any DATA chunks. The receiver should check if the verification tag in the common header matches with its own tag. If yes - the chunk is accepted, the association is destroyed and the upper layer is informed. On figure 3 you can see an ABORT chunk. It is described in Section 3.3.7 from the specification. The chunk has got only one optional parameter - error cause, specifying the reason for the abort. The error cause in the ABORT chunk on figure 3 is 0x000c, which is user initiated abort. All possible values are listed in Section 3.3.10.
ABORT chunk has one reserved bit in the chunk flags - the T bit. It is the last one (8th). Setting it to zero means that the sender has filled in the verification tag in the common header. We already discussed in a nutshell how this value is processed. If you want to learn more about this, check Section 8.5.1.
You can download the trace containing the ABORT chunk on figure 3.
Association shutdown is initiated by one of the SCTP users. This is an indication that the association is not required anymore and it needs to be released gracefully. The procedure is shown on figure 4. We will review the simplified, happy case of the shutdown. We won't cover things like timeouts or unexpected messages. If this is important for you please check Section 9.2, which covers the association shutdown procedure in detail.
On figure 4 we assume that the association is already established and is currently in data transfer phase. At this moment, the user on host A wishes to release the association and makes shutdown request to the SCTP stack. Stack A enters in SHUTDOWN-PENDING state and it stops accepting data from the user. The endpoint remains in this state until it sends all pending DATA chunks (accepted before the shutdown initiation). When they are transmitted, A enters in SHUTDOWN state and sends SHUTDOWN chunk to B. This chunk includes the last sequential TSN it has received so far from B. In this state A may still receive DATA chunks from B and it should act accordingly by confirming received TSNs and reporting gaps/duplicates. The purpose is to let B send all the data it has accepted from its user, before the SHUTDOWN chunk was received.
As soon as B receives the SHUTDOWN chunk it enters in SHUTDOWN-RECEIVED state and stops accepting data from its user. B also checks the TSN received in the SHUTDOWN chunk and verifies that all its outstanding data is received by A. In case there are unsent/unreceived DATA chunks, B continues its normal operation until all the data is sent. When the data is transmitted, B sends SHUTDOWN ACK chunk to A and enters SHUTDOWN-ACK-SENT state. This is a confirmation for A that the association can be released. At this point A clears all records about the association and sends SHUTDOWN COMPLETE. When B receives the chunk it also clears the association.
Once again, please note that this is a really simple version of the shutdown procedure. The above description is good enough to understand how association termination works from user's point of view. However if you are developing custom SCTP stack (for example) make sure that you have read and understood well Section 9.2.
Now let's have a look at SHUTDOWN, SHUTDOWN ACK and SHUTDOWN COMPLETE chunks.
There is a sample SHUTDOWN chunk on figure 5. It has only one parameter - Cumulative TSN Ack. As discussed in the previous section, it contains the TSN of the last DATA chunk, received without any gaps. SHUTDOWN chunk is specified in Section 3.3.8.
SHUTDOWN ACK and SHUTDOWN COMPLETE chunks
Both chunks has no parameters. We already covered what they are used for, so I'll only provide samples here. Figure 6 shows SHUTDOWN ACK chunks, which is specified in Section 3.3.9. SHUTDOWN COMPLETE chunk is show on figure 7 and you can find its specification in Section 3.3.13.
In these four posts we have covered some essential SCTP terminology, how messages are encoded, how associations are created and destroyed and how data transfer works. I hope you have enjoyed reading and you have learned something new. There are topics that I've skipped intentionally. Most of them are not useful for the regular SCTP user. One exception is multi-homing. This is one of the killer SCTP features, which I haven't covered and I do recommend you to read about it by yourself. Section 6.4 should be a good start for this reading, but be warned that multi-homing is scattered in the entire specification. I plan to write a dedicated post(s) about it in the future, but for now you are on your own.
Almost all examples in these posts were extracted from a test SCTP session, created with a dummy client-server application. I have provided links to single messages all over the posts, but if you wish to have a look at the whole picture, you can get the trace from this link.