How is TCP & UDP Checksum Calculated?
When you send and receive data over the wire, there are possibilities where the data can get corrupted, altered, or modified(it can be accidental, purposely done with evil intention). Whatever may be the case, there needs to be a method where the receiver can verify and figure out easily if the data is altered or corrupted.
Generally what happens is this..The sender calculates a short checksum value(very small in size) which represents the message/data that is being sent. The value calculated is either sent along with the message, or through other methods. Once the data is received, the receiver also calculates the checksum. If both the values(of the sender and the receiver) tallies, then the data is uncorrupted/unaltered.
Thanks to cryptographers who designed hash algorithms. Some of the popular hash algorithms are MD5, SHA1 etc. You might have noticed websites that provides MD5 & SHA1 hash values of the file that you want to download. This is because you can calculate the hash value on the file that you downloaded, and then if it matches the one that is mentioned in the website, you can have the surety that the file is not corrupted/and is unaltered.
MD5 and SHA1 are primarily used to verify integrity of files. However Internet communication(the core network data verification. The TCP and UDP data verification) does not leverage MD5 or SHA1. The method used by internet communication and hash algorithms are similar, but not exactly same. Because the need is different in each case.
Collision is the most important factor in hash algorithms. It’s nothing but the surety about the fact that “No two hash outputs will have the same value”. If there is two different set of data with same hash output, then that beats the purpose. Because it should be unique for a particular data. Then only you can verify the integrity. MD5 and SHA1 concentrates more making collision free outputs. Its designed to be really strong, and does not have any sort of collision. This is the reason its used for file integrity verification. Also the beauty of hash function is that its one way hash. You cant get the data if you have the hash value. This is the reason most of the times passwords are stored in databases in the form of hash values. When a user enters a password, the login program generates the hash value of the password, and it compares it with the one in the database. If both the values match, then the user is allowed to login.
TCP and UDP checksum calculation is not much worried about collision as such(compared to md5 and sha1 algorithms). Swiftness and efficient detection of errors is what is more important to TCP and UDP checksum, rather than collision.
Due to this reason TDP and UDP checksum uses ones’ complement method to calculate the checksum.
ones complement is nothing but the value we get when we change all 0s to 1s, and 1s to 0s. For example, ones complement of 110111001010 is 001000110101.
How does the method of ones’ complement work for UDP and TCP checksum Calculation?
Let’s understand this by an example. Imagine we have a UDP segment or a TCP packet. The first thing that we do is to divide and slice it up to 16 bit pieces. Let’s assume we have three 16 bit data as below.
1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0
0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0
0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 0
If we add those three 16 bit data using binary addition. We get the below 16 bit data(Its simple binary addition).
1001101001010110 + 0000101110001110 + 0000110111001100 = 1011001110110000
Ones complement of our result 1011001110110000 is 0100110001001111(This is our checksum). So we need to basically send our data (which is the three binary 16 bit numbers) along with its checksum to a recieaver. The main thing to understand is, the reciever will get the data as well as the checksum we calculated.
The reciever will get the below things.
1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 (data)
0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 (data)
0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 0 (data)
0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 (Checksum)
The reciever will simply add all the above 4 things. Data as well as checksum are added together. Let’s try adding it.
1001101001010110 + 0000101110001110 + 0000110111001100 + 0100110001001111 = 1111111111111111
If the output of sum of 16 bit data and checksum is 1111111111111111. All fields will be 1s. Even if there is one 0, then that means errors were introduced in the data during transit.
This serves the purpose. Unlike md5 and sha1, it’s quite simple as well to calculate and verify. Computers are super cool when it comes to binary addition :). This is exactly how TCP and UDP checksum is calculated.
There are few more things to understand about this TCP and UDP checksum calculation. The checksum is not only calculated using TCP/UDP headers and data. It also ads several bits of data from IP header as well. This data is sometimes called as a pseudo header.
Let’s now see what this pseudo header is all about. Pseudo header has several fields from IP header. In other words TCP and UDP(sitting in Transport layer) checksum calculation requires several data fields and bits from the Network Layer. How is that even possible. Because Network layer is beneath the transport layer. We will discuss this in few minutes.
Psuedo header consists of the following things from the IP header.
- Source Address(IP)
- Destination Address(IP)
- Protocol Header(Like weather its TCP, or UDP or ICMP etc)
- Reserved 8 Bit
The whole content of pseudo header is about 12 bytes(32 bit source address + 32 bit destination address + 8 bit reserved + 16 bit tcp length + 8 bit protocol type = 96 bits = 12 bytes). You can clearly see that most of it comes from the IP header(from the network layer), although we are discussing about checksum calculation in transport layer.
This is how it works. At the sender side, when the data arrives at the transport layer, the system now needs to calculate the checksum. It temporarily costructs the pseudo header. Keep the word temporary in mind. After constructing this pseudo header, it keeps it in buffer. Then it starts calculating the checksum by dividing the whole thing into 16 bit chunks(pseudo header, tcp data, tcp header) and then adds it. Finally it calculates the ones complement as we did earlier.
Once the checksum is calculated, the result of the checksum will then go to the right place. That is the checksum field of the TCP header. Once the checksum is placed inside the real TCP header, the pseudo header temporarily created to calculate the checksum is then discarded.
But what will be the value of Checksum field in TCP header, when the checksum itself is being calculated?. We learned that checksum calculation itself uses TCP headers, data, and pseudo header. So during this calculation what will be placed in the checksum field of tcp header?
During the calculation of checksum, the field is generally kept with 0’s. After the calculation, right value will replace all 0’s.
Also keep the fact in mind that the pseudo header never leaves the system. Its discarded, and is never part of the TCP header, or IP header or anything that goes through the wire.
So what happens on the destination side?
Well the exact same thing happens on the destination side as well. A temporary pseudo header is constructed and given to transport layer along with actual TCP header & data. This is prepended temporarily to tcp/udp segment, to calculate checksum.
If the output of addition of temporary pseudo header, tcp data, tcp header turns out to be all 1s, the reciever end can confirm that the data is not corrupted.
Why do you need a Pseudo Header?
That’s a legitimate question. Because IP (network layer) header also has its own checksum field. So what’s the point of taking few fields from there while calculating TCP/UDP checksum?
If the source address or destination address does not match, the checksum will fail. Note the point that even if the network packet somehow reaches transport layer, there still is a secondary verification of source and destination with checksum. The main purpose is to confirm with 100 percent surety that the TCP/UDP data reached the correct destination.
The main thing to understand about pseudo header is that, it never leaves the system. But still it does its job by adding another layer of verification. If you capture network traffic on your system for few minutes, then you can see TCP/UDP checksum value in tools like wireshark.
sudo tcpdump -vvv -s 0 -l -n port 80 -w tcp-out.pcap
The above command will capture packets towards port number 80 and then write the output to a file called tcp-out.pcap. This file can then be imported to tools like wireshark to analyse it further. You should be able to see checksum under TCP/UDP section inside wireshark.
Its also not possible to equally divide data into 16 bit chunks(for example, the last chunk might not have 16 bit). To solve that, binary value is prepended with zeros to make it 16 bit during the calculation of checksum.