Handling socket error and Keepalive

Handling TCP socket error and TCP_KEEPALIVE

The 'Net' abounds with simple 20 line examples of TCP client behavior; however they all assume the TCP server exists and that crashing due to exceptions is helpful education for you. trying to locate realistic examples often pushes you into complex examples using hundreds of lines of code to do real work.

So here is a simple-dummy example TCP client application which runs on either a PC or Digi product:

It creates the socket
It sets the socket timeout to 5.0 seconds, this means non-blocking and all requests will fault in 5.0 seconds unless they succeed.
It tests and enables the TCP Keepalive - which by default is OFF on most systems (Windows, Linux and Digi Python) This test is not required; it is here merely to show how it is done.
It tries to open (connect) to a fixed IP address and TCP port 2101, which either succeeds rapidly - or fails in 5.0 seconds.
- "except socket.error" traps the error and causes the code to sleep 5 more seconds, then restart at socket creation.
if the socket is open, it waits 5 seconds for up to 6 bytes of data, and throws an exception if socket errors or no data is received. This block has two-level error trapping:
- "except socket.timeout" traps the no-data error and loops up to try receiving again. Note that a robust program design would keep track of how frequently (or long) this 'no data' continues. In many designs, a TCP socket sitting idle for more than a few minutes might be best closed.
- "except socket.error" traps any remaining error, and exits the inner "while True:" and restarts creating the socket.

To use this example, set the IP address to any Digi TS/DS with TCP Sockets active. Then by powering up or down the Digi TS/DS you can cause the connect(('x.x.x.x',2101) to succeed or fail. Sending simple ASCII data into the serial port of the Digi TS/DS would enable you to force or skip the "socket.timeout" try-except clause.

TCP Keepalive

So how does your Python code understand if no data means the TCP peer is being quiet - or if the TCP socket has gone away? The normal answer is that eventually a socket error will cause the "socket.error" clause to execute. However, that could easily be hours (or forever) after the socket fails.

By default sockets in your Digi Python application (as well as Windows or Linux) open with TCP keepalives turned off, and thus it is possible your application will languish for a very long time with a dead socket open. This simple line of Python code will turn TCP Keepalives on:

x = sock.setsockopt( socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

By default your Digi Python product will have a default TCP idle time of two hours, so even if you turn on TCP Keepalives don't expect your program to recover in minutes. Different TCP Keepalive settings can be entered in the Digi Web interface at Configuration > Network > Advanced Network Settings > TCP Keep-Alive Settings, or by telnet with the set net and show net commands. If you truly wish to force TCP Keepalive settings to always contain a smaller setting, you could use Module:digicli to force specific settings. This example sets roughly a 5 minute detection of a failed TCP socket.

status,results = digicli.digicli( "set net idle=240 probe_count=5 probe_interval=10" )

Do not try to use TCP Keepalive to detect TCP socket failure more quickly than a few minutes.

People who try to set it for 5 seconds (or for milliseconds) invariably cause serious compatibility issues with other products - and invariably fail to be satisfied. If you truly require detecting a TCP socket failure in 1 second or less, which implies your TCP peers normally send data many times per second, then use non-blocking sockets with the "socket.timeout" exception to detect when no data had been received in your required time-frame. And if you accept that a TCP peer quiet for 1 second is bad, then close the socket manually and attempt recovery directly. Do not use TCP Keepalive for such short-period detection.

Example code

(Note that this code uses " import socket " and thus all methods and all constants such as SOCK_STREAM must include the " socket. " preface. Many other example applications use " from socket import * ", which eliminates this needs for " socket. " preface. Either solution works - just be mindful of this detail if you mix-and-match sample code from diverse sources.)

import sys
import socket
import traceback
import time

def do_work( forever = True):

    while True:

        # start with a socket at 5-second timeout
        print "Creating the socket"
        sock = socket.socket( socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout( 5.0)

        # check and turn on TCP Keepalive
        x = sock.getsockopt( socket.SOL_SOCKET, socket.SO_KEEPALIVE)
        if( x == 0):
            print 'Socket Keepalive off, turning on'
            x = sock.setsockopt( socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            print 'setsockopt=', x
        else:
            print 'Socket Keepalive already on'

        try:
            sock.connect(('192.168.196.8',2101))

        except socket.error:
            print 'Socket connect failed! Loop up and try socket again'
            traceback.print_exc()
            time.sleep( 5.0)
            continue

        print 'Socket connect worked!'

        while 1:
            try:
                req = sock.recv(6)

            except socket.timeout:
                print 'Socket timeout, loop and try recv() again'
                time.sleep( 5.0)
                # traceback.print_exc()
                continue

            except:
                traceback.print_exc()
                print 'Other Socket err, exit and try creating socket again'
                # break from loop
                break

            print 'received', req

        try:
            sock.close()
        except:
            pass

        # loop back up & restart

if __name__ == '__main__':

    do_work( True)