Forum Discussion

Andre_12127's avatar
Icon for Nimbostratus rankNimbostratus
Jun 26, 2012

Dealing with long lasting outbound TCP Connections

I am currently trying to resolve an issue concerning long lasting TCP Sessions.


One of the balanced webservers is regularly querying a database server that is outside the loadbalanced segment. The operation triggered lasts quite a long time, since it has to process a large amount of data. During that time the TCP session is still open but idle.


After one hour the session is resetted by the Big IP causing the Query to fail.



My suspicion is that this reset is caused by the wildcard forwarding server which routes the outbound traffic back to the rest of the LAN. This server has a timeout setting in its client connect profile of 3600 seconds which would qualify as a reason.


I tried bypassing that by adding a second forwarding server which only contains the one database host needed and added a longer timeout in a separate Client Protocol Profile, as well as disabling resetting timed out TCP session without, success.



I am wondering wheter I am completely thinking wrong here or that wildcard virtual server matches before the dedicated Forwarding Server matches and I don´t see a way at the moment to reverse that, if there is one at all.


Currently I am thinking that adding an iRule to the wildcard server that exchanges the client protocol profil when an IP from a certain datagroup matches might be the most feasable solution.


Or would there be an easier way around such a problem?



The reason why I am asking is that we do not have a real test environment and everything is run over that one productive cluster (I know it should be different, but I am already sore from argumenting with the other admins), and fidgeting around with the standard route doesn´t leave me feeling comfortable.


So if there is an easier solution to this problem it would be greatly appreciated.







6 Replies

  • Hi Andre,



    The more specific virtual server should be the one matched:



    SOL9038 - The order of precedence for local traffic object listeners




    Can you check the connection table to see which VS is being matched and what idle timeout is set for the connection? The syntax has changed slightly over the versions, but you should be able to check using 'tmsh show sys conn...' See the man page for details on filtering the connection table entries in your version (tmsh help sys conn).



  • Thanks for the help Aaron, its greatly appreciated.



    So basicly the idea of creating a forwarding VS that only relates to that one outbound host with a different client connect profile should do the trick.


    Ok then I will have to investigate further. I will have to talk to the database guys to see wheter they can produce the query again today. They said they would need some time, but hopefully we can arrange that sooner, since I want that off the table. Once we have that query running again, I will check like suggested and post results.



  • if you are running 10.2.3 or later, logging reset cause might be helpful.



    sol13223: Configuring the BIG-IP system to log TCP RST packets

  • Unfortunately we are on 10.2.0 still, and at the moment it looks like an update might take some time.



    Have talked with the database admins, and they want to give it another go somewhere down next week. So sit tight, I will be back ;)
  • is this relevant?



    sol8049: Implementing TCP Keep-Alives for server-client communication using TCP profiles

  • Ok the Thread is smelling a little necrotic already, yet I wanted to share how things turned out.



    First of all, the Hoolios hint on using the "sh sys conn" command on the TMSH was a great help. We could monitor the connecting session for the specific VS that way very nicely and it ultimately helped us tracking the correct connection down.



    Our original focus was to add a new forwarding VS which would match when the load balanced applicationservers behind the LTM would open a connection to the database server to fetch their data. We used a client protocol profile (based on the fast L4 profile) which had a longer timeout setting then the default profile. We then tracked the matching of the VS in another try and saw the longer timeouts. At the same moment we also noticed that this connection was never even close to hitting that time out since the application and the database server had a lot of pakets going back and forth between them. Nevertheless the application broke down after exactely 3600 seconds.


    After some more testing we finally found out that it wasn´t the server<->server communication between those two servers mentioned earlier that caused the problem, but the client component installed on a regular PC. The responsible person for the application jumped to some conclusions from the applications error logs and send us on a wild goosechase.


    Since the application only runs once a year (but is nevertheless very important) and since it does not need any HA features, we have decided to port it onto a server outside the LTM segments.



    Thanks for your help folks, you contributed a lot to the solution .... even though things turned out quite different from what we orginally assumed.