How to Locate the Cause of System Inaccessibility
A client is a piece of computer hardware or software that accesses a service made available by a server as part of the client–server model of computer networks. When the client is not accessible, all services will be directly interrupted. In view of this problem, IT team always carry out a lot of network troubleshooting works, and the difficulty of troubleshooting is increased due to various reasons. Facing such a difficult problem, it is critical to locate the cause of the problem and solve it effectively. This case explains how to use retrospective network analysis technology to quickly and accurately locate the root cause of network events.
A Telecom operator’s system suddenly cannot be accessed and the web page could not be opened normally. Although the network equipment had been inspected for several times, the cause of the problem could not be determined. Before the problem happened, Colasoft nChronos has been deployed in the server area of their system to conduct real-time network monitoring and full network traffic saving, which provides effective data support for successfully helping IT Team locate the root cause of abnormal phenomena.
Coalsoft nChronos captured all the traffic for retrospective network anlaysis，and also analyze the relevant business accessing logic. After the monitoring and analyzing for the business outlet , Extranet x.x.244.46, load balancing (F5) x.x.248.27, server x.x.16.92-95, database x.x.16.86, it is found that a large number of accesses have been reset during the network failure, as shown.
Then analyzing the sessions, it is found that the client sends the post request after establishing the three-way handshake connection with the server, and the server normally replies ACK for acknowledge, but after waiting for a few seconds or even several hundred seconds, the client still does not receive the data, so the client sends an RST packet to close the session, as shown.
By monitoring of database x.x.16.86, it is found that the application needs to query data from the database, so is there a problem with the database?
It is known that the number of connections of the system database is small and most of the connections are long connections, so the maximum number has been set to 100. After analysis, it is found that the number of connections to server x.x.16.94 has reached 229 in half an hour, and the number of connections to x.x.16.93 has reached 132. It exceeding the upper limit of 100 connections obviously.
Due to the high number of database connections during that period, the database response is slow or even unresponsive, which indirectly leads to the network problem that the client cannot access or open the application normally, as shown.
With the a more considered analysis of the sessions between client x.x.16.94 and the database, it is found that after the client sent the request, the database does not send the data in time to respond. The client waits 264 seconds to send the FIN packet for disconnection, and five minutes later, it gets the response data and FIN packet from database x.x.16.86. It directly results in too many connections to database x.x.16.86, and the database cannot respond normally, as shown.
To sum up, we can see that due to the database problem, the number of client connections exceeds the upper limit because the connection cannot be closed in time. Therefore, the database cannot respond to the data normally, which indirectly leads to the client cannot call the data and access the application normally. The number of connections to the application server cannot be supported by the connection pool. Database monitoring is shown as below.
After locating the cause of the problem, IT Team modified the configuration of the maximum number of DB connections of the x.x.16.92-95 server, so as to improve the system fault-tolerant bottleneck and solve the problem.
The system returns to normal after settings finished. At the same time, the connection number of the client (x.x.16.92-9S) returns to normal, as shown.
Nowadays, the business system we are facing is more and more complex. Traditional detection methods often cannot effectively locate the root cause of the problem from complex related factors. At this time, the retrospective network analysis technology can be used to analyze the relevant business of the network failure, accurately find the abnormal in the business, so as to realize the rapid location of the root cause, and then help the IT Team to solve network problems.