While using mongodb replica set to serve my app server, I encountered a problem that TCP connections between 4 app servers and 2 mongo slave nodes continued increase until 2 mongo slave nodes stop service.
Here is the normal logs from one of slave nodes:
2017-03-12T03:01:41.165+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43775 #1537161 (55 connections now open)
2017-03-12T03:01:41.188+0800 I ACCESS [conn1537161] Successfully authenticated as principal admin on admin
2017-03-12T03:01:41.193+0800 I NETWORK [conn1537161] end connection 127.0.0.1:43775 (54 connections now open)
2017-03-12T03:01:41.938+0800 I NETWORK [conn1537152] end connection 10.162.xxx.xxx:49334 (53 connections now open)
2017-03-12T03:01:41.938+0800 I NETWORK [initandlisten] connection accepted from 10.162.xxx.xxx:49345 #1537162 (54 connections now open)
2017-03-12T03:01:41.958+0800 I ACCESS [conn1537162] Successfully authenticated as principal __system on local
2017-03-12T03:01:42.256+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43776 #1537163 (55 connections now open)
2017-03-12T03:01:42.279+0800 I ACCESS [conn1537163] Successfully authenticated as principal admin on admin
2017-03-12T03:01:42.284+0800 I NETWORK [conn1537163] end connection 127.0.0.1:43776 (54 connections now open)
2017-03-12T03:01:42.721+0800 I ACCESS [conn1401659] Successfully authenticated as principal online on online
2017-03-12T03:01:47.243+0800 I NETWORK [conn1537154] end connection 10.251.xxx.xxx:34844 (53 connections now open)
But nothing unusual with mongo master node). Please note that our app servers during this time period have fewer access than daylight. Abnormal logs from the same slave node above:
2017-03-12T03:05:59.301+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:33377 #1537221 (86 connections now open)
2017-03-12T03:06:01.484+0800 I NETWORK [conn1537215] end connection 120.26.xxx.xxx:48190 (85 connections now open)
2017-03-12T03:06:01.487+0800 I NETWORK [initandlisten] connection accepted from 120.26.xxx.xxx:48203 #1537222 (86 connections now open)
2017-03-12T03:06:01.508+0800 I ACCESS [conn1537222] Successfully authenticated as principal __system on local
2017-03-12T03:06:05.464+0800 I NETWORK [initandlisten] connection accepted from 10.168.xxx.xxx:51008 #1537223 (87 connections now open)
2017-03-12T03:06:11.231+0800 I ACCESS [conn1537213] Successfully authenticated as principal online on online
2017-03-12T03:06:12.289+0800 I NETWORK [conn1537216] end connection 10.162.xxx.xxx:49427 (86 connections now open)
2017-03-12T08:37:44.764+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:37444 #1571431 (32306 connections now open)
2017-03-12T08:37:44.773+0800 I ACCESS [conn1571430] Successfully authenticated as principal online on online
2017-03-12T08:37:44.786+0800 I NETWORK [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
2017-03-12T08:37:44.786+0800 I NETWORK [initandlisten] can't create new thread, closing connection
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:37445 #1571432 (32306 connections now open)
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] can't create new thread, closing connection
2017-03-12T08:37:57.000+0800 I REPL [ReplicationExecutor] Error in heartbeat request to xxx.xxx.xxx.xxx:27017; UnknownError Sending command { replSetHeartbeat: "rs1", pv: 1, v: 5, from: "xxx.xxx.xxx.xxx:27018", fromId: 1, checkEmpty: false } on database admin over network to xxx.xxx.xxx.xxx:27017 received exception boost::thread_resource_error
Suppose any bug from mongodb slave nodes stops read operation, which result in app servers' i/o timeout; and app servers consider no connection in pool to slave node is unavailable, so trying to open new connection with slave nodes. But unfortunately connections cannot be established normally, so new connections continued being opened.
But only two sessions are used in each app server. In detail I get the sessions during init() and use them for db read/write operations. I cannot figure out the leak. Here is the code I get the sessions:
func init() {
...some code...
session, err := mgo.Dial(url)
if err != nil {...handle error...}
strongSession = session.Copy()
strongSession.SetMode(mgo.Strong, true)
strongSession.SetSafe(&mgo.Safe{})
eventualSession = session
eventualSession.SetMode(mgo.Eventual, true)
eventualSession.SetSafe(&mgo.Safe{})
...some code...
}
If problem is caused by mgo driver's connection pool leak, can I assume some pool management bugs with driver gopkg.in/mgo.v2?
This problem occurs two times in my live environment, several days ago and two months ago, with the same time period from around 3:00 to 8:30 in the morning.
I really thought if any tasks at this time on both mongo server and app server, but found nothing.
I have dug the logs, searched google and stackoverflow for several days by keywords like 'mgo connection pool', 'mongo can't create new thread', 'mongo connection increase' and so on, but unfortunately have no idea on how to investigate this problem further.
Really appreciate anyone could give some advise~~
Thanks so much~~