具有mgo驱动的Golang应用服务器,并且mongodb副本设置了从属节点:连接太多

While using mongodb replica set to serve my app server, I encountered a problem that TCP connections between 4 app servers and 2 mongo slave nodes continued increase until 2 mongo slave nodes stop service.

Environment

  • operating system: Ubuntu 14.04 64bit, both for mongodb servers and app servers
  • mongodb: v3.0.9, replica set with 1 master, 2 slave and 1 arbiter
  • app: 4 servers cluster, written by golang, with mongodb driver gopkg.in/mgo.v2

Details

1. Normally we have about 55 connections on each slave node.

Here is the normal logs from one of slave nodes:

2017-03-12T03:01:41.165+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43775 #1537161 (55 connections now open)
2017-03-12T03:01:41.188+0800 I ACCESS [conn1537161] Successfully authenticated as principal admin on admin
2017-03-12T03:01:41.193+0800 I NETWORK [conn1537161] end connection 127.0.0.1:43775 (54 connections now open)
2017-03-12T03:01:41.938+0800 I NETWORK [conn1537152] end connection 10.162.xxx.xxx:49334 (53 connections now open)
2017-03-12T03:01:41.938+0800 I NETWORK [initandlisten] connection accepted from 10.162.xxx.xxx:49345 #1537162 (54 connections now open)
2017-03-12T03:01:41.958+0800 I ACCESS [conn1537162] Successfully authenticated as principal __system on local
2017-03-12T03:01:42.256+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43776 #1537163 (55 connections now open)
2017-03-12T03:01:42.279+0800 I ACCESS [conn1537163] Successfully authenticated as principal admin on admin
2017-03-12T03:01:42.284+0800 I NETWORK [conn1537163] end connection 127.0.0.1:43776 (54 connections now open)
2017-03-12T03:01:42.721+0800 I ACCESS [conn1401659] Successfully authenticated as principal online on online
2017-03-12T03:01:47.243+0800 I NETWORK [conn1537154] end connection 10.251.xxx.xxx:34844 (53 connections now open)

2. At around 3:02 in the morning one day, 2 slave nodes' connections began increase, likely 5-15 connections per node per minute.

But nothing unusual with mongo master node). Please note that our app servers during this time period have fewer access than daylight. Abnormal logs from the same slave node above:

2017-03-12T03:05:59.301+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:33377 #1537221 (86 connections now open)
2017-03-12T03:06:01.484+0800 I NETWORK [conn1537215] end connection 120.26.xxx.xxx:48190 (85 connections now open)
2017-03-12T03:06:01.487+0800 I NETWORK [initandlisten] connection accepted from 120.26.xxx.xxx:48203 #1537222 (86 connections now open)
2017-03-12T03:06:01.508+0800 I ACCESS [conn1537222] Successfully authenticated as principal __system on local
2017-03-12T03:06:05.464+0800 I NETWORK [initandlisten] connection accepted from 10.168.xxx.xxx:51008 #1537223 (87 connections now open)
2017-03-12T03:06:11.231+0800 I ACCESS [conn1537213] Successfully authenticated as principal online on online
2017-03-12T03:06:12.289+0800 I NETWORK [conn1537216] end connection 10.162.xxx.xxx:49427 (86 connections now open)

3. Until 8:37 in the morning, slave nodes finally stopped service.

2017-03-12T08:37:44.764+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:37444 #1571431 (32306 connections now open)
2017-03-12T08:37:44.773+0800 I ACCESS [conn1571430] Successfully authenticated as principal online on online
2017-03-12T08:37:44.786+0800 I NETWORK [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
2017-03-12T08:37:44.786+0800 I NETWORK [initandlisten] can't create new thread, closing connection
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] connection accepted from 10.171.xxx.xxx:37445 #1571432 (32306 connections now open)
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
2017-03-12T08:37:45.290+0800 I NETWORK [initandlisten] can't create new thread, closing connection

4. Even heartbeat between mongo nodes crashes.

2017-03-12T08:37:57.000+0800 I REPL [ReplicationExecutor] Error in heartbeat request to xxx.xxx.xxx.xxx:27017; UnknownError Sending command { replSetHeartbeat: "rs1", pv: 1, v: 5, from: "xxx.xxx.xxx.xxx:27018", fromId: 1, checkEmpty: false } on database admin over network to xxx.xxx.xxx.xxx:27017 received exception boost::thread_resource_error

5. Finally I restarted the 2 mongo slave nodes, and everything goes all right.

6. To recover app service in time, no more on site status with mongo nodes or app servers. Only two more clues as below.

  • TCP connections on mongo slave nodes increase rapidly(but again, master node is normal).
  • While read operation from app server to mongo slave nodes got i/o timeout(from the beginning 3:02).

My surmise and investigation

- Something wrong with mongodb slave node

Suppose any bug from mongodb slave nodes stops read operation, which result in app servers' i/o timeout; and app servers consider no connection in pool to slave node is unavailable, so trying to open new connection with slave nodes. But unfortunately connections cannot be established normally, so new connections continued being opened.

- App servers' mgo driver connection pool leaks.

But only two sessions are used in each app server. In detail I get the sessions during init() and use them for db read/write operations. I cannot figure out the leak. Here is the code I get the sessions:

        func init() {  
          ...some code... 

          session, err := mgo.Dial(url)  
          if err != nil {...handle error...}  

          strongSession = session.Copy()  
          strongSession.SetMode(mgo.Strong, true)  
          strongSession.SetSafe(&mgo.Safe{})  

          eventualSession = session  
          eventualSession.SetMode(mgo.Eventual, true)  
          eventualSession.SetSafe(&mgo.Safe{})  

          ...some code...  
        }  

If problem is caused by mgo driver's connection pool leak, can I assume some pool management bugs with driver gopkg.in/mgo.v2?

My confusion

This problem occurs two times in my live environment, several days ago and two months ago, with the same time period from around 3:00 to 8:30 in the morning.

I really thought if any tasks at this time on both mongo server and app server, but found nothing.

I have dug the logs, searched google and stackoverflow for several days by keywords like 'mgo connection pool', 'mongo can't create new thread', 'mongo connection increase' and so on, but unfortunately have no idea on how to investigate this problem further.

Really appreciate anyone could give some advise~~

Thanks so much~~