We've been tasked with migrating quite a bit of xml data (1.27 million xml files, one node with properties per file) into a Neo graph, and we've been using go routines to chew through files, parse xml, prepare cypher queries for inserts etc. Due to the architecture of having to process the xml, we are using go routines concurrently with channels to process each file in threads, throttling the number of workers going on at one time.
The issue i'm having is that I'm running into errors like "tcp connection reset by peer" and also "panic: Couldn't read expected bytes for message length. Read: 0 Expected: 2." and I can only imagine this is due to running connections and statements concurrently in our workers. Our throttling has us at 100 concurrent workers, and I wouldn't think this would be a major problem for Neo, but I just can't figure out why its choking on us.
Are there any architecture recommendations out there for handling a use case like this, where we have to run single cypher statements in large numbers of worker routines (in our case 100 at a time)?
Currently, we're walking a file tree to build up a queue of files to process, then after the walk is done, we iterate that queue and fire off go routines to process each file, using a buffered throttle channel to block the firing off of new routines until previous routines have finished. Within each routine i'm spinning up a new connection, prepare statement, execute, close, etc..
I see this package offers Pipelines, but i'm just not sure how to use that within the processing/queue/channel architecture that we've got going currently:
https://github.com/johnnadratowski/golang-neo4j-bolt-driver
I've also tried using:
But keep getting tcp connection reset by peer
errors when trying to connect to Neo concurrently.
It is possible you are using thread unsafe functionality in neo4j-bolt-driver.
There are 2 versions of drivers provided by Neo4j-bolt-driver:
Driver
Plain driverDriverPool
Driver which manages connection poolThe driver objects themselves are thread-safe, but the Conn
object which represent the underlying connection are not. You maybe using the Conn
object in a way that its not meant to be.
With goroutines, its best to create Conn
objects using DriverPool
methods. When Close
is called on the connection, it doesn't necessarily close the underlying connection and reclaims the connection for reuse.