有分布式数据处理管道框架,还是组织框架的好方法?

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:

  • Component A fetches pages.
  • Component B analyzes pages from A.
  • Component C stores analyzed bits and pieces from B.

There are obviously more than just three components involved.

Further requirements:

  • Each component needs to be a separate process (or set of processes).
  • Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.

This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.

I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.

Another issue with the AMQP approach is that each component will have to have very similar logic for:

  • Connecting to the queue
  • Handling connection errors
  • Serializing/deserializing data into a common format
  • Running the actual workers (goroutines or forking subprocesses)
  • Dynamic scaling of workers
  • Fault tolerance
  • Node registration
  • Processing metrics
  • Queue throttling
  • Queue prioritization (some workers are less important than others)

…and many other little details that each component will need.

Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.

Does a good framework exist for this kind of stuff?

Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.

Edit: Added some points for clarity.

I guess you are looking for a message queue, like beanstalkd, RabbitMQ, or ØMQ (pronounced zero-MQ). The essence of all of these tools is that they provide push/receive methods for FIFO (or non-FIFO) queues and some even have pub/sub.

So, one component puts data in a queue and another one reads. This approach is very flexible in adding or removing components and in scaling each of them up or down.

Most of these tools already have libraries for Go (ØMQ is very popular among Gophers) and other languages, so your overhead code is very little. Just import a library and start receiving and pushing messages.

And to decrease this overhead and avoid dependency on a particular API, you can write a thin package of yours which uses one of these message queue systems to provide very simple push/receive calls and use this package in all of your tools.

I understand that you want to avoid Hadoop+Java, but instead of spending time developing your own framework, you may want to have a look at Cascading. It provides a layer of abstraction over underlying MapReduce jobs.

Best summarized on Wikipedia, It [Cascading] follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.

You may also want to have a look at some of their examples, Log Parser, Log Analysis, TF-IDF (especially this flow diagram).

Because Go has CSP channels, I suggest that Go provides a special opportunity to implement a framework for parallelism that is simple, concise, and yet completely general. It should be possible to do rather better than most existing frameworks with rather less code. Java and the JVM can have nothing like this.

It requires just the implementation of channels using configurable TCP transports. This would consist of

  • a writing channel-end API, including some general specification of the intended server for the reading end
  • a reading channel-end API, including listening port configuration and support for select
  • marshalling/unmarshalling glue to transfer data - probably encoding/gob

A success acceptance test of such a framework should be that a program using channels should be divisible across multiple processors and yet retain the same functional behaviour (even if the performance is different).

There are quite a few existing transport-layer networking projects in Go. Notable is ZeroMQ (0MQ) (gozmq, zmq2, zmq3).