Hadoop起点

Every month I receive a CSV file, around 2 GB size. I import this file in a table in MySql database and this is almost instant.

Then using PHP, I query this table, filter data from this table and write relevant data to several other tables. This take several days - all queries are optimized.

I want to move this data to Hadoop but do not understand what should be the starting point. I am studying Hadoop and I know this can be done using Sqoop but still too confused, where to start in terms of how to migrate this data to Hadoop.

Use Apache Spark may be in Python, as it easy to get started with. Though the use of Spark may be overkill, but given its speed and scalability there is no harm in putting some extra effort on this.

You might want to switch to any other databases that Spark directly provides APIs to access(Hive/Hbase etc). It is optional though because, with little extra code, you can right to MySql only if you don't want to change.

The overall design would be like this:

Your monthly CSV file will be on a known location on HDFS.
Spark application will read this file, do any transformations, write the results to MySql(or any other storage)

Systems involved:

HDFS
Spark
MySql/other storage
Optional cluster to make it scalable