Every month I receive a CSV file, around 2 GB size. I import this file in a table in MySql database and this is almost instant.
Then using PHP, I query this table, filter data from this table and write relevant data to several other tables. This take several days - all queries are optimized.
I want to move this data to Hadoop but do not understand what should be the starting point. I am studying Hadoop and I know this can be done using Sqoop but still too confused, where to start in terms of how to migrate this data to Hadoop.
Use Apache Spark may be in Python, as it easy to get started with. Though the use of Spark may be overkill, but given its speed and scalability there is no harm in putting some extra effort on this.
You might want to switch to any other databases that Spark directly provides APIs to access(Hive/Hbase etc). It is optional though because, with little extra code, you can right to MySql only if you don't want to change.
The overall design would be like this:
Systems involved: