PHP将JSON / CSV与SQL数据库匹配,进行了许多调整(cakePHP)

I want to insert a JSON file (also available as CSV) into a mySQL database using the cakePHP framework. The basics are clear, but the surrounding requirements make it difficult:

  1. The JSON/CSV file is large (approx. 200 MB and up to 200.000 lines).
  2. The file contains several fields. These fields need to be mapped to fields with different names in the mySQL database.
  3. The CSV contains a field named art_number. This field is also present in the mySQL database. The art_number is unique, but not the primary key in mySQL. I want to update the mySQL record if CSV and database have the same art_number. If not a new record should be created.
  4. Several fields of the CSV file need to be processed before they are stored. Also additional fields need to be added.
  5. The CSV contains an image_URL. If it is a NEW record (unknown art_number) to the database, this image should be copied, modified (with imagick) and stored on the server.
  6. The whole job needs to run on a daily basis.

As you can see there is a lot going on with some limitations (memory, runtime etc.). But I am not sure how to approach this from an architecture point of view. E.g. should I first try to insert everything into a seperate "import" database table and then run through the steps seperately? What is a good way to get the IDs from the database mapped to the CSV lines? Cakephp is able to perform either creating a new or updating an existing record if I am able to map the ID based on the art_number. Also changing and copying up to 200.000 images seems to be a big issue. So how to break this down into smaller chunks?

I would appreciate if you could help find the right strategy here. What do I need to consider in terms of memory and speed? Doe sit make sense to split the process into different jobs? What/how would oyu do that?

I would appreciate if you could help find the right strategy here. What do I need to consider in terms of memory and speed?

  • Use a shell for the import
  • Read the data in chunks of X lines or X amount of data to avoid memory problems and then process these chunks. It's a simple loop.
  • If the processing is going to require a LONG time consider using a job queue like Resque. You can update the status of the progress to the user if needed.

Doe sit make sense to split the process into different jobs? What/how would oyu do that?

This depends on the requirements and how long your processing will take and how much your system can process in parallel without going up to 100% CPU usage and effectively slowing down the site. If this happens move the processing to another machine or limit the CPU usage for that process using the nice command.