On a daily basis, I get a source csv file that has 250k rows and 40 columns. It’s 290MB. I will need to filter it because it has more rows than I need and more columns than I need.
For each row that meets the filtering criteria, we want to update that into the destination system 1 record at a time using its PHP API.
What will be the best approach for everything up until the API call (the reading / filtering / loading) for the fastest performance?
Iterating through each row of the file, deciding if it’s a row I want, grabbing only the columns I need, and then passing it to the API?
Loading ALL records into a temporary MySQL table using LOAD DATA INFILE. Then querying the table for the rows and fields I want, and iterating through the resultset passing each record to the API?
Is there a better option?
Thanks!
I need make an assumption first, majority of the 250K rows will go to database. If only a very small percentage, then iterate over the file and send all the rows in batch is definitely faster.
Different configurations could affect both approaches, but general speaking, the 2nd approach performs better with less scripting effort.
Approach 1: the worst is to send each row to server. More round trip and more small commits.
What you can improve here is to send rows in batch, maybe a few hundreds together. You will see a much better result.
Approach 2: MyISAM will be faster than InnoDB because of all the overheads and complexity of ACID. If MyISAM is acceptable to you, try it first.
For InnoDB, there is a better Approach 3 (which is actually a mix of approach 1 and approach 2). Because InnoDB don't do table lock and you can try to import multiple files concurrently, i.e., separate the CSV files to multiple files and execute Load Data from your scripts. Don't add auto_increment key into the table first to avoid auto_inc lock.
LOAD DATA
, but say @dummy1, @dummy2
etc for columns that you don't need to keep. That gets rid of the extra columns. Load into a temp table. (1 SQL statement.)INSERT INTO real_table SELECT ... FROM tmp_table WHERE ...
to both filter out unnecessary rows and copy the desired ones into the real table. (1 SQL statement.)You did not mention any need for step 2. Some things you might need:
In one project I did:
INSERT INTO summary SELECT a,b,COUNT(*),SUM(c) FROM tmp GROUP BY a,b;
. Or an INSERT ON DUPLICATE KEY UPDATE
to deal with rows already existing in summary
.Whether to use MyISAM, InnoDB, or MEMORY -- You would need to benchmark your own case. No index is needed on the tmp table since each processing step does a full table scan.
So, 0.3GB each 24 hours -- Should be no sweat.