I have a database with just over 70 million rows in. This data was originally parsed and imported from roughly 70,000 XML files. These files are updated every week so I need to scan through these XML files (via a cron on Sundays at 2AM in the morning) and update rows that have changed/insert new rows.
$operatorSQL = "INSERT IGNORE INTO `operator` (`reference`, `national_operator_code`, `operator_code`, `operator_short_name`, `operator_name_on_license`, `trading_name`) VALUES (:reference, :nationalOperatorCode, :operatorCode, :operatorShortName, :operatorNameOnLicense, :tradingName);";
$serviceSQL = "INSERT IGNORE INTO `service` (`service_code`, `private_code`, `date_start`, `date_end`, `mode`, `description`, `origin`, `destination`) VALUES (:serviceCode, :privateCode, :dateStart, :dateEnd, :mode, :description, :origin, :destination);";
$serviceOperatorSQL = "INSERT IGNORE INTO `service_operator` (`service_code`, `operator_reference`) VALUES (:serviceCode, :operatorReference);";
$journeyPatternSQL = "INSERT IGNORE INTO `journey_pattern` (`reference`, `direction`, `destination_display`, `vehicle_type_code`, `vehicle_type_description`) VALUES (:reference, :direction, :destinationDisplay, :vehicleTypeCode, :vehicleTypeDescription);";
$journeyPatternRouteSQL = "INSERT IGNORE INTO `journey_pattern_route` (`journey_pattern_reference`, `route_reference`) VALUES (:reference, :routeReference);";
$journeyPatternSectionLink = "INSERT IGNORE INTO `journey_pattern_section_link` (`journey_pattern_reference`, `journey_pattern_section_reference`) VALUES (:reference, :journeyPatternSectionReference);";
$journeyPatternSectionSQL = "INSERT IGNORE INTO `journey_pattern_section` (`reference`) VALUES (:reference);";
$lineSQL = "INSERT IGNORE INTO `service_line` (`service_code`, `name`) VALUES (:serviceCode, :name);";
$timingLinkSQL = "INSERT IGNORE INTO `journey_pattern_timing_link` (`reference`, `stop_from`, `stop_from_timing`, `stop_from_sequence_number`, `stop_from_activity`, `stop_to`, `stop_to_timing`, `stop_to_sequence`, `stop_to_activity`, `run_time`, `direction`) VALUES (:reference, :stopFrom, :stopFromTiming, :stopFromSequenceNumber, :stopFromActivity, :stopTo, :stopToTiming, :stopToSequenceNumber, :stopToActivity, :runTime, :direction)";
$timingLinkJpsSQL = "INSERT INTO `journey_pattern_timing_link_jps` (`journey_pattern_timing_link`, `journey_pattern_section_reference`) VALUES (:linkReference, :sectionReference);";
$timingLinkRouteLinkRefSQL = "INSERT INTO `journey_pattern_timing_link_rlr` (`journey_pattern_timing_link`, `route_link_reference`) VALUES (:linkReference, :routeLinkReference);";
$routeSQL = "INSERT IGNORE INTO `route` (`reference`, `private_code`, `description`) VALUES (:reference, :privateCode, :description);";
$routeSectionSQL = "INSERT IGNORE INTO `route_section` (`reference`) VALUES (:reference);";
$routeLinkSQL = "INSERT IGNORE INTO `route_link` (`reference`, `stop_from`, `stop_to`, `direction`, `distance`) VALUES (:reference, :stopFrom, :stopTo, :direction, :distance);";
$routeLinkSectionSQL = "INSERT INTO `route_link_section` (`route_link_reference`, `route_section_reference`) VALUES (:routeLinkReference, :routeSectionReference);";
$vehicleJourneySQL = "INSERT IGNORE INTO `vehicle_journey` (`reference`, `private_code`, `departure`) VALUES (:reference, :privateCode, :departure);";
$vehicleJourneyServiceSQL = "INSERT IGNORE INTO `vehicle_journey_service` (`vehicle_journey_reference`, `service_reference`) VALUES (:reference, :serviceRef);";
$vehicleJourneyLineSQL = "INSERT IGNORE INTO `vehicle_journey_line` (`vehicle_journey_reference`, `service_line_reference`) VALUES (:reference, :lineRef);";
$vehicleJourneyJpSQL = "INSERT IGNORE INTO `vehicle_journey_jp` (`vehicle_journey_reference`, `journey_pattern_reference`) VALUES (:reference, :journeyPatternRef);";
Above are all of the SQL queries that are performed. You will notice that the IGNORE
clause is used in the INSERT
statement, this is just to make sure that if any files have duplicate data no errors will stop the script, instead it'll just ignore it and move on.
I don't feel this is the most efficient way of doing it however as when I run the script again after doing the initial insert of all the data it's just as slow as when the original inserts are executed... surely if 99.9% of the rows are the same it should skim through? Any ideas why this is happening?
Query optimisation is normally for select, update and delete queries. The fact that you are just inserting data into table(s) means there is no query optimisation to be done; the engine does not have to work out some complicated plan to shove that data into the tables. The speed at which it will do the insert is just a function of your CPU, hard-disk speed, I/O network bandwidth, amongst other factors. The data you are inserting is not being cached in any sense so if you do the inserts again, it will be done at the same rate.
First, I would determine whether the performance issue is with the XML parsing or the database. What happens if you run the program but simply leave out the calls to the database -- does it take significantly less time than with the database calls in?
If it's the database I would do the following, if reasonably possibly given the nature of your data:
INSERT
s.INSERT
statement if the cached-memory check indicates you don't have the record already.This will be possible if your key information is not large (integer key values, for instance) and if you're not running on a memory-constricted box. Otherwise, the cached keys may not fit in memory. If this is the case