读取CSV并插入数据库性能

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this. But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

For such a mission you don't want to insert data line by line :) Basically, you have 2 ways:

  1. Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (How to do a batch insert in MySQL) instead.
  2. Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)

Follow below three steps

  1. Save the CSV file with the name of table what you want to save it to.
  2. Execute below python script to create a table dynamically (Update CSV filename, db parameters)
  3. Execute "mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE:

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')

query = "CREATE TABLE " + table_name[0] + "( 
" 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == 'int64':
        query += "\t\t int(11) NOT NULL"
    elif df.dtypes[count] == 'object':
        query += "\t\t varchar(64) NOT NULL"
    elif df.dtypes[count] == 'float64':
        query += "\t\t float(10,2) NOT NULL"


    if count == 0:
        query += " PRIMARY KEY"

    if count < df.columns.values.size - 1:
        query += ",
"

query += " );"
#print(query)

database = connect(host='localhost',  # your host
                     user='username', # username
                     passwd='password',     # password
                     db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)