I have a table like this (more columns but these will do):
events
+----------+----------------+--------------------+------------------+------------------+---------+
| event_id | user_ipaddress | network_userid | domain_userid | user_fingerprint | user_id |
+----------+----------------+--------------------+------------------+------------------+---------+
| 1 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 2 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 3 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 4 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
+----------+----------------+--------------------+------------------+------------------+---------+
The table contains around 1M records. I'm trying to update all records to set the user_id
.
I'm using a very simple PHP script for that.
I'm looping over each record with user_id = NULL
and SELECT
from the entire table to find existing user_id
based on user_ipaddress
, network_userid
, domain_userid
and/or user_fingerprint
.
If nothing was found I will generate a unique user_id
and UPDATE
the record.
If a match was found I will UPDATE
the record with the correspondent user_id
.
The query looks like this:
UPDATE events SET user_id = 'abc' WHERE event_id = '1'
The SELECT
part is super fast (~5ms).
The UPDATE
part starts fast (~10ms) but becomes slower (~800ms) after a few hundred updates.
If I wait for around 10-20 minutes it's becomes fast again.
I'm running a PostgreSQL 9.3.3 on AWS RDS (db.m1.medium) with General Purpose SSD storage. I have indexes on all columns combined and individually.
I have played with FILLFACTOR
and currently it's set to 70
. I have tried to run VACUUM FULL events
, but I never know if it finished (waited more than 1h). Also I've tried REINDEX TABLE events
.
I'm the only one using this server.
Here's an EXPLAIN ANALYZE
of the UPDATE
query:
Update on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.118..0.118 rows=0 loops=1)
-> Index Scan using events_event_id_idx on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.062..0.065 rows=1 loops=1)
Index Cond: (event_id = '1'::bpchar)
Total runtime: 0.224 ms
Any good ideas on how I can keep the query fast?
I found out that the problem was caused by the filesystem chosen for my RDS instance. I was running with General Purpose Storage (SSD)
. It apparently has some I/O limits. So the solution was to switch storage. Now I'm running the Provisioned IOPS Storage
and the performance improved instantly.
Also a solution could be to stick to the General Purpose Storage (SSD)
and increase storage size, since that would increase I/O limits as well.
Read more: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD
Thanks for all replies. And thanks to @Dan and @ArtemGr for pointing me in that direction.
Over the 10-20 minutes to become fast again, do you get a gradual improvement?
Things I'd check:
I am just guessing, It is because your primary key is char not int. Try to convert your primary key into int and see the result.
Your explain analyze result says Index Cond: (event_id = '1'::bpchar)