Postgres - UPDATE随着时间的推移变慢

I have a table like this (more columns but these will do):

events
+----------+----------------+--------------------+------------------+------------------+---------+  
| event_id | user_ipaddress |   network_userid   |  domain_userid   | user_fingerprint | user_id |  
+----------+----------------+--------------------+------------------+------------------+---------+  
|        1 | 127.0.0.1      | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 |       2199066221 |         |  
|        2 | 127.0.0.1      | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 |       2199066221 |         |  
|        3 | 127.0.0.1      | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 |       2199066221 |         |  
|        4 | 127.0.0.1      | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 |       2199066221 |         |  
+----------+----------------+--------------------+------------------+------------------+---------+

The table contains around 1M records. I'm trying to update all records to set the user_id.
I'm using a very simple PHP script for that.

I'm looping over each record with user_id = NULL and SELECT from the entire table to find existing user_id based on user_ipaddress, network_userid, domain_userid and/or user_fingerprint.

If nothing was found I will generate a unique user_id and UPDATE the record.
If a match was found I will UPDATE the record with the correspondent user_id.

The query looks like this:

UPDATE events SET user_id = 'abc' WHERE event_id = '1'

The SELECT part is super fast (~5ms).
The UPDATE part starts fast (~10ms) but becomes slower (~800ms) after a few hundred updates.

If I wait for around 10-20 minutes it's becomes fast again.

I'm running a PostgreSQL 9.3.3 on AWS RDS (db.m1.medium) with General Purpose SSD storage. I have indexes on all columns combined and individually.

I have played with FILLFACTOR and currently it's set to 70. I have tried to run VACUUM FULL events, but I never know if it finished (waited more than 1h). Also I've tried REINDEX TABLE events.

I'm the only one using this server.

Here's an EXPLAIN ANALYZE of the UPDATE query:

Update on events  (cost=0.43..8.45 rows=1 width=7479) (actual time=0.118..0.118 rows=0 loops=1)
  ->  Index Scan using events_event_id_idx on events  (cost=0.43..8.45 rows=1 width=7479) (actual time=0.062..0.065 rows=1 loops=1)
        Index Cond: (event_id = '1'::bpchar)
Total runtime: 0.224 ms

Any good ideas on how I can keep the query fast?

I found out that the problem was caused by the filesystem chosen for my RDS instance. I was running with General Purpose Storage (SSD). It apparently has some I/O limits. So the solution was to switch storage. Now I'm running the Provisioned IOPS Storage and the performance improved instantly.

Also a solution could be to stick to the General Purpose Storage (SSD) and increase storage size, since that would increase I/O limits as well.

Read more: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD

Thanks for all replies. And thanks to @Dan and @ArtemGr for pointing me in that direction.

Over the 10-20 minutes to become fast again, do you get a gradual improvement?

Things I'd check:

  • are you creating new connections with each update and leaving them open? They would then timeout and close sometime later.
  • what is the system load (CPU, memory, IO) doing? I did wonder whether the instance might support bursts, but I don't think so.

I am just guessing, It is because your primary key is char not int. Try to convert your primary key into int and see the result.

Your explain analyze result says Index Cond: (event_id = '1'::bpchar)

  1. The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
  2. Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.

SQL Index - Difference Between char and int