We found that using minimal WAL files together with the CREATE TABLE and COPY commands inside a transaction causes a 10x to 300x slowdown, whereas the PostgreSQL code and docs say this will make the commands faster due to optimizations that writes the table directly bypassing the WAL files. The slowdown is amplified by the amount of shared buffers due to the fact that the FlushRelationBuffers function sequentially searches the entire shared buffers. The FlushRelationBuffers function's comments even state that this function is not optimized because it is not used in any "performance-critical code paths". I would say that this is actually a code critical code path when it slows down your system performance by 10 to 300 times. Skipping the WAL files might be overall faster, but it is slower on the clients connection because you are taking code that runs in a background process and moving it into a foreground process thereby slowing down the foreground process. I have showed the PostgreSQL 10.6 source code, because this is the version I am running, this also affects PostgreSQL 12. In this white paper I will talk about how we went about diagnosing this issue and reviewing the code to find out the cause of the slow down.
Normally we run with archive mode turned on and with streaming replication turned on, but we need a new server that is going to be a standalone server. I disabled the WAL Archival and Streaming Replication which also follows the suggested guidelines in the doc's.
14.4.7. Disable WAL Archival and Streaming Replication
When loading large amounts of data into an installation that uses WAL archiving or streaming replication, it might be faster to take a new base backup after the load has completed than to process a large amount of incremental WAL data. To prevent incremental WAL logging while loading, disable archiving and streaming replication, by setting wal_level to minimal, archive_mode to off, and max_wal_senders to zero. But note that changing these settings requires a server restart.
Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this will actually make certain commands faster, because they are designed not to write WAL at all if wal_level is minimal. (They can guarantee crash safety more cheaply by doing an fsync at the end than by writing WAL.) This applies to the following commands:
- CREATE TABLE AS SELECT
- CREATE INDEX (and variants such as ALTER TABLE ADD PRIMARY KEY)
- ALTER TABLE SET TABLESPACE
- CLUSTER
- COPY FROM, when the target table has been created or truncated earlier in the same transaction
https://www.postgresql.org/docs/10/populate.html#POPULATE-PITR
When running our application against the new server it took approximately 7.5 hours where with old server it would take less than 1 hour. I was quick to think that this might be an OS issue as we just changed from openSUSE Leap 42.2 to Ubuntu 18.04.3 LTS (Bionic Beaver) due to a slight version change in the hardware RAID controller. When reviewing the PostgreSQL log files, I found that the new server seemed to be taking 10 times longer for the COPY command than the other servers.
Machine | OS | copy time | Server Age | Hard Drives |
---|---|---|---|---|
Development | openSUSE Leap 42.2 | 45 Minutes | 7 Years | Spinning DIsk RAID |
New Server | Ubuntu 18.04.3 LTS (Bionic Beaver) | 7.5 hours | 2 Weeks | SSD RAID |
Here is some of the information from our postgres.auto.con file.
wal_level = 'minimal' # default replica max_wal_senders = 0 #archive_mode = off #default off shared_buffers = '50GB' work_mem = '327MB' maintenance_work_mem = '10GB' effective_cache_size = '96GB' #wal_sync_method = fsync # default for linux is fsync random_page_cost = '1.1' effective_io_concurrency = '256'
If you want to know more information about the random_page_cost see https://momjian.us/main/writings/pgsql/hw_selection.pdf - Page 17. If you want to know more information about the effective_io_concurrency see https://momjian.us/main/writings/pgsql/hw_selection.pdf - Page 17, https://news.ycombinator.com/item?id=12517368, https://portavita.github.io/2019-07-19-PostgreSQL_effective_io_concurren....
Since our application is very large, we needed to create a small test case to find the problem. The programming group, create a 50,000 row csv file that is 480,752 bytes and a Groovy script to create 10 tables and to load 1,000 rows of the csv file the tables while rotating through the tables. The application ends up doing 5 COPY commands to each of the 10 tables. I could see that each copy was taking 70ms to 120ms when looking at the PostgreSQL log files and then comparing it to the 0.8ms to 1.3ms on our 7 year old test machine, I could see that there is a problem with the new machine. I also tested on 1 year old hardware, that is the same as the new serverwith the SSD's and the copy command took from 1.2ms to 1.5ms. This 1 year old hardware is the same as the new stand alone machine, but just has a different OS. We later on boiled this down to a test SQL file that will cause the same issue.
Machine | OS | copy time | Server Age | Hard Drives |
---|---|---|---|---|
Development | openSUSE Leap 42.2 | 0.8ms to 1.3ms | 7 Years | Spinning DIsk RAID |
Production | openSUSE Leap 42.2 | 1.2ms to 1.5ms | 1 Year | SSD RAID |
New Server | Ubuntu 18.04.3 LTS (Bionic Beaver) | 70ms to 120ms | 2 Weeks | SSD RAID |
Development | Production | New Server |
---|---|---|
COPY 1000 Time: 1.365 ms COPY 1000 Time: 0.893 ms COPY 1000 Time: 0.892 ms COPY 1000 Time: 0.996 ms COPY 1000 Time: 0.958 ms COPY 1000 Time: 0.899 ms COPY 1000 Time: 1.019 ms COPY 1000 Time: 1.031 ms COPY 1000 Time: 1.116 ms COPY 1000 Time: 1.041 ms COPY 1000 Time: 1.073 ms COPY 1000 Time: 0.911 ms COPY 1000 Time: 0.939 ms COPY 1000 Time: 0.968 ms |
COPY 1000 Time: 1.527 ms COPY 1000 Time: 1.461 ms COPY 1000 Time: 1.423 ms COPY 1000 Time: 1.241 ms COPY 1000 Time: 1.362 ms COPY 1000 Time: 1.365 ms COPY 1000 Time: 1.383 ms COPY 1000 Time: 1.406 ms COPY 1000 Time: 1.358 ms COPY 1000 Time: 1.408 ms COPY 1000 Time: 1.258 ms COPY 1000 Time: 1.339 ms COPY 1000 Time: 1.349 ms COPY 1000 Time: 1.384 ms |
COPY 1000 Time: 121.331 ms COPY 1000 Time: 101.869 ms COPY 1000 Time: 101.375 ms COPY 1000 Time: 101.558 ms COPY 1000 Time: 102.500 ms COPY 1000 Time: 103.690 ms COPY 1000 Time: 102.611 ms COPY 1000 Time: 101.177 ms COPY 1000 Time: 101.614 ms COPY 1000 Time: 103.256 ms COPY 1000 Time: 102.448 ms COPY 1000 Time: 101.128 ms COPY 1000 Time: 100.869 ms COPY 1000 Time: 101.728 ms |
This lead me down the path to look at the wal_sync_method because of course for ACID compliance all the data has to be written to the WAL files, but I would learn that this is not true later on.
/usr/local/apps/postgres/current/bin/pg_test_fsync -s 3 3 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 44835.635 ops/sec 22 usecs/op fdatasync 42148.497 ops/sec 24 usecs/op fsync 8383.168 ops/sec 119 usecs/op fsync_writethrough n/a open_sync 8501.906 ops/sec 118 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 22627.033 ops/sec 44 usecs/op fdatasync 36627.596 ops/sec 27 usecs/op fsync 8155.612 ops/sec 123 usecs/op fsync_writethrough n/a open_sync 4256.142 ops/sec 235 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 8320.870 ops/sec 120 usecs/op 2 * 8kB open_sync writes 4247.050 ops/sec 235 usecs/op 4 * 4kB open_sync writes 2149.511 ops/sec 465 usecs/op 8 * 2kB open_sync writes 1063.186 ops/sec 941 usecs/op 16 * 1kB open_sync writes 533.648 ops/sec 1874 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 7801.971 ops/sec 128 usecs/op write, close, fsync 7804.802 ops/sec 128 usecs/op Non-sync'ed 8kB writes: write 553329.822 ops/sec 2 usecs/op
While the above output shows that open_datasync has the most operations per second, that did not play out when actually running our test script. When running our 50,000 row test script using the different wal_sync_method's, see timings below. This leads me to believe that fsync is still better than open_datasync even though open_datasync has more operations per second than fync per the pg_test_fsync application output.
wal_sync_method | Total Test Time |
---|---|
fsync | 0:09.27 |
fdatasync | 0:15.75 |
open_datasync | 0:15.16 |
open_sync | 0:11.73 |
At this point, we decided to look at the pg_stat_statements. Since I already have it loaded as a shared_preload_libraries, it is easy to add to a database and use.
CREATE EXTENSION pg_stat_statements; SELECT pg_stat_statements_reset();
Now run the test script and then query the pg_stat_statements afterwards.
SELECT * FROM pg_stat_statements;
Values that are 0 for both the fast and slow tests have been removed from the following tables. They are: local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written, blk_read_time and blk_write_time.
queryid | calls | total_time | min_time | max_time | mean_time | stddev_time | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written |
---|---|---|---|---|---|---|---|---|---|---|---|
2862154959 | 5 | 390.252225 | 76.88265 | 79.205529 | 78.050445 | 0.974304313734883 | 5000 | 19803 | 186 | 190 | 74 |
1125121860 | 5 | 394.627177 | 77.423235 | 79.932728 | 78.9254354 | 0.835203713590062 | 5000 | 19803 | 186 | 190 | 74 |
757962155 | 5 | 392.687423 | 76.734166 | 79.275727 | 78.5374846 | 0.930741207527982 | 5000 | 19803 | 186 | 190 | 74 |
4979868 | 5 | 386.722666 | 76.950663 | 77.755313 | 77.3445332 | 0.315288358332753 | 5000 | 19803 | 186 | 190 | 74 |
4108769240 | 5 | 385.043845 | 76.662604 | 77.72198 | 77.008769 | 0.370750516167408 | 5000 | 19803 | 186 | 190 | 74 |
982812922 | 5 | 389.727785 | 77.449501 | 78.762465 | 77.945557 | 0.435613480251932 | 5000 | 19803 | 186 | 190 | 74 |
3468077325 | 5 | 390.009387 | 77.282116 | 79.029509 | 78.0018774 | 0.658153792248161 | 5000 | 19803 | 186 | 190 | 74 |
3466826688 | 5 | 385.310237 | 76.51649 | 77.757097 | 77.0620474 | 0.460984058299894 | 5000 | 19803 | 186 | 190 | 74 |
3943553597 | 5 | 412.839826 | 77.706408 | 97.593635 | 82.5679652 | 7.54129887371314 | 5000 | 19803 | 186 | 190 | 74 |
1256848945 | 5 | 388.654889 | 76.841249 | 79.531977 | 77.7309778 | 1.09063370887762 | 5000 | 19803 | 186 | 190 | 74 |
In testing, I found that if we committed after the create tables, the copies are fast. You will need to reset pg_stat_statements before running the new test.
SELECT pg_stat_statements_reset();
Here is the pg_stat_statements for this test.
queryid | calls | total_time | min_time | max_time | mean_time | stddev_time | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written |
---|---|---|---|---|---|---|---|---|---|---|---|
2862154959 | 5 | 39.106633 | 6.855429 | 9.02918 | 7.8213266 | 0.726818419868842 | 5000 | 19967 | 188 | 187 | 0 |
1125121860 | 5 | 39.710882 | 7.167101 | 8.675202 | 7.9421764 | 0.656001857053652 | 5000 | 19968 | 188 | 187 | 0 |
757962155 | 5 | 38.905694 | 6.890894 | 8.868478 | 7.7811388 | 0.677629815512688 | 5000 | 19968 | 188 | 187 | 0 |
4979868 | 5 | 38.482345 | 7.08392 | 8.226157 | 7.696469 | 0.380047955497198 | 5000 | 19967 | 188 | 187 | 0 |
4108769240 | 5 | 39.587987 | 6.878957 | 8.880604 | 7.9175974 | 0.658522980833501 | 5000 | 19967 | 188 | 187 | 0 |
982812922 | 5 | 39.087789 | 7.566033 | 8.191758 | 7.8175578 | 0.237778606090119 | 5000 | 19967 | 188 | 187 | 0 |
3468077325 | 5 | 39.635136 | 7.382916 | 8.945652 | 7.9270272 | 0.542390095733837 | 5000 | 19967 | 188 | 187 | 0 |
3466826688 | 5 | 39.615545 | 6.897731 | 9.071549 | 7.923109 | 0.709376866450831 | 5000 | 19967 | 188 | 187 | 0 |
3943553597 | 5 | 39.76028 | 7.632807 | 8.212925 | 7.952056 | 0.201387453103713 | 5000 | 19969 | 188 | 187 | 0 |
1256848945 | 5 | 39.041628 | 7.036939 | 8.444774 | 7.8083256 | 0.478956203302807 | 5000 | 19967 | 188 | 187 | 0 |
We can see from the two tests that when we ran without using the commit after the CREATE TABLE that PostgreSQL did not write any of the shared buffers to disk (shared_blks_written).
So maybe these writes are causing some type of I/O issues or something else that we can monitor using PostgreSQL's wait events. We can monitor the wait events by looping over the pg_stat_activity view.
bash while true; do \ /usr/local/postgres-current/bin/psql -d sandbox -t -A -c \ "SELECT wait_event_type, wait_event, query FROM pg_stat_activity WHERE query LIKE 'copy %' AND state = 'active'"; \ done;
Now run the test script.
/usr/local/postgres-current/bin/psql -h db0 -d sandbox -f slow-copy-inside-txn-repro.sql pg_sleep ---------- (1 row) BEGIN psql:slow-copy-inside-txn-repro.sql:9: NOTICE: schema "deleteme" does not exist, skipping DROP SCHEMA CREATE SCHEMA CREATE TABLE Timing is on. COPY 1000 Time: 121.331 ms COPY 1000 Time: 101.869 ms COPY 1000 Time: 101.375 ms COPY 1000 Time: 101.558 ms COPY 1000 Time: 102.500 ms COPY 1000 Time: 103.690 ms COPY 1000 Time: 102.611 ms COPY 1000 Time: 101.177 ms COPY 1000 Time: 101.614 ms COPY 1000 Time: 103.256 ms COPY 1000 Time: 102.448 ms COPY 1000 Time: 101.128 ms COPY 1000 Time: 100.869 ms COPY 1000 Time: 101.728 ms Timing is off. ROLLBACK
Once the test script is finished, we need to click Ctrl+c to end the loop.
||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ||COPY deleteme.table (foo) FROM stdin; ^C
As you can see from the blank wait events that PostgreSQL is not waiting on anything. This means that PostgreSQL is CPU bound, aka running it's own code.
If we run this same test against out 7 year old test hardware we see an amazing difference.
/usr/local/postgres-current/bin/psql -h db0 -d sandbox -f slow-copy-inside-txn-repro.sql pg_sleep ---------- (1 row) BEGIN psql:slow-copy-inside-txn-repro.sql:9: NOTICE: schema "deleteme" does not exist, skipping DROP SCHEMA CREATE SCHEMA CREATE TABLE Timing is on. COPY 1000 Time: 1.365 ms COPY 1000 Time: 0.893 ms COPY 1000 Time: 0.892 ms COPY 1000 Time: 0.996 ms COPY 1000 Time: 0.958 ms COPY 1000 Time: 0.899 ms COPY 1000 Time: 1.019 ms COPY 1000 Time: 1.031 ms COPY 1000 Time: 1.116 ms COPY 1000 Time: 1.041 ms COPY 1000 Time: 1.073 ms COPY 1000 Time: 0.911 ms COPY 1000 Time: 0.939 ms COPY 1000 Time: 0.968 ms Timing is off. ROLLBACK
Our brand new SSD server is 100 times slower than our 7 year old spinning disk server.
Now we will want to profile PostgreSQL to find out where it is spending all it's time. To do so, we need to recompile PostgreSQL to make it easier for debugging / profiling.
https://wiki.postgresql.org/wiki/Developer_FAQ#Compile-time
./configure --enable-cassert --enable-debug \ CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
To profile PostgreSQL to see where it is spending all its time, we need to install perf. For more inforation on perf see http://www.brendangregg.com/perf.html
perf top
Now that I know that all the time is being spent in FlushRelationBuffers, I would like to know what inside of that function is taking up all the time. I can do this by monitoring the specific process that the COPY thread is using. To more easily find the thread add a pg_sleep(30) just after you make the connection. This allows you time to find the connection and connect perf to the process before any important work is performed. We can watch the pg_stat_activity until we see the sleeping process in the database we are performing the test and then ask perf to record what it sees happening.
while true; \ do /usr/local/postgres-current/bin/psql -d postgres -t -A -c \ "SELECT pid FROM pg_stat_activity WHERE query LIKE '%pg_sleep%' AND datname = 'sandbox'"; \ done; perf record -p PID
After the test script process ends, this will cause perf to also automatically end. We can then view the perf report.
Samples: 22K of event 'cycles:ppp', Event count (approx.): 24367076801
Overhead Command Shared Object Symbol
90.08% postgres postgres [.] FlushRelationBuffers
1.19% postgres postgres [.] rnode_comparator
0.82% postgres postgres [.] nocache_index_getattr
0.67% postgres postgres [.] DropRelFileNodesAllBuffers
We can now see that the FlushRelationBuffers is taking up all 90% of the time for our test script. If we would like to know what inside of that function is taking up all then time, we can press the down arrow to highlight the FlushRelationBuffers line and then click a to annotate the source code of that function. Once we see the annotation, you can click k to see the source code line numbers, which will help when referencing the full source code.
FlushRelationBuffers /usr/local/postgres-10.6-debug/bin/postgres
Percent│3200 for (i = 0; i < NBuffers; i++)
0.03 │19e: add $0x1,%r13d
│1a2: cmp %r13d,NBuffers
│ ↓ jle 279
│3204 bufHdr = GetBufferDescriptor(i);
3.07 │ movslq %r13d,%rbx
│ shl $0x6,%rbx
│ add BufferDescriptors,%rbx
│3210 if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
│ mov 0x8(%r12),%eax
0.04 │ cmp %eax,0x8(%rbx)
96.86 │ ↑ jne 19e
│ mov 0x4(%r12),%eax
│ cmp %eax,0x4(%rbx)
│ ↑ jne 19e
│ mov (%r12),%eax
│ cmp %eax,(%rbx)
│ ↑ jne 19e
│3213 ReservePrivateRefCountEntry();
│ → callq ReservePrivateRefCountEntry
At this point go to https://doxygen.postgresql.org/ and enter FlushRelationBuffers into the search field. Just remember this is searching the master code base and not your specific PostgreSQL version. In my case, I am using version 10.6. The master has this same issue, just different line numbers, so it was a fast way to find the file. The file manager on teh left hand side of the doxygen will show you the directory structure location to file the file. I can then looked the specific version up on github https://github.com/postgres/postgres/tree/REL_10_6 where you can then follow the same directory structure down to the exact same file, but for your specific version of PostgreSQL.
https://github.com/postgres/postgres/blob/REL_10_6/src/backend/storage/b...
/* --------------------------------------------------------------------- * FlushRelationBuffers * * This function writes all dirty pages of a relation out to disk * (or more accurately, out to kernel disk buffers), ensuring that the * kernel has an up-to-date view of the relation. * * Generally, the caller should be holding AccessExclusiveLock on the * target relation to ensure that no other backend is busy dirtying * more blocks of the relation; the effects can't be expected to last * after the lock is released. * * XXX currently it sequentially searches the buffer pool, should be * changed to more clever ways of searching. This routine is not * used in any performance-critical code paths, so it's not worth * adding additional overhead to normal paths to make it go faster; * but see also DropRelFileNodeBuffers. * -------------------------------------------------------------------- */ void FlushRelationBuffers(Relation rel) {
for (i = 0; i < NBuffers; i++) { uint32 buf_state; bufHdr = GetBufferDescriptor(i); /* * As in DropRelFileNodeBuffers, an unlocked precheck should be safe * and saves some cycles. */ if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node)) continue;
The comment in the code appears to be wrong, because to me it is in the critical code path, if it is slowing down PostgreSQL by 10x or more.
XXX currently it sequentially searches the buffer pool, should be changed to more clever ways of searching. This routine is not used in any performance-critical code paths, so it's not worth adding additional overhead to normal paths to make it go faster; but see also DropRelFileNodeBuffers.
As the stated in the comment, the code is looping through the shared buffers and checking each buffer / page to see if it belongs to the new file and needs to be flushed to disk. This means that the above "for loop" is looping through the NBuffers which is shared_buffers / block_size. This is a 6.5+ million NBuffers on our server.
6,553,600 NBuffers = 50G shared_buffers / 8K block_size
Tuning your shared_buffers: The High Performance book talks about using 25% of RAM with a maximum of 8G for shared_buffers and Robert Haas has come out with a blog post that says if you can put 100% of your database in RAM, then you should do so.
PostgreSQL 9.6 High Performance book by Ibrar Ahmed & Gregory Smith pages 129 - 131 on "Starting size guidelines" and "Platform, version, and workload limitations" (talks about 25% of RAM with max of 8G RAM.)
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-h...
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-1...
http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html
For shared_buffers, the quick answer is to allocate about 25% of system memory to shared_buffers, as recommended by the official documentation and by the wiki article on Tuning Your PostgreSQL server, but not more than about 8GB on Linux or 512MB on Windows, and sometimes less. However, I've recently become aware of a number of cases which suggest that higher values may perform much better. PostgreSQL uses the operating system's memory for caching, but, as recent performance results have shown, there can be a huge performance benefit to having the entire database in shared_buffers as opposed to merely having it in RAM. Several EnterpriseDB customers or potential customers have reported excellent results from cranking up shared_buffers to very large values, even giving PostgreSQL the lion's share of system memory for its own use. This flies in the face of some conventional wisdom that PostgreSQL doesn't handle large shared_buffers settings well, but enough people have reported good results that it seems worth taking this strategy seriously.
The RelFileNodeEquals is really just a macro that does a compare. When we look it up we can find it's source code.
https://github.com/postgres/postgres/blob/REL_10_6/src/include/storage/r...
/* * Note: RelFileNodeEquals and RelFileNodeBackendEquals compare relNode first * since that is most likely to be different in two unequal RelFileNodes. It * is probably redundant to compare spcNode if the other fields are found equal, * but do it anyway to be sure. Likewise for checking the backend ID in * RelFileNodeBackendEquals. */ #define RelFileNodeEquals(node1, node2) \ ((node1).relNode == (node2).relNode && \ (node1).dbNode == (node2).dbNode && \ (node1).spcNode == (node2).spcNode)
We are actually looking at the NOT (!) version. This is what the code would really looks like this inside the assembly version.
#define !RelFileNodeEquals(node1, node2) \ ((node1).relNode != (node2).relNode || \ (node1).dbNode != (node2).dbNode || \ (node1).spcNode != (node2).spcNode)
If we look up at what called FlushRelationBuffers, we will find that it was the heap_sync function. When we look up it's code, we can see that it is calling the FlushRelationBuffers.
https://github.com/postgres/postgres/blob/REL_10_6/src/backend/access/he...
/* * heap_sync - sync a heap, for use when no WAL has been written * * This forces the heap contents (including TOAST heap if any) down to disk. * If we skipped using WAL, and WAL is otherwise needed, we must force the * relation down to disk before it's safe to commit the transaction. This * requires writing out any dirty buffers and then doing a forced fsync. * * Indexes are not touched. (Currently, index operations associated with * the commands that use this are WAL-logged and so do not need fsync. * That behavior might change someday, but in any case it's likely that * any fsync decisions required would be per-index and hence not appropriate * to be done here.) */ void heap_sync(Relation rel) { /* non-WAL-logged tables never need fsync */ if (!RelationNeedsWAL(rel)) return; /* main heap */ FlushRelationBuffers(rel); /* FlushRelationBuffers will have opened rd_smgr */ smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
If we look up at what called heap_sync, we will find that it was the CopyFrom function. When we look up it's code, we can see that it is calling the heap_sync.
https://github.com/postgres/postgres/blob/REL_10_6/src/backend/commands/...
/*---------- * Check to see if we can avoid writing WAL * * If archive logging/streaming is not enabled *and* either * - table was created in same transaction as this COPY * - data is being written to relfilenode created in this transaction * then we can skip writing WAL. It's safe because if the transaction * doesn't commit, we'll discard the table (or the new relfilenode file). * If it does commit, we'll have done the heap_sync at the bottom of this * routine first. * * As mentioned in comments in utils/rel.h, the in-same-transaction test * is not always set correctly, since in rare cases rd_newRelfilenodeSubid * can be cleared before the end of the transaction. The exact case is * when a relation sets a new relfilenode twice in same transaction, yet * the second one fails in an aborted subtransaction, e.g. * * BEGIN; * TRUNCATE t; * SAVEPOINT save; * TRUNCATE t; * ROLLBACK TO save; * COPY ... * * Also, if the target file is new-in-transaction, we assume that checking * FSM for free space is a waste of time, even if we must use WAL because * of archiving. This could possibly be wrong, but it's unlikely. * * The comments for heap_insert and RelationGetBufferForTuple specify that * skipping WAL logging is only safe if we ensure that our tuples do not * go into pages containing tuples from any other transactions --- but this * must be the case if we have a new table or new relfilenode, so we need * no additional work to enforce that. *---------- */ /* createSubid is creation check, newRelfilenodeSubid is truncation check */ if (cstate->rel->rd_createSubid != InvalidSubTransactionId || cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId) { hi_options |= HEAP_INSERT_SKIP_FSM; if (!XLogIsNeeded()) hi_options |= HEAP_INSERT_SKIP_WAL; }
How does it tell if XLogIsNeeded? We look that macro up and we find that it is checking to see if the WAL level is greather than or equal to replica.
https://github.com/postgres/postgres/blob/REL_10_6/src/include/access/xl...
/* * Is WAL-logging necessary for archival or log-shipping, or can we skip * WAL-logging if we fsync() the data before committing instead? */ #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
We can the look up the WAL_LEVEL_REPLICA to see what all the possible values are and what numerical order they are in. This way we know what ones are greather than replica and the ones less than replica.
https://github.com/postgres/postgres/blob/REL_10_6/src/backend/access/rm...
/* * GUC support */ const struct config_enum_entry wal_level_options[] = { {"minimal", WAL_LEVEL_MINIMAL, false}, {"replica", WAL_LEVEL_REPLICA, false}, {"archive", WAL_LEVEL_REPLICA, true}, /* deprecated */ {"hot_standby", WAL_LEVEL_REPLICA, true}, /* deprecated */ {"logical", WAL_LEVEL_LOGICAL, false}, {NULL, 0, false} };
We can see at the end of the CopyFrom function that it calls the heap_sync function when it is skipping the WAL files.
https://github.com/postgres/postgres/blob/REL_10_6/src/backend/commands/...
/* * If we skipped writing WAL, then we need to sync the heap (but not * indexes since those use WAL anyway) */ if (hi_options & HEAP_INSERT_SKIP_WAL) heap_sync(cstate->rel); return processed; }
Now that we know the problem is caused by the MINIMAL wal files, I reviewed the docs and noticed that in the docs, https://www.postgresql.org/docs/10/populate.html#POPULATE-PITR, that it that with minimal wal files and archive turned off that it will make these commands faster, but that is not true as we have just proven.
Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this will actually make certain commands faster, because they are designed not to write WAL at all if wal_level is minimal. (They can guarantee crash safety more cheaply by doing an fsync at the end than by writing WAL.) This applies to the following commands:
- CREATE TABLE AS SELECT
- CREATE INDEX (and variants such as ALTER TABLE ADD PRIMARY KEY)
- ALTER TABLE SET TABLESPACE
- CLUSTER
- COPY FROM, when the target table has been created or truncated earlier in the same transaction
Now that we know what we need to change, we can start testing with different configurations.
Development Replica 8GB |
Production Replica 50GB |
New Server Minimal 50GB |
New Server Replica 50GB |
New Server Minimal 8GB |
---|---|---|---|---|
COPY 1000 Time: 1.365 ms COPY 1000 Time: 0.893 ms COPY 1000 Time: 0.892 ms COPY 1000 Time: 0.996 ms COPY 1000 Time: 0.958 ms COPY 1000 Time: 0.899 ms COPY 1000 Time: 1.019 ms COPY 1000 Time: 1.031 ms COPY 1000 Time: 1.116 ms COPY 1000 Time: 1.041 ms COPY 1000 Time: 1.073 ms COPY 1000 Time: 0.911 ms COPY 1000 Time: 0.939 ms COPY 1000 Time: 0.968 ms |
COPY 1000 Time: 1.527 ms COPY 1000 Time: 1.461 ms COPY 1000 Time: 1.423 ms COPY 1000 Time: 1.241 ms COPY 1000 Time: 1.362 ms COPY 1000 Time: 1.365 ms COPY 1000 Time: 1.383 ms COPY 1000 Time: 1.406 ms COPY 1000 Time: 1.358 ms COPY 1000 Time: 1.408 ms COPY 1000 Time: 1.258 ms COPY 1000 Time: 1.339 ms COPY 1000 Time: 1.349 ms COPY 1000 Time: 1.384 ms |
COPY 1000 Time: 121.331 ms COPY 1000 Time: 101.869 ms COPY 1000 Time: 101.375 ms COPY 1000 Time: 101.558 ms COPY 1000 Time: 102.500 ms COPY 1000 Time: 103.690 ms COPY 1000 Time: 102.611 ms COPY 1000 Time: 101.177 ms COPY 1000 Time: 101.614 ms COPY 1000 Time: 103.256 ms COPY 1000 Time: 102.448 ms COPY 1000 Time: 101.128 ms COPY 1000 Time: 100.869 ms COPY 1000 Time: 101.728 ms |
COPY 1000 Time: 0.472 ms COPY 1000 Time: 0.353 ms COPY 1000 Time: 0.344 ms COPY 1000 Time: 0.337 ms COPY 1000 Time: 0.340 ms COPY 1000 Time: 0.330 ms COPY 1000 Time: 0.317 ms COPY 1000 Time: 0.347 ms COPY 1000 Time: 0.350 ms COPY 1000 Time: 0.343 ms COPY 1000 Time: 0.331 ms COPY 1000 Time: 0.345 ms COPY 1000 Time: 0.325 ms COPY 1000 Time: 0.317 ms |
COPY 1000 Time: 20.279 ms COPY 1000 Time: 16.195 ms COPY 1000 Time: 15.576 ms COPY 1000 Time: 15.499 ms COPY 1000 Time: 16.040 ms COPY 1000 Time: 15.916 ms COPY 1000 Time: 15.096 ms COPY 1000 Time: 15.081 ms COPY 1000 Time: 15.063 ms COPY 1000 Time: 15.034 ms COPY 1000 Time: 15.197 ms COPY 1000 Time: 14.960 ms COPY 1000 Time: 14.889 ms COPY 1000 Time: 16.139 ms |
When testing our app, we could see these differences.
wal_level | Total Time H:MM:SS | Processing Time H:MM:SS | COPY rows | shared_buffers | NBuffers |
---|---|---|---|---|---|
minimal | 7:38:23 | 6:55:21 | 1,000 | 50G | 6,553,600 |
hot_standby / replica | 0:43:54 | 0:15:35 | 1,000 | 50G | 6,553,600 |
minimal | 1:54:06 | 1:16:57 | 1,000 | 8G | 1,048,576 |
minimal | 1:00:42 | 0:31:13 | 10,000 | 8G | 1,048,576 |
hot_standby / replica | 0:55 | 0:15:03 | 10,000 | 50G | 6,553,600 |
The difference between the Total Time and the Processing Time is the amount of time to fetch / download the data from external sources to the local application server.
We can still see that even by lowering the amount of RAM to 8GB that it is taking over twice as long as the replica because the FlushRelationBuffers is having to loop through all 1+ million shared buffer blocks/pages. Since we really do want to go with the minimal WAL files, we are revising our code so that it does not use the 1,000 row writes and instead writes the full table in a single COPY command. Once this is finished, I will update this white paper with the new improved timings.
NOTE: If you wish to see the above links in the latest version of the code, just replace "REL_10_6" in the URL with "master" and remove the the #Lxxx at the end of the URL, as the line number will not be the same between version 10.6 and the current master. Once the page is displayed, use your web browsers search function to find the function you want to look up in the code.
If you change log_min_duration_statement to "0" to log everything, we can see the issue inside the PostgreSQL log files.
2020-03-03 15:26:08.222 PST,,,18397,,5e5ed3a2.47dd,3,,2020-03-03 14:01:06 PST,,0,LOG,00000,"received SIGHUP, reloading configuration files",,,,,,,,,"" 2020-03-03 15:26:08.222 PST,,,18397,,5e5ed3a2.47dd,4,,2020-03-03 14:01:06 PST,,0,LOG,00000,"parameter ""log_min_duration_statement"" changed to ""0""",,,,,,,,,"" 2020-03-03 15:26:21.195 PST,,,25716,"localhost:39260",5e5ee79d.6474,1,"",2020-03-03 15:26:21 PST,,0,LOG,00000,"connection received: host=localhost port=39260",,,,,,,,,"" 2020-03-03 15:26:21.201 PST,"postgres","postgres",25716,"localhost:39260",5e5ee79d.6474,2,"authentication",2020-03-03 15:26:21 PST,6/497,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,"" 2020-03-03 15:26:21.259 PST,"postgres","postgres",25716,"localhost:39260",5e5ee79d.6474,3,"DROP TABLE",2020-03-03 15:26:21 PST,6/0,0,LOG,00000,"duration: 56.545 ms statement: DROP TABLE IF EXISTS public.test;",,,,,,,,,"psql" 2020-03-03 15:26:21.259 PST,"postgres","postgres",25716,"localhost:39260",5e5ee79d.6474,4,"idle",2020-03-03 15:26:21 PST,,0,LOG,00000,"disconnection: session time: 0:00:00.063 user=postgres database=postgres host=localhost port=39260",,,,,,,,,"psql" 2020-03-03 15:26:21.265 PST,,,25718,"localhost:39266",5e5ee79d.6476,1,"",2020-03-03 15:26:21 PST,,0,LOG,00000,"connection received: host=localhost port=39266",,,,,,,,,"" 2020-03-03 15:26:21.270 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,2,"authentication",2020-03-03 15:26:21 PST,6/499,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,"" 2020-03-03 15:26:21.271 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,3,"BEGIN",2020-03-03 15:26:21 PST,6/500,0,LOG,00000,"duration: 0.040 ms statement: BEGIN",,,,,,,,,"psql" 2020-03-03 15:26:21.273 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,4,"CREATE TABLE",2020-03-03 15:26:21 PST,6/500,1288,LOG,00000,"duration: 1.831 ms statement: CREATE TABLE public.test ( v text );",,,,,,,,,"psql" 2020-03-03 15:26:21.379 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,5,"COPY",2020-03-03 15:26:21 PST,6/500,1288,LOG,00000,"duration: 105.771 ms statement: COPY public.test (v) FROM stdin;",,,,,,,,,"psql" 2020-03-03 15:26:21.379 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,6,"COMMIT",2020-03-03 15:26:21 PST,6/0,0,LOG,00000,"duration: 0.049 ms statement: COMMIT",,,,,,,,,"psql" 2020-03-03 15:26:21.379 PST,"postgres","postgres",25718,"localhost:39266",5e5ee79d.6476,7,"idle",2020-03-03 15:26:21 PST,,0,LOG,00000,"disconnection: session time: 0:00:00.114 user=postgres database=postgres host=localhost port=39266",,,,,,,,,"psql" 2020-03-03 15:26:21.386 PST,,,25720,"localhost:39272",5e5ee79d.6478,1,"",2020-03-03 15:26:21 PST,,0,LOG,00000,"connection received: host=localhost port=39272",,,,,,,,,"" 2020-03-03 15:26:21.391 PST,"postgres","postgres",25720,"localhost:39272",5e5ee79d.6478,2,"authentication",2020-03-03 15:26:21 PST,6/501,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,"" 2020-03-03 15:26:21.446 PST,"postgres","postgres",25720,"localhost:39272",5e5ee79d.6478,3,"DROP TABLE",2020-03-03 15:26:21 PST,6/0,0,LOG,00000,"duration: 54.140 ms statement: DROP TABLE IF EXISTS public.test;",,,,,,,,,"psql" 2020-03-03 15:26:21.446 PST,"postgres","postgres",25720,"localhost:39272",5e5ee79d.6478,4,"idle",2020-03-03 15:26:21 PST,,0,LOG,00000,"disconnection: session time: 0:00:00.060 user=postgres database=postgres host=localhost port=39272",,,,,,,,,"psql" 2020-03-03 15:26:21.452 PST,,,25722,"localhost:39278",5e5ee79d.647a,1,"",2020-03-03 15:26:21 PST,,0,LOG,00000,"connection received: host=localhost port=39278",,,,,,,,,"" 2020-03-03 15:26:21.457 PST,"postgres","postgres",25722,"localhost:39278",5e5ee79d.647a,2,"authentication",2020-03-03 15:26:21 PST,6/503,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,"" 2020-03-03 15:26:21.460 PST,"postgres","postgres",25722,"localhost:39278",5e5ee79d.647a,3,"CREATE TABLE",2020-03-03 15:26:21 PST,6/0,0,LOG,00000,"duration: 2.044 ms statement: CREATE TABLE public.test ( v text );",,,,,,,,,"psql" 2020-03-03 15:26:21.483 PST,"postgres","postgres",25722,"localhost:39278",5e5ee79d.647a,4,"COPY",2020-03-03 15:26:21 PST,6/0,0,LOG,00000,"duration: 22.555 ms statement: COPY public.test (v) FROM stdin;",,,,,,,,,"psql" 2020-03-03 15:26:21.483 PST,"postgres","postgres",25722,"localhost:39278",5e5ee79d.647a,5,"idle",2020-03-03 15:26:21 PST,,0,LOG,00000,"disconnection: session time: 0:00:00.030 user=postgres database=postgres host=localhost port=39278",,,,,,,,,"psql"
As you can see the First copy command is inside a transaction and takes 105.771 ms and the Second copy command outside of a transaction takes 22.555 ms. Another timing number to look at is the session time, 114 ms with transactions and 30 ms without transactions.
I would like to thank the following people for their input and help during the research of this issue:
Tobin Stelling a co-worker at SCHARP / Fred Hutchinson Cancer Research Center
David Costanzo a co-worker at SCHARP / Fred Hutchinson Cancer Research Center
Jeremy Schneider from Amazon AWS