PostgreSQL Benchmark on Datrium – 4.3 Million TPS with 1 GB RAM – and some Frustration!
This blog post is mostly about PostgreSQL performance on Datrium, but I do make direct comparisons with results published by other storage vendors. If you don’t like reading competitive pieces of evidence, please stop here. You have been warned!
Cutting to the chase, I want to share some benchmark numbers I have been able to run in our Solutions lab and demonstrate how Datrium DVX compares to other published figures. While some may claim that benchmarks can be gamed (and they can), I tried to stick to a simple formula that can be easily repeated by anyone on any platform for comparable results. Furthermore, the more hardware you throw at the problem, the more performance you will get, but generally if you fix as many variables as possible, the results should be within a reasonable margin.
I read through a few benchmarks, and the one I felt to be more honest regarding configuration and executed instructions was the EDB PostgresTM Advanced Server Performance on EMC XtremIO, so I followed a similar formula.
Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
Intel(R) NVMe DC P3608 2 x 1.6 TB
VMware ESXi, 6.0.0, 3620759
Datrium Hyperdriver Agent – 17.5 GB RAM
Data Node – 1 x F12X2
If you don’t know how Datrium architecture works, I recommend watching this video from Clint Wyckoff. In a Datrium system, data nodes are used for storing durable data, while a copy of the data is stored on the host flash. All read IO is local to the host with intrinsic data locality, while write IO is stored on the host flash and also on the data node(s) using Erasure Coding (N+2 parity). Furthermore, all IO operations are compressed and deduplicated, by default – no check boxes.
1 GB RAM **
CentOS 3.10.0-514.10.2.el7.x86_64 (default install)
** PostgreSQL utilizes all allocated memory, and uses shared_buffers to cache as much data possible. Since we’re aiming to demonstrate storage performance I limited VM memory to 1 GB to force the PostgreSQL to utilize the storage device as much as possible.
Shared_buffers = 32 MB (default)
fsync = on # turns forced synchronization on or off
synchronous_commit = on # synchronization level;
full_page_writes = on # recover from partial page writes
** These PostgreSQL parameters can be changed to improve performance however, it is possible to lose data whenever a sudden shutdown occurs. Some vendors that perform data integrity checks recommend to turn these settings off for better performance. I have chosen to NOT turn these off during this benchmark.
pgbench 9.2 was used to create the database and run the benchmark. pgbench results are shown as TPS or transactions per second. In the XtremIO paper, they executed a read-only and an OLTP-B mixed workload (read/write). I decided to skip the read-only benchmark because it’s useless for production environments. I used the same commands used in the XtremIO white paper to produce the benchmark. The commands are as follows:
Create database instance using psql:
# CREATE DATABASE foobar OWNER postgres TABLESPACE foobar;
Run the pgbench database initialization. The following command loads a pgbench database using a scale factor of 7500, vacuums the resulting data, and then indexes it. It will create a database of approximately 113 GB in size:
# pgbench -i -s 7500 –index-tablespace=foobar –tablespace=foobar foobar
Run the pgbench read/write workload for 30 minutes using the following command:
# pgbench -s 7500 -c 100 -r -N -T 1800 foobar
During the benchmark the VM was running at about 70% CPU utilization.
How does Datrium DVX compare?
I have not seen a single vendor benchmark that executed pgbench demonstrating the real end-to-end application latency. All papers that I have found report array controller latencies – and there’s a big reason for that! There is enormous latency difference depending on where latency is measured. Application latency, measured by the application, is what matters at the end of the day, so I’m not hiding it.
Latencies shown by ExtremIO are not real application latencies, but rather the latency measured at the array controller. Moreover, I found a Gotcha in their performance numbers.
|TPS||AVG Read Latency||AVG Write Latency|
|ExtremIO||7,642||~0.2||~0.4 (not real)|
Granted, I chose to compare to XtremIO because it’s probably the lowest latency storage solution for raw performance when discussing single host deployment. Also, the white-paper does not specify the data protection RAID-level used. This makes me wonder if they were actually using RAID 6 (Disk Striping with Double Parity). Finally, as with any SAN the more hosts and VMs you add, less performance you get for each application.
The XtremIO paper states the following, “We ran the following pgbench command to generate a mixed workload with a 2:1 read/write ratio” (page 16). However, the results table (page 19) demonstrate that Read IOPS is 4.7X higher than Write IOPS – it’s 80R:20W!
Where is the 2:1?
I want to believe that there is a genuine mistake in the report and that the authors were not trying to game the results. Therefore, It’s just fair to say that the specified latency numbers are not real or valid.
Table from XtremIO paper
When I ran the same pgbench command on Datrium the results were consistently 70R:30W. We can clearly see that XtremIO handled ~8,000 Write IOPS at peak, while Datrium absorbed 16,523 Write IOPS at peak – more than double the amount. (see below)
This other paper for the VMAX 250F All Flash with 32 SSDs achieved 11,757 TPS in a RAID 5 (3+1) configuration and a VM with 96 GB Memory. The paper does not clarify if compression is enabled during tests, but no serious enterprise SAN array promotes RAID 5 for data protection nowadays. Lower data resiliency and memory caching plays out in a performance benchmarks. Moreover, latencies are also measured at the array controller.
Datrium always use N+2 parity erasure coding to mitigate against any two simultaneous drive or block failures while still providing compression, and deduplication.
How about HyperConverged?
I would love to compare Datrium to Converged or HyperConverged solutions, but vendors seem hesitant to report their real performance numbers, and when they do, they do not provide enough information for a decent comparison.
I did, however, find Nutanix numbers (here) provided by user jcytam that I used as a general guidance. I replicated the pgbench benchmark as much as I could, using the same VM configuration, 8 vCPU and 24 GB RAM. I also used the same pgbench command as described in the post, and the same pgbench major release. Unfortunately, the Replication Factor (akin to RAID) was not specified.
In Nutanix warm Read IO comes from SSDs/RAM and Write IOs go to SSDs. That said, this is not an official Nutanix benchmark and should not be seen as official numbers – many factors can influence a benchmark.
Further down on this blog I measure Datrium DVX with Samsung PMA SSDs.
I could not find pgbench benchmarks for VMware VSAN, Hyperflex or Simplivity.
The XtremIO benchmark above was a comparison without any tuning, but the XtremIO paper does not indicate that there were no PostgreSQL, VMware or Linux tuning. So, I decided to do a simple tuning, while keeping all the declared configuration the same. That means, no change to VM memory or CPU.
Note that I have also run pgbench with lots of memory, CPU cores and higher shared_buffers, and I got to multiple hundreds of thousands TPS – however, it means nothing because it doesn’t demonstrate the storage performance capability.
Here is my PostgreSQL and Linux tuning:
fsync = off # turns forced synchronization on or off
synchronous_commit = off # synchronization level; on, off, or local
full_page_writes = off # recover from partial page writes
ext4 mount options to (defaults,nodiratime,noatime,data=writeback,barrier=0,discard)
I also implemented the changes recommended by PgTune according to my environment.
Let’s look at the new results.
|TPS||AVG Read Latency||AVG Write Latency|
|ExtremIO||7,642||~0.2||~0.4 (not real)|
Just reinforcing the idea that latencies shown by ExtremIO are not real application latencies, but rather the latency measured at the array controller. In the image below, on the Datrium benchmark, where disk latency at the vSphere VM level never goes above 4 ms (lower than the ~5.7 at the application level), and for the most part is below 3 ms.
If I was to measure latency at the data node, it would have been much lower, but would be meaningless. I suggest that vendors always provide real application latencies – it’s just fair to customers.
Based on the workload generated by pgbench using a single VM and single Host, Datrium DVX tells me that I would be able to add another 29 servers with the equivalent workload and results, before I need to add another data node to the pool, totalizing 435,120 TPS. (see image below)
10 data nodes can be part of a data pool, in which case we would have approximately 4.3 Million TPS.
SATA SSD vs NVMe
As NVMe approaches price parity with SATA SSDs, we will start seeing greater adoption of the technology, and Datrium is well positioned to support NVMe – customers have been utilizing NVMe on hosts for over a year.
Since I ran the benchmark on a host with two NVMe SSDs, I decided to run the same workload on another host with two SATA SSD to understand the difference, because I thought readers would ask about it.
This host is a Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz and the SSDs are cheap ($0.5/GB) Samsung PMA. Looking at the results, it’s clear that the lower-grade SSD doesn’t provide the same performance as the NVMe, and we also notice a bump in write latency.
That said, the performance numbers are outstanding for a pgbench running with 1GB RAM on cheap commodity flash with N+2 parity erasure coding, while still providing compression and deduplication. Repeat that 30 times until you get to the data node boundary, and then add more data nodes, up to ten.
|TPS||AVG Read Latency||AVG Write Latency|
|Datrium NVMe (tuned)||14,504||~0.3||~5.7|
|Datrium SATA SSD (tuned)||10,319||~0.3||~8.6|
I can’t stress enough that all the performance numbers presented in the blog post have been generated with a single Supermicro server with Two flash devices and a Datrium data node (F12X2) with 12 SSDs. The list price for a data node and a host license is sub $150K, and it scales to 435,120 TPS based on this same workload.
We have to remember that if we throw memory, host Flash, CPU, and make changes to shared_buffers on PostgreSQL, it is possible to get up to hundreds of thousands of TPS from a single VM with the same pgbench workload. I could have added up to 16 NVMe devices on the host to distribute the load and get more parallelism, but it is too easy ? and costly ? to solve performance problems throwing hardware at it.
I didn’t run this benchmark to prove that PostgreSQL does an outstanding job caching and managing data in memory, or that Intel newer processors are faster, but instead to show Datrium raw storage performance.
I also know that comparing benchmarks can lead to endless debates, in which case I invite vendors to run this very benchmark and share their numbers. I can provide the source VM, pre-configured, and you just run the benchmark. I also invite vendors to demonstrate their numbers with Erasure Coding (or equivalent data protection) with Deduplication and Compression ENABLED, like Datrium.
To me, the exciting part is to see how well storage systems handle benchmarks when all parts are moderately equal. Datrium is on par with any enterprise-grade Tier 1 storage solution, providing industrial-strength data resiliency, data reduction and scalability. Datrium scalability, up to 18 Million IOPS, and 256 GB/s Random Write throughput is unmatched in the industry.
My Rant – Over the last few days, I’ve spent many hours over storage benchmarks from various vendors, but honestly, what’s up with benchmarks that do not use production-grade conditions to demonstrate performance numbers? Some papers appear to purposely hide details to avoid vendors replicating their benchmark, while others game their numbers to make them look good. As an industry, we need to be better than that!
As a next step, I am planning to run the same benchmark with Red Hat Enterprise Virtualization. I also will run a scale-out pgbench benchmark with VMs on multiple servers – adding up to 2,000 snapshots per VM. pgbench-tools is also an option.
If you would like to see a specific benchmark on Datrium let me know and we will do everything possible to run it – that is one of my team’s charter at Datrium – and we shall not hide or lie performance numbers.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net