Oleg Avdeev || Blog

vPMU support on EC2 and the weird case of z1d instance family

As explained in great detail in this post by Brendan Gregg Performance Monitoring Counters are an awesome way to measure performance on modern processors. You can get insight into things like branch mispredictions, cache misses and TLB performance. You can also do sampling based on these events in perf (aka Precise Event Based Sampling/PEBS).

arch_perfmon

Not too long ago, it wasn't possible to access this feature on EC2 instances (apparently, for security reasons). However, than situation have changed with the updates in 2016 to AWS XEN hypervisor, and with the roll out of the AWS Nitro hypervisor last year.

As Brendan's blog post says, you can get acess to vPMU on dedicated hardware, when using the largest instance type. However, as I ran my little /proc/cpuinfo experiment, I found out that it is not the whole story.

You can easily check for vPMU support by looking for arch_perfmon flag in /proc/cpuinfo. Here's the exhaustive list of instance types that have it as of today:

i3.metal
c5.9xlarge
c5.18xlarge
m4.16xlarge
m5.12xlarge
m5.24xlarge
r5.12xlarge
r5.24xlarge
f1.16xlarge
h1.16xlarge
i3.16xlarge
p2.16xlarge
p3.16xlarge
r4.16xlarge
x1.32xlarge
c5d.9xlarge
c5d.18xlarge
m5d.12xlarge
m5d.24xlarge
r5d.12xlarge
r5d.24xlarge
x1e.32xlarge

Looking at this list and lstopo(1) output, the rule is pretty clear: you get vPMU on largest types of instances running under “Xen AWS 2017” (using Brendan's terms). Then, on instances running on Nitro hypervisor you get vPMU on types that use entire processor socket/NUMA node and up. For example, m5d.12xl is only second largest in its family, yet it takes the entire socket, so you get vPMU support.

Curiously, the newest z1d family doesn't have vPMU enabled at all, even though the largest type is two NUMA nodes and AWS says it runs on Nitro as well. The cores declare the same family/model as r5s, so it is unlikely there is some deep architectural reason (like, CPU bug) to disable it. Did it somehow get in the way of hardware design, while Intel and AWS were trying to squeeze as much per-core performance as possible? I know next to nothing about chip design, but I wouldn't imagine it taking much die space. Maybe Nitro required rushing a new release to run z1d, so they disabled vPMU support there temporarily?

Anyway, at least we can enjoy having vPMU on these 22 types above. By the way, the cheapest way to play with hardware performance counters today is c5.9xlarge which is $0.55/hr as of today on spot market in Oregon and N.Virginia regions.