Share your infra - Our loved tech (hard/soft)ware stack

Sep 15, 2021 • Raphaël Ducom

This is the second post in the series "Share your infrastructure"

The hardware

First, a word about virtualization. We are aware of the overhead, and we like raw performances (and hardware). But as everything is a trade-off, one major benefit is availability. Once a year, we have a host crash, for different reasons. With vSphere automatically restarting VMs, we can take our weekends, and even sleep soundly every night without fear of discovering a failed host with our coffee. This is a major feature :)

In 2021, we are still using OVH’s private cloud (PCC) for our production, and it's honestly a great product. A year ago, we migrated from PCC SDDC with SAN, to a VSAN setup, with nice perf improvements. Quick feedback: VSAN is great, scalable, and a gamechanger on the iops side.

We currently have 9 hosts. Each VSAN host has:

  • 2 x 6242R (Xeon scalable 2nd gen) running at 3.1Ghz
  • 768 Gb of DDR4 ECC
  • 4 x 25Gbs Mellanox NICs
  • 4 PM1725 (1.6Tb) as cache tier, and 20 PM983 (1.92Tb) as capacity tier, all NVME

This gives us around 333Tb (so 165Tb in “RAID-1”), 6,74Tb of RAM, and over 1Thz cumulated CPU.

Our hosts are spread across 7 racks in the datacenter, and we use the VSAN fault-domain feature in order to resist single rack outages (vSphere manages data replication in respect to fault domains).

Here are some initial synthetic HCI benchmarks when we received the first 3 servers : promising results, far beyond a traditional SAN, but less iops than a real bare metal.

500k iops

Yep, could achieve ~1M iops on a single consumer grade M2 NVMe disk today, but that’s VSAN, not bare metal.

Last year, before starting the SAN to VSAN migration, we’ve made several initial tests on SQL Server, finding little IPC performance gains over our previous vSphere hosts (2689v4). What a surprise when - after our migration - we found an overall IPC gain of around 30%. We mostly explain it by 2 factors:

  • A consolidation factor improvement: we can now have several VM without noisy neighbor issues we had in previous generations.
  • The full NVME VSAN datastore, reducing latency.

before after

Before / after : 30% gain on SQL CPU after migration to the new vSphere.

One really astonishing fact: we previously were using SAN datastore. All our important workloads like backups, compression, or massive SQL operations were IO constrained.

We are now CPU bound. Datastores, as our network, are no longer our bottleneck. In fact, the only io bound operation left is the massive backup operations on our DFS servers.

For instance, backing up around 2To (1310 folders) of small files :

  • before: 6h 15m / Throughput : 65,91 Mb/s
  • after : 1h 56m / Throughput : 262,07 Mb/s

On the SQL side, we now have very small disk latencies (0.2ms in read, and 0.32ms in write).

We are really happy with this hardware. The only two downsides are :

  • Ansible is taking 1min to retrieve the vSphere inventory, but that’s out of the scope of this article.
  • We have no volume snapshots (like ZFS on SAN). In the event of a severe issue, we rely on our backups (more in the outage section below)

Monitoring stack

We use several tools for our observability. For historic reasons, we are using PRTG, with SNMP probes for all our needs. But we are getting close of PRTG scalability limits, and this tool isn’t “infrastructure as code-ready” (no APIs).

As a consequence, we just started migrating to Prometheus (another long journey). Why didn’t we choose to use Datadog infrastructure monitoring or Signalfx/Splunk for all our infrastructure ? Mostly because of the business model: a really expensive pricing per host (we'd love to have a pricing per Gb of data/metrics).

Another tool we use a lot is OpServer (hi Nick!).

It’s really convenient to have a global view of the whole infrastructure in a few clicks. But now, with more than 300 virtual machines, we are having some readability issues :)

opserver

3 specific points about our OpServer usage:

  • We keep struggling with bosun / scollector. Both are consuming a lot of CPU, and our bosun setup if far from ideal. We hope to replace it completely with Prometheus exporters once all our production is plugged in.
  • Our developers (and the infra team) use the exception tab a lot ! We currently log all “equivalent to 500 status code” errors, and when needed, we also log the body content of some web requests.
  • Unfortunately, exception grouping does not work quite well as the URL changes on each request because it contains the tenant name. One of our next things to do about this is to replace SE.Exceptional with our custom code.

CPU usage made by OPServer’s SQL queries is often really high. This is probably due to a bad execution plans, and we didn't manage to fix it yet..

Last but not least, we use Datadog, a lot! All our logs (from HAProxy, IIS, etc.) are persisted for 30 days on their European servers. We also use Datadog metrics just for our SQL Server instances, and lastly (but not the least) Datadog APM, which is really amazing!

We have chosen to deploy the APM on only one of our clusters + the webservices given its high price tag, using it at a 100% sampling rate. It’s enough to troubleshoot generic issues, but we miss it a lot when a specific issue happens to a customer in another cluster. This is likely something we would change in the future (deploy everywhere, with a low sampling rate), but once again, their pricing model runs against us.

We still have several pain points around observability: we don’t yet monitor all the dotnet metrics, we only log incoming traffic, we also have a partial HAProxy-generated correlation-ids support (doesn’t flow across all the apps). These are work-in-progress metrics, and takes time for fine tuning.

Our software stack

To begin with, we love .NET a lot!

And a lot more since .NET Core started being in the first lines of TechEmpower benchmarks! Our backend is 100% .NET, our front SPAs are made with Angular/Typescript, and our database is SQL Server.

Historically, all our applications were in a single monolith. This monolith still receives an important part of all web requests.

We have chosen not to go to full microservices, but instead, split our applications and extract some generic services. The word around here is that we are saying we “just do macro-services” :)

A lot can be found inside this monolith, like geologic stratas. It’s currently running on .NET 4.7.1, with a mix of EF 6 and some traces of edmx, and mainly have 3 generations of API inside of it. A legacy in both positive and negative ways, like all monoliths.

All these generations have different behaviors, on the allocation side (GC sensible), or on the execution side (some long lived processes). A “quick win” to reduce the impact of generations against each other was to split them at runtime.

With HAProxy, we split our traffic over 3 pairs of servers (each pair load balanced in round robin).

  • V2 is for all the oldest generations
  • LR (long request) is for all the web-request consuming CPU for long periods (some exports, imports, etc.)
  • V3 is for the "less old" generation, receiving most of the requests.

monolith

In addition to our monolithic architecture, we have a pair of IIS servers for our “out-of-monolith” applications. (All in .NET Core)

We started the out-of-monolith migration journey in early 2018, when .NET Core was in version 2.0.

Today, each team (business-unit) has several products in .NET Core, and/or partially migrated products out of the monolith. Obviously, all our new products are directly created out of the monolith.

Why didn't we just migrated our monolith to .NET Core?

Several reasons : legacy dependencies + a “legacy framework” surrounding all our code, making it nearly impossible to simply migrate to our new “state of the art” standards. So feature per feature, we are completely rewriting code in our new .NET Core codebase, without any big-bang.

One word about our typical platform usage

We have a peak load on every Monday morning between 9am and 11am. Everyone wants to either create a leave request or declare their time on their projects.

This is what our servers experienced last week : (each color is a different cluster).

monolith

This weekly traffic peak on Monday mornings implies we keep a CPU margin. And for the specific first Monday of each month, we also add additional resources to prevent CPU saturation.

Some interesting numbers:

  • Around 20% of our traffic goes to our .NET Core (out-of-monolith) IIS
  • We host all our out-of monolith apps on the same IIS pair (= 2 x 8 CPU)
    • .NET Core CPU peaks at 17% Monday morning
  • We host all our monolith instances on 3 IIS pairs (= 6 x 16 CPU)
    • .NET 4.7 CPU peaks at 70% Monday morning
  • If you do the math, 20% of the web-requests consumes 2.72 vCores on the .NET core side, and 80% consumes 67.2 vCores.

Another way to look at this : .NET Core use less than 6 times the CPU consumed by .NET 4.7.

So, the gains migrating our software to .NET Core is obvious. Once migrated, we'll probably be able to receive all the web-requests on the actual .NET Core (out-of-monolith) IIS VM pair, without adding more cores to the VM.

Performance is the real power of .NET Core.

About the author

Raphaël Ducom

DevOps Engineer