Share your infra - The 20,000 foot view

Aug 23, 2021 • Raphaël Ducom

This is the first post in the series "Share your infrastructure"

In this series, I’m going to take you on a journey over our tech infrastructure, our challenges, our legacy, and what matters the most : our mistakes, and what we learned from them.

Few years ago we started informal private meetings with some tech peeps from local organizations in Nantes called “share your infra”, to share feedback, and learn from others. These sessions were very interesting, so let me introduce our infrastructure in this new fresh tech blog.

Ok but, what is Lucca?

We are a French B2B HRIS editor, started 19 years ago (and have kept our startup mindset). Lucca is now a SaaS solution for leave management, expense reports, dematerialization of HR files, online pay slip distribution and time tracking.

We currently have more than 4400 B2B customers, mostly from European countries, but some are worldwide, that’s why we can say our tools are used in more than 100 countries. Our typical customer range is from 50 to 200 employees with some around 2000-4000. The average company size is increasing rapidly (with now companies in the 20k user range).

Some important facts about Lucca : we have no VCs, we are self financed, we love making great and useful products, we have a strong transparency culture (even salaries), and lastly, we love to keep things simple. “We design solutions that provide simple answers to specific needs” :)

Some technical legacy history

Lucca started in the 2000s, as a software company focused on leave management, and selling software for on-premise deployment. Back in those years, the stack was ASP (yes, classic ASP!), with Odbc drivers (so Oracle, MsSQL, even MS Access!). We were not a SaaS : the software was installed on each customer's servers. Usually a “swiss-knife” server, hosting both database and web (and a lot more unknown things).

In 2007, our CEO saw the growing SaaS model, and decided to embrace the SaaS concepts. So we stopped selling on-premise licenses, announced the end of support for on-premises customers, and started to manage our servers using dedicated bare metal instances from 1&1 cloud provider, and then OVH.

We migrated most of our customers to our dedicated servers by replicating the same tenancy model : one database = one customer (1 SQL and 1 IIS on the same server, no redundancy). The scaling model was to run ~100 customers (db) per server, and then add another server when filled. It was 2008, and we switched to HTTPS.

In 2012, we designed a new software to manage all our instances, named “CloudControl”. This software is still used to create new instances, and provision previews based on customer anonymized data, and more importantly at that time, deploy the master branch every night on all servers. In 2012, deploying was synonymous with (small) service interruption.

In 2013 two major infrastructure changes : we migrated our dedicated servers to virtualized servers (vSphere 5 running on AMD Opterons CPUs), and added pfSense as firewall/router, reducing our surface exposure. (yes, previously all our servers were internet-facing). We also added a WAF on all our IIS servers, and started monitoring everything with Logmatic, a SaaS log monitoring company (since acquired by Datadog).

That's what automatic SQL schema changes deployment across dozens of dedicated servers looks like in 2016:

A lot of resources

Then in 2016, we added HAProxy LBs for managing HTTPS, rate limits, and custom maintenance pages. We also moved all our static files to Nginx, reducing latency to a few ms. (~100 ms before under IIS).

At that time, we started having bigger customers, and our current scalability model couldn't cope with that. We started a plan for clustering our applications : one dedicated SQL Server, and multiple IIS. We had a lot of in-memory cached data, and needed to start using Redis as a shared distributed cache. At the beginning, we used the “webgarden” feature of IIS, letting us have 2 app pool instances per App per IIS in order to consolidate our Redis usage. And then we switched in late 2018 to a real cluster model.

Lot of things also happened in 2018 : our first .NET Core applications, the first ansible playbooks, a major revamp of our CI/CD stack (Jenkins, SonarCloud, PowerShell deployment scripts driven by Jenkins, a big effort on unit tests, etc.)

The cluster : our scalability unit

Currently, our software stack is composed of a legacy monolithic application (net471), and 22 .NET Core applications (.NET Core 3.1 / 5). All these applications are running in each cluster.

A cluster is our actual scalability unit : we can scale just by adding more and more clusters. Nowadays, we add a new cluster every 5 months, and this period will inherently shorten over time, given our exponential growth. Cluster's size are currently limited by SQL Server : we can use a maximum of 24 cores per SQL Standard instance, and try to stay below this hard-limit to keep room for eventual "unexpected CPU usage incidents".

A cluster is also our availability-unit, we try to keep clusters fully isolated from each other, to minimize the blast radius of any incident. That’s the main reason for having dedicated SQL, Redis and RabbitMQ per cluster. When an infrastructure issue happens, only a subset of our customers is impacted.

A cluster is made of:

  • 1 primary SQL Server instance + 2 replica restoring LOGs continuously
  • 6 IIS VM for our monolithic application, 16 core, 24Gb RAM each
  • 2 IIS VM for our .NET Core applications, 8 core, 16Gb RAM each
  • 2 HAProxy, in passive/active mode (keepalived)
  • 3 Redis instances (in sentinel mode)
  • 3 RabbitMQ instances (Raft)
  • 2 DFS servers (MS distributed file system), for our customer’s files
  • 2 Nginx servers for the static SPA files

Some numbers

Currently we have 6 clusters in prod (+1 small cluster in a swiss datacenter)

  • 1 000 customers per cluster
  • ~600Gb of mdf / ldf per SQL Server instance
  • 2.5Tb of customer files per cluster (and a lot of small ones)
  • 8 millions HTTP calls per day per cluster, from 8am to 7pm, with peaks at 700/800 requests / sec.

We have a set of 23 web-services (.NET Core 3.1/ 5) and 3 legacy web-services (.NET 4.7.1) per region (shared to clusters in a region), which are either zero or multi-tenant. Our web-services are a cluster on their own : 2 IIS instances + 1 SQL instance (+replica), and dedicated Redis/RMQ.

In addition, we have some “non-production” zones with multiple clusters.

  • Preview : made to create environments on the fly on any GitHub branches.
  • Formation (training in French) : to let our customers create a clone of their production, for learning or experimentation purposes.
  • Demos : anyone can create a demo instance to try our software.
  • Security / pre-production: a specific cluster for pentesting or specific validations.
  • TestSQL : to test and validate SQL schema migration against all our customer database before deploying on production.

Yes, there's a lot of VMs. 320 VMs exactly.

A lot of resources

Our tenancy model

Each tenant at Lucca has a dedicated URL, like customer.ilucca.net.

This URL is DNS mapped (CNAME) to a public IP, targeting a specific cluster (NATed by PfSense). To be more specific, it’s routed to the HAProxy virtual IP of this cluster.

So all our tenants are individually mapped to a specific cluster, with an exception for the “currently filling” cluster which receives *.ilucca.net. When we create a new cluster, we persist the previous cluster DNS entries before redirecting * to the new cluster.

Our databases are all suffixed by the tenant name, same goes for the SQL logins : once a user web request comes on the IIS server, we extract the first part of the URL segment, and build (or get from cache) a connection string for this webrequest. Then, all these scoped requests are executed on a specific user database. So if from a SQL point of view we have a strong tenant isolation, on the IIS side, we have a soft tenant isolation.

For the customers files, we have a dedicated folder for each tenant on the DFS, and a webservice in front to enforce tenant access policies and permissions.

Tenant management

We have developed a custom internal tool (CloudControl) to manage all our customers' instances. Basically, we have a “tenant template” (an active instance) which consists of one database, and several files. The creation of a new instance is simple : we create a backup of this db, and restore it with a new name, and new credentials. Same goes for the files.

There are several strengths to this model:

  • Instance creation is really fast, and robust.
  • We can create preview or training environments in a few seconds for any customer - instance. (Previews are for internal usage, training environment for our customers).
  • Developers can download and setup an anonymized instance on their laptop in a few - seconds, with a single command line.
  • We can easily move an instance from a cluster to another in order to balance the load - in case of some specific customer usage.
  • We sometimes have cases where a customer made a big mistake (like mass deleting - expenses), and asks to restore their data to a previous state.

One of the best CloudControl's features is the ability to create (in a minute) a preview environment (for internal usage) from the artefacts of a specific GitHub branch (or several branches from multiple repositories / apps). This let our developers or product owners to test a new feature, validate a bug fix, or try to reproduce an issue reported by customers in a few minutes.

When I started at Lucca in 2016, CloudControl was mind-blowing : the first time I’ve seen a simple but effective solution for easy preprod management. This asset is a fundamental part of Lucca’s velocity to deploy multiple times per day or quickly resolve customer issues.

Starting a preview

In this screenshot, I find git branches on multiple repositories, and define a custom environment with an anonymized copy of a specific tenant. Also, I can connect to training instances (“formation”), or if I have a support ticket assigned, connect directly to production.

About the author

Raphaël Ducom

DevOps Engineer