What is System Design
System design is often presented as something to study for interviews, but that is only a small part of what it really is.
In practice, system design is a way of thinking about problems, how parts connect, how decisions scale, how failures happen, and how systems behave as they grow.
The purpose of this site is to explore those ideas in a practical way, so they can be understood not just as interview topics, but as concepts that can be applied to real engineering challenges.
What better way to understand system design than by designing a system from scratch?
To make these ideas concrete, we will walk through a simple example: a synthetic monitoring system.
Synthetic monitoring is a technique used to check the availability and performance of endpoints by periodically sending requests and observing how they respond. Instead of relying on real user traffic, the system proactively simulates it, helping detect failures or slowdowns before they become visible to users.
We will design a basic version of such a system from scratch, focusing not on deep implementation details, but on how different parts of the system connect, how work flows through it, and what needs to be considered as the system begins to scale.
2. The System We Are Building
2.1 What Are We Building
We are building a simple synthetic monitoring system that periodically checks a set of configured endpoints to determine whether they are reachable and responding as expected.
Each endpoint will be triggered at a fixed interval, and the system will record whether the request was successful along with basic information such as response time or failure status.
At a high level, the goal is straightforward: continuously verify that a set of services are available and detect issues early, before they impact real users.
This is not a full featured monitoring platform. We will explore that in more depth in a later case study. For now, the focus is on a minimal system that helps us understand how such systems are structured and how their components work together.
2.2 Basic Requirements
Before thinking about the design, it is important to define what this system is expected to do. This helps set clear boundaries and keeps the problem focused.
At a minimum, the system should:
- store a list of endpoints that need to be monitored along with how frequently they should be checked
- trigger checks for each endpoint at a fixed interval, for example every five minutes, without missing or duplicating executions
- send a request to the endpoint and determine whether it is reachable and responding within an acceptable time
- record the result of each check so that we have a basic history of availability and performance
- handle multiple endpoints running at the same time, since checks will often overlap depending on their schedule
- account for common scenarios such as slow responses, timeouts, and temporary failures
3. High Level Design
3.1 Core Components
To keep the system simple, we will focus on four core components:
- Scheduler, which determines when an endpoint should be checked and dispatches work directly to the engine
- Engine, which performs the actual check by sending a request to the endpoint
- Database, which stores the endpoint configuration and the result of each check
- Time series system, which is used to store and visualize monitoring data over time
The time series system is where monitoring data becomes useful, allowing us to observe trends, detect anomalies, and build dashboards using tools such as Grafana or Wavefront.
Each of these components has a clear responsibility, which helps keep the system easier to reason about.
3.2 How They Work Together
The database stores the list of endpoints along with their monitoring frequency.
The scheduler continuously looks for endpoints that are due for execution and dispatches them directly to the engine.
The engine receives that work, sends a request to the target endpoint, measures the response, and determines whether the check succeeded or failed.
Once the check is complete, the result is written back to the database.
In parallel, the engine also forwards monitoring data through a pipeline towards the time series system for aggregation and visualization.
At this stage, the system looks simple because the complexity is not in the components, but in how time and coordination are managed between them.
OPERATIONAL PATH
+----------+ +-----------+ +--------+ +----------+
| Database |---->| Scheduler |---->| Engine |---->| Database |
+----------+ +-----------+ +--------+ +----------+
|
| ANALYTICS PATH
v
+---------+ +-------------+ +---------+
| Queue |---->| Time Series |---->| Grafana |
+---------+ +-------------+ +---------+
3.3 End to End Flow
A single execution would look like this:
- an endpoint is stored in the database with its configuration
- when its scheduled time arrives, the scheduler picks it up
- the scheduler dispatches the work directly to the engine
- the engine performs the check against the endpoint
- the result is stored back in the database
- the result is also forwarded to a pipeline for time series processing
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6
Store Schedule Dispatch Execute Write Forward
Config Check to Engine Check Result to Queue
Database [config]--------------------------------------------------[result]
Scheduler [picks up]--[sends]
Engine [checks]------------------[pushes]
Time Series [receives]
This design is intentionally simple, but even in this form it introduces the key ideas behind system design: separation of responsibilities, flow of work, and coordination between components.
4. Capacity and Estimation
Once the high level design is clear, the next step is to estimate how much work the system is expected to handle. Even for a simple design, this is important because it helps us understand whether the system can support the required load and where bottlenecks may appear.
Capacity planning at this stage is less about precision and more about exposing where assumptions will break first.
4.1 Assumptions
To keep the estimation practical, let us assume the following:
- the system monitors 10,000 endpoints
- each endpoint is checked once every 5 minutes
- each check takes 2 seconds on average
- each result record is approximately 500 bytes
These numbers are only examples, but they give us a reasonable starting point.
4.2 Request Volume
If 10,000 endpoints are checked every 5 minutes, then the total number of checks per minute will be:
10,000 / 5 = 2,000 checks per minute
That means the system needs to handle approximately 2,000 checks per minute, or 33 checks per second.
At this stage, the numbers are small enough to ignore inefficiencies. A single engine instance could likely handle this load. That changes quickly as the number of monitored endpoints grows, and with it, every assumption made here needs to be revisited.
+-------------------------------------+
| 10,000 endpoints |
| monitored at fixed intervals |
+---------------+---------------------+
| / 5 min interval
v
+-------------------------------------+
| 2,000 checks / min |
| total scheduling throughput |
+---------------+---------------------+
| / 60 seconds
v
+-------------------------------------+
| 33 checks / second |
| engine must sustain this rate |
+---------------+---------------------+
| x 2 sec avg duration
v
+-------------------------------------+
| 66 concurrent workers |
| minimum engine capacity needed |
+---------------+---------------------+
| x 500 bytes x 60 x 24
v
+-------------------------------------+
| 1.44 GB / day |
| raw result storage growth |
+-------------------------------------+
4.3 Concurrent Executions
If each check takes 2 seconds on average, and the system is processing around 33 checks per second, then the number of checks running at the same time will be roughly:
33 x 2 = 66 concurrent checks
So, at a minimum, the engine layer should be able to handle around 66 concurrent executions.
In practice, this number should be higher to account for timeouts, retries, and uneven scheduling distribution.
4.4 Storage Growth
If each check produces a result of around 500 bytes, then the amount of data written will be:
- 2,000 results per minute
- 2,000 x 60 = 120,000 results per hour
- 2,000 x 60 x 24 = 2,880,000 results per day
At 500 bytes per result, that is roughly 1.44 GB of data per day.
This is where storage decisions start to matter. Keeping raw results forever may not be practical, so retention and aggregation become important as the system grows.
5. Scaling Considerations
So far, the system works well for the assumed load. But as the number of endpoints grows, certain parts of the system will start to struggle. This is where we need to think about how the system scales.
5.1 Scaling the Engine
The engine is responsible for executing checks, and it handles most of the workload.
As the number of endpoints increases, the number of checks per second will increase as well. Since each check is independent, the engine can be scaled horizontally by adding more workers.
To distribute the incoming load more evenly, a load balancer can sit in front of the engine layer and route requests across multiple engine instances.
BEFORE AFTER
Scheduler Scheduler
| |
v v
+---------+ +------------------+
| Engine | | Load Balancer |
+---------+ +--------+---------+
|
+------------+------------+
v v v
+---------+ +---------+ +---------+
|Engine 1 | |Engine 2 | |Engine 3 |
+---------+ +---------+ +---------+
Of all the components in this system, the engine is the most forgiving to scale: stateless, parallel, and easy to replicate. The harder problems are elsewhere.
5.2 Scheduler Bottlenecks
The scheduler is responsible for identifying which endpoints need to be executed at a given time.
In a simple design, a single scheduler instance may be enough. But as the number of monitored endpoints grows, a single scheduler becomes a bottleneck and the natural instinct is to add more replicas. This introduces a new problem: multiple scheduler replicas reading from the same configuration source will attempt to process the same endpoints, resulting in duplicated executions and wasted compute.
Common approaches to handle this include sharding work manually, scaling the scheduler vertically, or introducing an intermediary between the scheduler and the engine. These can reduce duplication but often add complexity or cost without fully solving the root problem.
A more efficient approach is to use a modulo-based partitioning model. Each scheduler replica is assigned an ordinal index, for example 0, 1, or 2 in a three-replica setup. When deciding which endpoints to process, each replica only handles endpoints where the endpoint ID modulo the total replica count equals its own index. This means every endpoint is deterministically owned by exactly one replica, with no coordination overhead and no duplication.
Endpoint ID Mod 3 Result Assigned To
----------- ------------ -----------------
endpoint-0 0 Scheduler Pod 0
endpoint-1 1 Scheduler Pod 1
endpoint-2 2 Scheduler Pod 2
endpoint-3 0 Scheduler Pod 0
endpoint-4 1 Scheduler Pod 1
endpoint-5 2 Scheduler Pod 2
+--------------------------------------------------+
| StatefulSet |
| |
| +------------+ +------------+ +------------+ |
| | Pod 0 | | Pod 1 | | Pod 2 | |
| | index=0 | | index=1 | | index=2 | |
| | owns IDs | | owns IDs | | owns IDs | |
| | 0, 3, 6 | | 1, 4, 7 | | 2, 5, 8 | |
| +------------+ +------------+ +------------+ |
+--------------------------------------------------+
In Kubernetes, this maps naturally to a StatefulSet, which guarantees stable pod identities and ordinal indexes across restarts and scaling events. As load increases, new replicas can be added and the partition boundaries adjust automatically. This approach has proven effective at workloads exceeding two million task executions per hour while maintaining consistent efficiency.
The key idea is that the partitioning logic is entirely deterministic. No central coordinator is needed, and no replica needs to communicate with another to decide what to work on.
We will explore this pattern in more depth in a dedicated article on scaling schedulers at large scale.
5.3 Database Load
The database stores endpoint configuration and monitoring results.
As the system grows, this layer can become more read intensive, especially if the scheduler and other services frequently query endpoint configuration and recent monitoring data.
Most systems do not fail because the database cannot store data. They fail because access patterns were never designed for scale. A database that works perfectly at a thousand queries per second can collapse under ten thousand if the queries are unindexed, unbounded, or hitting the wrong layer.
A common way to handle this is to add read replicas so that read traffic can be offloaded from the primary database. For example, Amazon Aurora supports up to 15 replicas in a cluster, which makes it a practical option for scaling read heavy workloads while keeping writes on the primary instance.
At this stage, the database design is no longer just about storage. It becomes a question of how reads and writes are separated, how queries are structured, and whether the access patterns that made sense at small scale still hold as load increases.
5.4 Time Series and Data Pipeline
As the system grows, simply storing results in the database is not enough. Monitoring data is most useful when it can be queried efficiently over time and visualized.
This is where a time series system comes in.
The moment data is written continuously, storage stops being the problem. Write patterns become the problem. How often data arrives, whether it arrives in bursts, and how the receiving system handles that pressure all matter more than raw capacity.
Instead of directly pushing results from the engine to the time series system, a queue can be introduced between them. The difference becomes clear under a traffic spike:
WITHOUT QUEUE WITH QUEUE
Engine ------------> Time Series Engine --------> Queue --------> Time Series
spike! burst in smoothed out
| ^^^^^^ ------->
v absorbed
overwhelmed
The flow becomes:
- engine generates result
- result is pushed to a queue
- time series system consumes data from the queue
This provides several advantages:
- it decouples the monitoring system from the time series system
- it prevents spikes in monitoring traffic from overwhelming the time series backend
- it allows batching and smoothing of writes
- it improves reliability in case the time series system is temporarily unavailable
Tools such as Grafana or Wavefront can then be used to visualize trends, latency patterns, and availability over time.
This separation ensures that the core monitoring system remains stable, while analytics and visualization can scale independently.
6. Conclusion
What we designed here is a simple system, but it highlights the core ideas behind system design.
We started with a clear problem, defined its boundaries, and built a system with a few well defined components. As we moved forward, we saw how load, scale, and data growth begin to influence design decisions.
Even in this basic setup, trade offs start to appear. Some parts are easy to scale, while others require more careful thinking. What seems simple at first quickly becomes more complex as the system grows.
There are also areas we have intentionally left for later.
The scheduler scaling problem introduced in this article will be explored in depth in a future piece, specifically how modulo-based partitioning can be implemented cleanly in a Kubernetes-native environment.
A real synthetic monitoring system also runs checks from multiple geographic locations. Multi-region execution introduces coordination and consistency challenges that deserve their own dedicated treatment: how checks are distributed globally, how results are aggregated across regions, and how failures in one region are handled without affecting others.
The goal here is not to build a complete system, but to develop the ability to reason about these changes and design systems that continue to work as they grow.
Because systems rarely fail when they are small. They fail when growth exposes the assumptions they were built on.