What is System Design - ArchitectInCloud

System design is often presented as something to study for interviews, but that is only a small part of what it really is.

In practice, system design is a way of thinking about problems, how parts connect, how decisions scale, how failures happen, and how systems behave as they grow.

The purpose of this site is to explore those ideas in a practical way, so they can be understood not just as interview topics, but as concepts that can be applied to real engineering challenges.

What better way to understand system design than by designing a system from scratch?

To make these ideas concrete, we will walk through a simple example: a synthetic monitoring system.

Synthetic monitoring is a technique used to check the availability and performance of endpoints by periodically sending requests and observing how they respond. Instead of relying on real user traffic, the system proactively simulates it, helping detect failures or slowdowns before they become visible to users.

We will design a basic version of such a system from scratch, focusing not on deep implementation details, but on how different parts of the system connect, how work flows through it, and what needs to be considered as the system begins to scale.

2. The System We Are Building

2.1 What Are We Building

We are building a simple synthetic monitoring system that periodically checks a set of configured endpoints to determine whether they are reachable and responding as expected.

Each endpoint will be triggered at a fixed interval, and the system will record whether the request was successful along with basic information such as response time or failure status.

At a high level, the goal is straightforward: continuously verify that a set of services are available and detect issues early, before they impact real users.

This is not a full featured monitoring platform. We will explore that in more depth in a later case study. For now, the focus is on a minimal system that helps us understand how such systems are structured and how their components work together.

2.2 Basic Requirements

Before thinking about the design, it is important to define what this system is expected to do. This helps set clear boundaries and keeps the problem focused.

At a minimum, the system should:

store a list of endpoints that need to be monitored along with how frequently they should be checked
trigger checks for each endpoint at a fixed interval, for example every five minutes, without missing or duplicating executions
send a request to the endpoint and determine whether it is reachable and responding within an acceptable time
record the result of each check so that we have a basic history of availability and performance
handle multiple endpoints running at the same time, since checks will often overlap depending on their schedule
account for common scenarios such as slow responses, timeouts, and temporary failures

3. High Level Design

3.1 Core Components

To keep the system simple, we will focus on four core components:

Scheduler, which determines when an endpoint should be checked and dispatches work directly to the engine
Engine, which performs the actual check by sending a request to the endpoint
Database, which stores the endpoint configuration and the result of each check
Time series system, which is used to store and visualize monitoring data over time

The time series system is where monitoring data becomes useful, allowing us to observe trends, detect anomalies, and build dashboards using tools such as Grafana or Wavefront.

Each of these components has a clear responsibility, which helps keep the system easier to reason about.

3.2 How They Work Together

The database stores the list of endpoints along with their monitoring frequency.

The scheduler continuously looks for endpoints that are due for execution and dispatches them directly to the engine.

The engine receives that work, sends a request to the target endpoint, measures the response, and determines whether the check succeeded or failed.

Once the check is complete, the result is written back to the database.

In parallel, the engine also forwards monitoring data through a pipeline towards the time series system for aggregation and visualization.

At this stage, the system looks simple because the complexity is not in the components, but in how time and coordination are managed between them.

OPERATIONAL PATH

+----------+     +-----------+     +--------+     +----------+
| Database |---->| Scheduler |---->| Engine |---->| Database |
+----------+     +-----------+     +--------+     +----------+
                                        |
                                        | ANALYTICS PATH
                                        v
                                   +---------+     +-------------+     +---------+
                                   |  Queue  |---->| Time Series |---->| Grafana |
                                   +---------+     +-------------+     +---------+

3.3 End to End Flow

A single execution would look like this:

an endpoint is stored in the database with its configuration
when its scheduled time arrives, the scheduler picks it up
the scheduler dispatches the work directly to the engine
the engine performs the check against the endpoint
the result is stored back in the database
the result is also forwarded to a pipeline for time series processing

           Step 1    Step 2     Step 3     Step 4     Step 5     Step 6
           Store     Schedule   Dispatch   Execute    Write      Forward
           Config    Check      to Engine  Check      Result     to Queue

Database  [config]--------------------------------------------------[result]
Scheduler          [picks up]--[sends]
Engine                                    [checks]------------------[pushes]
Time Series                                                                  [receives]

This design is intentionally simple, but even in this form it introduces the key ideas behind system design: separation of responsibilities, flow of work, and coordination between components.

4. Capacity and Estimation

Once the high level design is clear, the next step is to estimate how much work the system is expected to handle. Even for a simple design, this is important because it helps us understand whether the system can support the required load and where bottlenecks may appear.

Capacity planning at this stage is less about precision and more about exposing where assumptions will break first.

4.1 Assumptions

To keep the estimation practical, let us assume the following:

the system monitors 10,000 endpoints
each endpoint is checked once every 5 minutes
each check takes 2 seconds on average
each result record is approximately 500 bytes

These numbers are only examples, but they give us a reasonable starting point.

4.2 Request Volume

If 10,000 endpoints are checked every 5 minutes, then the total number of checks per minute will be:

10,000 / 5 = 2,000 checks per minute

That means the system needs to handle approximately 2,000 checks per minute, or 33 checks per second.

At this stage, the numbers are small enough to ignore inefficiencies. A single engine instance could likely handle this load. That changes quickly as the number of monitored endpoints grows, and with it, every assumption made here needs to be revisited.

+-------------------------------------+
|         10,000 endpoints            |
|      monitored at fixed intervals   |
+---------------+---------------------+
                |  / 5 min interval
                v
+-------------------------------------+
|       2,000 checks / min            |
|    total scheduling throughput      |
+---------------+---------------------+
                |  / 60 seconds
                v
+-------------------------------------+
|        33 checks / second           |
|   engine must sustain this rate     |
+---------------+---------------------+
                |  x 2 sec avg duration
                v
+-------------------------------------+
|      66 concurrent workers          |
|  minimum engine capacity needed     |
+---------------+---------------------+
                |  x 500 bytes x 60 x 24
                v
+-------------------------------------+
|          1.44 GB / day              |
|      raw result storage growth      |
+-------------------------------------+

4.3 Concurrent Executions

If each check takes 2 seconds on average, and the system is processing around 33 checks per second, then the number of checks running at the same time will be roughly:

33 x 2 = 66 concurrent checks

So, at a minimum, the engine layer should be able to handle around 66 concurrent executions.

In practice, this number should be higher to account for timeouts, retries, and uneven scheduling distribution.

4.4 Storage Growth

If each check produces a result of around 500 bytes, then the amount of data written will be:

2,000 results per minute
2,000 x 60 = 120,000 results per hour
2,000 x 60 x 24 = 2,880,000 results per day

At 500 bytes per result, that is roughly 1.44 GB of data per day.

This is where storage decisions start to matter. Keeping raw results forever may not be practical, so retention and aggregation become important as the system grows.

5. Scaling Considerations

So far, the system works well for the assumed load. But as the number of endpoints grows, certain parts of the system will start to struggle. This is where we need to think about how the system scales.

5.1 Scaling the Engine

The engine is responsible for executing checks, and it handles most of the workload.

As the number of endpoints increases, the number of checks per second will increase as well. Since each check is independent, the engine can be scaled horizontally by adding more workers.

To distribute the incoming load more evenly, a load balancer can sit in front of the engine layer and route requests across multiple engine instances.

BEFORE                                AFTER

  Scheduler                               Scheduler
      |                                       |
      v                                       v
 +---------+                        +------------------+
 | Engine  |                        |   Load Balancer  |
 +---------+                        +--------+---------+
                                             |
                                +------------+------------+
                                v            v            v
                           +---------+  +---------+  +---------+
                           |Engine 1 |  |Engine 2 |  |Engine 3 |
                           +---------+  +---------+  +---------+

Of all the components in this system, the engine is the most forgiving to scale: stateless, parallel, and easy to replicate. The harder problems are elsewhere.

5.2 Scheduler Bottlenecks

The scheduler is responsible for identifying which endpoints need to be executed at a given time.

In a simple design, a single scheduler instance may be enough. But as the number of monitored endpoints grows, a single scheduler becomes a bottleneck and the natural instinct is to add more replicas. This introduces a new problem: multiple scheduler replicas reading from the same configuration source will attempt to process the same endpoints, resulting in duplicated executions and wasted compute.

Common approaches to handle this include sharding work manually, scaling the scheduler vertically, or introducing an intermediary between the scheduler and the engine. These can reduce duplication but often add complexity or cost without fully solving the root problem.

A more efficient approach is to use a modulo-based partitioning model. Each scheduler replica is assigned an ordinal index, for example 0, 1, or 2 in a three-replica setup. When deciding which endpoints to process, each replica only handles endpoints where the endpoint ID modulo the total replica count equals its own index. This means every endpoint is deterministically owned by exactly one replica, with no coordination overhead and no duplication.

Endpoint ID    Mod 3 Result    Assigned To
-----------    ------------    -----------------
endpoint-0          0          Scheduler Pod 0
endpoint-1          1          Scheduler Pod 1
endpoint-2          2          Scheduler Pod 2
endpoint-3          0          Scheduler Pod 0
endpoint-4          1          Scheduler Pod 1
endpoint-5          2          Scheduler Pod 2


+--------------------------------------------------+
|                   StatefulSet                    |
|                                                  |
|  +------------+  +------------+  +------------+  |
|  |   Pod 0    |  |   Pod 1    |  |   Pod 2    |  |
|  |  index=0   |  |  index=1   |  |  index=2   |  |
|  | owns IDs   |  | owns IDs   |  | owns IDs   |  |
|  |  0, 3, 6   |  |  1, 4, 7   |  |  2, 5, 8   |  |
|  +------------+  +------------+  +------------+  |
+--------------------------------------------------+

In Kubernetes, this maps naturally to a StatefulSet, which guarantees stable pod identities and ordinal indexes across restarts and scaling events. As load increases, new replicas can be added and the partition boundaries adjust automatically. This approach has proven effective at workloads exceeding two million task executions per hour while maintaining consistent efficiency.

The key idea is that the partitioning logic is entirely deterministic. No central coordinator is needed, and no replica needs to communicate with another to decide what to work on.

We will explore this pattern in more depth in a dedicated article on scaling schedulers at large scale.

5.3 Database Load

The database stores endpoint configuration and monitoring results.

As the system grows, this layer can become more read intensive, especially if the scheduler and other services frequently query endpoint configuration and recent monitoring data.

Most systems do not fail because the database cannot store data. They fail because access patterns were never designed for scale. A database that works perfectly at a thousand queries per second can collapse under ten thousand if the queries are unindexed, unbounded, or hitting the wrong layer.

A common way to handle this is to add read replicas so that read traffic can be offloaded from the primary database. For example, Amazon Aurora supports up to 15 replicas in a cluster, which makes it a practical option for scaling read heavy workloads while keeping writes on the primary instance.

At this stage, the database design is no longer just about storage. It becomes a question of how reads and writes are separated, how queries are structured, and whether the access patterns that made sense at small scale still hold as load increases.

5.4 Time Series and Data Pipeline

As the system grows, simply storing results in the database is not enough. Monitoring data is most useful when it can be queried efficiently over time and visualized.

This is where a time series system comes in.

The moment data is written continuously, storage stops being the problem. Write patterns become the problem. How often data arrives, whether it arrives in bursts, and how the receiving system handles that pressure all matter more than raw capacity.

Instead of directly pushing results from the engine to the time series system, a queue can be introduced between them. The difference becomes clear under a traffic spike:

WITHOUT QUEUE                        WITH QUEUE

Engine ------------> Time Series     Engine --------> Queue --------> Time Series
          spike!                              burst in     smoothed out
            |                                 ^^^^^^           ------->
            v                                 absorbed
        overwhelmed

The flow becomes:

engine generates result
result is pushed to a queue
time series system consumes data from the queue

This provides several advantages:

it decouples the monitoring system from the time series system
it prevents spikes in monitoring traffic from overwhelming the time series backend
it allows batching and smoothing of writes
it improves reliability in case the time series system is temporarily unavailable

Tools such as Grafana or Wavefront can then be used to visualize trends, latency patterns, and availability over time.

This separation ensures that the core monitoring system remains stable, while analytics and visualization can scale independently.

6. Conclusion

What we designed here is a simple system, but it highlights the core ideas behind system design.

We started with a clear problem, defined its boundaries, and built a system with a few well defined components. As we moved forward, we saw how load, scale, and data growth begin to influence design decisions.

Even in this basic setup, trade offs start to appear. Some parts are easy to scale, while others require more careful thinking. What seems simple at first quickly becomes more complex as the system grows.

There are also areas we have intentionally left for later.

The scheduler scaling problem introduced in this article will be explored in depth in a future piece, specifically how modulo-based partitioning can be implemented cleanly in a Kubernetes-native environment.

A real synthetic monitoring system also runs checks from multiple geographic locations. Multi-region execution introduces coordination and consistency challenges that deserve their own dedicated treatment: how checks are distributed globally, how results are aggregated across regions, and how failures in one region are handled without affecting others.

The goal here is not to build a complete system, but to develop the ability to reason about these changes and design systems that continue to work as they grow.

Because systems rarely fail when they are small. They fail when growth exposes the assumptions they were built on.