- Published on
Designing Systems: How to Approach a System Design
- Authors
This is the first part of a four-part Designing Systems series. The series is aimed at senior, staff, and lead engineers who design systems as part of their actual job - not at people prepping for interviews, and not at people deciding which architecture paradigm to adopt. Both of those are useful, but they are different conversations.
The four parts of the series:
- Part 1: How to Approach a System Design: Framing the problem, separating requirements from constraints from tradeoffs, the hidden assumptions, sketching before deciding, and the design as a communication artifact.
- Part 2: Core Elements: Data, state, control, failure, and observability - the five things every system design has to make decisions about.
- Part 3: Structuring for Change: Boundaries, contracts, evolution, and ownership. How a system survives the changes that are coming after the first version ships.
- Part 4: Anti-Patterns and What Experience Teaches: The failure modes that show up repeatedly, the positive case for what a good design feels like, and a few habits that compound over a career.
Most bad system designs are not bad because the engineer chose the wrong technology. They are bad because the engineer started drawing boxes before they understood the problem. Part 1 is about how to slow down at the front of the design and ask the questions that determine whether the rest of the work is useful.
Table of Contents
Framing Before Designing
The most common mistake in a system design is starting too early. The boxes and arrows are seductive. They feel like progress. They are, in fact, the easiest part of the work, and they are useless until the framing is done.
Framing means three things: knowing what the system is supposed to do, knowing what it has to be good at, and knowing what is fixed versus what you actually get to decide. Most experienced engineers can run this loop in their head for a small change. For a non-trivial system, the loop has to be explicit, or it gets skipped.
Functional and Non-Functional Requirements
The functional requirements are what the system has to do. Customers can place an order. Refunds settle in under a minute. Messages are delivered in order. These are usually the easy part of the conversation, because the product or business stakeholder has been thinking about them already.
The non-functional requirements (NFRs) are how the system has to behave. Availability. Latency. Throughput. Durability. Consistency. Compliance. Cost. Maintainability. NFRs are where most system designs succeed or fail, and they are also where most design conversations gloss over.
A useful habit is to write down every NFR with a target number and a stakeholder.
| NFR | Useful framing | Common failure |
|---|---|---|
| Availability | "99.95% monthly availability, region-failover within 5 minutes" | "highly available" with no target |
| Latency | "p95 read under 100ms, p99 write under 500ms" | average latency only, no tail behaviour |
| Consistency | "read-after-write within the same session, eventual otherwise" | "strong consistency" without scoping it |
| Throughput | "10k writes/sec sustained, 30k peak, 5x growth in 18 months" | current load only, no growth projection |
| Durability | "no data loss beyond a 5-second window in any single failure" | "no data loss" as an absolute |
| Cost | "infra below $X/month at projected steady-state load" | cost left implicit, surfaces in review later |
| Compliance | "PII at rest encrypted, audit log retained 7 years" | listed once, never traced into design choices |
Illustrative NFR Priorities by Domain (Relative, Not Absolute)
The same NFR menu produces different priority profiles across domains. A banking system and a streaming system are not arguing about the same tradeoffs. Naming the domain up front saves a lot of design conversation later.
Two follow-on habits matter as much as writing the NFRs down.
First, surface conflicts early. Strong consistency reduces availability under network partitions. Lower latency targets push you toward denormalisation, which raises maintenance cost. Tight cost targets push you away from regional redundancy, which lowers availability. NFRs that look reasonable individually often conflict when you put them next to each other. Better to find that out before the design than after.
Second, tie each NFR to a stakeholder. Orphan NFRs are the first to be dropped under pressure. If the cost ceiling is the CFO's, that conversation looks different than if it is a casual estimate. If the compliance requirement is from legal, it stops being negotiable. The stakeholder is part of the requirement.
Constraints, Tradeoffs, and Decisions
These three words often get used interchangeably in design conversations. They are not the same thing, and conflating them is one of the most common ways designs go wrong.
A constraint is fixed. You did not get to pick it. Maybe a regulator requires it, maybe the budget forces it, maybe the team's existing language stack rules out the alternatives. A constraint is part of the problem, not part of the solution.
A tradeoff is a choice between two valid options where you give something up to get something else. CP vs AP, lower latency vs higher throughput, faster delivery vs higher confidence. Tradeoffs do not have right answers - they have answers that match a context.
A decision is the resolution of a tradeoff in the current context, with the reasoning attached. Decisions can be revisited if the context changes. That is the whole point of treating them as decisions rather than facts.
The reason this matters: in most design discussions, people treat their decisions as constraints, and other people's constraints as decisions they could revisit. That asymmetry burns a lot of design time. A useful discipline is to label each item in the design explicitly:
| Item | Type | Notes |
|---|---|---|
| Must run on existing Kafka cluster | Constraint | Org-level standardisation, not negotiable |
| Strong consistency for balance reads | Decision | Justified by audit and dispute requirements |
| Single-region writes, multi-region reads | Decision | Trades write availability for read latency |
| Postgres for primary store | Decision | Could revisit if load profile changes materially |
| PII encryption at rest | Constraint | Legal/compliance, not negotiable |
Once you label the items, the design conversation gets faster. You stop arguing about things that are not actually open, and you start arguing about the things that are.
The Hidden Assumptions
Every design rests on assumptions that the designer is not consciously aware of. Surfacing them is one of the highest-leverage things a senior engineer does in a review.
Useful questions to run against any design:
- What load are we designing for? Today's load, projected load in 18 months, or peak hypothetical load. The answer changes the topology.
- What failure are we designing for? Single-node failure, single-region failure, correlated failure across providers, or "no significant failure expected." The answer changes the redundancy strategy.
- What latency budget are we designing for? End-to-end user latency, service-level latency, or one specific hot path. The answer changes where you spend optimisation effort.
- What change rate are we designing for? A system that ships once a quarter and a system that ships ten times a day need very different boundaries.
- What team are we designing for? A system maintained by ten people, or by two on-call rotation slots after the original team moves on. The answer changes how much complexity you should permit.
If you cannot answer these questions out loud, the design is resting on assumptions. The work of framing is to make them explicit.
Sketching Before Deciding
Once the framing is done, the next instinct is to start drawing the design. The next instinct should be resisted for one more round.
A useful habit is to sketch two or three quite different designs at low fidelity before picking one to develop. The sketches do not have to be polished. They have to be cheap enough that you can throw two of them away.
This works for a few reasons:
- It forces you to articulate why a particular approach is the right one, by comparison to alternatives that are concretely on the table.
- It exposes the decisions that are common to all three sketches, which are usually the real decisions, and the ones that are different, which are the tradeoff space.
- It gives reviewers something to push back on. A single design invites "this looks fine"; three designs invite "I would pick the second one because of X." The second response is more useful.
The cost of producing three sketches is hours. The cost of building the wrong design is months. The ratio is one of the better deals in engineering.
A common failure mode here is the anchoring sketch: drawing one design, then producing two "alternatives" that are obviously worse, to justify the first. Reviewers see through this immediately. If the three alternatives are not genuinely competitive in your own head, you have not actually done the sketching exercise.
The Design as a Communication Artifact
A system design is not just an engineering output. It is a communication artifact. The diagram and the document are how the design lives in the heads of everyone who has to build, operate, review, and eventually replace the system.
A few principles follow from this:
- The design has more readers than authors. A senior engineer writing a design is producing a document that will be read by team members today, on-call engineers in six months, a new hire in a year, and a future architect deciding whether to extend or replace the system. The writing should be aimed at all of them.
- The diagram and the prose have to agree. A diagram that shows three services and a doc that talks about four is a sign that the design is not yet stable in the author's head. Surfacing the inconsistency early is cheaper than discovering it during build.
- Decisions need their reasoning attached. Not "we use Postgres" but "we use Postgres because the access pattern is mixed OLTP/analytical, the team has operational experience, and the data volumes fit in a single instance for the next 18 months." The reasoning is what makes the decision revisable in the future.
- Invite the team in at the right time. Too early, and the conversation gets pulled into implementation details before the framing is settled. Too late, and the team feels presented with a fait accompli. A good middle is to share the framing and the sketches first, and bring the team into the decision conversation rather than the result.
The design is not finished when the diagram is drawn. It is finished when the people who have to live with it understand it, agree with it (or have registered their disagreement), and could explain it back to a third party without the original author in the room.
Conclusion
The work of approaching a system design is the work of slowing down before the boxes start. Framing comes before designing: knowing what the system has to do, what it has to be good at, what is fixed, and what is open. Sketching comes before deciding: two or three cheap alternatives are worth more than one polished proposal. And the design itself is a communication artifact, not just an engineering output - it succeeds when the people who have to live with it can carry it forward without the original author in the room.
Part 2 picks up where Part 1 ends. Once the framing is done and the sketch is chosen, the design has to make concrete decisions about five things: data, state, control, failure, and observability. Each of those is a hinge point where good designs and bad designs diverge.