Challenges in Building PaaS on AWS

I have been involved in building SaaS products as well as some large-scale platforms. Building a Platform-as-a-Service (PaaS) on AWS is a fundamentally different challenge from building a product for end users. You are not just shipping features — you are building a system that other teams, other products, or even external developers will build on top of. The stakes are higher, the blast radius is wider, and the decisions you make early tend to calcify quickly.

In this post, I want to walk through both the non-technical and technical challenges I have observed, and share what I have learned along the way.

Non-Technical Challenges

The engineering industry tends to jump straight to the technical side of platform problems. But in my experience, the organizational and people challenges are often what sink platform initiatives before the distributed systems problems even get a chance to.

Hiring Engineers Who Think in Platforms

The first and most persistent challenge is hiring. Platform engineering requires a specific mindset that is different from product feature development.

A product engineer optimizes for user-facing outcomes: ship the feature, measure the funnel, iterate on feedback. A platform engineer optimizes for reliability, extensibility, and the developer experience of internal consumers. They think about API contracts, backward compatibility, blast radius, and the cost of a breaking change — not just whether the feature works today.

Finding engineers who have this instinct, or who can develop it quickly, is hard. Most engineers have spent their careers in feature teams. They are great at solving defined problems. Platform work requires tolerating ambiguity, thinking in primitives rather than features, and caring deeply about internal customers who may not always articulate their needs clearly.

When hiring for a platform team, I look for engineers who have dealt with systems that have multiple consumers, who have written APIs that others depend on, and who have felt the pain of a breaking change ripple through an organization. That lived experience shapes how they make decisions.

Product Managers Who Understand the Platform Play

Product management for a platform is a genuinely different discipline. Most PM frameworks — jobs-to-be-done, user story mapping, OKRs tied to activation and retention — were designed for products with external end users. They map poorly to an internal platform where your “users” are other engineering teams and your “product” is a set of APIs, guardrails, and abstractions.

A good platform PM needs to think in terms of primitives: what is the smallest, most composable unit of capability we can expose? They need to understand API design, versioning strategy, and the concept of internal customers with their own roadmaps and constraints. They need to balance standardization (which makes the platform defensible and maintainable) against flexibility (which makes it usable for teams with diverse needs).

The PMs who struggle in this role are the ones who want to ship visible features to visible users. Platform value is often invisible — it shows up as fewer incidents, faster onboarding, lower cognitive load for product teams. Making that legible to leadership is itself a skill.

Team Structure: Centralized vs. Federated

How you structure the platform team shapes nearly everything else. There is no universally correct answer, but the decision has real consequences.

A centralized platform team owns the platform end-to-end. They set standards, build tooling, and are the single point of contact for platform capabilities. This gives you consistency and clear ownership, but it creates a bottleneck. Every product team that needs something from the platform has to wait in the platform team’s queue. The platform team becomes a tax on velocity.

A federated model distributes platform ownership across product teams, with a thin central team setting standards and owning the most critical shared infrastructure. This scales better but risks drift — different teams start solving the same problems differently, and the “platform” becomes a collection of loosely connected components rather than a coherent system.

Most mature organizations end up somewhere in between: a small, high-leverage central platform team that owns the hardest and most critical pieces, combined with embedded platform engineers in product teams who act as a bridge. The central team focuses on the control plane, security primitives, and landing zone. The embedded engineers handle the day-to-day integration and act as a feedback channel.

Getting the team topology wrong is expensive to fix. Conway’s Law is real — the architecture of your platform will eventually mirror the communication structure of the organization that built it.

Business Model Shapes Everything

Whether your platform serves B2C, B2B, or B2B2C consumers has profound implications for its design — and this is often underappreciated.

A B2C platform faces enormous scale at the data plane but often has relatively homogeneous consumers. You can afford strong opinions because your consumers have limited ability to negotiate special treatment. Cost efficiency per unit and global availability are typically the dominant constraints.

A B2B platform serves enterprise customers who have compliance requirements, SLA expectations, data residency needs, and integration constraints that vary widely. Multi-tenancy gets complicated fast. You may need dedicated infrastructure for certain customers. Audit logs, RBAC, and contractual SLOs become first-class concerns.

A B2B2C platform is the hardest. You are building for businesses who are in turn building for consumers. You need to think about isolation between your B2B customers and within each of their consumer bases. A single noisy tenant cannot be allowed to degrade the experience for all consumers across all other tenants.

Establishing this business context early, and revisiting it as the business evolves, is essential. The platform decisions that are right for B2B are often wrong for B2B2C, and retrofitting is painful.

Technical Challenges

Assuming the organizational foundation is in place, the technical challenges of building a PaaS on AWS are substantial. Let me go through the ones I have found most consequential.

Authentication and Authorization at Platform Scale

Auth is deceptively hard in a multi-tenant platform. In a single-product system, you authenticate a user and authorize them to perform actions. In a platform, you have multiple layers of identity: the platform itself, the tenants on the platform, the users within each tenant, and potentially service-to-service calls across all of these.

AWS gives you good building blocks — IAM, Cognito, AWS STS for cross-account access — but assembling them into a coherent multi-tenant auth model requires careful design. Some questions that come up quickly:

How do you model tenant isolation in IAM? Do you use separate AWS accounts per tenant, separate IAM roles, or resource-level policies?
How do you handle service-to-service authentication within the platform without creating overly permissive roles?
How do you propagate the identity of the calling user through a chain of internal services so that audit logs are meaningful?

The answers depend on your isolation model. If you are using separate AWS accounts per tenant (which I generally recommend for strong isolation), you get a lot of security primitives for free but introduce complexity in cross-account orchestration. If you are sharing accounts with resource-level isolation, you need to be far more careful about policy design and trust boundaries.

Getting auth right early matters because it underpins every other security control. Retrofitting a robust auth model onto a platform that was built without one is one of the most painful refactors you will ever attempt.

Control Plane vs. Data Plane

The most important architectural separation in any PaaS is between the control plane and the data plane.

The data plane is where your tenants’ workloads actually run. This is the hot path — the code that processes real traffic, serves real requests, and must be available and performant. Its primary constraint is throughput and latency.

The control plane is the management layer. It handles provisioning, configuration, tenant onboarding, metering, and billing. It orchestrates the data plane but does not sit in the critical path of data plane requests. Its primary constraint is consistency and correctness.

This separation matters enormously for operational reasons. Control plane outages should not affect the data plane. If your provisioning system goes down, existing tenant workloads should continue to run without interruption. The data plane should be designed to be self-sufficient: it reads its configuration at startup (or caches it) and does not depend on the control plane being available to serve requests.

A common mistake is to build a platform where the data plane calls the control plane on every request — to check authorization, fetch tenant configuration, or validate quotas. This couples the availability of the data plane to the control plane and creates a latency dependency that is hard to eliminate later. Push configuration to the data plane at provisioning time, cache aggressively, and design the control plane for eventual consistency with the data plane.

Cell-Based Architecture

At sufficient scale, a single global deployment of your platform becomes a liability. A bad deployment, a runaway tenant, or a cascading failure can affect all of your customers simultaneously. Cell-based architecture is the answer to this.

The idea is to partition your platform into independently deployable and operable units called cells. Each cell serves a subset of your tenant population and is isolated from other cells at the infrastructure level. A failure in one cell — whether from a bad deployment or a noisy tenant — is contained to that cell. The rest of your customers are unaffected.

On AWS, cells often map to AWS accounts or to isolated VPCs within accounts. The cell boundary is where you enforce the blast radius. Tenants are assigned to cells at onboarding time, and the assignment is typically sticky (though you may want a mechanism for migrating tenants between cells for load balancing or regulatory reasons).

Cell-based architecture introduces its own complexity: you need a routing layer that maps incoming requests to the correct cell, you need deployment tooling that can roll out changes across cells in a controlled sequence, and you need observability that can aggregate metrics across cells while still letting you drill down into individual cell behavior. But the operational resilience it buys is worth the investment.

Security Controls in a Multi-Tenant Environment

Security in a PaaS is not just about protecting the platform from external threats — it is about protecting tenants from each other. Tenant isolation is a security requirement, not just an operational nicety.

The layers of security controls I find most important:

Network isolation: Each tenant’s workloads should run in isolated network segments. Use separate VPCs, security groups, and NACLs to ensure that a compromised tenant cannot reach another tenant’s network. AWS PrivateLink is useful for exposing platform services to tenant VPCs without routing traffic through the public internet.

Compute isolation: Depending on your threat model, you may need stronger isolation than network-level controls provide. AWS Nitro Enclaves and AWS Graviton instances with dedicated hardware can provide stronger compute isolation for sensitive workloads.

Data isolation: Tenant data should be encrypted at rest with tenant-specific keys managed through AWS KMS. This allows you to revoke access to a specific tenant’s data without affecting others and simplifies compliance with data deletion requests.

Audit logging: Every action in the control plane should be logged to CloudTrail and streamed to an immutable audit log. This is not just good practice — it is a contractual requirement for most enterprise B2B customers and a regulatory requirement in many jurisdictions.

Blast radius controls: Use SCPs (Service Control Policies) in AWS Organizations to prevent tenant accounts from taking actions that could affect the broader platform — disabling CloudTrail, modifying security group rules that the platform manages, or launching resources outside approved regions.

Services in the Control Plane

Deciding what belongs in the control plane and what belongs in the data plane is an ongoing design challenge. Some services clearly belong in the control plane:

Tenant management: Onboarding, offboarding, configuration management, and tenant metadata live here. This is where you model your tenants as first-class entities.

Metering and billing: Usage data needs to be collected, aggregated, and fed into billing. Metering is a control plane concern because it needs to see across all tenant activity. Getting metering right is harder than it looks — you need to handle late-arriving data, deal with partial failures in data collection, and ensure that the metering model maps cleanly to your billing model.

Feature flags: The ability to roll out features to a subset of tenants, or to specific tiers of your platform, is a control plane capability. AWS AppConfig is a reasonable starting point; more sophisticated platforms build their own.

Quota management: Enforcing per-tenant quotas (API rate limits, storage limits, compute limits) requires a coordination layer that is aware of all tenant activity. This is a hard distributed systems problem — more on this below.

Health and observability: Aggregate health of all cells, tenant-level SLA tracking, and anomaly detection live in the control plane. This is where you get the global view.

Landing Zone

Before you can build any of the above, you need a well-structured AWS foundation. This is what AWS calls a landing zone — a governed, multi-account AWS environment with baseline security controls, networking, and observability in place.

AWS Control Tower is the managed way to set this up. It gives you a multi-account structure based on AWS Organizations, with guardrails (implemented as SCPs and AWS Config rules) that enforce baseline policies across all accounts. It handles account vending — the automated provisioning of new AWS accounts for new cells or new tenants — and provides a central audit account for CloudTrail and AWS Config data.

The alternative is to build your own landing zone automation, which gives you more control but requires significant investment. Tools like Terraform, the AWS CDK, and open-source projects like Landing Zone Accelerator are useful here.

Getting the landing zone right before you start building the platform is important. Adding guardrails and account structure to an existing unstructured AWS environment is extremely painful. The blast radius of a poorly governed AWS organization is large.

Distributed Systems Problems at Platform Scale

Even with the right team, the right architecture, and a solid AWS foundation, you will eventually run into the fundamental hard problems of distributed systems. These are not AWS-specific — they are inherent to any system that runs across multiple machines. But at platform scale, they become unavoidable.

Split Brain

A split brain occurs when a network partition causes different parts of your system to have inconsistent views of shared state. Each partition continues to operate independently, potentially making conflicting decisions, and when the partition heals, reconciling the divergent state is non-trivial.

This shows up in practice in places like distributed databases, leader-based replication systems, and any service that caches state and needs to stay synchronized with a source of truth. AWS services like DynamoDB (with its eventual consistency model) and Aurora (with its writer-only primary) have well-documented behaviors around partitions — but understanding those behaviors and designing your application to handle them correctly is on you.

The operational implication: you need to decide, explicitly, whether your platform prioritizes availability or consistency in the face of a partition (the CAP theorem is not just a thought experiment). Making that decision consciously, per service, and encoding it into your runbooks and incident response playbooks is part of building a production-grade platform.

Byzantine Faults

A crash fault is when a node stops responding. A Byzantine fault is when a node responds, but with incorrect or inconsistent data. This is a much harder class of failure to handle.

In a platform context, Byzantine faults are most relevant in quota management, distributed rate limiting, and any system where multiple nodes need to agree on a shared value. If a node reports stale or incorrect quota consumption data, other nodes may allow requests that should have been throttled, leading to quota violations that are hard to detect and harder to attribute.

Byzantine fault-tolerant algorithms exist (pBFT, for example) but are expensive in terms of coordination overhead. In practice, most platforms handle this through redundant data collection, reconciliation jobs that detect anomalies, and circuit breakers that fail closed when data looks inconsistent.

Thundering Herds

A thundering herd occurs when a large number of consumers simultaneously attempt to perform the same action — typically after a period of unavailability. When a service recovers from an outage, all the clients that were retrying simultaneously reconnect. When a cache expires, all the readers that were waiting for the cache miss simultaneously hit the origin. The resulting load spike can push the recovering service back into failure.

At platform scale, thundering herds are a persistent operational risk. Your platform serves many tenants, and if an event triggers synchronized behavior across tenants — a planned maintenance window, a brief network interruption, a certificate rotation — the reconnection storm can be severe.

Mitigations include: exponential backoff with jitter in all clients (the AWS SDK does this by default for most services), cache warming before bringing a service back online, request hedging to distribute load across replicas, and probabilistic cache refresh (refreshing a cache entry slightly before it expires rather than waiting for it to expire and serving the miss).

Quorum Formation and Leader Election

Many stateful platform services require a notion of leadership: a single node that is authoritative for writes, coordinates distributed transactions, or manages shared state. Electing that leader, detecting when it has failed, and electing a new one without introducing split brain is the leader election problem.

AWS services like Amazon ZooKeeper (via Amazon MSK), DynamoDB with conditional writes, and Amazon ElastiCache with Redis Sentinel all provide mechanisms for distributed coordination. But using them correctly — handling the edge cases around network partitions, clock skew, and the period of uncertainty when a leader is suspected to have failed but has not yet been replaced — requires careful design.

The practical advice: use managed coordination services where possible (etcd via EKS, DynamoDB for lightweight locking, AWS ElastiCache for distributed mutexes) rather than building your own. The failure modes in custom leader election code are subtle and the consequences of getting it wrong — multiple leaders, stale leader continuing to process writes after being replaced — are severe.

Putting It Together

These challenges do not exist in isolation. The distributed systems problems are harder to solve without the right team. The right team is harder to build without the right hiring strategy. The right hiring strategy depends on having a clear business model to articulate what the platform needs to be. And the technical architecture — control plane, data plane, cell structure, security model — needs to be designed with the business model in mind from the start.

Building a platform is a long-term bet. The decisions you make in the first six months — on team structure, on AWS account strategy, on control plane architecture — will constrain and enable what is possible for years afterward. Getting those foundations right, even if it means moving slower initially, is almost always the right trade-off.

The platforms that succeed are the ones that treat reliability, security, and internal developer experience as first-class product requirements — not as afterthoughts to be addressed once the “real” work is done. The “real” work, at platform scale, is the foundation.