Lessons Learned Migrating to Microservices at Scale
Sarah Chen
Head of Engineering
Why We Decided to Break Up the Monolith
Three years ago, the entire Primates platform ran as a single monolithic Ruby on Rails application backed by a PostgreSQL database. This architecture served us well during our early growth phase—it was simple to develop, test, and deploy, and it allowed our small engineering team to move fast without the overhead of managing distributed systems. But as our customer base grew from fifty to over a thousand organizations, and our team expanded from ten to sixty engineers, the monolith became increasingly painful to work with.
Deployment times ballooned from five minutes to over forty minutes as the codebase grew. A bug in one feature could take down the entire platform because everything ran in the same process. Teams working on different features frequently stepped on each other's toes, creating merge conflicts and requiring complex coordination. Database performance degraded as the single PostgreSQL instance struggled to handle the diverse query patterns of our growing feature set. The monolith had become a bottleneck—not just technically, but organizationally.
We did not make the decision to migrate to microservices lightly. We were well aware of the distributed systems tax—the operational complexity, network reliability challenges, data consistency problems, and debugging difficulties that come with breaking a monolithic application into distributed components. But after careful analysis, we concluded that the organizational scalability benefits of microservices outweighed the technical complexity costs for our specific situation. We had the team size, the domain complexity, and the scale requirements that made microservices the right architectural choice.
The Strangler Fig Migration Pattern
Rather than attempting a big-bang rewrite—which we had seen fail at other companies—we adopted the Strangler Fig migration pattern. Named after the strangler fig tree that gradually grows around and replaces its host tree, this pattern involves incrementally extracting functionality from the monolith into new microservices, routing traffic to the new services, and eventually decommissioning the corresponding code in the monolith. The beauty of this approach is that the monolith continues to serve production traffic throughout the migration, reducing risk and allowing the team to learn and adapt as they go.
Our migration proceeded through three phases over approximately twenty-four months. In the first phase, we identified and extracted the most independent and well-bounded domains from the monolith. The event ingestion system was our first extraction target because it had clear boundaries, high throughput requirements that would benefit from independent scaling, and minimal dependencies on other parts of the system. We built the new ingestion service in Go, chose Apache Kafka as the communication backbone, and used an API gateway to gradually shift traffic from the monolith's ingestion endpoints to the new service.
Here is a simplified example of how we implemented the traffic routing in our API gateway during the migration:
server {
listen 443 ssl;
server_name api.primates.dev;
# Migrated endpoints route to new microservices
location /v2/events {
proxy_pass http://event-ingestion-service:8080;
proxy_set_header X-Request-ID $request_id;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
# Circuit breaker: fall back to monolith if new service is down
error_page 502 503 504 = @fallback_monolith;
}
# Not-yet-migrated endpoints continue to the monolith
location / {
proxy_pass http://monolith-rails:3000;
proxy_set_header X-Request-ID $request_id;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
location @fallback_monolith {
proxy_pass http://monolith-rails:3000;
add_header X-Fallback "true";
}
}
Service Communication Patterns
One of the most critical decisions in a microservices architecture is how services communicate with each other. We use a combination of synchronous and asynchronous communication patterns, chosen based on the specific requirements of each interaction. For request-response interactions where the caller needs an immediate response, we use gRPC with Protocol Buffers for internal service-to-service communication. gRPC provides strong typing, efficient binary serialization, and built-in support for streaming, which have proven valuable for our use cases.
For interactions where eventual consistency is acceptable and decoupling is more important than immediacy, we use event-driven communication through Kafka. Services publish domain events to Kafka topics, and interested services consume these events asynchronously. This pattern has several advantages:
- Temporal decoupling: the producing service does not need the consuming service to be available at the time the event is published. Events are durably stored in Kafka and will be processed when the consumer is ready.
- Spatial decoupling: the producing service does not need to know which services consume its events. New consumers can subscribe to existing event streams without any changes to the producer.
- Replay capability: Kafka's log-based storage allows consumers to replay historical events, which is invaluable for rebuilding state, debugging issues, and onboarding new services.
However, event-driven communication introduces complexity around eventual consistency, event ordering, and idempotency. We learned these lessons the hard way during the migration, encountering bugs caused by out-of-order event processing, duplicate event delivery, and race conditions between services that relied on eventually consistent data. To address these challenges, we established a set of event-driven architecture standards that all teams must follow.
"The biggest mistake teams make when adopting microservices is underestimating the operational complexity. You are not just changing your architecture—you are changing how your entire organization builds, deploys, and operates software." — Sam Newman, author of Building Microservices
Data Management Strategies
Managing data across microservices is arguably the most challenging aspect of the architecture. The monolith's single shared database made it easy to maintain data consistency through transactions and joins, but in a microservices world, each service owns its data and cannot directly access other services' databases. This constraint is fundamental to achieving the loose coupling that makes microservices valuable, but it requires new approaches to data management.
We adopted the database-per-service pattern, where each microservice has its own dedicated database instance chosen for the specific data access patterns of that service. Our event ingestion service uses Apache Cassandra for high-write-throughput time-series data. Our user management service uses PostgreSQL for relational data with strong consistency requirements. Our search service uses Elasticsearch for full-text search capabilities. Our analytics service uses ClickHouse for OLAP-style queries on large datasets. The following table summarizes our database choices by service:
| Service | Database | Rationale | Data Volume |
|---|---|---|---|
| Event Ingestion | Apache Cassandra | High write throughput, time-series data | 4B events/day |
| User Management | PostgreSQL | Relational data, ACID transactions | 2M users |
| Search | Elasticsearch | Full-text search, faceted queries | 500M documents |
| Analytics | ClickHouse | Columnar storage, OLAP queries | 50TB |
| Notifications | Redis | Low-latency reads, pub/sub | 10M records |
Cross-service data consistency is maintained through the Saga pattern for distributed transactions and event sourcing for maintaining materialized views of data from other services. The Saga pattern replaces traditional ACID transactions with a sequence of local transactions coordinated by either choreography (event-driven) or orchestration (centralized coordinator). We use choreography-based sagas for simple workflows and orchestration-based sagas for complex multi-step processes where visibility and error handling are critical.
Observability and Debugging
Debugging issues in a distributed system is fundamentally different from debugging a monolithic application. A single user request might traverse ten or more services, making it essential to have comprehensive observability tooling that can trace requests across service boundaries, correlate logs from multiple services, and visualize the relationships between services in real time. We invested heavily in three pillars of observability:
- Distributed tracing using OpenTelemetry and Jaeger, which allows us to trace individual requests as they flow through multiple services, identify latency bottlenecks, and understand the dependency graph of our service interactions.
- Structured logging with correlation IDs that link log entries from different services for the same user request. All services emit structured JSON logs that include the trace ID, span ID, service name, and request metadata, making it straightforward to reconstruct the full processing path of any request.
- Metrics and alerting using Prometheus and Grafana, with service-level objectives defined for every service covering availability, latency, and error rate. We use SLO-based alerting to focus on-call engineers on issues that affect user experience rather than low-level infrastructure metrics.
The migration to microservices was one of the most challenging engineering projects we have undertaken, but the results have been transformative. Deployment frequency has increased from weekly releases to dozens of deployments per day. Individual service deployment times are under three minutes. Teams can develop, test, and deploy their services independently. And our platform can scale individual components based on their specific load patterns rather than scaling the entire application uniformly. The lessons we learned along the way—about migration patterns, communication strategies, data management, and observability—are applicable to any organization contemplating a similar journey.
About the Author
Sarah Chen
Head of Engineering
Sarah Chen is the Head of Engineering at Primates, where she leads the platform infrastructure and distributed systems teams. With over fifteen years of experience building large-scale systems at companies including Google and Stripe, Sarah specializes in designing fault-tolerant architectures that handle billions of requests daily. She holds a Ph.D. in Computer Science from MIT and is a frequent speaker at distributed systems conferences worldwide.
Related Articles

Building Scalable Data Pipelines with Apache Kafka and Flink
Learn how to design and implement production-grade data pipelines that handle millions of events per second. We cover architecture patterns, fault tolerance strategies, and performance tuning techniques.
Kubernetes Autoscaling: A Deep Dive into HPA, VPA, and KEDA
Master Kubernetes autoscaling with this comprehensive guide covering Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and event-driven autoscaling with KEDA. Includes real-world configurations and performance benchmarks.
Comments (3)
This is an excellent deep dive! The architecture diagrams really helped me understand the overall flow. We have been considering a similar approach at our company and this gives us a great starting point.
Great article. I especially appreciated the section on error handling and fault tolerance. One question: have you considered using an event sourcing pattern for the audit trail instead of the approach described here?
We implemented something very similar last quarter after reading your previous post. The performance improvements were even better than expected. Looking forward to more content like this!