Building Scalable Data Pipelines with Apache Kafka and Flink
Sarah Chen
Head of Engineering
Introduction to Modern Data Pipelines
In today's data-driven landscape, the ability to process and analyze massive streams of information in real time is no longer a luxury—it is a fundamental requirement. Organizations across every industry generate terabytes of event data every day, from user clickstreams and IoT sensor readings to financial transactions and application logs. Building pipelines that can reliably ingest, transform, and deliver this data at scale is one of the most critical engineering challenges modern teams face.
At Primates, we process over four billion events per day across our analytics platform. Over the past three years, we have iterated through multiple pipeline architectures, learning hard lessons about scalability, fault tolerance, and operational complexity along the way. In this article, I want to share the architecture patterns and engineering practices that have allowed us to build pipelines that maintain sub-second latency even under peak load conditions, while keeping our infrastructure costs predictable and manageable.
The core technologies driving our pipeline infrastructure are Apache Kafka for event streaming and Apache Flink for stream processing. Together, these tools provide a powerful foundation for building data pipelines that are both horizontally scalable and fault-tolerant. But choosing the right tools is only part of the equation—how you design your pipeline topology, manage state, handle backpressure, and monitor system health are equally important factors that determine whether your pipeline thrives in production or buckles under pressure.
Architecture Overview
Our pipeline architecture follows a layered approach that separates concerns cleanly and allows each component to scale independently. The ingestion layer handles raw event collection from hundreds of sources, normalizing data formats and applying basic validation before publishing to Kafka topics. The processing layer, built on Flink, consumes from these topics and applies complex transformations, enrichments, aggregations, and windowed computations. Finally, the delivery layer routes processed data to various sinks including data warehouses, real-time dashboards, and downstream microservices.
One of the key architectural decisions we made early on was to adopt an event-driven, schema-first approach. Every event type in our system is defined using Apache Avro schemas, which are stored in a centralized Schema Registry. This gives us strong typing guarantees across the entire pipeline, enables schema evolution without breaking downstream consumers, and provides automatic serialization and deserialization. The schema registry also serves as living documentation of our event contracts, making it straightforward for teams to discover and understand the data flowing through the system.
Here is a simplified view of how our ingestion service publishes events to Kafka:
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import io.confluent.kafka.serializers.KafkaAvroSerializer;
public class EventPublisher {
private final KafkaProducer<String, GenericRecord> producer;
private final String topicName;
public EventPublisher(String bootstrapServers, String schemaRegistryUrl, String topic) {
Properties props = new Properties();
props.put("bootstrap.servers", bootstrapServers);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", KafkaAvroSerializer.class.getName());
props.put("schema.registry.url", schemaRegistryUrl);
props.put("acks", "all");
props.put("retries", 3);
props.put("linger.ms", 5);
props.put("batch.size", 32768);
this.producer = new KafkaProducer<>(props);
this.topicName = topic;
}
public void publishEvent(String key, GenericRecord event) {
ProducerRecord<String, GenericRecord> record =
new ProducerRecord<>(topicName, key, event);
producer.send(record, (metadata, exception) -> {
if (exception != null) {
logger.error("Failed to publish event: {}", exception.getMessage());
metrics.incrementCounter("publish.failures");
} else {
metrics.incrementCounter("publish.successes");
}
});
}
}
Stream Processing with Flink
Apache Flink serves as the backbone of our stream processing layer. We chose Flink over alternatives like Spark Structured Streaming and Kafka Streams for several reasons: its true event-time processing capabilities, exactly-once state consistency guarantees, flexible windowing semantics, and superior performance under high-throughput scenarios. Flink's ability to manage large amounts of keyed state efficiently using RocksDB as a state backend has been particularly valuable for our use cases, which involve maintaining session windows and computing rolling aggregates across millions of unique keys.
Our Flink applications are organized as a directed acyclic graph (DAG) of operators, where each operator performs a specific transformation. Common operator patterns in our pipelines include:
- Filtering operators that discard invalid or irrelevant events based on schema validation rules and business logic predicates. These operators typically reduce the event volume by fifteen to twenty percent, significantly reducing the load on downstream processing stages.
- Enrichment operators that join streaming events with reference data from external sources such as user profile databases and product catalogs. We use Flink's async I/O capabilities to perform these lookups without blocking the main processing thread, maintaining high throughput even when external services have variable latency.
- Aggregation operators that compute windowed metrics like counts, sums, averages, and percentiles over configurable time windows. We use a combination of tumbling windows for fixed-interval reporting and session windows for user behavior analysis, with late event handling configured to accommodate clock skew of up to five minutes.
- Routing operators that direct processed events to appropriate output topics based on content-based routing rules. This allows us to maintain a single processing pipeline while delivering data to multiple downstream consumers with different requirements.
One of the most complex aspects of building reliable stream processing applications is managing state correctly. Flink provides checkpointing and savepoint mechanisms that allow the system to recover from failures without data loss, but you need to configure these carefully based on your throughput and latency requirements. We run incremental checkpoints every thirty seconds using the RocksDB state backend, which gives us a good balance between recovery time and checkpoint overhead.
Handling Backpressure
Backpressure is an inevitable reality in any high-throughput data pipeline. When a downstream operator cannot keep up with the rate of incoming data, the system must have a strategy for dealing with the excess. Flink handles backpressure naturally through its credit-based flow control mechanism, which propagates backpressure signals upstream through the operator graph. However, relying solely on this built-in mechanism can lead to cascading slowdowns that affect the entire pipeline.
We have implemented several additional strategies to manage backpressure gracefully:
- We monitor Flink's backpressure metrics continuously using Prometheus and Grafana, with alerts configured to fire when any operator sustains backpressure above fifty percent for more than two minutes.
- We use Kafka's built-in buffering as a shock absorber between pipeline stages. By sizing our Kafka topics with sufficient partition counts and retention periods, we can absorb traffic spikes without data loss even if processing temporarily falls behind.
- We implement adaptive rate limiting at the ingestion layer, which reduces the publishing rate when downstream processing latency exceeds configurable thresholds. This prevents the pipeline from being overwhelmed during traffic spikes while ensuring that no data is dropped.
"The most resilient pipelines are those designed to degrade gracefully under load rather than fail catastrophically. Backpressure management is not about preventing slowdowns—it is about ensuring that slowdowns in one part of the system do not cascade into failures across the entire pipeline." — Jay Kreps, co-creator of Apache Kafka
Performance Tuning and Optimization
Achieving optimal performance from a Kafka-Flink pipeline requires careful tuning across multiple dimensions. After extensive benchmarking and production profiling, we have identified several key areas where configuration changes can have a dramatic impact on throughput and latency. The following table summarizes the most impactful tuning parameters we have discovered through our optimization efforts:
| Parameter | Default Value | Optimized Value | Impact |
|---|---|---|---|
| Kafka batch.size | 16384 | 65536 | 35% throughput increase |
| Kafka linger.ms | 0 | 10 | 28% throughput increase |
| Flink parallelism | 1 | 64 | Linear scaling with partitions |
| Flink network buffers | 2048 | 8192 | Reduced backpressure by 40% |
| RocksDB block cache | 8 MB | 256 MB | 60% reduction in state access latency |
| Checkpoint interval | 60s | 30s | Faster recovery, minimal overhead |
Beyond configuration tuning, we have also invested heavily in optimizing our serialization and deserialization paths. Switching from JSON to Avro for our internal event formats reduced serialization overhead by approximately sixty percent and cut our network bandwidth consumption nearly in half. We also implemented custom Flink serializers for our most frequently accessed state objects, which eliminated the overhead of Kryo serialization and reduced checkpoint sizes by roughly forty percent.
Another critical optimization was implementing proper partitioning strategies for our Kafka topics. We use a combination of hash-based partitioning for events that require ordering guarantees and round-robin partitioning for events where ordering is not important. This ensures that related events are processed by the same Flink subtask while maintaining even load distribution across partitions. The partitioning key selection is crucial—choosing keys with high cardinality and even distribution prevents hot partitions that can create bottlenecks in the processing layer.
Monitoring and Observability
Operating data pipelines at scale requires comprehensive observability. We have built a monitoring stack that provides visibility into every layer of the pipeline, from Kafka broker metrics to Flink operator-level statistics. Our monitoring setup includes real-time dashboards showing end-to-end latency percentiles, throughput rates, consumer lag, checkpoint durations, and error rates. We use a combination of Prometheus for metrics collection, Grafana for visualization, and PagerDuty for alerting, with runbooks linked directly from alert notifications to speed up incident response.
One monitoring pattern that has proven especially valuable is tracking event processing latency as a histogram rather than a simple average. This allows us to detect latency degradation in the tail of the distribution—the p99 and p999 latencies—which often indicate emerging problems long before they become visible in average latency metrics. We alert on p99 latency exceeding our SLO thresholds, which gives our on-call engineers time to investigate and address issues before they impact end users.
Building scalable data pipelines is a continuous journey rather than a destination. The architectures and practices described in this article represent our current state of the art, but we are constantly evaluating new technologies and approaches. We are particularly excited about emerging developments in unified batch-stream processing, serverless stream processing engines, and AI-driven pipeline optimization. As data volumes continue to grow, the engineering challenges will only become more interesting.
About the Author
Sarah Chen
Head of Engineering
Sarah Chen is the Head of Engineering at Primates, where she leads the platform infrastructure and distributed systems teams. With over fifteen years of experience building large-scale systems at companies including Google and Stripe, Sarah specializes in designing fault-tolerant architectures that handle billions of requests daily. She holds a Ph.D. in Computer Science from MIT and is a frequent speaker at distributed systems conferences worldwide.
Related Articles
Lessons Learned Migrating to Microservices at Scale
After two years of migrating our monolithic application to a microservices architecture, here are the patterns that worked, the mistakes we made, and the tools that saved us along the way.
Kubernetes Autoscaling: A Deep Dive into HPA, VPA, and KEDA
Master Kubernetes autoscaling with this comprehensive guide covering Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and event-driven autoscaling with KEDA. Includes real-world configurations and performance benchmarks.
Comments (3)
This is an excellent deep dive! The architecture diagrams really helped me understand the overall flow. We have been considering a similar approach at our company and this gives us a great starting point.
Great article. I especially appreciated the section on error handling and fault tolerance. One question: have you considered using an event sourcing pattern for the audit trail instead of the approach described here?
We implemented something very similar last quarter after reading your previous post. The performance improvements were even better than expected. Looking forward to more content like this!