Uber Implements Catastrophe Restoration for Multi-Area Kafka

In a recent weblog write-up, Uber engineers emphasize how they use a house-developed replication system to put into practice catastrophe restoration at scale with a multi-area Kafka deployment.

Uber has a massive deployment of Apache Kafka, processing trillions of messages and various petabytes of details for every day. To use Kafka, the engineers had to supply organization resilience and continuity in the face of natural and human-designed disasters. They designed uReplicator, which is Uber’s open-resource option for replicating Kafka’s information. Uber based uReplicator on Kafka’s MirrorMaker with advancements on substantial-trustworthiness, a zero-information-decline ensure, and relieve of operation.

Uber engineers Yupeng Fu and Mingmin Chen summarize their insights:


A vital insight from the techniques is that providing trusted and multi-regional available infrastructure expert services like Kafka can drastically simplify the advancement of the organization continuity prepare for the apps. The application can retail outlet its point out in the infrastructure layer and as a result come to be stateless, leaving the complexity of condition administration, like synchronization and replication throughout areas, to the infrastructure companies.


Utilizing uReplicator, Uber engineers developed the adhering to Kafka topology for catastrophe restoration:

Resource: https://eng.uber.com/kafka/

Each producer makes info into a regional, regional Kafka cluster. This technique is the most performant possibility. In scenario the local Kafka cluster fails, the producer fails more than to a different regional Kafka cluster. Then, uReplicator replicates the regional clusters to mixture Kafka clusters obtainable in all regions. Every single cluster is made up of aggregated information from all other locations.

Information consumption is always accomplished from the combination Kafka cluster in each individual region utilizing two reader topologies – energetic-active or energetic-passive. Energetic-lively is favored when efficiency and speedier time to restoration are extra important. Lively-passive, on the other hand, is favored when consistency is much more significant.

In energetic-energetic method, audience in all locations study the details from the aggregated clusters. Nevertheless, only a picked out major service updates the processed info final results in an active-energetic databases out there in all areas. The determine beneath demonstrates this concept with a Flink occupation that calculates Uber’s surge pricing info.

Resource: https://eng.uber.com/kafka/

In energetic-passive manner, only one consumer is permitted to take in from the mixture clusters in 1 of the regions at a time. Kafka replicates the intake offset to other locations. Upon failure of the key location, the buyer fails in excess of to one more location and resumes its intake. Uber wanted to cope with a caveat in this method, the place “messages from the aggregate clusters may perhaps grow to be out of purchase after aggregating from the regional clusters.” Uber released an offset supervisor whose mission is to solve these discrepancies and attain zero data decline in the face of region failover, depicted in the adhering to diagram.

Supply: https://eng.uber.com/kafka/