How Uber Harnesses the Power of Presto and Apache Kafka for Data Analysis

At Uber, data is at the heart of everything we do. From optimizing routes for drivers to predicting rider demand, our ability to process and analyze vast amounts of data in real-time is crucial. In this blog, we’ll explore how Uber integrates Presto and Apache Kafka to handle real-time data analytics efficiently.

Understanding Presto and Kafka

What is Presto? Presto is an open-source distributed SQL query engine that allows us to run interactive queries on large datasets. It’s incredibly versatile, capable of querying data stored in a variety of sources such as databases and data lakes.

What is Apache Kafka? Apache Kafka is a distributed streaming platform that lets us publish, subscribe to, store, and process streams of records in real-time. It acts as a backbone for data streaming at Uber, handling trillions of messages daily.

The Challenge: Real-Time Data Analysis

As Uber scaled, data teams needed a way to perform quick, on-demand analysis on real-time data streams from Kafka. Traditional methods either required complex setups or couldn’t deliver the low-latency performance needed for tasks like troubleshooting and ad-hoc data exploration.

Exploring Solutions

Evaluated several options to address real-time data analysis needs:

Streaming Processing Engines: Tools like Apache Flink continuously process streams but aren’t ideal for quick, point-in-time lookups.
Real-time OLAP Datastores: Solutions like Apache Pinot provide low-latency queries but require significant setup and resources.

Ultimately, found that integrating Presto with Kafka offered the best balance of ease-of-use, flexibility, and performance.

The Solution: Integrating Presto and Kafka

By integrating Presto with Kafka, enabled real-time SQL queries directly on Kafka streams. This approach had several benefits:

Immediate Availability: Kafka topics can be queried as soon as they are created, with no extra setup required.
Powerful Query Capabilities: Presto’s ability to join data from multiple sources allows for comprehensive insights.

However, this faced several challenges, such as dynamic topic discovery, query restrictions, and quota control, which we overcame through strategic improvements.

Overcoming Challenges

Integrating these systems required addressing several technical challenges:

Dynamic Topic Discovery: they implemented a system for on-demand discovery of Kafka topics and schemas.
Query Restrictions: To maintain performance, they enforced limits on how much data each query could pull from Kafka.
Quota Control: they used Kafka’s broker quotas to manage data consumption, preventing system overloads.

Dynamic Topic and Schema Discovery

they made changes to enable on-demand cluster/topic and schema discovery. Kafka topic metadata and data schema are fetched through KafkaMetadata at runtime. they extracted the TableDescriptionSupplier interface to supply these metadata and extended it to read metadata from their in-house Kafka cluster management service and schema registry at runtime. This allows them to support multiple Kafka clusters in a single Kafka connector, with a cache layer to reduce requests hitting the Kafka cluster.

Query Filter

To improve reliability, they added column filter enforcement checks for the presence of either _timestamp or _partition_offset in the filter constraints of Presto queries for Kafka. Queries without these filters are rejected, preventing large data pulls that could degrade system performance.

Quota Control on Kafka Cluster

Kafka is critical for Uber, supporting many real-time use cases. they needed to prevent potential cluster degradation by limiting Presto’s consumption throughput. they achieved this by specifying a static Kafka consumer client ID for all Presto workers, subjecting them to the same quota pool, ensuring stable consumption rates and preventing overloads.

The Impact

The integration of Presto and Kafka has significantly improved our ability to perform ad-hoc data analysis. Engineers can now write simple SQL queries to retrieve real-time data in seconds, greatly enhancing productivity and decision-making. Before this integration, data lookups could take tens of minutes; now, they can be completed in just a few seconds.

For example, checking if a specific order message is missing from a Kafka stream can be done with a simple SQL query:

SELECT * FROM kafka.cluster.order WHERE uuid= ‘0e43a4-5213-11ec’;

This query returns the result in seconds, allowing for quick troubleshooting and data exploration.

Conclusion

Integrating Presto with Kafka has transformed how Uber handles real-time data analysis. By enabling quick, on-demand queries over real-time data streams, we’ve empowered our teams to make faster, data-driven decisions. Moving forward, we plan to share our improvements with the open-source community to help others benefit from our work.

Source

DevBlogIt

How Uber Harnesses the Power of Presto and Apache Kafka for Data Analysis

Understanding Presto and Kafka

The Challenge: Real-Time Data Analysis

Exploring Solutions

The Solution: Integrating Presto and Kafka

Overcoming Challenges

Dynamic Topic and Schema Discovery

Query Filter

Quota Control on Kafka Cluster

The Impact

Conclusion

Ansam Yousry

More From Author

End to End Data Engineering Project: Dbt With BigQuery and Github

End-to-End Data Engineering project with dbt Core , Snowflake and Apache Airflow

Mastering dbt: Your Complete Step-by-Step Handbook

How Uber Uses Kafka in Its Dynamic Pricing Model

Apache Airflow Tutorial: Architecture, Concepts, and How to Run Airflow Locally With Docker

Leave a Reply Cancel reply