January 3, 2020

215 words 2 mins read

Geospatial big data analysis at Uber

Geospatial big data analysis at Uber

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.


Talk Title	Geospatial big data analysis at Uber
Speakers	Zhenxiao Luo (Twitter), Wei Yan (Uber)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

Uber’s geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto. Zhenxiao and Wei start with an overview of Uber’s big data infrastructure before explaining how Uber models geospatial data and outlining its data ingestion pipeline. They then discuss geospatial query performance improvement techniques and experiences, focusing on geospatial data processing in big data systems, including Hadoop and Presto. Zhenxiao and Wei conclude by sharing Uber’s use cases and roadmap.

roadm reliability hadoop infrastructure uber big data use case performance pipeline oadm

comments powered by Disqus

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

December 29, 2019

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.

Presto: Distributed SQL on anything (sponsored by Teradata)

Presto: Distributed SQL on anything (sponsored by Teradata)

November 3, 2019

Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. Join Kamil Bajda-Pawlikowski to learn about Presto, Teradata's recent enhancements in query performance, security integrations, and ANSI SQL coverage, and its roadmap for 2017 and beyond.

Key big data architectural considerations for deploying in the cloud and on-premises (sponsored by NetApp)

Key big data architectural considerations for deploying in the cloud and on-premises (sponsored by NetApp)

January 1, 2020

When analytics applications become business critical, balancing cost with SLAs for performance, backup, dev, test, and recovery is difficult. Karthikeyan Nagalingam discusses big data architectural challenges and how to address them and explains how to create a cost-optimized solution for the rapid deployment of business-critical applications that meet corporate SLAs today and into the future.

Mini

December 16, 2019

The Network Automation & Orchestration Summit Mini-Summit schedule is below. 10:30am - 11:00amONAP Welcome - Message from the Chairman of the Board, Chris Rice, AT&T; President of the Board, Yachen …

PinTrace: A distributed tracing pipeline

PinTrace: A distributed tracing pipeline

December 14, 2019

Distributed tracing is an emerging field of monitoring distributed systems. Suman Karumuri shares the challenges of building and deploying distributed tracing at scale using PinTrace, one of the largest distributed tracing pipelines. Drawing on real-world examples, Suman explains how traces can be used to understand, debug, and optimize your production workflows.

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

December 5, 2019

The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.