Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.


Talk Title	Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform
Speakers	Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Recruit Group is one of the largest web service providers in Japan. It has many services covering diverse business fields, including travel and restaurant reservation, human resource services, and POS systems. Analyzing application logs collected from these various services enable the company to provide more insightful services for individuals and corporate customers. Rough estimates show the log size to be around 1 TB per day, and the number of servers/instances to collect logs from will be 1,000+ in the future. Recruit Group had to design a platform that could handle all these ever-changing requirements. It began with a project to collect and analyze all the application logs generated by these services efficiently and easily. The first step was to develop a platform to handle extensive logs from upstream applications and transfer them to downstream ones in an efficient and effective manner. This platform is based on the data hub architecture and utilizes Apache Kafka for high performance and scalability. The Kafka cluster was developed on Google Compute Engine along with some managed services in Google Cloud Platform, such as BigQuery and Pub/Sub, for analysis. Recruit Group faced quite a few technical problems while developing this platform. Kenji Hayashida and Toru Sasaki share some of these critical problems and explains how the company solved them. Along the way, you’ll explore the platform and get lessons learned and best practices drawn from this experience. Topics include:

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Kafka in jail: Running Kafka in container-orchestrated clusters

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Pangeo: Big data climate science in the cloud

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

Airflow on Kubernetes: Dynamic Workflows Simplified