January 26, 2020

360 words 2 mins read

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.

Talk Title Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform
Speakers Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides
Video

Recruit Group is one of the largest web service providers in Japan. It has many services covering diverse business fields, including travel and restaurant reservation, human resource services, and POS systems. Analyzing application logs collected from these various services enable the company to provide more insightful services for individuals and corporate customers. Rough estimates show the log size to be around 1 TB per day, and the number of servers/instances to collect logs from will be 1,000+ in the future. Recruit Group had to design a platform that could handle all these ever-changing requirements. It began with a project to collect and analyze all the application logs generated by these services efficiently and easily. The first step was to develop a platform to handle extensive logs from upstream applications and transfer them to downstream ones in an efficient and effective manner. This platform is based on the data hub architecture and utilizes Apache Kafka for high performance and scalability. The Kafka cluster was developed on Google Compute Engine along with some managed services in Google Cloud Platform, such as BigQuery and Pub/Sub, for analysis. Recruit Group faced quite a few technical problems while developing this platform. Kenji Hayashida and Toru Sasaki share some of these critical problems and explains how the company solved them. Along the way, you’ll explore the platform and get lessons learned and best practices drawn from this experience. Topics include:

comments powered by Disqus