November 18, 2019

202 words 1 min read

Federated Prometheus Monitoring at Scale

Federated Prometheus Monitoring at Scale

In Media Build and Products under Oath, We run 12 production Kubernetes clusters running across our data centers with ~1200 machines with multi-tenant deployments. We monitor our cluster with Promethe …


Talk Title	Federated Prometheus Monitoring at Scale
Speakers	LungChih Tung (Software Developer II, Oath Inc), Nandhakumar Venkatachalam (Princi Production Engineer, Oath Inc)
Conference	KubeCon + CloudNativeCon Europe
Conf Tag
Location	Copenhagen, Denmark
Date	Apr 30-May 4, 2018
URL	Talk Page
Slides	Talk Slides
Video

In Media Build and Products under Oath, We run 12 production Kubernetes clusters running across our data centers with ~1200 machines with multi-tenant deployments. We monitor our cluster with Prometheus, each cluster runs a Prometheus instance and overall a single federated cluster with a persistent storage. Total time series is ~17mi (max 5mi /instance) with samples ingestion rate is 300K (max 80K /instance). We have built mind-blowing dashboards at a federated instance like Controller, Scheduler, API server, DNS, Kubelet, Etcd, Utilization overall and per-tenant namespace/ deployment/container gives high visibility. We leverage Alert manager which provides powerful alerting capabilities alerts on call on cluster status, nodes availability, scrape status, fd usage etc.We would like to share our experience of how we monitoring multi-kubernetes cluster with the multi-tenant environment

container dns api cluster dashboard prometheus data center monitoring kubernetes

comments powered by Disqus

Introducing Amazon EKS

Introducing Amazon EKS

November 18, 2019

Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a new managed service for running Kubernetes on AWS. This session will provide an overview of Amazon EKS, why we built it, and how it wo …

Declarative Multi-Cluster Monitoring with Prometheus

Declarative Multi-Cluster Monitoring with Prometheus

November 17, 2019

Loodse and CoreOS run many different setups across multiple heterogeneous public and private clouds. In this context, they faced some challenges with deploying the monitoring infrastructure. Matthias …

Everything you Need to Know about Using GPUs with Kubernetes

Everything you Need to Know about Using GPUs with Kubernetes

November 17, 2019

This talk will start by describing the need for making Kubernetes aware of resources like GPUs. Then it will briefly go into the history of GPU support in Kubernetes and the various backwards incompat …

Kubernetes Runs Anywhere, but Does your Data?

Kubernetes Runs Anywhere, but Does your Data?

November 17, 2019

Kubernetes is now the defacto container orchestrator and it is able to run basically anywhere, from cloud providers to bare metal clusters. With this ubiquitous ability to run your applications comes …

Automating GPU Infrastructure for Kubernetes

Automating GPU Infrastructure for Kubernetes

November 16, 2019

Kubernetes has seen broad interest from the machine learning community and many users are bringing GPUs to their clusters. However, compiling, installing, and updating the NVIDIA kernel modules needed …

Building a Fault-Tolerant Custom Resources Controller on Kubernetes

Building a Fault-Tolerant Custom Resources Controller on Kubernetes

November 16, 2019

CRD (custom resource definition) is widely used to extend the behavior of Kubernetes. As all other kubernetes resource have controllers, so do CRDs. It is important that the custom resources are mana …