项目作者: keikoproj

项目描述 :
Provides deep monitoring and self-healing of Kubernetes clusters
高级语言: Go
项目地址: git://github.com/keikoproj/active-monitor.git
创建时间: 2019-08-09T19:32:39Z
项目社区:https://github.com/keikoproj/active-monitor

开源协议:Apache License 2.0

下载


Active-Monitor

Maintenance
PR
slack

Go Report Card
Build Status
Code Coverage
Latest Version

Motivation

Active-Monitor is a Kubernetes custom resource controller which enables deep cluster monitoring and self-healing using Argo workflows.

While it is not too difficult to know that all entities in a cluster are running individually, it is often quite challenging to know that they can all coordinate with each other as required for successful cluster operation (network connectivity, volume access, etc).

Overview

Active-Monitor will create a new health namespace when installed in the cluster. Users can then create and submit HealthCheck object to the Kubernetes server. A HealthCheck / Remedy is essentially an instrumented wrapper around an Argo workflow.

The HealthCheck workflow is run periodically, as defined by repeatAfterSec or a schedule: cron property in its spec, and watched by the Active-Monitor controller.

Active-Monitor sets the status of the HealthCheck CR to indicate whether the monitoring check succeeded or failed. If in case the monitoring check failed then the Remedy workflow will execute to fix the issue. Status of Remedy will be updated in the CR. External systems can query these CRs and take appropriate action if they failed.

RemedyRunsLimit parameter allows to configure how many times a remedy should be run. If Remedy action fails for any reason it will stop on further retries. It is an optional parameter. If it is not set Remedyworkflow is triggered whenever HealthCheck workflow fails.

RemedyResetInterval parameter allows resetting remedy after the reset interval time and RemedyWorkflow can be retried again in case monitor workflow fails. If remedy reaches a RemedyRunsLimit it will be reset when HealthCheck passes in any subsequent run before RemedyResetInterval.

Typical examples of such workflows include tests for basic Kubernetes object creation/deletion, tests for cluster-wide services such as policy engines checks, authentication and authorization checks, etc.

The sort of HealthChecks one could run with Active-Monitor are:

  • verify namespace and deployment creation
  • verify AWS resources are using < 80% of their instance limits
  • verify kube-dns by running DNS lookups on the network
  • verify kube-dns by running DNS lookups on localhost
  • verify KIAM agent by running aws sts get-caller-identity on all available nodes
  • verify if pod max threads has reached
  • verify if storage volume for a pod (e.g: prometheus) has reached its capacity.
  • verify if critical pods e.g: calico, kube-dns/core-dns pods are in a failed or crashloopbackoff state

With the Cluster/Namespace level, healthchecks can be run in any namespace provided namespace is already created.
The level in the HealthCheck spec defines at which level it runs; it can be either Namespace or Cluster.

When level is set to Namespace, Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the Role and RoleBinding with namespace level permissions so that the HealthChecks in a namespace can be performed.

When the level is set to be Cluster the Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the ClusterRole and ClusterRoleBinding with cluster level permissions so that the HealthChecks in a cluster scope can be performed.

Dependencies

Installation Guide

  1. # step 0: ensure that all dependencies listed above are installed or present
  2. # step 1: install argo workflow controller
  3. kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-argo.yaml
  4. # step 2: install active-monitor CRD and start controller
  5. kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/config/crd/bases/activemonitor.keikoproj.io_healthchecks.yaml
  6. kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-active-monitor.yaml

Alternate Install - using locally cloned code

  1. # step 0: ensure that all dependencies listed above are installed or present
  2. # step 1: install argo workflow-controller
  3. kubectl apply -f deploy/deploy-argo.yaml
  4. # step 2: install active-monitor controller
  5. make install
  6. kubectl apply -f deploy/deploy-active-monitor.yaml
  7. # step 3: run the controller via Makefile target
  8. make run

Usage and Examples

Create a new healthcheck:

Example 1:

Create a new healthcheck with cluster level bindings to specified serviceaccount and in health namespace:

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello.yaml

OR with local source code:

kubectl create -f examples/inlineHello.yaml

Then, list all healthchecks:

kubectl get healthcheck -n health OR kubectl get hc -n health

  1. NAME LATEST STATUS SUCCESS CNT FAIL CNT AGE
  2. inline-hello-7nmzk Succeeded 7 0 7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n health

  1. ...
  2. Status:
  3. Failed Count: 0
  4. Finished At: 2019-08-09T22:50:57Z
  5. Last Successful Workflow: inline-hello-4mwxf
  6. Status: Succeeded
  7. Success Count: 13
  8. Events: <none>

Example 2:

Create a new healthcheck with namespace level bindings to specified serviceaccount and in a specified namespace:

kubectl create ns test

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello_ns.yaml

OR with local source code:

kubectl create -f examples/inlineHello_ns.yaml

Then, list all healthchecks:

kubectl get healthcheck -n test OR kubectl get hc -n test

  1. NAME LATEST STATUS SUCCESS CNT FAIL CNT AGE
  2. inline-hello-zz5vm Succeeded 7 0 7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n test

  1. ...
  2. Status:
  3. Failed Count: 0
  4. Finished At: 2019-08-09T22:50:57Z
  5. Last Successful Workflow: inline-hello-4mwxf
  6. Status: Succeeded
  7. Success Count: 13
  8. Events: <none>

argo list -n test

  1. NAME STATUS AGE DURATION PRIORITY
  2. inline-hello-88rh2 Succeeded 29s 7s 0
  3. inline-hello-xpsf5 Succeeded 1m 8s 0
  4. inline-hello-z8llk Succeeded 2m 7s 0

Generates Resources

  • activemonitor.keikoproj.io/v1alpha1/HealthCheck
  • argoproj.io/v1alpha1/Workflow

Sample HealthCheck CR:

  1. apiVersion: activemonitor.keikoproj.io/v1alpha1
  2. kind: HealthCheck
  3. metadata:
  4. generateName: dns-healthcheck-
  5. namespace: health
  6. spec:
  7. repeatAfterSec: 60
  8. description: "Monitor pod dns connections"
  9. workflow:
  10. generateName: dns-workflow-
  11. resource:
  12. namespace: health
  13. serviceAccount: activemonitor-controller-sa
  14. source:
  15. inline: |
  16. apiVersion: argoproj.io/v1alpha1
  17. kind: Workflow
  18. spec:
  19. ttlSecondsAfterFinished: 60
  20. entrypoint: start
  21. templates:
  22. - name: start
  23. retryStrategy:
  24. limit: 3
  25. container:
  26. image: tutum/dnsutils
  27. command: [sh, -c]
  28. args: ["nslookup www.google.com"]

Sample RemedyWorkflow CR:

  1. apiVersion: activemonitor.keikoproj.io/v1alpha1
  2. kind: HealthCheck
  3. metadata:
  4. generateName: fail-healthcheck-
  5. namespace: health
  6. spec:
  7. repeatAfterSec: 60 # duration in seconds
  8. level: cluster
  9. workflow:
  10. generateName: fail-workflow-
  11. resource:
  12. namespace: health # workflow will be submitted in this ns
  13. serviceAccount: activemonitor-healthcheck-sa # workflow will be submitted using this
  14. source:
  15. inline: |
  16. apiVersion: argoproj.io/v1alpha1
  17. kind: Workflow
  18. metadata:
  19. labels:
  20. workflows.argoproj.io/controller-instanceid: activemonitor-workflows
  21. spec:
  22. ttlSecondsAfterFinished: 60
  23. entrypoint: start
  24. templates:
  25. - name: start
  26. retryStrategy:
  27. limit: 1
  28. container:
  29. image: ravihari/ctrmemory:v2
  30. command: ["python"]
  31. args: ["promanalysis.py", "http://prometheus.system.svc.cluster.local:9090", "health", "memory-demo", "memory-demo-ctr", "95"]
  32. remedyworkflow:
  33. generateName: remedy-test-
  34. resource:
  35. namespace: health # workflow will be submitted in this ns
  36. serviceAccount: activemonitor-remedy-sa # workflow will be submitted using this acct
  37. source:
  38. inline: |
  39. apiVersion: argoproj.io/v1alpha1
  40. kind: Workflow
  41. spec:
  42. ttlSecondsAfterFinished: 60
  43. entrypoint: kubectl
  44. templates:
  45. -
  46. container:
  47. args: ["kubectl delete po/memory-demo"]
  48. command: ["/bin/bash", "-c"]
  49. image: "ravihari/kubectl:v1"
  50. name: kubectl

Active-Monitor Architecture

Access Workflows on Argo UI

  1. kubectl -n health port-forward deployment/argo-ui 8001:8001

Then visit: http://127.0.0.1:8001

Prometheus Metrics

Active-Monitor controller also exports metrics in Prometheus format which can be further used for notifications and alerting.

Prometheus metrics are available on :8080/metrics

  1. kubectl -n health port-forward deployment/activemonitor-controller 8080:8080

Then visit: http://localhost:8080/metrics

Active-Monitor, by default, exports following Promethus metrics:

  • healthcheck_success_count - The total number of successful healthcheck resources
  • healthcheck_error_count - The total number of erred healthcheck resources
  • healthcheck_runtime_seconds - Time taken for the healthcheck’s workflow to complete

Active-Monitor also supports custom metrics. For this to work, your workflow should export a global parameter. The parameter will be programmatically available in the completed workflow object under: workflow.status.outputs.parameters.

The global output parameters should look like below:

  1. "{\"metrics\":
  2. [
  3. {\"name\": \"custom_total\", \"value\": 123, \"metrictype\": \"gauge\", \"help\": \"custom total\"},
  4. {\"name\": \"custom_metric\", \"value\": 12.3, \"metrictype\": \"gauge\", \"help\": \"custom metric\"}
  5. ]
  6. }"

❤ Contributing ❤

Please see CONTRIBUTING.md.

To add a new example of a healthcheck and/or workflow:

Release Process

Please see RELEASE.

License

The Apache 2 license is used in this project. Details can be found in the LICENSE file.

Other Keiko Projects

Instance Manager -
Kube Forensics -
Addon Manager -
Upgrade Manager -
Minion Manager -
Governor