公制模式分析以进行故障排除

When I troubleshoot some site issues, I need to check many metrics like CPU, memory, application metrics and so on. generally, I want to know the following items automatically (without checking all the metrics one by one by human) :

  1. How many metrics have spikes during that time.
  2. if metric X has the same pattern with metric Y
  3. if metric X has some periodicity characters.

for item 1 and 2, I think I can get it by calculating some change rate. for item 3, I have no idea so far.

my questions here are:

  1. do we have some library already which can be used here, language (Go, Java, Python is ok).
  2. do you have any suggestion for requirement 3.

=====

More background here:

I have a Prometheus(a monitor system) setup already, but my issue is I want to analyze these metrics automatically. For example: User input: Here are 1000 time-serial data and I have an issue on time 1 to time 2, I see metrics X has spiked during that time. Program output: item 1/2/3 above.

I just have some issue during implement the program.

I think you need some monitoring & analytic services like:

DataDog: https://www.datadoghq.com/

Librato: https://www.librato.com/

etc...

Or a self hosted infrastructure to run Graphite

(https://github.com/hopsoft/docker-graphite-statsd) or similar tools.