docker+prom+grafana+altermanager

时间：2020-12-28 10:42:06 阅读：67 评论：0 收藏：0 [点我收藏+]

docker基础

docker run -it --name centos -v $HOME:/tmp -p 8080:8080 centos

docker inspect container   #查看已启动容器启动命令

docker container prune    #删除退出的容器
docker ps -a --no-trunc    #查看容器启动参数

反查dockerfile

方法一：
docker history --format {{.CreatedBy}} --no-trunc=true 0e0218889c33|sed "s?/bin/sh\ -c\ \#(nop)\ ??g"|sed "s?/bin/sh\ -c?RUN?g" | tac

方法二：
apt-get install npm    #管理前端包工具
npm install npx
npx dockerfile-from-image node:8 > dockerfile   #解析dockerfile

修改容器镜像的启动命令 -- docker 修改启动命令

#使用宿主机网络，并将容器名称修改为prometheus，-d后台运行
docker run -d -p 9090:9090 --name prometheus --net=host prom/prometheus

#将容器里文件拷贝出来到root目录
docker cp prometheus:/etc/prometheus/prometheus.yml /root/

#修改后挂载加进去
docker run -d -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml   -p 9090:9090 --name prometheus --net=host prom/prometheus

docker logs contianid   #查看日志

docker search  java    #从docker hub中搜索java镜像，可以查看版本

容器端口

9100 node-export
9090 prometheus
3000 grafana

启动

启动node-exporter

docker run -d --name=node-exporter -p 9100:9100 prom/node-exporter

启动grafana

密码：admin，admin；配置文件/etc/grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana

启动prom


#将容器里文件拷贝出来到root目录
docker cp prometheus:/etc/prometheus/prometheus.yml /root/

#修改后挂载加进去
docker run -d -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-p 9090:9090 --name prometheus --net=host prom/prometheus


# docker 镜像自启动
- 在运行docker容器时可以加如下参数来保证每次docker服务重启后容器也自动重启：

docker run --restart=always
如果已经启动了则可以使用如下命令：
docker update --restart=always <CONTAINER ID>


# 配置文件
## prom配置
配置帮助
```yaml

global:
  #默认情况下抓取目标的频率.
  [ scrape_interval: <duration> | default = 1m ]

  # 抓取超时时间.
  [ scrape_timeout: <duration> | default = 10s ]

  # 评估规则的频率.
  [ evaluation_interval: <duration> | default = 1m ]

  # 与外部系统通信时添加到任何时间序列或警报的标签
  #（联合，远程存储，Alertma# nager）.
  external_labels:
    [ <labelname>: <labelvalue> ... ]

# 规则文件指定了一个globs列表. 
# 从所有匹配的文件中读取规则和警报.
rule_files:
  [ - <filepath_glob> ... ]

# 抓取配置列表.
scrape_configs:
  [ - <scrape_config> ... ]

# 警报指定与Alertmanager相关的设置.
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

# 与远程写入功能相关的设置.
remote_write:
  [ - <remote_write> ... ]

# 与远程读取功能相关的设置.
remote_read:
  [ - <remote_read> ... ]

具体配置

root@ubuntu:~# cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global ‘evaluation_interval‘.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it‘s Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    # metrics_path defaults to ‘/metrics‘
    # scheme defaults to ‘http‘.

  - job_name: prometheus
    static_configs:
    - targets: [‘192.168.191.128:9090‘]
      labels:
        instance: prometheus
  - job_name: ‘consul‘ #prometheus与consul配置段
    consul_sd_configs:
      - server: ‘192.168.191.128:8500‘
        services: [ ]
  - job_name: node-exporter
    static_configs:
    - targets: [‘192.168.191.128:9100‘]
      labels:
        instance: node-exporter

动态配置file_sd_config：

修改/usr/local/prometheus/*.json增加删除，支持动态更新


在prometheus.yaml配置文件最后增加
- job_name: ‘node-discorvery‘   #发现规则名称
file_sd_configs:                       #选择适配器
  - files: 
    - /usr/local/prometheus/*.json   #匹配文件

在对应目录/usr/local/prometheus/*.json添加，容器启动时需要挂载进去
[
{
"targets": [ "10.10.2.99:9100"],
"labels": {
"job": "linux-bj",
"idc": "bj-jiuxianqiao"
}
},
{
"targets": [ "10.10.2.62:9100","10.10.1.35:9100"],
"labels": {
"job": "linux-gx",
"idc": "gz-daxuecheng"
}
}
]


## rule_files:配置
- 配置告警规则，在prometheus.yml中指定规则文件目录
- prometheus根据这些规则信息，会推送报警信息到alertmanager中。
```yaml

alertmanager_config:配置

指定Prometheus服务器向其发送警报的Alertmanager实例
提供参数以配置如何与这些Alertmanagers进行通信。
Alertmanagers支持静态指定或者动态发现指定

relabel_configs允许从发现的实体中选择Alertmanagers，并对使用的API路径提供高级修改，该路径通过__alerts_path__标签公开

# Alertmanager configuration   #告警配置
alerting:
alertmanagers:  
- static_configs:  #告警规则，也可以基于动态的方式进行告警。
- targets:     
  # - alertmanager:9093

remote_write

指定后端的存储的写入api地址。

remote_read

指定后端的存储的读取api地址。

relabel_config

重新标记是一种强大的工具，可以在抓取目标之前动态重写目标的标签集。每个抓取配置可以配置多个重新标记步骤。它们按照它们在配置文件中的出现顺序应用于每个目标的标签集。标签默认在prometheus web console可以看到相关的标签：

使用

promtheus使用

技术分享图片

node_cpu_seconds_total{cpu="0"}
技术分享图片

grafana使用

grafana的dashboard地址

prometheus exporter

Process-exporter 进程监控

docker search process-exporter   #查看排名最高的
docker pull opvizorpa/process-exporter   #下载
docker inspect 9ec6749205fc      #查看镜像启动配置信息:9256端口
apt-get install nginx         #安装nginx测试

docker run -d --rm -p 9256:9256 --privileged -v /proc:/host/proc -v `pwd`:/config ncabatoff/process-exporter --procfs /host/proc -config.path /config/filename.yml

# 本地安装
[root@host-10-10-2-62 ~]# wget https://github.com/ncabatoff/process-exporter/releases/download/v0.5.0/process-exporter-0.5.0.linux-amd64.tar.gz
[root@host-10-10-2-62 ~]# tar -xf process-exporter-0.5.0.linux-amd64.tar.gz -C /usr/local/

# 开机自启动
[root@host-10-10-2-62 process-exporter-0.5.0.linux-amd64]# cat /etc/systemd/system/process-exporter.service 
[Unit]
Description=node exporter
Documentation=node exporter

[Service]
ExecStart=/usr/local/process-exporter-0.5.0.linux-amd64/process-exporter -config.path /usr/local/process-exporter-0.5.0.linux-amd64/process-name.yaml 

[Install]
WantedBy=multi-user.target

# 配置文件根据变量名匹配到配置文件：  
{{.Comm}} 包含原始可执行文件的basename，/proc/stat 中的换句话说，2nd 字段  
{{.ExeBase}} 包含可执行文件的basename  
{{.ExeFull}} 包含可执行文件的完全限定路径  
{{.Matches}} 映射包含应用命令行tlb所产生的所有匹配项

# 增加监控nginx配置文件
root@ubuntu:~# cat process-name.yaml 
process_names:
  - name: "{{.Matches}}"
    cmdline:
    - ‘nginx‘

# 监控所有进程
[root@host-10-10-2-62 process-exporter-0.5.0.linux-amd64]# cat process-name.yaml 
process_names:
  - name: "{{.Comm}}"
    cmdline:
    - ‘.+‘

# prometheus server添加监控
  - job_name: ‘process‘
    static_configs: 
      - targets: [‘10.10.2.62:9256‘]

# 进程查询
统计有多少个进程数：sum(namedprocess_namegroup_states)

统计有多少个僵尸进程：sum(namedprocess_namegroup_states{state="Zombie"})

pushgateway

配置

pushgateway 9091

docker pull prom/pushgateway

docker run -d --name=pushgateway -p 9091:9091 prom/pushgateway

访问ip+9091

prometheus配置文件中prometheus.yml添加target：
  - job_name: pushgateway
    static_configs:
      - targets: [‘192.168.191.128:9091‘]
        labels:
          instance: pushgateway

URL:http://<ip>:9091/metrics/job/<JOBNAME>{/<LABEL_NAME>/<LABEL_VALUE>}，

测试推送数据

echo "some_metric 3.14" | curl --data-binary @- http://192.168.191.128:9091/metrics/job/some_job

# 一次性推送多条数据
cat <<EOF | curl --data-binary @http://192.168.2.14:9091/metrics/job/some_job/instance/some_instance
 # TYPE some_metric counter
 some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.
another_metric 2398.283
EOF

可以发现 pushgateway 中的数据我们通常按照 job 和 instance 分组分类，所以这两个参数不可缺少。

技术分享图片

统计当前tcp并发连接数，推送到pushgateway

[root@host-10-10-2-109 ~]# cat count_netstat_esat_connections.sh 
#!/bin/bash
instance_name=`hostname -f | cut -d‘.‘ -f1`  #获取本机名，用于后面的的标签
label="count_netstat_established_connections"  #定义key名
count_netstat_established_connections=`netstat -an | grep -i ESTABLISHED | wc -l`  #获取数据的命令
echo "$label: $count_netstat_established_connections"
echo "$label  $count_netstat_established_connections" | curl --data-binary @- http://10.10.2.109:9091/metrics/job/pushgateway_test/instance/$instance_name

域名中label 表示标签名称，后面跟的是数值。Job 后面定义的是pushgateway_test(与prometheus定义的一致,域名)，instance_name 变量表示主机名。

技术分享图片

眼尖的会发现这里头好像不太对劲，刚刚提交的指标所属 job 名称为 exported_job="pushgateway_test" ，而 job 显示为 job="pushgateway" ，这显然不太正确，那这是因为啥？其实是因为 Prometheus 配置中的一个参数 honor_labels （默认为 false）决定的。

  - job_name: pushgateway
    honor_labels: true
    static_configs:
      - targets: [‘10.10.2.109:9091‘] 
        labels:
          instance: pushgateway

技术分享图片

python 接口pushgateway 数据推送：

1、安装prometheus\_client模块：
apt install python-pip
pip install prometheus_client

2、简单示例文件：

[root@host-10-10-2-109 ~]# cat client.py 
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge(‘job_last_success_unixtime‘, ‘Last time a batch job successfully finished‘, registry=registry)
g.set_to_current_time()
push_to_gateway(‘localhost:9091‘, job=‘batchA‘, registry=registry)
#执行脚本
[root@host-10-10-2-109 ~]# python client.py

3、查询结果：
技术分享图片

技术分享图片

4、稍微更改一下，获取的是ping 的数据:

[root@host-10-10-2-109 ~]# cat client_ping.py 
#!/usr/bin/python
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge(‘ping‘, ‘pingtime‘,[‘dst_ip‘,‘city‘], registry=registry) #Guage(metric_name,HELP,labels_name,registry=registry)
g.labels(‘192.168.1.10‘,‘shenzhen‘).set(42.2)    #设置标签
g.labels(‘192.168.1.11‘,‘shenzhen‘).set(41.2)
g.labels(‘192.168.1.12‘,‘shenzhen‘).set(32.1)  
push_to_gateway(‘localhost:9091‘, job=‘ping_status‘, registry=registry)
[root@host-10-10-2-109 ~]# python client_ping.py

技术分享图片

promql

范围向量

s?- seconds
m?- minutes
h?- hours
d?- days
w?- weeks

y?- years
prometheus_http_requests_total{handler="/api/v1/query"}[5m]
时间位移操作：

prometheus_http_requests_total{} offset 5m #返回5分钟前的样本，瞬时位移
prometheus_http_requests_total{}[1d] offset 1d  #返回昨天一天的数据样本，区间位移

标量浮点值

标量浮点值可以直接写成形式[-](digits)[.(digits)]。

例如： 20
标量只有一个数字，没有时序。需要注意的是，当使用表达式count(http_requests_total)，返回的数据类型，依然是瞬时向量。用户可以通过内置函数scalar()将单个瞬时向量转换为标量。

字符串：string

"this is a string"
‘these are unescaped: \n \\ \t‘
`these are not unescaped: \n ‘ " \t`

Promql操作符

使用PromQL除了能够方便的按照查询和过滤时间序列以外，PromQL还支持丰富的操作符，用户可以使用这些操作符对进一步的对事件序列进行二次加工。这些操作符包括：数学运算符，逻辑运算符，布尔运算符等等。

数学运算

PromQL支持的所有数学运算符如下所示：

 + (加法)
  - (减法)
  * (乘法)
  / (除法)
  % (求余)
  ^ (幂运算)

举例说明例如我们查询主机的内存大小，返回的是Bytes,如果我要把他转换成G可以使用一下表达式：

node_memory_MemTotal_bytes / 1024 /1024 /1024

返回的结果是一个瞬时向量。两个瞬时向量之间的数学计算例如：

node_disk_written_bytes_total + node_disk_read_bytes_total #返回的是多块磁盘之间的读写IO

那么我们会发现是根据表达式的标签进行数学运算，分别算出vda、vdb的磁盘io.

布尔运算

在PromQL通过标签匹配模式，用户可以根据时间序列的特征维度对其进行查询。而布尔运算则支持用户根据时间序列中样本的值，对时间序列进行过滤，常常用在我们的告警规则当中。

Prometheus支持以下布尔运算符如下：


 == (相等)
 != (不相等)
 > (大于)
 < (小于)
 >= (大于等于)
 <= (小于等于)

使用bool修饰符、返回匹配的查询结果：

例如：通过数学运算符我们可以很方便的计算出，当前所有主机节点的内存使用率：

(node_memory_bytes_total - node_memory_free_bytes_total) / node_memory_bytes_total

而在我们写告警规则的时候我们需要筛选出，内存使用率超过百分之95的主机、则可以使用布尔运算表达式：

(node_memory_bytes_total - node_memory_free_bytes_total) / node_memory_bytes_total > 0.95

集合运算符

使用瞬时向量表达式能够获取到一个包含多个时间序列的集合，我们称为瞬时向量。通过集合运算，可以在两个瞬时向量与瞬时向量之间进行相应的集合操作。目前，Prometheus 支持以下集合运算符：

and?(并且)
or?(或者)
unless?(排除)

vector1 and vector2?会产生一个由?vector1?的元素组成的新的向量。该向量包含 vector1 中完全匹配?vector2?中的元素组成。

vector1 or vector2?会产生一个新的向量，该向量包含?vector1?中所有的样本数据，以及?vector2?中没有与?vector1?匹配到的样本数据。

vector1 unless vector2?会产生一个新的向量，新向量中的元素由?vector1?中没有与?vector2?匹配的元素组成。

操作运算符优先级

在 Prometheus 系统中，二元运算符优先级从高到低的顺序为：

 ^
*, /, %
 +, -
 ==, !=, <=, <, >=, >
 and, unless
or

具有相同优先级的运算符是满足结合律的（左结合）。例如，2 3 % 2 等价于 (2 3) % 2。运算符 ^ 例外，^ 满足的是右结合，例如，2 ^ 3 ^ 2 等价于 2 ^ (3 ^ 2)。

聚合运算

Prometheus还提供了下列内置的聚合操作符，这些操作符作用于瞬时向量。可以将瞬时表达式返回的样本数据进行聚合，形成一个新的时间序列。

 sum (求和)
 min (最小值)
 max (最大值)
 avg (平均值)
 stddev (标准差)
 stdvar (标准差异)
 count (计数)
 count_values (对value进行计数)
 bottomk (后n条时序)
 topk (前n条时序)
 quantile (分布统计)

使用聚合操作的语法如下：

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]

其中只有count_values, quantile, topk, bottomk支持参数(parameter)。

without用于从计算结果中移除列举的标签，而保留其它标签。by则正好相反，结果向量中只保留列出的标签，其余标签则移除。通过without和by可以按照样本的问题对数据进行聚合。

sum(http_requests_total) without (instance)
等于：
sum(http_requests_total) by (code,handler,job,method)

如果只需要计算整个应用的HTTP请求总量，可以直接使用表达式：

sum(http_requests_total)
查询数据的平均值：
avg(http_requests_total)
查询最靠前的3个值：
topk(3, http_requests_total)

常用函数

Prometheus为不同的数据类型提供了非常多的计算函数，有个小技巧就是遇到counter数据类型，在做任何操作之前，先套上一个rate()或者increase()函数。下面介绍一些比较常用的函数帮助理解：

increase()函数：

该函数配合counter数据类型使用，获取区间向量中的第一个和最后一个样本并返回其增长量。如果除以一定时间就可以获取该时间内的平均增长率：

increase(node_cpu_seconds_total[2m]) / 120 #主机节点最近两分钟内的平均CPU使用率

rate()函数：

该函数配合counter类型数据使用，取counter在这个时间段中的平均每秒增量。

rate(node_cpu_seconds_total[2m]) #直接计算区间向量在时间窗口内平均增长速率

sum()函数：

在实际工作中CPU大多是多核的，而node_cpu会将每个核的数据都单独显示出来，我们其实不会关注每个核的单独情况，而是关心总的CPU情况。使用sum()函数进行求和后可以得出一条总的数据，但sum()是将所有机器的数据都进行了求和，所以还要再使用by (instance)或者by (cluster_name)就可以取出单个服务器或者一组服务器的CPU数据。上面的公式可以进化为：

sum( increase(node_cpu_seconds_total[1m]) ) #先找出每一个，然后再合并

Topk():该函数可以从大量数据中取出排行前N的数值，N可以自定义。比如监控了100台服务器的320个CPU，用这个函数就可以查看当前负载较高的那几个，用于报警：

topk(3, http_requests_total) #统计最靠前的3个值。

predict_linear()函数：对曲线变化速率进行计算，起到一定的预测作用。比如当前这1个小时的磁盘可用率急剧下降，这种情况可能导致磁盘很快被写满，这时可以使用该函数，用当前1小时的数据去预测未来几个小时的状态，实现提前告警：

predict_linear( node_filesystem_free_bytes{mountpoint="/"}[1h],4*3600 ) < 0 #如果未来4小时后磁盘使用率为负数就会报警

cpu利用率表达式拆解

1、先把key找出来，比如是为了查看CPU的使用率，那么就应该使用node_cpu这个key

2、在node_cpu这个key的基础上把idle的CPU时间和全部CPU时间过滤出来，使用{}做过滤

node_cpu_seconds_total{ mode=‘idle‘ }  #找出空闲CPU的值
node_cpu_seconds_total  #不写其他参数代表ALL

3、使用increase()函数把1分钟的数据抓取出来，这个时候取出来的是每个CPU的数据

increase(node_cpu_seconds_totalmode=‘idle‘}[1m])

4、使用sum()函数求和每个CPU的数据，得到单独一个数据：

sum( increase(node_cpu_seconds_total{mode=‘idle‘}[1m]) )

5、sum()函数虽然把每个CPU的数据进行了求和，但是还把每台服务器也进行了求和，所有服务器的CPU数据都相同了，还需要进行一次处理。这里又引出了一个新函数?by (instance)。它会把sum求和到一起的数值按照指定方式进行拆分，instance代表的是机器名。如果不写by (instance)的话就需要在{}中写明需要哪个实例的数据。

sum( increase(node_cpu_seconds_total{mode=‘idle‘}[1m]) ) by (instance) #空闲CPU一分钟增量

6、获取CPU空闲时间占比：

sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) /sum(increase(node_cpu_seconds_total[1m])) by(instance)

7、CPU的利用率：

1-(sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) /sum(increase(node_cpu_seconds_total[1m])) by(instance)) * 100

最终计算可能为负数，可能好多granafa模板都这样，当cpu处于多核、低负载的情况下，值的差异会被放大，从而导致出现负数的情况。

几个比较常用的表达式

1、计算cpu 的使用率：

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

2、内存的使用率：

(node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100

3、磁盘使用率

100 - (node_filesystem_free_bytes{mountpoint="/",fstype=~"ext4|xfs|rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype=~"ext4|xfs|rootfs"} * 100)

4、主机节点cpu iowait 占百分比：
avg(irate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) * 100

5、系统1分钟时候的负载：
sum by (instance) (node_load1)

6、网卡流量：
avg(irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))by (environment,instance,device)

参考博客

- alinode 官方镜像分析并提取 Dockerfile | Web技术试炼地
 - hub.docker.com针对docker中的grafana，提供的官方帮助文档
 ncabatoff/process-exporter: Prometheus exporter

docker+prom+grafana+altermanager

原文：https://blog.51cto.com/14223698/2574538

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)