使用 Node Exporter textfile 自定义指标收集器

比如我们要收集服务器的CPU 温度，他的实时值可以通过/sys/class/thermal/thermal_zone0/temp 获取

核心概念：

* 将自定义的指标数据（例如温度、应用状态、业务指标等）写入文本文件。
* Node Exporter 会定期读取这些文件，并将其转换为 Prometheus 可识别的指标格式。

工作原理：

用户将指标写入指定目录下的 .prom 文件（例如 /home/application/node_exporter/textfile_collector）
Node Exporter 内置的 textfile 收集器会周期性扫描该目录，读取文件内容并暴露指标。
Prometheus 通过 Node Exporter 的 HTTP 端点（默认 9100）抓取这些指标。

prom文件格式要求：

必须符合 Prometheus 的文本格式（Text-based exposition format）。
示例

# HELP temp_celsius CPU温度（摄氏度）
# TYPE temp_celsius gauge
temp_celsius{core="0"} 45.0
temp_celsius{core="1"} 46.5

配置 textfile collector

 mkdir -p /home/application/node_exporter/textfile_collector

编写温度采集脚本 vim /home/application/node_exporter/read_cpu_temp.sh

#!/bin/bash

# 温度文件路径
TEMP_FILE="/sys/class/thermal/thermal_zone0/temp"
# 输出文件路径
OUTPUT_FILE="/home/application/node_exporter/textfile_collector/cpu_temp.prom"

# 读取温度值（单位：千分之一摄氏度）
RAW_TEMP=$(cat $TEMP_FILE)
# 转换为摄氏度（保留1位小数）
CPU_TEMP=$(echo "scale=1; $RAW_TEMP / 1000" | bc)

# 生成 Prometheus 指标格式
cat <<EOF > $OUTPUT_FILE
# HELP cpu_temperature CPU温度（摄氏度）
# TYPE cpu_temperature gauge
cpu_temperature{unit="celsius"} ${CPU_TEMP}
EOF

赋予执行权限

chmod +x /home/application/node_exporter/read_cpu_temp.sh

配置定时任务

crontab -e
# 添加以下内容（每30秒更新一次）
* * * * * /home/application/node_exporter/read_cpu_temp.sh
* * * * * sleep 30; /home/application/node_exporter/read_cpu_temp.sh

配置Node-exporte service 启动文件,（需指定 textfile 目录）

cat >> /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/home/application/node_exporter/node_exporter --collector.systemd --web.config.file=/home/application/node_exporter/web.config -collector.textfile --collector.textfile.directory=/home/application/node_exporter/textfile_collector
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

配置 Prometheus 抓取数据(新增 textfile 收集器)

global:
  scrape_interval: 10s

scrape_configs:
#nj-test环境机器-18
  - job_name: 'nj-test-linux'
    params:
      collect[]:
        - textfile  #新增 textfile 收集器
        - cpu
        - meminfo
        - diskstats
        - netdev
        - netstat
        - filefd
        - filesystem
        - xfs
        - systemd
        - uname
        - time
        - os
        - stat
        - loadavg
        - sockstat
        - netclass
    static_configs:
    - targets: ['172.16.10.10:9100']
      labels:
        srebro_project_name: "nj-test"
        nodename: "运维组-边缘服务器温度告警服务器-172.16.10.10"

配置告警规则

groups:
- name: temperature
  rules:
  - alert: HighCPUTemperature
    expr: cpu_temperature > 40  # 温度超过40°C报警
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "CPU温度过高（{{ $value }}℃）"
      description: "服务器 {{ $labels.instance }} 的CPU温度已持续5分钟超过60℃"