采用alloy + prometheus + loki + tempo + alertmanager + grafana 实现可观测性
一、云原生可观测性现状
在云原生架构深入落地的背景下,微服务、容器化和Serverless等技术的大规模应用,使得系统的可观测性面临前所未有的挑战。当前主流的可观测性体系普遍采用”三支柱”架构:
- 指标监控(Prometheus)
- 日志收集(Loki/ELK)
- 链路追踪(SkyWalking/Jaeger/PinPoint)
各组件采用独立Agent的模式导致实际部署中出现”Agent膨胀”现象。根据CNCF 2023年度调查报告显示,78%的受访企业在生产环境中同时运行3种以上监控Agent,其中35%面临严重的资源争用问题。
二、OneAgent方案的意义
采用OneAgent的设计思路对于技术改造具有重要意义。首先,它简化了监控体系结构,减少了需要管理和维护的组件数量。其次,通过统一的数据采集入口,可以更高效地处理和分析来自不同来源的数据,提高故障排查速度和系统稳定性。最后,从成本效益的角度看,减少agent的数量有助于降低系统开销,提升整体效率。
三、多Agent与OneAgent架构对比
- 多Agent:
- 管理复杂:每个组件都有自己的agent,增加了运维负担。
- 资源浪费:多个agent同时运行会占用更多的系统资源。
- 数据孤岛:不同的agent可能会导致数据无法有效整合,影响分析结果。
- OneAgent:
- 简化管理:只需关注一个agent的状态和配置。
- 资源优化:减少了系统中运行的agent实例数,降低了对计算资源的需求。
- 数据一致性:提供统一的数据格式和存储方式,便于后续处理和分析。
四、基于Grafana Alloy的OneAgent方案
1、介绍
Alloy是Grafana产品公司旗下的一款新主推遥测数据采集工具,Grafana Alloy也是一个开源OpenTelemetry收集器,具有内置 Prometheus管道并支持指标、日志、跟踪和配置文件。Alloy支持为OTEL、Prometheus、Pyroscope、Loki等服务提供许多指标、日志、跟踪等原生数据管道以及数据采集功能。Alloy在Grafana产品生态中,用于替代Promital、Agent两个采集程序,Alloy提供了强大和灵活的模块组件配置功能,支持远比原Agent工具更多的数据格式处理功能和更多样的采集来源端、多平台数据转换能力;并且支持自定义配置组件、数据过滤、数据管道转发、本地配置、云端配置等,官方声明Alloy对更多平台协议数据兼容性更好、安全性和数据配置调试能力更强;
2、部署
下载alloy镜像grafana/alloy:v1.6.1
在K8S环境下部署:
kind: ConfigMap
apiVersion: v1
metadata:
name: alloy
namespace: alloy
labels:
app.kubernetes.io/component: config
app.kubernetes.io/instance: alloy
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: alloy
app.kubernetes.io/part-of: alloy
app.kubernetes.io/version: v1.6.1
helm.sh/chart: alloy-0.11.0
annotations:
meta.helm.sh/release-name: alloy
meta.helm.sh/release-namespace: alloy
data:
config.alloy: "logging {\n\tlevel = \"debug\"\n\tformat = \"logfmt\"\n write_to = [loki.write.grafana_loki.receiver]\n}\n\nlivedebugging {\n enabled = true\n}\n\n// 写入loki\n loki.write \"grafana_loki\" {\n endpoint {\n url = \"http://10.244.10.219:32621/loki/api/v1/push\" \n }\n}\n\nloki.relabel \"add_static_label\" {\n rule {\n source_labels = [\"filename\"]\n regex = `(.*\\/)([^/]+)\\.log$`\n replacement = \"$2\"\n target_label = \"dezhu\"\n }\n\tforward_to = [loki.write.grafana_loki.receiver]\n}\n\n// 测试格式化aizj-backend的日志 \nlocal.file_match \"aizj_backend_log\" {\n path_targets = [{ \n __path__ = \"/var/log/aizj-backend.log\",\n\t job = \"aizj\",\n }]\n}\n\n\nloki.source.file \"aizj_backend_log\" {\n targets = local.file_match.aizj_backend_log.targets\n forward_to = [loki.process.aizj_backend_log_parser.receiver]\n} \n\nloki.process \"aizj_backend_log_parser\" {\n stage.match {\n selector = \"{job=\\\"aizj\\\"}\" \n stage.regex {\n expression = \"^(?s)(?P<time>\\\\S+?) (?P<stream>stdout|stderr) (?P<flags>\\\\S+?) (?P<content>.*)$\"\n }\n \n stage.template {\n source = \"json_payload\"\n template = \"{\\\"timestamp\\\": \\\"{{ .time }}\\\", \\\"stream\\\": \\\"{{ .stream }}\\\", \\\"flags\\\": \\\"{{ .flags }}\\\", \\\"content\\\": \\\"{{ .content }}\\\"}\"\n }\n \n stage.output {\n source = \"json_payload\"\n }\n }\n forward_to = [loki.relabel.add_static_label.receiver]\n}\n\n\n\n\n\nloki.echo \"expression\"{ }"
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: alloy
namespace: alloy
labels:
app: alloy
spec:
replicas: 1
selector:
matchLabels:
app: alloy
template:
metadata:
creationTimestamp: null
labels:
app: alloy
annotations:
app: alloy
spec:
volumes:
- name: volume-vt1wvp
configMap:
name: alloy
defaultMode: 420
- name: log
hostPath:
path: /var/log
type: ''
containers:
- name: alloy
image: 'grafana/alloy:v1.6.1'
args:
- run
- /etc/alloy/config.alloy
- '--storage.path=/tmp/alloy'
- '--server.http.listen-addr=0.0.0.0:12345'
- '--server.http.ui-path-prefix=/'
- '--stability.level=generally-available'
ports:
- name: http-metrics
containerPort: 12345
protocol: TCP
env:
- name: ALLOY_DEPLOY_MODE
value: helm
resources: {}
volumeMounts:
- name: volume-vt1wvp
readOnly: true
mountPath: /etc/alloy
- name: log
mountPath: /var/log
livenessProbe:
httpGet:
path: /-/ready
port: 12345
scheme: HTTP
initialDelaySeconds: 45
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/hostname: app-06
serviceAccountName: default
serviceAccount: default
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
安装完成后访问http://ip:port/graph ,可看到alloy的管理界面(默认的端口是12345)
3、部署方式
在K8S中,日志采集一般分为Sidecar和DaemonSet两种方式。一般建议DaemonSet在中小型集群中使用;Sidecar推荐在超大型的集群中使用(为多个业务方提供服务,每个业务方有明确的自定义日志采集需求,采集配置数会超过500)。
- DaemonSet方式在每个node节点上只运行一个日志agent,采集这个节点上所有的日志。DaemonSet相对资源占用要小很多,但扩展性、租户隔离性受限,比较适用于功能单一或业务不是很多的集群。
- Sidecar方式为每个POD单独部署日志agent,这个agent只负责一个业务应用的日志采集。Sidecar相对资源占用较多,但灵活性以及多租户隔离性较强,建议大型的K8S集群或作为PAAS平台为多个业务方服务的集群使用该方式。
| DaemonSet方式 | Sidecar方式 | |
|---|---|---|
| 采集日志类型 | 标准输出+部分文件 | 文件 |
| 部署运维 | 一般,需维护DaemonSet | 较高,每个需要采集日志的POD都需要部署sidecar容器 |
| 日志分类存储 | 一般,可通过容器/路径等映射 | 每个POD可单独配置,灵活性高 |
| 多租户隔离 | 一般,只能通过配置间隔离 | 强,通过容器进行隔离,可单独分配资源 |
| 支持集群规模 | 取决于采集配置数(由于每个节点的Agent都需要加载所有配置并工作,所以集群的采集配置数有上限,一般不建议超过500个采集配置) | 无限制 |
| 资源占用 | 较低,每个节点运行一个容器 | 较高,每个POD运行一个容器 |
| 查询便捷性 | 较高,可进行自定义的查询、统计 | 高,可根据业务特点进行定制 |
| 可定制性 | 低 | 高,每个POD单独配置 |
| 耦合度 | 低,Agent可独立升级 | 一般,默认采集Agent升级对应Sidecar业务也会重启(有一些扩展包可以支持Sidecar热升级) |
| 适用场景 | 日志分类明确、功能较单一的集群 | 大型、混合型、PAAS型集群 |
4、Grafana Alloy使用示例
4.1 采集日志并发送到Loki
4.1.1 第一个组件:日志文件
local.file_match "aizj_backend_log" {
path_targets = [{
__path__ = "/var/log/aizj-backend.log",
job = "aizj",
}]
sync_period = "5s"
}
此配置创建一个名为 aizj_backend_log 的 local.file_match 组件,它执行以下操作
- 它告诉 Alloy 要从哪些文件获取源数据。
- 给获取的数据打上“job”标签
- 它每 5 秒检查一次新文件。
4.1.2 第二个组件:抓取
loki.source.file "aizj_backend_log" {
targets = local.file_match.aizj_backend_log.targets
forward_to = [loki.process.aizj_backend_log_parser.receiver]
}
此配置创建一个名为 aizj_backend_log 的 [<u>loki.source.file</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/loki/loki.source.file/) 组件,它执行以下操作
- 它连接到
local_files组件作为其源或目标。 - 它将抓取的日志转发到另一个名为
aizj_backend_log_parser的组件的接收器。 - 它提供额外的属性和选项,从末尾开始跟踪日志文件,这样您就不会摄取整个日志文件历史记录。
4.1.3 第三个组件:格式化日志和过滤日志
loki.process "aizj_backend_log_parser" {
stage.match {
selector = "{job=\"aizj\"}"
stage.regex {
expression = "^(?s)(?P<time>\\S+?) (?P<stream>stdout|stderr) (?P<flags>\\S+?) (?P<content>.*)$"
}
stage.template {
source = "json_payload"
template = "{\"timestamp\": \"{{ .time }}\", \"stream\": \"{{ .stream }}\", \"flags\": \"{{ .flags }}\", \"content\": \"{{ .content }}\"}"
}
stage.drop {
source = "json_payload"
expression = "^warning.*\n?"
drop_counter_reason = "noisy"
}
stage.output {
source = "json_payload"
}
}
forward_to = [loki.write.grafana_loki.receiver]
}
loki.process 组件允许您转换、过滤、解析和丰富日志数据。在此组件中,您可以定义一个或多个处理阶段,以指定您希望如何在存储或转发日志条目之前处理它们。
此配置创建一个名为 aizj_backend_log_parser 的 [<u>loki.process</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/loki/loki.process/) 组件,它执行以下操作
- 它从默认的
aizj_backend_log组件接收抓取的日志条目。 - 它使用
stage.match来匹配需处理的日志条目。 - 它使用
stage.regex定义正则表达式,并对日志进行正则分割。 - 它使用
stage.template将正则分割后的数据以JSON格式组装,并命名为json_payload。 - 它使用
stage.drop块来定义要从抓取的日志中删除的内容,使用source说明抓取的数据来源,使用expression定义正则表达式删除特定的内容,使用可选的参数drop_counter_reason定义删除的原因。 - 它使用
stage.output来说明此组件输出的内容为JSON格式化后的数据。 - 它将处理后的日志转发到另一个名为
grafana_loki的组件的接收器。
4.1.4 第四个组件:将日志写入Loki
loki.write "grafana_loki" {
endpoint {
url = "http://xxxxx:32621/loki/api/v1/push"
}
}
最后一个组件创建一个名为 grafana_loki 的 [<u>loki.write</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/loki/loki.write/) 组件,该组件指向 http://xxxxx:32621/loki/api/v1/push。
通过此配置,Alloy 将直接连接到运行的 Loki 实例。
4.2 收集Kubernetes日志并发送到Loki
4.2.1 系统日志
要获取系统日志,您应该使用以下组件
[<u>local.file_match</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/local/local.file_match/):发现本地文件系统上的文件。[<u>loki.source.file</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/loki/loki.source.file/):从文件中读取日志条目。[<u>loki.write</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/loki/loki.write/):将日志发送到 Loki 端点。您应该已在配置日志传递部分中配置它。
4.2.2 Pod 日志
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pod_labels" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
action = "replace"
target_label = "container_name"
}
rule {
source_labels = ["__meta_kubernetes_namespace"]
action = "replace"
target_label = "namespace"
}
}
loki.source.kubernetes "pod_logs" {
targets = discovery.relabel.pod_labels.output
forward_to = [loki.process.filter_logs.receiver]
}
loki.process "filter_logs" {
stage.drop {
source = ""
expression = ".*Connection closed by authenticating user root"
drop_counter_reason = "noisy"
}
forward_to = [loki.write.grafana_loki.receiver]
}
loki.write "grafana_loki" {
endpoint {
url = "http://10.244.10.219:32621/loki/api/v1/push"
}
}
[<u>discovery.kubernetes</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/discovery/discovery.kubernetes/):发现Kubernetes中Pod的信息并列出它们以供组件使用,rule可选的值有:pod、node、service、endpoints、endpointslice、ingress。[<u>discovery.relabel</u>](https://grafana.org.cn/docs/alloy/latest/reference/components/discovery/discovery.relabel/):对 Pod 的元信息进行重新标记,即自定义label。
附录:如何快速实现java应用快速接入
在不更改java程序的情况下,通过简单的配置可以快速将java程序接入tempo,并实现与日志系统快速关联(如果对接otel日志和监控方案,需要java引入对应的sdk和修改部分代码,此方案需要与研发对齐讨论)
1、启动java程序的shell命令添加javaagent
java -javaagent:/path/to/opentelemetry-javaagent-all.jar \
-Dotel.traces.exporter=otlp \
-Dotel.exporter.otlp.endpoint=http://alloy.monitoring.svc:4318 \
-Dotel.resource.attributes=service.name=my-java-app \
-jar /path/to/your/application.jar
2、修改logback.xml(示例)
需要pattern部分,内容自定义,示例中增加的部分为[trace_id=%X{trace_id}, span_id=%X{span_id}]
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - [trace_id=%X{trace_id}, span_id=%X{span_id}] %msg%n</pattern>
</encoder>
</appender>
附录:二进制模式各组件配置文件
prometheus配置
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['prom.51ima.lo:9093']
rule_files:
- "/home/finance/App/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'alertmanager'
static_configs:
- targets: ['prom.51ima.lo:9093']
- job_name: 'loki'
static_configs:
- targets: ['loki.51ima.lo:3100']
- job_name: 'grafana'
static_configs:
- targets: ['loki.51ima.lo:3000']
- job_name: nodes
file_sd_configs:
- files:
- "/home/finance/App/prometheus/node/node*.yml"
refresh_interval: 60s
- job_name: node_process
file_sd_configs:
- files:
- "/home/finance/App/prometheus/node/process*.yml"
refresh_interval: 60s
- job_name: middleware
file_sd_configs:
- files:
- "/home/finance/App/prometheus/conf.d/*.yml"
refresh_interval: 60s
- job_name: 'tcp_port'
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- files:
- "/home/finance/App/target/blackbox-exporter-tcp.yml"
refresh_interval: 10s
relabel_configs:
- source_labeľs: [_address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: <blackbox_exporter_ip>:9115
- job_name: 'http_status'
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:
- files:
- "/home/finance/App/target/blackbox-exporter-http.yml"
refresh_interval: 10s
relabel_configs:
- source_labels: [_address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address
replacement: <blackbox_exporter_ip>:9115
prometheus.service
- web.enable-admin-api:启用管理员行为API端点
- web.enable-lifecycle:通过HTTP请求启用关闭(shutdown)和重载(reload)
- web.enable-remote-write-receiver:启用远程写入服务,支持第三方组件推送并写入远程数据
[Unit]
Description=Prometheus Monitoring System
Wants=network-online.target
After=network-online.target
[Service]
User=finance
Group=finance
ExecStart=/home/finance/App/prometheus/prometheus \
--config.file=/home/finance/App/prometheus/prometheus.yml \
--web.enable-admin-api --web.enable-lifecycle --web.enable-remote-write-receiver \
--storage.tsdb.path=/home/finance/Data/prometheus --storage.tsdb.retention.time=90d \
--web.listen-address=:9090
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
alertmanager配置
alertmanager.yml
global:
resolve_timeout:5m
route:
group_by:['alertname','instance']
group_wait:10s
group_interval:10s
repeat_interval:1h
receiver:'alert.webhook'
receivers:
-name:'alert.webhook'
webhook_configs:
-url:'<告警渠道地址>'
send_resolved:true
inhibit_rules:
-source_match:
severity:'P1'
target_match:
severity:'P2'
equal:['alertname','instance']
-source_match:
severity:'P0'
target_match:
severity:'P1'
equal:['alertname','instance']
alertmanager.service
[Unit]
Description=AlertManager for Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=finance
Group=finance
Type=simple
ExecStart=/home/finance/App/alertmanager/alertmanager --config.file=/home/finance/App/alertmanager/alertmanager.yml \
--storage.path=/home/finance/Data/alertmanager --web.listen-address=:9093
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
blackbox_exporter配置
blackbox.yml
modules:
http_2xx:
prober:http
http:
valid_status_code:[200]
method:GET
preferred_ip_protocol:"ip4"
http_post_2xx:
prober:http
http:
method:POST
tcp_connect:
prober:tcp
pop3s_banner:
prober:tcp
tcp:
query_response:
-expect:"^+OK"
tls:true
tls_config:
insecure_skip_verify:false
grpc:
probe:grpc
grpc:
tls:true
preferred_ip_protocol:"ip4"
grpc_plain:
probe:grpc
grpc:
tls:true
service:"service1"
ssh_banner:
prober:tcp
tcp:
query_response:
-expect:"^SSH-2.0-"
-send:"SSH-2.0-blackbox-ssh-check"
irc_banner:
prober:tcp
tcp:
query_response:
-send:"NICK prober"
-send:"USER prober prober prober :prober"
-expect:"PING :([^ ]+)"
send:"PONG ${1}"
-expect:"^:[^ ]+ 001"
icmp:
prober:icmp
icmp_ttl5:
prober:icmp
timeout:5s
icmp:
ttl:5
blackbox_exporter.service
[Unit]
Description=Prometheus blackbox exporter
Wants=network-online.target
After=network-online.target
[Service]
User=finance
Group=finance
Type=simple
ExecStart=/home/finance/App/exporter/blackbox_exporter --config.file=/home/finance/App/exporter/conf/blackbox.yml
KillMode=process
Restart=always
[Install]
WantedBy=multi-user.target
loki配置
config.yml
auth_enabled: false
server:
http_listen_port:3100
common:
path_prefix:/home/finance/Data/loki
storage:
filesystem:
chunks_directory:/home/finance/Data/loki/chunks
replication_factor:1
ring:
instance_addr:127.0.0.1
kvstore:
store:inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled:true
max_size_mb:100
schema_config:
configs:
-from:2020-10-24
store:boltdb-shipper
object_store:filesystem
schema:v11
index:
prefix:index_
period:24h
limits_config:
enforce_metric_name:false
reject_old_samples:true
reject_old_samples_max_age:168h
retention_period:168h
ingestion_rate_strategy:local
ingestion_rate_mb:15
ingestion_burst_size_mb:20
shard_streams:
enabled:true
desired_rate:10485760
chunk_store_config:
max_look_back_period:168h
compactor:
retention_enabled:true
deletion_mode:filter-and-delete
compaction_interval:10m
retention_delete_delay:5m
delete_request_cancel_period:10m
retention_delete_worker_count:150
ruler:
alertmanager_url:http://prom.51ima.lo:9093
enable_alertmanager_v2:true
enable_api:true
enable_sharding:true
ring:
kvstore:
store:inmemory
rule_path:/home/finance/Data/loki/tmp_rules
storage:
type:local
local:
directory:/home/finance/Data/loki/rules
flush_period:1m
analytics:
reporting_enabled:false
loki.service
[Unit]
Description=Loki service
After=network.target
[Service]
Type=simple
User=finance
Group=finance
ExecStart=/home/finance/App/loki/loki -config.file /home/finance/App/loki/config.yml
TimeoutSec = 120
Restart = on-failure
RestartSec = 2
[Install]
WantedBy=multi-user.target
tempo配置
config.yml
server:
http_listen_port:3200
grpc_listen_port:9095
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint:0.0.0.0:4317
http:
endpoint:0.0.0.0:4318
jaeger:
protocols:
thrift_http:
endpoint:0.0.0.0:14268
override_ring_key:memberlist
storage:
trace:
backend:local
local:
path:/home/finance/Data/tempo
pool:
max_workers:12 # 根据CPU核心数调整
wal:
path:/home/finance/Data/tempo/wal
query_frontend:
search:
concurrent_jobs:200
max_duration:720h
compactor:
compaction:
block_retention:336h
compaction_window:2h
ingester:
max_block_bytes:209715200
max_block_duration:30m# 块压缩前的最大保留时间
trace_idle_period:30s# 无数据时保留内存中Trace的最长时间
# 可观测性数据的指标存储配置
metrics_generator:
registry:
external_labels:
source:tempo
cluster:linux-microservices
overrides:
defaults:
ingestion:
rate_limit_bytes:50000000
metrics_generator:
processors:[service-graphs,span-metrics]
tempo.service
[Unit]
Description=tempo service
After=network.target
[Service]
Type=simple
User=finance
Group=finance
ExecStart=/home/finance/tempo/tempo --config.file /home/finance/App/tempo/config.yml
TimeoutSec = 120
Restart = on-failure
RestartSec = 2
[Install]
WantedBy=multi-user.target
grafana配置
grafana.ini
app_mode = production
instance_name = ${HOSTNAME}
force_migration = false
[paths]
data = /home/finance/Data/grafana/data
temp_data_lifetime = 24h
logs = /home/finance/Logs/grafana
plugins = /home/finance/Data/grafana/plugins
provisioning = /home/finance/Data/grafana/provisioning
[server]
protocol = http
enable_gzip = true
# 如果使用子路径访问,启用如下设置
# domain= <域名>
# root_url = %(protocol)s://%(domain)s:(http_port)s/grafana/
# serve_from_sub_path = true
[database]
type = sqlite3
name = grafana
path = grafana.db
[log]
mode = file
level = info
[log.file]
format = json
log_rotate = true
max_lines = 1000000
# Max size shift of single file, default is 28 means 1 << 28, 256MB
max_size_shift = 28
daily_rotate = true
max_days = 7
[analytics]
reporting_enabled = false
check_for_updates = false
check_for_plugin_updates = false
[security]
admin_user = <设置的管理用户>
admin_password = <设置的管理用户密码>
[unified_alerting]
enabled = false
[users]
viewers_can_edit = true
grafana.service
[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
[Service]
PermissionsStartOnly=True
User=finance
Group=finance
Type=notify
Restart=on-failure
WorkingDirectory=/home/finance/App/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
ExecStart=/home/finance/App/grafana/bin/grafana server \
--config=/home/finance/App/grafana/conf/grafana.ini \
--pidfile=/var/run/grafana/grafana-server.pid
LimitNOFILE=10000
TimeoutStopSec=20
SystemCallArchitectures=native
UMask=0027
[Install]
WantedBy=multi-user.target
alloy配置
config.alloy
logging {
level = "debug"
format = "json"
write_to = [loki.write.grafana_loki.receiver]
}
livedebugging {
enabled = true
}
// 定义otel接收器,用于接收从javaagent发过来的可观测数据
otelcol.receiver.otlp "alloy" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
//metrics = [otelcol.processor.batch.batch_processor.input]
//logs = [otelcol.processor.batch.batch_processor.input]
traces = [otelcol.processor.batch.batch_processor.input]
}
}
otelcol.processor.batch "batch_processor" {
output {
//metrics = [otelcol.exporter.prometheus.metrics.input]
//logs = [otelcol.exporter.loki.logs.input]
traces = [otelcol.exporter.otlp.trace.input]
}
}
otelcol.exporter.otlp "trace" {
client {
endpoint = "http://trace.51ima.lo:4317"
}
}
//otelcol.exporter.prometheus "metrics" {
// forward_to = [prometheus.remote_write.metrics.receiver]
//}
//otelcol.exporter.loki "logs" {
// forward_to = [loki.process.log_parser.receiver]
//}
// 添加静态标签
loki.relabel "add_static_label" {
rule {
source_labels = ["filename"]
regex = `(.*\/)([^/]+)\.log$`
replacement = "$2"
target_label = "app"
}
forward_to = [loki.write.grafana_loki.receiver]
}
// ----------------------------------------file_match
local.file_match "aizj_backend" {
path_targets = [{
__path__ = "/var/log/alloy/aizj-backend.log",
aizj = "aizj_backend",
}]
}
local.file_match "aizj_aiproxy" {
path_targets = [{
__path__ = "/var/log/alloy/aizj-aiproxy.log",
aizj = "aizj_aiproxy",
}]
}
local.file_match "static" {
path_targets = [{
__path__ = "/var/log/alloy/dezhu-static.log",
dezhu = "static",
}]
}
// ----------------------------------------source.file
loki.source.file "aizj_backend" {
targets = local.file_match.aizj_backend.targets
forward_to = [loki.process.aizj_backend.receiver]
}
loki.source.file "aizj_aiproxy" {
targets = local.file_match.aizj_aiproxy.targets
forward_to = [loki.process.aizj_backend.receiver]
}
loki.source.file "static" {
targets = local.file_match.static.targets
forward_to = [loki.write.grafana_loki.receiver]
}
// ----------------------------------------loki.process
loki.process "nlp_log_parser" {
stage.match {
selector = "{nlp=\"nlp\"}"
stage.drop {
expression = "^\\s*$"
}
stage.regex {
expression = "^(?P<log_time>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s+(?P<level>\\w+)\\s+\\[(?P<thread>[^\\]]+)\\]\\s+(?P<class>\\S+)\\s+--\\s+(?P<message>.*)$"
}
stage.template {
source = "nlp_json"
template = "{\"log_time\": \"{{ .log_time }}\", \"level\": \"{{ .level }}\", \"thread\": \"{{ .thread }}\", \"class\": \"{{ .class }}\", \"message\": \"{{ .message }}\"}"
}
stage.output {
source = "nlp_json"
}
}
forward_to = [loki.write.grafana_loki.receiver]
}
loki.process "aizj_backend" {
stage.match {
selector = "{aizj=~\"aizj_backend|aizj_analyze_job|aizj_dap|aizj_aiproxy|aizj_gateway|aizj_admin_backend\"}"
stage.regex {
expression = "^(?P<log_time>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s+(?P<level>\\w+)\\s+\\d+\\s+---\\s+\\[(?P<thread>[^\\]]+)\\]\\s+(?P<class>\\S+)\\s+:\\s+(?P<message>.*)$"
}
stage.template {
source = "aizj_json"
template = "{\"log_time\": \"{{ .log_time }}\", \"level\": \"{{ .level }}\", \"thread\": \"{{ .thread }}\", \"class\": \"{{ .class }}\", \"message\": \"{{ .message }}\"}"
}
stage.output {
source = "aizj_json"
}
}
forward_to = [loki.write.grafana_loki.receiver]
}
// 写入loki
loki.write "grafana_loki" {
endpoint {
url = "http://10.244.10.219:32621/loki/api/v1/push"
}
}
// 收集alloy的pod指标
discovery.kubernetes "alloy_pods" {
role = "pod"
}
prometheus.scrape "alloy_pods" {
targets = discovery.kubernetes.alloy_pods.targets
forward_to = [prometheus.remote_write.metrics.receiver]
job_name = "alloy_otel"
}
prometheus.exporter.unix "local_system" { }
prometheus.scrape "scrape_metrics" {
targets = prometheus.exporter.unix.local_system.targets
forward_to = [prometheus.relabel.filter_metrics.receiver]
scrape_interval = "10s"
}
prometheus.relabel "filter_metrics" {
rule {
source_labels = ["__address__"]
target_label = "instance"
}
forward_to = [prometheus.remote_write.metrics.receiver]
}
prometheus.exporter.unix "process" {
enable_collectors = ["processes"]
}
prometheus.scrape "scrape_metrics_process" {
targets = prometheus.exporter.unix.process.targets
scrape_interval = "30s"
body_size_limit = "16MB"
forward_to = [prometheus.relabel.filter_metrics2.receiver]
}
prometheus.relabel "filter_metrics2" {
rule {
target_label = "instance"
replacement = env("HOST_IP")
}
rule {
target_label = "job"
replacement = "node_exporter"
}
forward_to = [prometheus.remote_write.metrics.receiver]
}
prometheus.remote_write "metrics" {
endpoint {
url = "http://10.244.10.219:31355/api/v1/write"
}
}
loki.echo "expression"{ }
alloy.service
[Unit]
Description= Vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines
Documentation=https://grafana.com/docs/alloy
Wants=network-online.target
After=network-online.target
[Service]
Restart=always
User=root
Environment=HOSTNAME=%H
Environment=HOST_IP=10.242.12.82
WorkingDirectory=/home/finance/App/alloy
ExecStart=/home/finance/App/alloy/alloy run --storage.path=/home/finance/Data/alloy /home/finance/App/alloy/config.alloy
ExecReload=/usr/bin/env kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
java应用配置
java应用启动时配置javaagent,实现自动上报alloy可观测数据。
非K8S环境启动
java -javaagent:/path/to/opentelemetry-javaagent-all.jar \
-Dotel.traces.exporter=otlp \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-Dotel.exporter.otlp.endpoint=http://alloy.51ima.lo:4318 \
-Dotel.resource.attributes=service.name=my-java-app \
-jar /path/to/your/application.jar
K8S环境启动
以backend启动为例:
---
# Source: base/templates/service.yaml
apiVersion:v1
kind:Service
metadata:
name:basic
labels:
app.kubernetes.io/name:basic
app.kubernetes.io/instance:basic
spec:
sessionAffinity:None
type:ClusterIP
ports:
-name:http
port:8080
protocol:TCP
targetPort:8080
selector:
app.kubernetes.io/name:abasic
app.kubernetes.io/instance:basic
---
# Source: base/templates/deployment-statefulset.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:basic
labels:
app.kubernetes.io/name:basic
app.kubernetes.io/instance:basic
spec:
replicas:1
selector:
matchLabels:
app.kubernetes.io/name:basic
app.kubernetes.io/instance:basic
template:
metadata:
labels:
app.kubernetes.io/name:basic
app.kubernetes.io/instance:basic
spec:
shareProcessNamespace:false
imagePullSecrets:
-name:harbor-secret
serviceAccountName:default
securityContext:
{}
containers:
-name:basic
securityContext:
{}
image:"base:cbbde210_20240815142233434"
imagePullPolicy:Always
command:
-sh
--c
-java-jar-server-Xms2048m-Xmx2048m-XX:MetaspaceSize=256m-XX:MaxMetaspaceSize=512m
-XX:+UseConcMarkSweepGC-XX:+CMSParallelRemarkEnabled-XX:+UseCMSCompactAtFullCollection
-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSInitiatingOccupancyFraction=90-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/finance/appshell-Djava.awt.headless=true-Dsun.net.client.defaultConnectTimeout=60000
-Dsun.net.client.defaultReadTimeout=60000-Djmagick.systemclassloader=no-Dnetworkaddress.cache.ttl=300
-Dsun.net.inetaddr.ttl=300-Dhealth.vault.enabled=false-Djava.security.egd=file:/dev/./urandom
-Dspring.profiles.active=${SPRING_PROFILES_ACTIVE}-Dtomcat.port=8080-Djava.io.tmpdir=/home/finance/App/servers
-Dspring.cloud.nacos.discovery.namespace=${nacos_namespace}-Dspring.cloud.nacos.config.namespace=${nacos_namespace}
-Dlogging.config=file:/home/finance/Conf/logback.xml/home/finance/App/smartxmabase.jar
-Dotel.metrics.exporter=otlp\
-Dotel.exporter.otlp.endpoint=http://alloy.51ima.lo:4318\
-Dotel.resource.attributes=service.name=smartxmabasic\
args:
[]
ports:
-name:http
containerPort:8080
protocol:TCP
envFrom:
-configMapRef:
name:smartxmabase-nacos
env:
-name:MY_POD_IP
valueFrom:
fieldRef:
fieldPath:status.podIP
-name:MY_NODE_IP
valueFrom:
fieldRef:
fieldPath:status.hostIP
-name:MY_POD_NAME
valueFrom:
fieldRef:
fieldPath:metadata.name
-name:JAVA_OPTS
value:""
livenessProbe:
httpGet:
path:/actuator/health
port:8080
initialDelaySeconds:100
periodSeconds:30
timeoutSeconds:
successThreshold:1
failureThreshold:
readinessProbe:
httpGet:
path:/actuator/health
port:8080
initialDelaySeconds:120
periodSeconds:30
timeoutSeconds:
successThreshold:1
failureThreshold:
resources:
limits:
cpu:1500m
memory:3Gi
requests:
cpu:200m
memory:2048Mi
volumeMounts:
-mountPath:/etc/localtime
name:timezone
-name:smartxmabase-logback
mountPath:/home/finance/Conf
volumes:
-hostPath:
path:/etc/localtime
name:timezone
-name:smartxmabase-logback
configMap:
name:smartxmabase-logback
-name:data-storage
emptyDir:{}
nodeSelector:
nodeType:dz
声明:来自运维开发故事,仅代表创作者观点。链接:https://eyangzhen.com/8547.html