引言:为什么需要ELK日志分析平台
在现代企业IT架构中,日志数据是系统运行的”黑匣子”。无论是排查故障、安全审计还是性能优化,日志都扮演着至关重要的角色。然而,随着微服务架构的普及和业务规模的扩大,传统日志管理方式面临巨大挑战:
- 日志分散:服务部署在多台服务器,日志分散各处
- 格式多样:不同系统产生的日志格式各异
- 查询困难:在海量日志中快速定位问题如同大海捞针
- 实时性差:无法及时发现和响应系统异常
ELK Stack(Elasticsearch + Logstash + Kibana)作为业界领先的开源日志分析解决方案,能够完美解决上述问题。它提供了从日志收集、处理到可视化展示的完整链路,已成为企业日志分析的事实标准。
一、ELK核心组件深度解析
1.1 Elasticsearch:分布式搜索引擎
Elasticsearch是ELK的核心存储和检索引擎,基于Lucene构建,具有分布式、高可用、RESTful API等特点。
核心概念:
- 索引(Index):类似数据库,是文档的集合
- 类型(Type):7.x后逐步废弃,建议直接使用索引
- 文档(Document):JSON格式的数据单元
- 分片(Shard):索引的分区,支持水平扩展
- 副本(Replica):分片的备份,提供高可用和性能提升
企业级配置要点:
# elasticsearch.yml 关键配置
cluster.name: production-elk-cluster
node.name: ${HOSTNAME}
network.host: 0.0.0.0
http.port: 9200
discovery.seed_hosts: ["es-node1", "es-node2", "es-node3"]
cluster.initial_master_nodes: ["es-node1", "es-node2", "es-node3"]
# JVM配置(根据服务器内存调整)
-Xms16g
-Xmx16g
# 路径配置
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
性能优化参数:
# 索引缓冲区设置
indices.memory.index_buffer_size: 10%
# 刷新间隔(权衡实时性和性能)
index.refresh_interval: 30s
# 线程池配置
thread_pool:
write:
size: 16
queue_size: 10000
search:
size: 24
queue_size: 1000
1.2 Logstash:数据处理管道
Logstash是强大的数据收集和处理引擎,支持多种输入输出插件,内置丰富的过滤器。
典型配置结构:
# logstash.conf
input {
# 多种输入源
file {
path => ["/var/log/nginx/access.log"]
type => "nginx_access"
start_position => "beginning"
codec => "json"
}
beats {
port => 5044
type => "beats"
}
tcp {
port => 5000
codec => "json_lines"
}
}
filter {
# 时间戳处理
date {
match => ["timestamp", "yyyy-MM-dd HH:mm:ss"]
target => "@timestamp"
}
# IP地址解析
geoip {
source => "client_ip"
}
# User-Agent解析
useragent {
source => "user_agent"
target => "user_agent"
}
# 条件判断
if [type] == "nginx_access" {
mutate {
convert => {
"status" => "integer"
"response_time" => "float"
}
}
}
# 异常日志标记
if [level] == "ERROR" or [level] == "FATAL" {
mutate {
add_field => {
"is_exception" => true
}
}
}
}
output {
# 输出到Elasticsearch
elasticsearch {
hosts => ["http://es-node1:9200", "http://es-node2:9200"]
index => "logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
# 批量发送配置
flush_size => 500
idle_flush_time => 2
# 重试策略
retry_max_interval => 30
retry_initial_interval => 2
}
# 调试输出(生产环境可关闭)
stdout {
codec => rubydebug
}
}
性能优化技巧:
- 批量处理:调整
flush_size和idle_flush_time平衡吞吐量和延迟 - 多管道并行:对于高吞吐场景,部署多个Logstash实例
- 过滤器优化:避免复杂正则,优先使用内置插件
- 内存管理:调整JVM参数,避免频繁GC
1.3 Kibana:可视化与探索
Kibana提供强大的数据可视化能力和用户友好的查询界面。
核心功能:
- Discover:实时日志探索
- Visualize:图表创建
- Dashboard:仪表盘聚合
- Dev Tools:REST API操作
- Alerting:监控告警
企业级配置:
# kibana.yml
server.port: 5601
server.host: "0.0.0.0"
server.name: "kibana-prod"
elasticsearch.hosts: ["http://es-node1:9200", "http://es-node2:9200"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "${KB_PASSWORD}"
# 性能调优
elasticsearch.requestTimeout: 300000
elasticsearch.shardTimeout: 30000
elasticsearch.pingTimeout: 1500
# 会话超时
xpack.security.sessionTimeout: 86400000
# 监控
monitoring.ui.container.elasticsearch.enabled: true
二、从零开始:ELK平台部署实践
2.1 环境规划与准备
硬件配置建议:
| 角色 | CPU | 内存 | 磁盘 | 数量 |
|---|---|---|---|---|
| Elasticsearch | 16核+ | 64GB+ | SSD RAID0 | 3+ |
| Logstash | 8核+ | 16GB+ | 普通磁盘 | 2+ |
| Kibana | 4核+ | 8GB+ | 普通磁盘 | 1+ |
| Filebeat | 2核+ | 4GB+ | - | 按需 |
网络规划:
- Elasticsearch集群内部网络:万兆网络
- Logstash到Elasticsearch:万兆网络
- Filebeat到Logstash:千兆网络
- Kibana访问:千兆网络
2.2 Docker Compose快速部署
对于开发和测试环境,可以使用Docker Compose快速部署:
# docker-compose.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: es01
environment:
- node.name=es01
- cluster.name=elk-cluster
- discovery.seed_hosts=es02,es03
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.transport.ssl.enabled=true
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata01:/usr/share/elasticsearch/data
ports:
- 9200:9200
networks:
- elk
elasticsearch02:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: es02
environment:
- node.name=es02
- cluster.name=elk-cluster
- discovery.seed_hosts=es01,es03
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.transport.ssl.enabled=true
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata02:/usr/share/elasticsearch/data
networks:
- elk
elasticsearch03:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: es03
environment:
- node.name=es03
- cluster.name=elk-cluster
- discovery.seed_hosts=es01,es02
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.transport.ssl.enabled=true
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata03:/usr/share/elasticsearch/data
networks:
- elk
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: logstash
volumes:
- ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf
- ./logstash/patterns:/usr/share/logstash/patterns
environment:
- "LS_JAVA_OPTS=-Xms2g -Xmx2g"
ports:
- 5044:5044
- 5000:5000
depends_on:
- elasticsearch
networks:
- elk
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: kibana
environment:
- SERVERNAME=kibana
- ELASTICSEARCH_HOSTS=http://es01:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KB_PASSWORD}
ports:
- 5601:5601
depends_on:
- elasticsearch
networks:
- elk
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.0
container_name: filebeat
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on:
- logstash
networks:
- elk
volumes:
esdata01:
driver: local
esdata02:
driver: local
esdata03:
driver: local
networks:
elk:
driver: bridge
Filebeat配置:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/*.log
fields:
log_type: nginx
fields_under_root: true
multiline.pattern: '^\['
multiline.negate: true
multiline.match: after
- type: docker
containers.ids:
- "*"
processors:
- add_docker_metadata: ~
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
output.logstash:
hosts: ["logstash:5044"]
# 性能调优
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s
logging.level: info
2.3 生产环境部署:Kubernetes方案
对于大规模生产环境,推荐使用Kubernetes部署:
# elasticsearch-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: es-cluster
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
resources:
requests:
memory: "32Gi"
cpu: "4"
limits:
memory: "32Gi"
cpu: "8"
env:
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: cluster.name
value: "elk-cluster"
- name: discovery.seed_hosts
value: "es-cluster-0.elasticsearch,es-cluster-1.elasticsearch,es-cluster-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "es-cluster-0,es-cluster-1,es-cluster-2"
- name: bootstrap.memory_lock
value: "true"
- name: ES_JAVA_OPTS
value: "-Xms16g -Xmx16g"
- name: xpack.security.enabled
value: "true"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
securityContext:
capabilities:
add:
- IPC_LOCK
runAsUser: 1000
fsGroup: 1000
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- elasticsearch
topologyKey: kubernetes.io/hostname
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "ssd"
resources:
requests:
storage: 1Ti
三、企业级优化策略
3.1 Elasticsearch性能优化
3.1.1 索引设计优化
索引生命周期管理(ILM):
# 创建ILM策略
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "50GB"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"set_priority": {
"priority": 0
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
# 创建索引模板
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs",
"index.codec": "best_compression",
"index.mapping.total_fields.limit": 10000,
"index.refresh_interval": "30s"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"level": {
"type": "keyword"
},
"message": {
"type": "text",
"analyzer": "standard"
},
"service": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"response_time": {
"type": "float"
},
"client_ip": {
"type": "ip"
}
}
}
}
}
3.1.2 查询优化
避免慢查询:
# 优化前:全表扫描
GET /logs-*/_search
{
"query": {
"match": {
"message": "error"
}
}
}
# 优化后:使用过滤器上下文
GET /logs-*/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1h"
}
}
},
{
"term": {
"level": "ERROR"
}
}
]
}
},
"size": 100,
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
]
}
使用Search Template:
# 创建模板
POST _scripts/log_search
{
"script": {
"lang": "mustache",
"source": """
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "{{start_time}}",
"lte": "{{end_time}}"
}
}
},
{
"terms": {
"level": {{#toJson}}levels{{/toJson}}
}
}
]
}
},
"size": {{size}}
}
"""
}
}
# 使用模板
GET /logs-*/_search/template
{
"id": "log_search",
"params": {
"start_time": "now-1h",
"end_time": "now",
"levels": ["ERROR", "FATAL"],
"size": 100
}
}
3.1.3 集群调优
JVM调优:
# 查看GC情况
GET /_nodes/stats/jvm
# 常用JVM参数
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=500
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
线程池监控:
# 监控写入线程池
GET /_cat/thread_pool/write?v&h=node_name,active,queue,rejected
# 监控搜索线程池
GET /_cat/thread_pool/search?v&h=node_name,active,queue,rejected
3.2 Logstash性能优化
3.2.1 管道优化
多管道配置:
# pipelines.yml
- pipeline.id: nginx
path.config: "/etc/logstash/conf.d/nginx.conf"
pipeline.workers: 4
pipeline.batch.size: 125
pipeline.batch.delay: 50
- pipeline.id: application
path.config: "/etc/logstash/conf.d/application.conf"
pipeline.workers: 8
pipeline.batch.size: 200
pipeline.batch.delay: 10
批量处理优化:
# 在output中优化批量参数
output {
elasticsearch {
hosts => ["http://es:9200"]
index => "logs-%{+YYYY.MM.dd}"
# 批量发送配置
flush_size => 500
idle_flush_time => 2
# 重试策略
retry_max_interval => 30
retry_initial_interval => 2
# 连接池
pool_max => 100
pool_max_per_route => 10
}
}
3.2.2 过滤器优化
避免复杂正则:
# 优化前:复杂正则
filter {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:class} - %{GREEDYDATA:message}" }
}
}
# 优化后:拆分处理
filter {
# 先按空格拆分
dissect {
mapping => {
"message" => "%{timestamp} %{level} [%{thread}] %{class} - %{message}"
}
}
# 再处理时间戳
date {
match => ["timestamp", "yyyy-MM-dd HH:mm:ss,SSS"]
}
}
使用条件过滤减少处理:
filter {
# 只处理需要的日志
if [type] == "nginx_access" {
# 复杂处理
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
geoip { source => "clientip" }
} else {
# 其他日志简单处理或直接通过
mutate {
add_field => { "processed" => "true" }
}
}
}
3.3 Filebeat优化
3.3.1 采集优化
多行日志处理:
# 处理Java堆栈
filebeat.inputs:
- type: log
paths:
- /var/log/app/*.log
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
multiline.max_lines: 500
multiline.timeout: 5s
# 处理Nginx错误日志
- type: log
paths:
- /var/log/nginx/error.log
multiline.pattern: '^\d{4}/\d{2}/\d{2}'
multiline.negate: true
multiline.match: after
内存队列调优:
queue.mem:
events: 8192
flush.min_events: 1024
flush.timeout: 5s
output.logstash:
hosts: ["logstash:5044"]
loadbalance: true
bulk_max_size: 2048
slow_start: true
3.3.2 资源限制
Cgroups限制:
# docker-compose.yml
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.0
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 512M
3.4 Kibana优化
3.4.1 性能调优
高级配置:
# kibana.yml
# 增加超时时间
elasticsearch.requestTimeout: 300000
elasticsearch.shardTimeout: 30000
elasticsearch.pingTimeout: 1500
# 增加内存
node.options: "--max-old-space-size=4096"
# 优化搜索
data.search.timeout: 300000
data.search.maxKeepAlive: 600000
# 缓存配置
uiSettings.overrides:
query:queryString:options: "{\"analyze_wildcard\":true}"
search:includeFrozen: false
3.4.2 安全加固
启用X-Pack安全:
# 设置密码
bin/elasticsearch-setup-passwords auto
# 配置角色
POST /_security/role/log_viewer
{
"cluster": ["monitor"],
"indices": [
{
"names": ["logs-*"],
"privileges": ["read", "view_index_metadata"]
}
]
}
POST /_security/user/log_user
{
"password": "strong_password",
"roles": ["log_viewer"],
"full_name": "Log Viewer"
}
配置SSL/TLS:
# elasticsearch.yml
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.key: /path/to/node.key
xpack.security.http.ssl.certificate: /path/to/node.crt
xpack.security.http.ssl.certificate_authorities: /path/to/ca.crt
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.key: /path/to/node.key
xpack.security.transport.ssl.certificate: /path/to/node.crt
xpack.security.transport.ssl.certificate_authorities: /path/to/ca.crt
四、企业级最佳实践
4.1 日志规范设计
统一日志格式:
{
"@timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"host": "order-service-7d8f9c6b4f-2xk9p",
"pid": 12345,
"thread": "http-nio-8080-exec-1",
"class": "com.example.order.OrderController",
"message": "Order created successfully",
"context": {
"order_id": "ORD-2024-12345",
"user_id": "USR-67890",
"amount": 99.99
},
"response_time_ms": 125
}
日志级别规范:
- TRACE:最详细的追踪信息,用于调试
- DEBUG:调试信息,开发环境使用
- INFO:正常运行信息,生产环境默认级别
- WARN:警告信息,需要关注但不影响运行
- ERROR:错误信息,影响部分功能
- FATAL:致命错误,系统无法运行
4.2 监控告警体系
Kibana Alerting配置:
# 创建错误率告警
POST /api/alerting/rule
{
"name": "High Error Rate Alert",
"rule_type_id": ".es-query",
"params": {
"index": ["logs-*"],
"time_field": "@timestamp",
"es_query": {
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-5m"
}
}
},
{
"terms": {
"level": ["ERROR", "FATAL"]
}
}
]
}
}
},
"threshold": 100,
"time_window_size": 5,
"time_window_unit": "m",
"agg_type": "count"
},
"schedule": {
"interval": {
"value": 1,
"unit": "m"
}
},
"actions": [
{
"group": "threshold met",
"id": "email-action",
"params": {
"to": ["ops-team@company.com"],
"subject": "High Error Rate Detected",
"message": "Error count: {{context.value}} in last 5 minutes"
}
},
{
"group": "threshold met",
"id": "slack-action",
"params": {
"message": "🚨 High error rate detected: {{context.value}} errors in 5 minutes"
}
}
]
}
Prometheus监控ELK:
# prometheus.yml
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: ['es-node1:9100', 'es-node2:9100', 'es-node3:9100']
metrics_path: '/_prometheus/metrics'
scrape_interval: 15s
- job_name: 'logstash'
static_configs:
- targets: ['logstash:9600']
metrics_path: '/_node/stats'
scrape_interval: 15s
- job_name: 'filebeat'
static_configs:
- targets: ['filebeat:5066']
metrics_path: '/stats'
scrape_interval: 15s
4.3 数据保留与归档
冷热分离架构:
# 配置热节点
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.require.data": "hot"
}
}
# 配置冷节点
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.require.data": "cold"
}
}
# 使用Rollover API
POST /logs-write/_rollover
{
"conditions": {
"max_age": "1d",
"max_size": "50gb"
}
}
快照备份:
# 注册快照仓库
PUT /_snapshot/backup_repo
{
"type": "fs",
"settings": {
"location": "/backup/elasticsearch",
"compress": true,
"chunk_size": "100m"
}
}
# 创建快照
PUT /_snapshot/backup_repo/snapshot_20240115
{
"indices": "logs-2024.01.15",
"ignore_unavailable": true,
"include_global_state": false
}
# 定时备份脚本
#!/bin/bash
SNAPSHOT_NAME="snapshot_$(date +%Y%m%d_%H%M%S)"
curl -X PUT "http://es-node1:9200/_snapshot/backup_repo/${SNAPSHOT_NAME}"
4.4 安全合规
审计日志:
# elasticsearch.yml
xpack.audit.enabled: true
xpack.audit.outputs: [index]
xpack.audit.index.template.settings:
index.number_of_shards: 3
index.number_of_replicas: 1
index.lifecycle.name: "audit_policy"
数据脱敏:
# Logstash过滤器脱敏
filter {
if [type] == "payment" {
mutate {
gsub => [
"credit_card", "\d{4}-\d{4}-\d{4}-\d{4}", "XXXX-XXXX-XXXX-XXXX"
]
}
}
# 使用fingerprint进行哈希脱敏
fingerprint {
source => ["user_id", "ip"]
target => "user_hash"
method => "SHA256"
}
}
五、常见问题与解决方案
5.1 性能问题
问题1:Elasticsearch写入慢
# 诊断命令
GET /_cat/thread_pool/write?v&h=node_name,active,queue,rejected
# 解决方案
# 1. 增加批量大小
# 2. 增加Logstash工作进程
# 3. 优化索引设置(refresh_interval)
# 4. 检查磁盘IO性能
问题2:查询超时
# 优化查询
GET /logs-*/_search
{
"timeout": "30s",
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1h"
}
}
}
]
}
},
"size": 0,
"aggs": {
"error_count": {
"filter": {
"term": {
"level": "ERROR"
}
}
}
}
}
5.2 数据丢失问题
Filebeat背压机制:
# 确保启用背压
filebeat.inputs:
- type: log
paths:
- /var/log/*.log
backoff: 1s
max_backoff: 60s
backoff_factor: 2
close_inactive: 5m
clean_removed: true
Logstash持久化:
# 启用持久化队列
pipeline {
queue.type: persisted
queue.max_bytes: 10gb
queue.page_capacity: 250mb
queue.checkpoint.writes: 1024
queue.checkpoint.interval: 1000
}
5.3 集群健康问题
脑裂问题预防:
# elasticsearch.yml
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 10s
discovery.zen.fd.ping_retries: 10
discovery.zen.fd.ping_interval: 2s
集群状态监控:
# 监控集群健康
GET /_cluster/health
# 监控节点状态
GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,load_5m
# 监控分片分配
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
六、扩展与集成
6.1 与APM集成
部署APM Server:
# apm-server.yml
apm-server:
host: "0.0.0.0:8200"
rum:
enabled: true
output.elasticsearch:
hosts: ["es-node1:9200", "es-node2:9200"]
username: "apm_server"
password: "${APM_PASSWORD}"
应用集成:
// Java应用配置
@Bean
public ApmAgentConfiguration apmAgentConfiguration() {
return ApmAgentConfiguration.builder()
.setServiceName("order-service")
.setServerUrl("http://apm-server:8200")
.setSecretToken("${APM_SECRET}")
.setCaptureBody("all")
.setCaptureHeaders(true)
.build();
}
6.2 与Metrics集成
Metricbeat配置:
# metricbeat.yml
metricbeat.modules:
- module: system
metricsets:
- cpu
- memory
- diskio
- filesystem
- network
period: 10s
- module: elasticsearch
metricsets:
- node
- node_stats
period: 10s
hosts: ["http://es-node1:9200"]
output.elasticsearch:
hosts: ["http://es-node1:9200"]
index: "metricbeat-%{+YYYY.MM.dd}"
6.3 与安全工具集成
与SIEM集成:
# Logstash安全事件处理
filter {
if [type] == "security" {
# 解析安全日志
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{WORD:action} %{IP:src_ip} %{WORD:result}" }
}
# 威胁情报 enrichment
if [src_ip] {
translate {
field => "src_ip"
destination => "threat_level"
dictionary_path => "/etc/logstash/threat_intel.yml"
fallback => "unknown"
}
}
# 风险评分
mutate {
add_field => { "risk_score" => 0 }
}
if [threat_level] == "high" {
mutate {
update => { "risk_score" => 100 }
}
}
}
}
七、总结与展望
ELK日志分析平台的落地是一个系统工程,需要从架构设计、性能优化、安全合规等多个维度综合考虑。成功的ELK平台应该具备以下特征:
- 高可用:集群化部署,无单点故障
- 高性能:能够处理海量日志,查询响应快
- 易维护:自动化运维,监控告警完善
- 安全合规:数据加密,访问控制,审计追踪
- 成本优化:合理的数据保留策略,冷热分离
随着云原生技术的发展,ELK也在不断演进。Elastic Cloud、Serverless架构、AI辅助分析等新技术将进一步简化日志分析平台的建设和使用。
记住,ELK平台建设不是一蹴而就的,需要持续迭代和优化。建议从小规模开始,逐步扩展,根据实际业务需求调整架构和配置。
附录:常用命令速查
# 集群健康
curl -X GET "localhost:9200/_cluster/health?pretty"
# 索引列表
curl -X GET "localhost:9200/_cat/indices?v"
# 分片分配
curl -X GET "localhost:9200/_cat/shards?v"
# 线程池状态
curl -X GET "localhost:9200/_cat/thread_pool?v&h=node_name,active,queue,rejected"
# JVM内存
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"
# 强制合并
curl -X POST "localhost:9200/logs-2024.01.15/_forcemerge?max_num_segments=1"
# 删除旧索引
curl -X DELETE "localhost:9200/logs-2023.*"
# 查看ILM策略
curl -X GET "localhost:9200/_ilm/policy"
# 查看快照
curl -X GET "localhost:9200/_snapshot/backup_repo/_all"
