elasticsearch v6.3.0 monitor

Overview

上周四（0614）es推出了v6.3.0，期待已久的SQL和Rollups已经可以被使用，x-pack功能更加完善。想在此基础上，看看630的监控是怎么样。

监控

es最重要的目标是线上的稳定高性能，而稳定的定义就离不开监控。监控方式有多种，

通过curl命令定时抽取es的各项重点stats，然后输出到grafana
结合kibana，通过es官方提供的x-pack monitor/Marvel
瞬时状态监控Cerebro

x-pack monitor

监控组织图

x-pack monitor arch.

x-pack collectors

流程大概是：

安装带x-pack plugin的es和kibana
配置product-es-cluster的x-pack data collector（采集哪些指标），exporter（将采集指标写到哪里），duration（默认7d），采集频率xpack.monitoring.collection.interval（也可以在kibana设置，默认10s，越频繁，monitoring index越大）
product-es-cluster每隔interval将collector数据写入到exporter（即monitoring-es-cluster）
kibana每隔interval（这个interval与exporter的interval无关）从monitoring-es-cluster读取collector数据，然后展示在UI上
kibana会返回一条确认信息到collector（当连上master的时候）

安装

在6.3之前，es和kibana都需要额外安装x-pack plugin；而6.3已经集成了x-pack，直接配置一下即可使用。

安装es v630

elasticsearch.yml

安装kibana vv630

kibana.yml

启动

起来之后，看看2个node时候连上，再看看kibana是否连上。 es master log

kibana log

配置

直接使用x-pack提供的monitor，在kibana上查看监控数据的趋势。或者使用动态setting来更新配置。

CMD

HOST=127.0.0.1:9200

curl -s -X GET "$HOST/_cluster/health?pretty"
curl -s -XGET "$HOST/_cat/indices?v"
curl -s -XGET "$HOST/_cat/nodes?v"

# setting check
curl -s -X GET "$HOST/_cluster/settings?pretty"

# dynamic setting
curl -s -X PUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent" : {
    "xpack.monitoring.collection.enabled" : "true",
    "xpack.monitoring.collection.interval" : "30s"
  }
}'

curl -s -X PUT "$HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient" : {
    "xpack.monitoring.collection.interval" : null
  }
}'

# create mapping
curl -s -X PUT "$HOST/person" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "number_of_shards" : 2,
        "number_of_replicas" : 0,
        "refresh_interval": "10s",

        "translog": {
            "flush_threshold_size": "10mb",
            "sync_interval": "10s",
            "durability": "async"
        },

        "merge": {
            "policy": {
                "segments_per_tier": "25",
                "floor_segment": "100mb"
            }
        }
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "id" : { "type" : "integer" },
                "name" : { "type" : "text" },
                "address" : { "type" : "text" }
            }
        }
    }
}
'

# generate docs
for i in {1..500000}
do
    curl -s -X POST "$HOST/person/_doc?pretty" -H 'Content-Type: application/json' -d'{  "id": "4", "name": "chenfh5", "address": "Beijing Chaoyang"}' > /dev/null
done

# check docs
curl -s -X GET "$HOST/.monitoring-es-6-2018.06.19/_search?pretty&size=1000" -H 'Content-Type: application/json' -d'
{
    "sort" : [
        { "timestamp" : {"order" : "desc"}}
    ]
}' | grep 'timestamp' | head -n 100

Others

Cerebro现在也从以前的kopf plugin形式脱离出来，做成直接通过http的形式来crul es stat以便渲染UI。所以直接配置es master的http端口，即可使用Cerebro来展示es集群的瞬时状态。

自定义监控grafana实例

安装grafana

grafana version

为grafana添加(时序)数据库，用以保存历史数据作对比

添加data source

编写定时脚本收集相关监控指标数据，并上报到grafana在步骤2所添加的db里如elasticsearch2elastic这个脚本，里面的， ```
希望监控的es集群

ElasticSearch Cluster to Monitor

elasticServer = os.environ.get(‘ES_METRICS_CLUSTER_URL’, ‘http://srcServer:9200’)

上报监控数据的目的地，即grafana的data source

ElasticSearch Cluster to Send Metrics

elasticMonitoringCluster = os.environ.get(‘ES_METRICS_MONITORING_CLUSTER_URL’, ‘http://destServer:9200’)

4. 查看grafana的data source是否有上报数据

![check if the monitor data have been uoloaded?](https://upload-images.jianshu.io/upload_images/2189341-c6a71b1cf9d45c39.png)

5. 配置dashboard
根据上报的数据，配置自己喜欢的监控仪表盘（New dashboard）。也可以使用社区已经建好的[dashboard](https://grafana.com/dashboards/878)。
这里采用了在社区版基础上修改为自己喜欢的。
先import社区版的到grafana里。`Home -> Import dashboard`
![因为网络问题，没有采用id/url的方式，而采用了离线json安装方式，具体采用了哪些指标，指标应用了哪些函数，都可以在json文件上看到，当然也可以在导入之后直接在grafana里面修改](https://upload-images.jianshu.io/upload_images/2189341-716682501e33f4a7.png)

6. 展示

![snapshot](https://upload-images.jianshu.io/upload_images/2189341-a0f86bddfdb387a4.png)

7. 自定义
通过定时脚本上传自定义数据，然后在metric里面应用各类函数得到最后想要的监控指标，然后选择合适图表来展示。如，每个RT的总数据量，每个RT的每日数据量。

![](https://upload-images.jianshu.io/upload_images/2189341-eaed0657ec54acfb.png)

8. 告警
利用grafana的alert功能

----
# 坑
1. grafana设置多集群
![在setting里面设置分集群](https://upload-images.jianshu.io/upload_images/2189341-d3bec16f262f3061.png)

![具体在tab里面引用重命名](https://upload-images.jianshu.io/upload_images/2189341-f0e0c01436fb305d.png)

另外，由于cluster_name这个在默认写的时候是text，导致了使用`query_string`查的时候这个cluster_name形同虚设，那么就会导致每个集群查到的数据都是一样的，取max之后也就都一样了。

![540](https://upload-images.jianshu.io/upload_images/2189341-bb8a7581c7fa1bff.png)

![630](https://upload-images.jianshu.io/upload_images/2189341-842a547be15698ee.png)

![cluster_name的mapping](https://upload-images.jianshu.io/upload_images/2189341-43cbb18eb54f0c82.png)

正确的打开方式，根据grafana的查询语句，来改造自己的mapping，即把**cluster_name**改为keyword。

![540_correct](https://upload-images.jianshu.io/upload_images/2189341-472210893fbc6634.png)

![630_correct](https://upload-images.jianshu.io/upload_images/2189341-73e1724744fa0f6c.png)

![cluster_name的mapping_correct](https://upload-images.jianshu.io/upload_images/2189341-551ba84854557817.png)

2. grafana es template variables
今天再来说说上面遇到的variable配置了query之后没有结果的问题。

先说结果，

not work

{“find”: “terms”, “fields”: “cluster_name”}

work

{“find”: “terms”, “field”: “cluster_name”, “query”: “”} ```

$error {"type":"illegal_state_exception","reason":"value source config is invalid; must have either a field context or a script or marked as unwrapped"}$

have resp

从上面可以看到如果只使用field，而没有query的话，grafana翻译出来的esDsl是错误的，即不带cluster_name这个field。而如果带上query就会有，而在带上query之后，再回退到不带query的方式，有时候可以，有时候不可以。而根据grafana提供的文档，如果对field没有特殊query需求，是可以不带query的。

grafana doc