1. Elasticsearch介绍

Elasticsearch是一个文档型的NoSQL数据库, 主要用于:

  • 实时全文搜索

  • 日志管理

  • 数据分析

  • 应用监控

2. 安装

2.1. 初始化数据目录

sudo mkdir -p ~/volumes/es/data0{1..3}/
sudo chmod 777 ~/volumes/es/ -R

echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

2.2. docker方式启动

docker-compose.yml
version: '3'
services:
  es01:
    image: elasticsearch:7.4.2
    container_name: es01
    environment:
      - node.name=es01
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms4g -Xmx4g"
      - ELASTIC_PASSWORD=sonic333
      - xpack.security.enabled=true
      - TZ=Asia/Shanghai
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ~/volumes/es/data01:/usr/share/elasticsearch/data
    ports:
      - 19200:9200
    networks:
      - esnet
  es02:
    image: elasticsearch:7.4.2
    container_name: es02
    environment:
      - node.name=es02
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms4g -Xmx4g"
      - ELASTIC_PASSWORD=sonic333
      - xpack.security.enabled=true
      - TZ=Asia/Shanghai
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ~/volumes/es/data02:/usr/share/elasticsearch/data
    networks:
      - esnet
  es03:
    image: elasticsearch:7.4.2
    container_name: es03
    environment:
      - node.name=es03
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms4g -Xmx4g"
      - ELASTIC_PASSWORD=sonic333
      - xpack.security.enabled=true
      - TZ=Asia/Shanghai
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ~/volumes/es/data03:/usr/share/elasticsearch/data
    networks:
      - esnet
  kibana:
    container_name: kibana
    image: kibana:7.4.2
    environment:
      - xpack.security.enabled=true
      - ELASTICSEARCH_HOSTS=http://es01:9200
      - ELASTICSEARCH_USERNAME=elastic
      - ELASTICSEARCH_PASSWORD=sonic333
      - XPACK_GRAPH_ENABLED=true
      - TIMELION_ENABLED=true
      - XPACK_MONITORING_COLLECTION_ENABLED=true
      - SERVER_HOST=0.0.0.0
      - LOGGING_TIMEZONE=Asia/Shanghai
      - TZ=Asia/Shanghai
    ports:
      - 5601:5601
    networks:
      - esnet
networks:
  esnet:
启动
docker-compose up -d

3. 基本概念

3.1. Index

表示一类文档的集合.
比如用户索引, 商品索引等.
索引由一个全小写的名称标识, 对文档的CRUD操作均需指定索引名称.
每一个索引都有自己的Mapping, 定义了该索引下文档的字段名和字段类型.

Table 1. RDBMS vs Elasticsearch
RDBMS Elasticsearch

Table

Index

Row

Document

Column

Field

Schema

Mapping

SQL

DSL

3.2. Document

文档以JSON形式存在, 一个文档表示一条数据.
JSON中每个字段都有对应的字段类型 (boolean/number/string/date/binary/range等), 字段类型可以自己指定或者Elasticsearch自动推断.
每个文档都有一个ID, 可以自己指定, 或者Elasticsearch自动生成.

每个文档都有一些元数据字段:

  • _index: 文档所属的索引名

  • _id: 文档id

  • _source: 文档原始JSON数据

  • _version: 文档版本号

  • _score: 文档搜索时的评分

3.3. Term

多个单词形成 OR 查询, 不要求顺序.
比如 "Mac OS" 会查询某个字段存在 MacOS 的文档.

3.4. Phrase

多个单词形成短语, 作为一个整体查询.
比如 "Mac OS" 会查询某个字段存在 Mac OS 字符串的文档.

3.5. 倒排索引

倒排索引由两部分组成:

  • 单词词典: 记录所有单词, 单词到倒排列表的关系.

  • 倒排列表: 记录了单词所处文档的信息:

    • 文档id

    • 词频 (TF): 记录单词在文档中出现的次数, 用于相关性评分.

    • 位置 (Position): 记录单词在文档中分词的位置, 用于语句搜索.

    • 偏移 (Offset): 记录单词的开始和结束位置, 用于高亮显示.

3.6. 倒排索引例子

Table 2. 源文档
文档id 文档内容

1

Mastering Elasticsearch

2

Elasticsearch Server

3

Elasticsearch Essentials

Table 3. Elasticsearch 倒排列表
文档id TF Position Offset

1

1

1

<10,23>

2

1

0

<0,13>

3

1

0

<0,13>

3.7. Analysis

文本分析: 把全文本转换为一系列单词的过程, 也叫 分词 .
Elasticsearch通过 Analyser 实现分词.

Analyser由三部分组成:

  • Character filter: 处理原始文本, 比如去除html标签, unicode字符转换等.

  • Tokenizer: 将文本切分为多个单词.

  • Token filter: 处理切分后的单词, 比如转小写.

自定义Analyser
POST /_analyze
{
  "tokenizer": "standard",
  "char_filter": [{
    "type": "mapping",
    "mappings":["- => _"]
  }],
  "text": ["123-345"]
}

3.8. 集群

3.8.1. 节点

每个Elasticsearch实例是一个节点, 节点分为:

  • master节点: 可以被选举为master的节点, 执行维护集群状态, 创建/删除索引等操作

node.master= true
node.data= false
node.ingest= false
cluster.remote.connect= false

可以设置 node.voting_only: true 表示该节点只参与master选举, 但不会成为master.

  • data节点: 处理数据相关操作,比如 CRUD, 全文搜索, 聚合查询等

node.master= false
node.data= true
node.ingest= false
cluster.remote.connect= false
  • ingest节点: 负责执行ingest pipeline的节点

node.master= false
node.data= false
node.ingest= true
cluster.remote.connect= false
  • coordinating节点: 将请求路由分发到其他节点.

node.master= false
node.data= false
node.ingest= false
cluster.remote.connect= false
  • ml节点: 负责执行机器学习Job/处理机器学习请求的节点

node.ml= true
xpack.ml.enabled= true

4. 索引操作API

4.1. 创建索引

PUT /page_view_info
{
    "settings": {
        "number_of_shards": 2
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "long"
            },
            "companyId": {
                "type": "integer"
            },
            "landingPageId": {
                "type": "integer"
            },
            "formId": {
                "type": "integer"
            },
            "ip": {
                "type": "keyword"
            },
            "province": {
                "type": "keyword"
            },
            "city": {
                "type": "keyword"
            },
            "location": {
                "type": "keyword"
            },
            "ua": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            },
            "device": {
                "type": "keyword"
            },
            "os": {
                "type": "keyword"
            },
            "browser": {
                "type": "keyword"
            },
            "uid": {
                "type": "keyword"
            },
            "sid": {
                "type": "keyword"
            },
            "pid": {
                "type": "keyword"
            },
            "referrer": {
                "type": "keyword"
            },
            "origin": {
                "type": "keyword"
            },
            "osType": {
                "type": "keyword"
            },
            "browserType": {
                "type": "keyword"
            },
            "networkType": {
                "type": "keyword"
            },
            "medium": {
                "type": "keyword"
            },
            "channel": {
                "type": "keyword"
            },
            "priceRange": {
                "type": "keyword"
            },
            "screenSize": {
                "type": "keyword"
            },
            "screenDpi": {
                "type": "keyword"
            },
            "brand": {
                "type": "keyword"
            },
            "longitude": {
                "type": "keyword"
            },
            "latitude": {
                "type": "keyword"
            },
            "url": {
                "type": "keyword"
            },
            "lengthOfStay": {
                "type": "double"
            },
            "accessDepth": {
                "type": "double"
            },
            "fill": {
                "type": "byte"
            },
            "submitDataType": {
                "type": "byte"
            },
            "submitDataId": {
                "type": "integer"
            },
            "content": {
                "type": "keyword"
            },
            "fillContent": {
                "type": "keyword"
            },
            "fillAt": {
                "type": "date"
            },
            "year": {
                "type": "short"
            },
            "month": {
                "type": "byte"
            },
            "day": {
                "type": "byte"
            },
            "hour": {
                "type": "byte"
            },
            "minute": {
                "type": "byte"
            },
            "createdAt": {
                "type": "date"
            },
            "updatedAt": {
                "type": "date"
            },
            "auditable": {
                "type": "boolean"
            },
            "clickId": {
                "type": "keyword"
            },
            "telgetIsAutoTel": {
                "type": "boolean"
            },
            "telgetType": {
                "type": "byte"
            },
            "submitDataExt": {
                "type": "object"
            }
        }
    }
}

4.2. 查看索引详情

GET /page_view_info

4.3. 查看所有索引

GET /_cat/indices

4.4. 删除索引

DELETE /page_view_info

4.5. 创建索引模板

PUT /_template/page_view_info_template
{
    "index_patterns": "page_view_info_*",
    "aliases": {
        "page_view_info": {}
    },
    "settings": {
        "number_of_shards": 2
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "long"
            },
            "companyId": {
                "type": "integer"
            },
            "landingPageId": {
                "type": "integer"
            },
            "formId": {
                "type": "integer"
            },
            "ip": {
                "type": "keyword"
            },
            "province": {
                "type": "keyword"
            },
            "city": {
                "type": "keyword"
            },
            "location": {
                "type": "keyword"
            },
            "ua": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            },
            "device": {
                "type": "keyword"
            },
            "os": {
                "type": "keyword"
            },
            "browser": {
                "type": "keyword"
            },
            "uid": {
                "type": "keyword"
            },
            "sid": {
                "type": "keyword"
            },
            "pid": {
                "type": "keyword"
            },
            "referrer": {
                "type": "keyword"
            },
            "origin": {
                "type": "keyword"
            },
            "osType": {
                "type": "keyword"
            },
            "browserType": {
                "type": "keyword"
            },
            "networkType": {
                "type": "keyword"
            },
            "medium": {
                "type": "keyword"
            },
            "channel": {
                "type": "keyword"
            },
            "priceRange": {
                "type": "keyword"
            },
            "screenSize": {
                "type": "keyword"
            },
            "screenDpi": {
                "type": "keyword"
            },
            "brand": {
                "type": "keyword"
            },
            "longitude": {
                "type": "keyword"
            },
            "latitude": {
                "type": "keyword"
            },
            "url": {
                "type": "keyword"
            },
            "lengthOfStay": {
                "type": "double"
            },
            "accessDepth": {
                "type": "double"
            },
            "fill": {
                "type": "byte"
            },
            "submitDataType": {
                "type": "byte"
            },
            "submitDataId": {
                "type": "integer"
            },
            "content": {
                "type": "keyword"
            },
            "fillContent": {
                "type": "keyword"
            },
            "fillAt": {
                "type": "date"
            },
            "year": {
                "type": "short"
            },
            "month": {
                "type": "byte"
            },
            "day": {
                "type": "byte"
            },
            "hour": {
                "type": "byte"
            },
            "minute": {
                "type": "byte"
            },
            "createdAt": {
                "type": "date"
            },
            "updatedAt": {
                "type": "date"
            },
            "auditable": {
                "type": "boolean"
            },
            "clickId": {
                "type": "keyword"
            },
            "telgetIsAutoTel": {
                "type": "boolean"
            },
            "telgetType": {
                "type": "byte"
            },
            "submitDataExt": {
                "type": "object"
            }
        }
    }
}

4.6. 查看索引模板

GET /_template/page_view_info_template

4.7. 查看所有索引模板

GET /_cat/templates

4.8. 删除索引模板

DELETE /_template/page_view_info_template

5. 文档操作API

5.1. 查询文档

GET /users/_search

5.2. 创建文档

  • POST请求, Elasticsearch会自动生成id

POST /users/_doc
{
  "username": "Joan",
  "age": 11
}
  • PUT请求, 自己指定id

PUT /users/_create/1
{
  "age": 33
}

第二次请求会报错.

5.3. 索引文档

PUT /users/_doc/1
{
  "username": "Alex",
  "age": 33
}

第二次执行后, 会删除原有文档再根据请求体重新创建文档, _version 字段加一.

5.4. 获取文档

GET /users/_doc/1

5.5. 更新文档

POST /users/_update/1
{
  "doc": {
  "age":111,
  "username": "Bob"
  }
}

5.6. 删除文档

DELETE /users/_doc/1

5.7. 批量更新/删除/索引

bulk API对索引进行不同的操作.
批量处理过程中单条失败不会影响其他操作, 返回结果也返回了每条操作执行的结果.

POST /_bulk
{"create":{"_index":"users","_id":11}}
{"username":"Fatman"}
{"update":{"_index":"users","_id":11}}
{"doc":{"username":"Hollis"}}
{"index":{"_index":"users","_id":11}}
{"username":"Fatman"}
{"delete":{"_index":"users","_id":11}}

5.8. 批量获取

GET /_mget
{
  "docs": [
    {
      "_index": "users",
      "_id": 1
    },
    {
      "_index": "users",
      "_id": 2
    }
  ]
}

6. URL查询

GET /<index>/_search

6.1. 参数

  • q: 查询条件

  • df: 查询字段, 如果为空则查询所有字段

  • sort: 排序, 格式为 <field>:[asc|desc]

# 查询存在一个字段包含"Mac"的文档
GET /page_view_info/_search?q=Mac

# 查询ua字段包含"Mac"的文档
GET /page_view_info/_search?q=ua:Mac
GET /page_view_info/_search?df=ua&q=Mac

# 查询ua包含"Mac"或者其他字段包含"Firefox"的文档
GET /page_view_info/_search?q=ua:Mac Firefox

# 查询ua包含"Mac"或者"Firefox"的文档
GET /page_view_info/_search?q=ua:(Mac Firefox)

# 查询ua包含"Mac Firefox"的文档
GET /page_view_info/_search?q=ua:"Mac Firefox"

6.2. 查询条件操作符

6.2.1. 布尔操作符

  • AND OR NOT 必须大写

  • + -

# 查询ua包含"Mac"并且包含"Firefox"的文档
GET /page_view_info/_search?q=ua:(Mac AND Firefox)
GET /page_view_info/_search?q=ua:(+Mac AND +Firefox)

# 查询ua包含"Mac"并且但不包含"OS"的文档
GET /page_view_info/_search?q=ua:(Mac NOT OS)
GET /page_view_info/_search?q=ua:(+Mac -OS)

6.2.2. 范围查询操作符

# 查询year大于2018的文档
GET /page_view_info/_search?q=year:>2018

# 查询ua字段存在以"OP"开头的term的文档
GET /page_view_info/_search?q=ua:OP*

7. Query DSL

DSL查询示例
GET /page_view_info/_search
{
  "sort": {"id": "desc"},
  "_source": ["id", "pid", "createdAt"],
  "from": 0,
  "size": 20,
  "query": {}
}

7.1. query_string

# 查询ua或os字段包含"Mac"的文档
GET /page_view_info/_search
{
  "query": {
    "query_string": {
      "query": "Mac",
      "fields": ["ua", "os"]
    }
  }
}

7.2. match

# 查询ua包含"Mac"并且包含"Firefox"的文档
GET /page_view_info/_search
{
  "query": {
    "match": {
      "ua": {
        "query": "Mac Firefox",
        "operator": "and"
      }
    }
  }
}

7.3. match_phrase

# 查询ua包含"Mac Firefox"的文档
GET /page_view_info/_search
{
  "query": {
    "match_phrase": {
      "ua": {
        "query": "Mac Firefox"
      }
    }
  }
}

7.4. bool

# 条件查询
GET /page_view_info/_search
{
  "size": 1,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "companyId": {
              "value": 759
            }
          }
        },
        {
          "term": {
            "auditable": {
              "value": true
            }
          }
        },
        {
          "range": {
            "createdAt": {
              "from": "2020-01-06T00:00:00.000Z",
              "to": "2020-01-06T23:59:59.000Z",
              "include_lower": true,
              "include_upper": true
            }
          }
        },
        {
          "script": {
            "script": {
              "source": "String city = doc['city'].value; return city !=null && city !='' && city !='Unknown'",
              "lang": "painless"
            }
          }
        }
      ]
    }
  }
}

8. 聚合查询

Elasticsearch聚合查询分为4种:

  • Metric: 对文档字段进行数学运算或统计分析.

  • Bucket: 将文档按照条件分组.

  • Pipeline: 对聚合结果进行二次聚合.

  • Matrix: 对多个字段操作, 结果作为矩阵形式.

8.1. Metric

8.1.1. avg

平均值

计算填单率
GET /page_view_info/_search?size=0
{
    "aggs": {
      "fillRate": {
        "avg": {
          "field": "fill"
        }
      }
    }
}

8.1.2. weighted_avg

加权平均值: \$(sum(weight * value)) / (sum(weight))\$

GET /page_view_info/_search?size=0
{
  "size": 0,
  "aggs": {
    "weightAvgST": {
      "weighted_avg": {
        "value": {
          "field": "lengthOfStay"
        },
        "weight": {
          "field": "fill"
        }
      }
    }
  }
}

8.1.3. cardinality

distinct count

获取uv
GET /page_view_info/_search?size=0
{
    "aggs": {
      "uv": {
        "cardinality": {
          "field": "uid"
        }
      }
    }
}

8.1.4. stats

统计信息, 包括平均值, 最大值, 最小值, 总和, 次数计数.

统计停留时长
GET /page_view_info/_search?size=0
{
    "aggs": {
      "statsST": {
        "stats": {
        "field": "lengthOfStay"
        }
      }
    }
}

8.1.5. extended_stats

详细统计, 包括平均值, 最大值, 最小值, 标准差, 方差, 平方和等维度.

统计停留时长
GET /page_view_info/_search?size=0
{
    "aggs": {
      "extendedStatsST": {
        "extended_stats": {
        "field": "lengthOfStay"
        }
      }
    }
}

8.1.6. max

最大值

获取最大停留时长
GET /page_view_info/_search?size=0
{
    "aggs": {
      "maxST": {
        "max": {
          "field": "lengthOfStay"
        }
      }
    }
}

8.1.7. min

最小值

获取最小停留时长
GET /page_view_info/_search?size=0
{
    "aggs": {
      "minST": {
        "min": {
          "field": "lengthOfStay"
        }
      }
    }
}

8.1.8. sum

求和

统计填单数
GET /page_view_info/_search?size=0
{
    "aggs": {
      "fillAmount": {
        "sum": {
          "field": "fill"
        }
      }
    }
}

8.1.9. value_count

计数

统计pv
GET /page_view_info/_search?size=0
{
    "aggs": {
      "pv": {
        "value_count": {
          "field": "id"
        }
      }
    }
}

8.1.10. percentiles

百分位

查看 lengthOfStay
GET /page_view_info/_search?size=0
{
    "aggs": {
      "stPercentiles": {
        "percentiles": {
          "field": "lengthOfStay"
        }
      }
    }
}

8.1.11. percentile_ranks

数值所处的百分位

.查看 lengthOfStay 小于2秒和120秒的百分比
GET /page_view_info/_search?size=0
{
    "aggs": {
      "stPercentileRanks": {
        "percentile_ranks": {
          "field": "lengthOfStay",
          "values": [2, 120]
        }
      }
    }
}

8.1.12. top_hits

分组后排序/取前几条记录

.查看每个落地页最近10条PV
GET /page_view_info/_search?size=0
{
  "size": 0,
  "aggs": {
    "company": {
      "terms": {
        "field": "landingPageId"
      },
      "aggs": {
        "topPV": {
          "top_hits": {
            "_source": ["pid"],
            "sort": [{
              "createdAt": {
                "order": "desc"
              }
            }],
            "size": 10
          }
        }
      }
    }
  }
}

8.2. Bucket

8.2.1. term

将文档按照指定field分组

GET /page_view_info/_search?size=0
{
  "size": 0,
  "aggs": {
    "company": {
      "terms": {
        "field": "landingPageId"
      }
    }
  }
}

8.2.2. range

将文档按照范围分组

# lengthOfStay按[2,2-60,60-120,120]分组
GET /page_view_info/_search?size=0
{
  "size":0,
  "aggs": {
    "st": {
      "range": {
        "field": "lengthOfStay",
        "ranges": [
          {
            "to": 2
          },
          {
            "from": 2,
            "to": 60
          },
          {
            "from": 60,
            "to": 120
          },
          {
            "from": 120
          }
        ]
      }
    }
  }
}

8.2.3. histogram

将文档按照一定的间隔大小分组

# accessDepth直方图
GET /page_view_info/_search?size=0
{
  "size":0,
  "aggs": {
    "st": {
      "histogram": {
        "field": "accessDepth",
        "interval": 10
      }
    }
  }
}

8.2.4. filter

为某一聚合查询添加过滤条件

GET /sales/_search?size=0
{
    "aggs" : {
        "t_shirts" : {
            "filter" : { "term": { "type": "t-shirt" } },
            "aggs" : {
                "avg_price" : { "avg" : { "field" : "price" } }
            }
        }
    }
}

8.2.5. filters

根据条件分组

#分别查看公司759和790的PV数量
GET /page_view_info/_search?size=0
{
  "aggs": {
    "landingPages": {
      "filters": {
        "filters": {
          "c759": {
            "bool": {
              "filter": {
                "term": {
                  "companyId": {
                    "value": 759
                  }
                }
              }
            }
          },
          "c790": {
            "bool": {
              "filter": {
                "term": {
                  "companyId": {
                    "value": 790
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

8.2.6. global

使该聚合查询忽略query查询条件

#统计759和所有公司的pv
GET /page_view_info/_search?size=0
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "companyId": {
              "value": 759
            }
          }
        },
        {
          "range": {
            "createdAt": {
              "gte": "now/d-60d"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "pv": {
      "value_count": {
        "field": "pid"
      }
    },
    "all_pv": {
      "global": {},
      "aggs": {
        "all_pv": {
          "value_count": {
            "field": "pid"
          }
        }
      }
    }
  }
}

8.2.7. missing

统计没有指定字段的文档数量

# 统计没有clickId的文档数量
GET /page_view_info/_search?size=0
{
  "aggs": {
    "missPV": {
      "missing": {
        "field": "clickId"
      }
    }
  }
}

8.3. Pipeline

8.3.1. max_bucket

# 找到pv最多的companyId
GET /page_view_info/_search?size=0
{
  "size": 0,
  "aggs": {
    "company": {
      "terms": {
        "field": "companyId"
      },
      "aggs": {
        "pv": {
          "cardinality": {
            "field": "id"
          }
        }
      }
    },
    "maxPvCompany": {
      "max_bucket": {
        "buckets_path": "company>pv"
      }
    }
  }
}