elasticsearch training notes

2017-11-14

Notes taken during elasticsearch training.

Components

Clustering

Give cluster sensible name, defaults to “elasticsearch”

Nodes

each node has a name, but set it to something that makes sense, eg node1, node2 otherwise is uses the first 7 digits of UUID

Documents

Index

Shard

Indexs are partitioned into shards and distributed across multiple nodes

Each shard is a standalone lucene index

Default for an index is 5

Index details

Dynamic index creation

Normally good idea to turn off

set action.auto_create_index to false to disable

Index with specifying ID

leave off the ID to have elasticsearch to generate one

CRUD

{
  "doc": {
    "comment" : "updated comment"
  }
}

Retrieve

Used GET to retrieve

Bulk API

using curl will need an extra new line on end, console doesn’t

Configuring

try to move start up options to yml config file

Index settings

shards

shards defaults to 3, but recommended to be 5, a 2 node cluster:

(or maybe not!??) not sure if this works like this:

node 1: shard 1,2,3
node 2: shard 4,5

if you lose node 1, then you will still have at least 2 shards.

read only and read write on index

make an index read only:

PUT my_tweets/_settings {
    "index.blocks.write": false
    }

node settings

in yml config file can specify diff path for

cluster settings

can turn on logging for the cluster, instant change, can produce a lot of data

persist or transient (survives a restart or not)

be careful not to set persistent cluster node above the real node number

eg if you set it to way to high (eg 200) then it will never start until cluster gets to 200. it can’t be reduced because the cluster can’t actually start and you have to basically throw away your data to start.

precedence

  1. transient
  2. persistant
  3. cli
  4. yml

some settings can only be set in the yml file

eg name of cluster, number of shards

examlple of dynamic settings

cluster and node name

cluster name can be in cli, but dont, use yml file

node name can be the same as it’s UUID under the hood, but it’s crazy town and dont do it

http vs transport

http for rest api, port 9200

transport for internal, port 9300

http.host is localhost by default

transport.host is localhost by default

network can be avoided hardcoding, eg site global see doco

Development vs. Production Mode

transport.bind_host

Explicitly Enforcing Bootstrap Checks

DO NOT set this to false, you will regret it:

-Des.enforce.bootstrap.checks=true

JVM Configuration

do not set jvm above 32g, no value and it’s a waste of resource.

best node has 64g mem, 32g for elasticsearch, 32g for OS

Set Xms and Xmx to the same size, and typically to no more than 50% of your physical RAM

sometimes it’s worth going smaller to speed up cluster

JVM Heap Size

default is 2g too small, 8 is good, do not exceeed 30 (for above of 32)

see https://www.elastic.co/blog/a-heap-of-trouble

node roles

|--------------------------------|
| Master Master Master           |
|                                |
| Data Data Data Data Data       |
|------------------|----|--------|
                   |    |
Injest Injest -> Coord Coord

master nodes

minimum 3 master nodes

set master for quorum to (n/2) + 1

Dedicated Master and Data Nodes

do not send client requests to dedicated master nodes

dedicated master nodes do not need to be big and beefy, storage etc

Configuring Dedicated Nodes

node.master: true
node.data: false
node.ingest: false
node.master: false
node.data: true
node.ingest: false
node.master: false
node.data: false
node.ingest: false

leave other nodes for queries, incase the coord node goes down.

node.master: false
node.data: false
node.ingest: true

machine learing

give range, learn normal, alert when outside of expected

tribe node

lets you query across multiple clusters, but doesn’t scale well

deprecated, but see cross cluster search

shards

default number of primary shards for an index is 5

it’s fixed once created and can’t be changed without redindex

dynamic indexs

disable or enable, or you can whitelist patterns

document routing

used to force into a diff shard, useful for parent/child relationships

deleting an index

can delete indexs with wildcards, but can disable

PUT _cluster/settings
"persistent": {
    "action.destructive_requires_name" : true
    }
}

alias for indexes

an alias in used for indexes, eg you want other name

see page 174 and 175 of training pdf

useful when GET from lots of indexes

eg GET trx-20171112,trx-20171113/_search

or instead GET month/_search

useful to decouple index name from code, eg you want to move an old index to a new index with more shards.

index templates

index templates are useful when indexes are built every eg day, and you want it diff from default settings for cluster.

saves you setting every time you create the index

chapter 6

exact values vs full text

stop words are removed as part of the full text analysis, such as “to”, “an”, “a”

desc: "An apple a day"

decc.keyword: "An apple a day"

desc: ["apple", "day"]

mappings

field data types: text, keyword, date. integer etc

mappings convert json after injestion and give the fields value types eg integer etc

err_1623 and err_1802 use a keyword for this, if you use “text” it will get chopped up: eg “err” and “1623”

(aside: nested means update all items if something removed down tree, parent child doesnt)

define mappings

elasticsearch will guess but can get it wrong

do it when creating new index

can you change a mapping? No, needs to reindex

Searching

by default it only returns top 10, need to set the size to get more

segments

Live in shards, created every 1 second with index.refresh_interval, or buffer full

can change index.refresh_interval eg 5 seconds

each segment is a file on disk under data dir

be careful about inodes

transaction log

can replay if sudden outage, should replay to write segments from mem to disk

flush

can force flush and flush with sync, should not need to do, but maybe before backups taken?

merging segments

mostly automatic, logs you want to use “merging with index only”

Can force merge api, again useful for backups or moving to cold storage

only use on old data that wont be written to again

(note: check out a tool called “curator”: snapshot, restore, close, forcemerge)

chapter 7

reindex

do I need to create new index before? No, but if issues with orig then they will be copied as well.

external index powerful to build specific indexs from master index, eg “just with item == disney” etc

multiple sources

eg. combine multiple indexs for each day of the month into one month.

can also “rollover” or “shrink”, may also be useful for combining logs etc

reindex from remote cluster

move data from eg dev to uat

closing index

reduces load on cluster, no ram or cpu, disk only

index is then not available for operations, search etc

removed replications, do a snapshot or you will loose data if you lose a node

Will keep every primary shard though, so make sure shards aren’t on the same node or you will lose them.

delayed shard allocation

useful if network dips for a few seconds and you don’t want replicas etc create

primary shards will always happen, it’s for replication only

default is 1 min

useful for upgrading nodes, eg it will be back in 5 mins after reboot.

index priority

index.priority can be changed to make certain indexes be recovered before others

total shards per node

hot warm data nodes

hot nodes power servers, useful for injesting data, bulk injesting etc

warm nodes less powerfull servers in cluster, useful for queries, spread out load

can also be useful for providing better searches when customer pays more money

see shard filtering

shard filtering

node.attr, does not contain node name, so if you want to exclude one specific node you have to set it.

chapter 8 capacity planing

can always use reindex from cluster to move into an archive cluster

shard over allocation, number of shareds more than number of nodes

makes eaiser to grow cluster to new nodes

a little over allocation is good, but too much is bad

capacity planning

depends on your use case

define sla before starting

number of primary shards

index in parallel until error code 429 - over capacity

scaling with replicas

capacity planning

fixed sized data

data grows slowly but lots of searches

time based data

lots of data but not huge searchs, logs etc

stuff with timestamps etc

searching usually involves a time stamp

search for recent events

time frame

time based data is best organised using time based indices

see page 359 and 360

set up aliases for time based searches across multiple date spans

(note: alias to single index can be written and read, alias to multiple indcies can only be read)

VERY good tool to test cluster performance and benchmark is es_rally (open source)

http://demo.elastic.co

xpack monitoring is free (xpack basic)

chapter 9 cluster management

shard allocation awareness

cluster.routing.allocation.awareness

forced awareness

You can configure forced awareness to avoid overwhelming a zone

PUT _cluster/settings
"persistent": {
    "cluster": {
        "routing": {
            "allocation.awareness.attributes": "my_rack_id",
        "allocation.awareness.force.my_rack_id.values": "rack1,rack2"
        }
    }
    }
}

installing plugins

xpack is a plugin from elastic

page 381

cluster backup

snapshot and restore

repository

on every node

repository types, see page 389

taking a snapshot

Try not to take snapshots on indexes that are currently being written to eg today

not encrypted, but not readable outside elasticsearch

restoring from a snapshot

renaming indicies: you can rename when you restore so you don’t overwrite data

restore to a diff cluster: also used to move data from another cluster

incremental snapshots

snapshots are segment level, as are increments

Chapter 10 Monitoring

_cluster/pending_tasks check cluster tasks if cluster is sluggish, will show what’s currently running

Monitoring (xpack)

if indexing goes up and not coming down, good idea to scale out

index latency should be less than 1000ms ideally, but depends

Stats API

Cluster, Node, Indicies stats; json

Tasks monitoring

check pending tasks first

GET _cluster/pending_tasks

Cat API

GET _cat/nodes

xpack monitoring

best practice to set up monitoring on a it’s own seperate dedicated cluster

search latency, send alert if taking too long

The Search Slow Log off by default, have to turn on. see page 443 for more info

check thread_pool for rejected and long running queries on nodes, shows the cluster is having issues.

logstash

Chapter 11 upgrading cluster

minor upgrade rolling upgrade

major upgrade needs cluster stop, upgrade, start

Rolling upgrade

Full cluster upgrade

use kafka as a message broker to hold data

do sync flush before node restart will commit everything to disk

Chapter 12 Production checklist

security

out of box no authentication, authorization or encryption

Bootstrap checks

disable swapping means Xms and Xms file sizes same in JVM??? see page 492

Linux checks

see page 494 for details

Best practices

avoid running over WAN Links

minimise hops between nodes

don’t use LVM raid etc, just straight FS on the disk; lose raid lose all disks, better to lose individual disks and node will be okay. see page 499

Todo