2017-11-14
Notes taken during elasticsearch training.
Give cluster sensible name, defaults to “elasticsearch”
each node has a name, but set it to something that makes sense, eg node1, node2 otherwise is uses the first 7 digits of UUID
Indexs are partitioned into shards and distributed across multiple nodes
Each shard is a standalone lucene index
Default for an index is 5
Normally good idea to turn off
set action.auto_create_index
to false to disable
leave off the ID to have elasticsearch to generate one
my_index/doc/4
my_index/doc/4/_create
my_index/doc/4
my_index/doc/4/_update
{
"doc": {
"comment" : "updated comment"
}
}
my_index/doc/4
Used GET to retrieve
POST _bulk
using curl
will need an extra new line on end, console doesn’t
try to move start up options to yml config file
shards defaults to 3, but recommended to be 5, a 2 node cluster:
(or maybe not!??) not sure if this works like this:
node 1: shard 1,2,3
node 2: shard 4,5
if you lose node 1, then you will still have at least 2 shards.
make an index read only:
PUT my_tweets/_settings {
"index.blocks.write": false
}
in yml config file can specify diff path for
can turn on logging for the cluster, instant change, can produce a lot of data
persist or transient (survives a restart or not)
be careful not to set persistent cluster node above the real node number
eg if you set it to way to high (eg 200) then it will never start until cluster gets to 200. it can’t be reduced because the cluster can’t actually start and you have to basically throw away your data to start.
some settings can only be set in the yml file
eg name of cluster, number of shards
cluster name can be in cli, but dont, use yml file
node name can be the same as it’s UUID under the hood, but it’s crazy town and dont do it
http for rest api, port 9200
transport for internal, port 9300
http.host is localhost by default
transport.host is localhost by default
network can be avoided hardcoding, eg site global see doco
transport.bind_host
DO NOT set this to false, you will regret it:
-Des.enforce.bootstrap.checks=true
do not set jvm above 32g, no value and it’s a waste of resource.
best node has 64g mem, 32g for elasticsearch, 32g for OS
Set Xms and Xmx to the same size, and typically to no more than 50% of your physical RAM
sometimes it’s worth going smaller to speed up cluster
default is 2g too small, 8 is good, do not exceeed 30 (for above of 32)
see https://www.elastic.co/blog/a-heap-of-trouble
|--------------------------------|
| Master Master Master |
| |
| Data Data Data Data Data |
|------------------|----|--------|
| |
Injest Injest -> Coord Coord
minimum 3 master nodes
set master for quorum to (n/2) + 1
do not send client requests to dedicated master nodes
dedicated master nodes do not need to be big and beefy, storage etc
node.master: true
node.data: false
node.ingest: false
node.master: false
node.data: true
node.ingest: false
node.master: false
node.data: false
node.ingest: false
leave other nodes for queries, incase the coord node goes down.
node.master: false
node.data: false
node.ingest: true
give range, learn normal, alert when outside of expected
lets you query across multiple clusters, but doesn’t scale well
deprecated, but see cross cluster search
default number of primary shards for an index is 5
it’s fixed once created and can’t be changed without redindex
disable or enable, or you can whitelist patterns
used to force into a diff shard, useful for parent/child relationships
can delete indexs with wildcards, but can disable
PUT _cluster/settings
"persistent": {
"action.destructive_requires_name" : true
}
}
an alias in used for indexes, eg you want other name
see page 174 and 175 of training pdf
useful when GET from lots of indexes
eg GET trx-20171112,trx-20171113/_search
or instead GET month/_search
useful to decouple index name from code, eg you want to move an old index to a new index with more shards.
index templates are useful when indexes are built every eg day, and you want it diff from default settings for cluster.
saves you setting every time you create the index
stop words are removed as part of the full text analysis, such as “to”, “an”, “a”
desc: "An apple a day"
decc.keyword: "An apple a day"
desc: ["apple", "day"]
field data types: text, keyword, date. integer etc
mappings convert json after injestion and give the fields value types eg integer etc
err_1623
and err_1802
use a keyword for this, if you use “text” it will get chopped up: eg “err” and “1623”
(aside: nested means update all items if something removed down tree, parent child doesnt)
elasticsearch will guess but can get it wrong
do it when creating new index
can you change a mapping? No, needs to reindex
by default it only returns top 10, need to set the size to get more
Live in shards, created every 1 second with index.refresh_interval
, or buffer full
can change index.refresh_interval
eg 5 seconds
each segment is a file on disk under data dir
be careful about inodes
can replay if sudden outage, should replay to write segments from mem to disk
can force flush and flush with sync, should not need to do, but maybe before backups taken?
mostly automatic, logs you want to use “merging with index only”
Can force merge api, again useful for backups or moving to cold storage
only use on old data that wont be written to again
(note: check out a tool called “curator”: snapshot, restore, close, forcemerge)
do I need to create new index before? No, but if issues with orig then they will be copied as well.
external index powerful to build specific indexs from master index, eg “just with item == disney” etc
eg. combine multiple indexs for each day of the month into one month.
can also “rollover” or “shrink”, may also be useful for combining logs etc
move data from eg dev to uat
reduces load on cluster, no ram or cpu, disk only
index is then not available for operations, search etc
removed replications, do a snapshot or you will loose data if you lose a node
Will keep every primary shard though, so make sure shards aren’t on the same node or you will lose them.
useful if network dips for a few seconds and you don’t want replicas etc create
primary shards will always happen, it’s for replication only
default is 1 min
useful for upgrading nodes, eg it will be back in 5 mins after reboot.
index.priority can be changed to make certain indexes be recovered before others
hot nodes power servers, useful for injesting data, bulk injesting etc
warm nodes less powerfull servers in cluster, useful for queries, spread out load
can also be useful for providing better searches when customer pays more money
see shard filtering
node.attr, does not contain node name, so if you want to exclude one specific node you have to set it.
can always use reindex from cluster to move into an archive cluster
shard over allocation, number of shareds more than number of nodes
makes eaiser to grow cluster to new nodes
a little over allocation is good, but too much is bad
depends on your use case
define sla before starting
index in parallel until error code 429 - over capacity
data grows slowly but lots of searches
lots of data but not huge searchs, logs etc
stuff with timestamps etc
searching usually involves a time stamp
search for recent events
time based data is best organised using time based indices
see page 359 and 360
set up aliases for time based searches across multiple date spans
(note: alias to single index can be written and read, alias to multiple indcies can only be read)
VERY good tool to test cluster performance and benchmark is es_rally
(open source)
xpack monitoring is free (xpack basic)
cluster.routing.allocation.awareness
You can configure forced awareness to avoid overwhelming a zone
PUT _cluster/settings
"persistent": {
"cluster": {
"routing": {
"allocation.awareness.attributes": "my_rack_id",
"allocation.awareness.force.my_rack_id.values": "rack1,rack2"
}
}
}
}
xpack is a plugin from elastic
page 381
snapshot and restore
on every node
repository types, see page 389
Try not to take snapshots on indexes that are currently being written to eg today
not encrypted, but not readable outside elasticsearch
renaming indicies: you can rename when you restore so you don’t overwrite data
restore to a diff cluster: also used to move data from another cluster
snapshots are segment level, as are increments
_cluster/pending_tasks
check cluster tasks if cluster is sluggish, will show what’s currently running
if indexing goes up and not coming down, good idea to scale out
index latency should be less than 1000ms ideally, but depends
Cluster, Node, Indicies stats; json
check pending tasks first
GET _cluster/pending_tasks
GET _cat/nodes
best practice to set up monitoring on a it’s own seperate dedicated cluster
search latency, send alert if taking too long
The Search Slow Log off by default, have to turn on. see page 443 for more info
check thread_pool
for rejected and long running queries on nodes, shows the cluster is having issues.
minor upgrade rolling upgrade
major upgrade needs cluster stop, upgrade, start
use kafka as a message broker to hold data
do sync flush before node restart will commit everything to disk
out of box no authentication, authorization or encryption
disable swapping means Xms and Xms file sizes same in JVM??? see page 492
see page 494 for details
avoid running over WAN Links
minimise hops between nodes
don’t use LVM raid etc, just straight FS on the disk; lose raid lose all disks, better to lose individual disks and node will be okay. see page 499