“No cluster leader” после перезапуска Consul

Я уже описывал это проблему в своей прошлой статье, но тогда я ещё не понимал всей сути. Сейчас я зашёл в своих исследованиях немного дальше и могу предложить лучшее решение.

Так вот. Проблема в том, что после перезапуска minikube консул не может выбрать лидера, а соответственно находится в нерабочем состоянии. В прошлой статье я описывал процесс удаления persistent volumes, что, разумеется, помогает, но не подходит в большинстве реальных ситуаций, так как мы хотим сохранить данные, которые были в Consul.

Правильное решение проблемы описано в официальной документации в статье про Outage Recovery.

Не буду слишком сильно вдаваться в подробности. Расскажу лишь, что на самом деле нужно сделать, чтобы завести Consul после перезапуска.

Для начала нам нужно подготовить файл “peers.json” примерно такого содержания:

[
  "172.17.0.15:8300",
  "172.17.0.17:8300",
  "172.17.0.25:8300"
]

[

"172.17.0.15:8300",

"172.17.0.17:8300",

"172.17.0.25:8300"

]

Это был файл для Raft протокола версии 2, как написано в документации. Мне помог файл именно с таким форматом. Как вы можете видеть, в этом файле описаны адреса нод консула с портами. Если же у вас Raft протокол версии 3 и позже, то формат файла становится таким:

[
  {
    "id": "5d86e16f-e232-4cf7-80f2-39ad22629e77",
    "address": "172.17.0.15:8300",
    "non_voter": false
  },
  {
    "id": "f64e19c1-1404-4522-a153-c59caa3e395b",
    "address": "172.17.0.17:8300",
    "non_voter": false
  },
  {
    "id": "ab448371-2ec3-4fc5-a497-77dc580af316",
    "address": "172.17.0.25:8300",
    "non_voter": false
  }
]

[

{

"id": "5d86e16f-e232-4cf7-80f2-39ad22629e77",

"address": "172.17.0.15:8300",

"non_voter": false

{

"id": "f64e19c1-1404-4522-a153-c59caa3e395b",

"address": "172.17.0.17:8300",

"non_voter": false

{

"id": "ab448371-2ec3-4fc5-a497-77dc580af316",

"address": "172.17.0.25:8300",

"non_voter": false

}

]

Ещё раз повторюсь, что мне помог файл первого формата, к тому же у меня был только одна нода Consulа, так что у меня он выглядел вот так:

[
  "172.17.0.17:8300"
]

[

"172.17.0.17:8300"

]

Адреса нод Consul-а можно узнать командой:

> kubectl describe pod --namespace mynamespace consul-consul-0 | grep -A1 -B1 "IP:"

Status:		Running
IP:		172.17.0.17
Created By:	StatefulSet/consul-consul
--
      STATEFULSET_NAME:		consul-consul
      POD_IP:			 (v1:status.podIP)
      STATEFULSET_NAMESPACE:	mynamespace (v1:metadata.namespace)

> kubectl describe pod --namespace mynamespace consul-consul-0 | grep -A1 -B1 "IP:"

Status: Running

IP: 172.17.0.17

Created By: StatefulSet/consul-consul

STATEFULSET_NAME: consul-consul

POD_IP: (v1:status.podIP)

STATEFULSET_NAMESPACE: mynamespace (v1:metadata.namespace)

Где вместо consul-consul-0 нужно подставить имя пода с Consul. В нашем случае IP будет 172.17.0.17.

Теперь нам нужно запихнуть этот файл в каталог, указанный в -data-dir . В моём случае это файл нужно было разместить по адресу “/var/lib/consul/raft/peers.json”. Вы можете проверить правильность пути тем, что в каталоге, куда вы собираетесь поместить “peers.json”, уже должен находиться файл “peers.info”. У меня Consul был развёрнут внутри Kubernetes, поэтому я проверял вот такой командой:

> kubectl exec --namespace mynamespace consul-consul-0 -- cat /var/lib/consul/raft/peers.info

As of Consul 0.7.0, the peers.json file is only used for recovery
after an outage. The format of this file depends on what the server has
configured for its Raft protocol version. Please see the agent configuration
page at https://www.consul.io/docs/agent/options.html#_raft_protocol for more
details about this parameter.

For Raft protocol version 2 and earlier, this should be formatted as a JSON
array containing the address and port of each Consul server in the cluster, like
this:

[
  "10.1.0.1:8300",
  "10.1.0.2:8300",
  "10.1.0.3:8300"
]

For Raft protocol version 3 and later, this should be formatted as a JSON
array containing the node ID, address:port, and suffrage information of each
Consul server in the cluster, like this:

[
  {
    "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
    "address": "10.1.0.1:8300",
    "non_voter": false
  },
  {
    "id": "8b6dda82-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.2:8300",
    "non_voter": false
  },
  {
    "id": "97e17742-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.3:8300",
    "non_voter": false
  }
]

The "id" field is the node ID of the server. This can be found in the logs when
the server starts up, or in the "node-id" file inside the server's data
directory.

The "address" field is the address and port of the server.

The "non_voter" field controls whether the server is a non-voter, which is used
in some advanced Autopilot configurations, please see
https://www.consul.io/docs/guides/autopilot.html for more information. If
"non_voter" is omitted it will default to false, which is typical for most
clusters.

Under normal operation, the peers.json file will not be present.

When Consul starts for the first time, it will create this peers.info file and
delete any existing peers.json file so that recovery doesn't occur on the first
startup.

Once this peers.info file is present, any peers.json file will be ingested at
startup, and will set the Raft peer configuration manually to recover from an
outage. It's crucial that all servers in the cluster are shut down before
creating the peers.json file, and that all servers receive the same
configuration. Once the peers.json file is successfully ingested and applied, it
will be deleted.

Please see https://www.consul.io/docs/guides/outage.html for more information.

> kubectl exec --namespace mynamespace consul-consul-0 -- cat /var/lib/consul/raft/peers.info

As of Consul 0.7.0, the peers.json file is only used for recovery

after an outage. The format of this file depends on what the server has

configured for its Raft protocol version. Please see the agent configuration

page at https://www.consul.io/docs/agent/options.html#_raft_protocol for more

details about this parameter.

For Raft protocol version 2 and earlier, this should be formatted as a JSON

array containing the address and port of each Consul server in the cluster, like

this:

[

"10.1.0.1:8300",

"10.1.0.2:8300",

"10.1.0.3:8300"

]

For Raft protocol version 3 and later, this should be formatted as a JSON

array containing the node ID, address:port, and suffrage information of each

Consul server in the cluster, like this:

[

{

"id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",

"address": "10.1.0.1:8300",

"non_voter": false

{

"id": "8b6dda82-3103-11e7-93ae-92361f002671",

"address": "10.1.0.2:8300",

"non_voter": false

{

"id": "97e17742-3103-11e7-93ae-92361f002671",

"address": "10.1.0.3:8300",

"non_voter": false

}

]

The "id" field is the node ID of the server. This can be found in the logs when

the server starts up, or in the "node-id" file inside the server's data

directory.

The "address" field is the address and port of the server.

The "non_voter" field controls whether the server is a non-voter, which is used

in some advanced Autopilot configurations, please see

https://www.consul.io/docs/guides/autopilot.html for more information. If

"non_voter" is omitted it will default to false, which is typical for most

clusters.

Under normal operation, the peers.json file will not be present.

When Consul starts for the first time, it will create this peers.info file and

delete any existing peers.json file so that recovery doesn't occur on the first

startup.

Once this peers.info file is present, any peers.json file will be ingested at

startup, and will set the Raft peer configuration manually to recover from an

outage. It's crucial that all servers in the cluster are shut down before

creating the peers.json file, and that all servers receive the same

configuration. Once the peers.json file is successfully ingested and applied, it

will be deleted.

Please see https://www.consul.io/docs/guides/outage.html for more information.

Копируем файл во все поды с Consul-ом по указанному выше пути (в моём случае был только один pod):

kubectl cp peers.json mynamespace/consul-consul-0:/var/lib/consul/raft/peers.json

1	kubectl cp peers.json mynamespace/consul-consul-0:/var/lib/consul/raft/peers.json

Перезапускаем поды, для чего удаляем их и ждём, пока replicaset его пересоздаст:

kubectl delete pod --namespace mynamespace consul-consul-0

1	kubectl delete pod --namespace mynamespace consul-consul-0

Смотрим логи:

kubectl logs --namespace mynamespace consul-consul-0

1	kubectl logs --namespace mynamespace consul-consul-0

В логах обязательно увидим надпись:

2017/10/25 06:43:21 [INFO] consul: cluster leadership acquired
2017/10/25 06:43:21 [INFO] consul: New leader elected: consul-consul-0
2017/10/25 06:43:21 [INFO] consul: member 'consul-consul-0' joined, marking health alive
2017/10/25 06:43:22 [INFO] agent: Synced service 'consul'

2017/10/25 06:43:21 [INFO] consul: cluster leadership acquired

2017/10/25 06:43:21 [INFO] consul: New leader elected: consul-consul-0

2017/10/25 06:43:21 [INFO] consul: member 'consul-consul-0' joined, marking health alive

2017/10/25 06:43:22 [INFO] agent: Synced service 'consul'

Эта надпись означает, что лидер был успешно выбран. Consul в рабочем состоянии.

Эту процедуру нужно проделывать после каждого перезапуска Consul-а. Видимо, предполагается, что Consul будет рамботать всегда, и остановки будут происходить только во время внезапных отключений электричества. Мне, правда, всё равно не понятно, зачем было делать такую сложную процедуру восстановления. Почему он не может восстановиться сам и выбрать лидера без нашего вмешательства, но это, скорее всего, из-за того, что я чего-нибудь ещё более глобального не понимаю.

“No cluster leader” после перезапуска Consul: 2 комментария

cry:

09.12.2019 в 16:57

Привет! Подскажите пожалуйста, так сам peers.json надо править на хосте или в самой поде?
После удаления он разве останется актуальным?

Ответить
cry:

10.12.2019 в 09:02

Мда, после этих махинаций кластер заболел 🙁

Ответить

Октябрь 2017
Пн	Вт	Ср	Чт	Пт	Сб	Вс
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

“No cluster leader” после перезапуска Consul: 2 комментария

Добавить комментарий Отменить ответ