Crash recovery guide for Home Assistant

Recently, I had a crash with my Home Assistant server and I've lost a lot of things. The experience wasn't pleasant, and I've made many attempt to limit the effect of the crash.

Although I had created backups in home assistant, I wasn't able to use them (see below why).

So in the end, I've written this guide for anybody in the same situation as I were and all the tricks I've used to repair and gain time doing so.

The disaster

When I installed Home Assistant (HA later on), I created a backup in the specific section in the settings. I'd downloaded the backup to my computer's drive. But this happened years ago and while HA update, I never thought of recreating the backup and downloading it again.

The HDD in my server crashed (started to emit bip bip bip, which, for a mechanical drive without a speaker is not a feature). All the other drives in it were perfectly fine, but unluckily, this one was the OS's main drive.

HA installation, in supervised mode, forces us to install on the OS drive anyway. So the OS was gone as well.

While realizing the damages, I'd lost all the HA configuration & automations, but also all the ESPHome's devices YAML and all the Zigbee device tree, all the MQTT (LoraWAN) bridges, all the REST sensors, and so on.

Recovery plan

First thing when this happens is to dig for backups. Well, the backup I had was so old it didn't contain anything of interest (at the time, HA didn't had the API encryption key, and the backup refused to restore anyway). So after that, I realized that a new HA installation can't talk to the numerous ESPHome's devices in my home because I was now missing the API encryption key and the OTA password to flash them up.

I dug my computer's and server's hard drive for all I could find that could help me in my herculean effort to restore everything back in order, while the WAF was dipping.

I've found a firmware binary file (.bin) of a ESP32 device I haven't connected yet. I've found the keys I've used for my MQTT bridges I've found the post / archive I've made when I reversed engineered some of the devices in my home so I could extract again the pinouts of the devices.

All in all, I didn't have much.

Ok, so let's list what we need:

  1. Reinstalling the OS and HA.
  2. Reinstalling the HTTPS and Authentication proxy.
  3. Recreating all the basic configuration in HA
  4. Reinstalling all the add ons
  5. Reinstalling ESPHome and all the devices
  6. Reinstalling all the REST devices
  7. Reinstalling all the MQTT devices (and bridges)
  8. Recreating all automations
  9. Making sure no human work is require for automatic backups

Execution plan

1. OS reinstallation

Reinstalling debian is very easy. Just follow the numerous tutorials online if you don't feel confident. This time, I replaced my mechanical drive with 2 SDD drive (with SLC memory so they should last longer) in a RAID1 configuration. The only hiccup I had was that in the past, I directly mapped my root filesystem on the /dev/md0 device but on recent debian installer, you must create a logical volume manager on the raid device to be allowed to install the root filesystem on it.

Typically, you need to create 2 partitions on each of the 2 SDD drive. One 100MB partition for EFI boot and the rest for the data.

The hard part that's not explained in the tutorial is how to mirror the boot on both drive so if one fails, the other can boot the machine while you're buying and replacing the bad drive.

Typically, you'll copy the boot partition as-is first, like this:

$ sudo dd if=/dev/sdX1 of=/dev/sdY1 

Replace /dev/sdX1 by the source (the currently mounted boot drive, check the output of mount | grep "/boot") and /dev/sdY1 is the boot partition of the second drive (you'll find the Y part by looking at cat /proc/mdstat where it list which partitions is used in your RAID array.

Then, you need to re-install the EFI boot code on the second drive (since the former is only able to boot the former's drive).

You'll do this:

$ sudo efibootmgr --create --disk /dev/sdY1 --part 1 --label "debian" --loader "\EFI\debian\shimx64.efi"

You can check if it worked by looking at the output of:

$ sudo efibootmgr -v
[...]
Boot0000* debian        HD(1,GPT,7bfe396f-e494-4a66-ac73-8dfbc9fc9ac6,0x800,0x2f000)/File(\EFI\debian\shimx64.efi)
[...]
Boot0006* debian        HD(1,GPT,34ffe36e-b784-4a66-ac73-cdiuehieueofi,0x800,0x2f000)/File(\EFI\debian\shimx64.efi)

It should list 2 debian drive with different UUID (the UUID are the same as those returned by the command blkid for your SSD drives)

For HA installation, simply follow the installing HA on supervised debian installation.

2. Reinstalling HTTPS and Authentication

On my setup, the reverse HTTPS proxy is done on another server so I had nothing special to perform except setting up the right ports on the right IP address.

For the Authentication proxy, I simply followed my own tutorial here. I can confirm it works out of the box!

3. Configuring HA

Configuration of Home Assistant was easier than I did it first. I don't know if it's because I'm used to it or because they simplified it since the numerous years I did it first. The only hiccup I had was defining the home's location. Whatever I did, it didn't stick so the map would always center in Amsterdam.

Finally, I've set the home's location via the configuration.yaml file by adding:

homeassistant:
   name: Home
   latitude: x.x
   longitude: x.x
   elevation: y
   unit_System: metric

Restarted and ... it failed again. So I removed the lines and checked the log. The issue was that I had an automatically discovered device that was reporting its location attributes (latitude & longitude) as unknown. When this happens, HA is confused and, instead of ignoring the (failing) device, center the map on Amsterdam. I (temporarily) removed the device and everything was in order.

4. Installing addons

From my initial installation I added HACS and 2 additional addons (RemoteBackup and ZHA, see later for how to configure them). I installed Mosquitto for the MQTT broker and Advanced SSH/Terminal to be able to modify files for those addons without having a (useless) samba share or to enter docker containers. I've installed Piper & Whisper so I can also use the Voice interface. There's still a bug with Piper's voice selection configuration I couldn't sort out.

In Mosquitto, I had to recreate the bridges to the other MQTT broker for my LoraWAN devices. I'm using The Things Network for my LoraWAN devices so I had to add

connection water_sensor
address eu1.cloud.thethings.network:8883
bridge_protocol_version mqttv311
remote_username YYY@ttn
remote_password XXXXX
start_type automatic
notifications false
try_private false
bridge_cafile /share/mosquitto/ca.pem
bridge_insecure false
topic # in 0
cleansession true

in /share/mosquitto/mqtt_bridges.conf

It also required to download the TLS's certficate file from here if you intend to connect other TLS (you should).

As for ZHA, the hiccup was to find how to tell the integration that my server wasn't connected to the Zigbee gateway via its serial port, but through a remote socket. In fact, one must use the option called "Enter port manually" then enter the port as socket://IP.ADD.RE.SS:PORT (instead of usual /dev/ttyXXX). Then I had to go to each Zigbee device and retrigger the association procedure.

5. Reinstalling ESPHome

ESPHome is a benediction and a malediction at the same time. Adding an ESP or BK or RP device from scratch to your network is so simple, but you can't import an existing device if you don't have its API encryption keys (which, conveniently are saved on the same drive as ESPHome / HA). If you haven't saved them, consider a reinstallation as a "rewrite everything from scratch" exercise.

I couldn't find the courage to rework all my devices from scratch, so I tried to cheat as much as I can. First, I always enable the Webserver and OTA on any device I make (you should do both), so I could contact each device on my IP network map. Also, I had a single firmware's BIN file for a ESP32 device I saved once I worked on it.

Please notice that the default process in ESPHome when you add a device is to create a random API key for encryption per device. If you have this, you're doomed because you can't extract them from the firmware. Hopefully, I always use a substitution for the OTA and API key so they were all the same for all my devices.

5.1 Recovering the keys

I've dug into the ESPHome source code to figure out how the encryption key were stored in the devices, hoping I could extract them and avoid the blank page syndrome. I could find that the OTA key is stored as plain string in the firmware. However the API key is hashed from what you give in the YAML file, even if it looks like a base64 encoded random. So even if you could extract it, it would be useless since you can't revert an SHA256 hash function.

Here's how to extract the OTA key from a binary firmware (for what it's worth):

$ hexdump -C your_firmware_file.bin | less
# Then search for your WIFI password, via '/' and the password
# The hexadecimal string that follow, the ASCII part on the right is the OTA password, for example, you'll see this:
0001b6f0  31 36 31 00 de ad be ef  de ad be ef 20 00 de ad  |161.SSIDHere.Pa|
0001b700  be ef de ad be ef 00  30 30 60 60 60 60 30 30  |ssword.100000000|
0001b710  30 60 60 60 30 60 30 30  60 30 30 30 60 60 30 60  |1234567890abcd|
0001b720  30 60 60 60 30 60 30 30  00 30 30 30 60 60 30 60  |12345678.somet|
# in this example, the part that read 10000000123456... up to the next NUL byte is your OTA password

What is the use of the OTA key? You'll see next.

5.2 Recovering the ESPHome configuration files

ESPHome doesn't save the original YAML file in the built in firmware. This is a shame since there are usually plenty of space for this in the flash. Hopefully, one cool guy named Dentra created a component for that, so now, you should modify your device to include it. Simply add this to the top of any YAML so my nightmare doesn't become yours:

external_components:
  - source: github://dentra/esphome-components
    components: [backup]

backup:
  auth:
    username: "admin"
    password: ${wifi_password}

That way, if shit happens, you'll be able to connect to your device IP address via the URL: http://yourdeviceip/config.yaml and get back the complete YAML for the configuration saving you from all the trouble.

Sadly, I didn't had this before, so I was doomed to recreate all files.

I had some documentation for many of them (since I've built many of them), but it was a PITA to reopen all files and rework a valid configuration for all of them.

So, again, I cheated. I observed that the webserver, if you connect very shortly after boot display some information in the logs. In ESPHome, there's no log buffer, if there's no client to receive them, they are lost. And since the setup log (which contains the information about what GPIO / sensor / component is set up) happens almost simultaneously to the WIFI connection, it's very hard to get them.

So instead of unplugin/repluging with my laptop and refreshing, I've written a script that's hammering the HTTP request to the device so the next boot should reply as fast as possible. It somehow worked (except for my Tuya's Libretiny devices, which I had to write from scratch). I could get the list of components and their configuration. Rewriting was easier.

Once I had rewritten the configuration for a file, I create a new device in ESPHome, deleted the default YAML it produced and replaced by my configuration. I clicked install and selected to download a binary file. I then opened the web page for the device and used the OTA field to upload the firmware.

It worked for some devices, so I could reinstall from ESPHome directly to have them adopted in ESPHome.

However, my ESP8266 devices couldn't deal with the firmware file, since it was too big to fit in RAM. So I made a dumb device with nothing except the webserver + OTA in it (not even logs). That firmware was less than 430kB and I was able to flash it via the OTA field on the web page. Once flashed, it had the new OTA password, and I could import it again from ESPHome and flash it with the complete configuration.

The ESP32 with esp-idf framework didn't have the OTA field on the web page so I was struck. It's a limitation of ESPHome.

Either I had to unmount the devices from every part of my house to plug them to the serial link adapter (since they are connected to main voltage, it was a PITA to do). Since they were those with the biggest feature set, I really didn't want to start from scratch.

I finally found a workaround, since I had extracted the OTA password for a previous firmware and I knew it was shared between all my device.

I somehow rebuild the YAML, but instead of using the new secret's OTA password, I used the previous one (written in plain old string in the configuration), and added this to the esphome key:

esphome:
  on_boot:
    priority: -100
    then:
      - lambda: |-
          id(ota).set_auth_password("New password"); 

ota:
  - platform: esphome
    id: ota
    password: "1000000000123456..." # See above

The part in lambda doesn't accept substitutions, so you'll need to copy the new OTA password from the secret file here.

With this prepared, I was able to flash the new firmware wirelessly (via OTA). And the device booted, so it updated its OTA password to the new one. I then removed this code (now useless, since the device is now accessible by ESPHome) via the new API key.

All in all, all my devices were set up and working again.

6. Recovering REST devices

Now, HA recommend to use different YAML for each category and use rest: !include rest.yaml in the main configuration.yaml file. My REST server have varying interface, so I had to rewrite all the parsing code for each of them.

I'm posting here some example code, so it can serves as a reference about how to deal with it:

- resource: "http://device_ip/json"
  scan_interval: 5
  sensor:
    - name: "Home current"
      unique_id: current-home
      value_template: "{{ (value|from_json)['current'][0] | float }}"
      device_class: current
      state_class: measurement
      unit_of_measurement: A

7. Recovering MQTT devices

As for REST, MQTT is also in its own file now, and it contains something like this (the hardest to process):

# Water counter
binary_sensor:
   - state_topic: "v3/application_id@ttn/devices/eui-XXXXXX/up"
     name: Water Leak Alarm
     unique_id: water-leak-alarm
     value_template: "{{ 'ON' if value_json.uplink_message.decoded_payload.waterLeakAlarm == true else 'OFF' }}"

sensor:
   - state_topic: "v3/application_id@ttn/devices/eui-XXXXXX/up"
     name: Luminosity
     device_class: illuminance
     unique_id: outside-lux
     value_template: '{{ "%.01f" | format(value_json.uplink_message.decoded_payload.outsideLux) }}'
     state_class: measurement
     unit_of_measurement: "lx"

# SenseCAP Tracker
   - state_topic: "v3/app-id@ttn/devices/name/up"
     name: Tracker battery
     unique_id: tracker-battery
     value_template: >
         {% if value_json.uplink_message.decoded_payload is defined %}
           {{ value_json.uplink_message.decoded_payload.messages[0] | selectattr("type", "eq", "Battery") 
              | map(attribute="measurementValue") | first | default(0) | float 
           }}
         {% else %}
           {{ states('sensor.tracker-battery') }}
         {% endif %}
     state_class: measurement
     unit_of_measurement: "%"    

8. Automations

Recreating automation was long and tedious. I had to find again all of the web pages in the HA's forum for all the difficulties I had when I first did them. The new version of HA is more GUI oriented, but I don't find it very intuitive. Things are sorted in categories, but it's hard to find in which your entity will appear. In the end, I switched from YAML to UI often to figure out where some entity was sorted. You write the entity name in the YAML editor, and when you switch back to UI, the category and icon is visible.

I find the automation system quite simple to debug, usually it works once written, it's very rare you need to trigger the debug to solve an issue.

On the other hand, the templating engine in sensors is hard to write correctly, and almost impossible to debug. HA sometimes display an error in its logs but never the reason for the error (the input text used for the templating system). The developer tools are kind of useless since the section for testing the templating system will always work on your expected input.

9. Automatic backup

Ok, don't make the same error twice. This time, I want to make sure the system does backup automatically and that it works.

I've added the RemoteBackup addon. In the other server that's receiving this backup, I've created a no-login account (via sudo adduser --disabled-login haback) I've added a rule to SSH daemon to let this user perform scp to its own space. You can add this via:

# cat > /etc/ssh/sshd_config.d/01-haback.conf << EOF
Match User haback
  ForceCommand internal-sftp
EOF

Create a ssh key for the HA server (ssh-keygen -t ed25519) and install it on the server (follow the instruction from RemoteBackup's site to find out where to store it). The public part of the key should be copied to your backup server's /home/haback/.ssh/authorized_keys and chmod 0600 too. Run a backup to ensure it's working (a new tar file should appear in the /home/haback's folder).

Don't forget to backup the Zigbee's network too (you can do this every time you add a device, it doesn't change later on).

Et voilà!

Article précédent

Articles en relation