State of the Instance (September 2021)
Boy oh boy.
Database woes
So moving the database to block storage didn't help as much as I expected. In fact, the performance degraded very quickly, the server got painfully slow and there were timeouts and 500 errors all over the place. I'm sure you remember.
The obvious solution was to double the server plan - more on that later - and move the database back to the larger SSD. Which worked for a while.
Except I fucked that up.
The crash
6th of September
I finally got around to upgrading the server plan. As always with potentially dangerous operations, I made backups of the database and, just in case, a snapshot of the entire instance. After the server upgraded, I stopped Pleroma and copied the entire database folder to the SSD. Performance instantly skyrocketed and everything seemed well.
What I didn't realize is that before copying the folder, I forgot to stop the database process as well. For this reason, the database started to slowly become inconsistent... (More technical explanation at the bottom of the page.)
8th of September
The inconsistencies eventually started to manifest, mainly by very high response time and errors creeping up again. Not knowing the cause, I rebooted the server and... the database process refused to start. Not even with a proper error message, it just got some internal error and exited right after starting.
Time to panic!
The funniest part was that to make a proper backup or restore one, you also need a running database process. After a while, it dawned on me what the problem was but it was too late to fix anything.
I launched a restore of the snapshot and went drinking because what the hell.
The recovery
9th of September
When I woke up around noon, the restore was complete. Except I missed the fact that there wasn't a functional database to use - in the snapshot, it still pointed to the folder on the block storage and that one was inconsistent too because of technical reasons. (Bottom of the page.) So the database process was still as dead as before.
Well, it was time to nuke the database folder and restore the database backup from scratch. I was aware this would take a long time but I underestimated how long that would be.
10th of September
The restore was still running.
11th of September
Still running...
12th of September
In the evening, the restore was finally over. I checked everything was okay this time and launched the instance.
For about four hours, the performance was terrible. Presumably the instance was catching up on remote posts. After that, though, everything was blazing fast as it should.
The loss
So a consequence of these events, apart from the long outage, is that roughly two days of posts made between 6th and 8th of September are lost.
I'm deeply sorry for that.
Scaling up
So yeah, I've doubled the instance plan. This means:
- double the CPU cores
- double the RAM
- double the SSD space
- double the monthly cost, but it's manageable so far
It seems we've already surpassed pre-August levels of performance! This should be more than enough for the months to come.
New emoji
I've added a few autism zodiac signs on request. Then, for completeness, I've added some more.
Technical details on the crash
So about copying the folder...
Linux processes can keep files open for longer periods of time. Since I didn't kill the Postgres process, it kept a few files open while I copied the folder. The folder name actually stayed the same - the normal location is /var/lib/pgsql
, which I kept. However, when moving it to the block storage a month ago, I've made this folder a symbolic link to a different folder on the block storage partition.
So I first removed the symbolic link and then copied the actual folder back to the SSD. Now, newly opened files got picked up from the new SSD copy, while the persistently opened file descriptors would still point to the old copy. This effectively corrupted both copies at the same time.
Would merging the two copies together and keeping the more recent file instances work? Probably, but I wouldn't trust such a homunculus. Plus I only thought of this while the restore was already running.
(I know, this all is mostly about September events. I planned to do the DB move right around the 1st and then quickly write this up, not expecting things to go south. Next month I'll try to post on the 1st as supposed to.)