Operating system update
Starting last Saturday, we performed an operating system update on the Stendhal server. Going from Ubuntu 14.04 over 16.04 to 18.04. The operating system update itself went very smoothly.
With the update of the operating system, a number of software packaged got new versions. For example Java, PHP and MySQL.
Website and Wikis
PHP is the the programming language used by the website and Wiki. We used version 5.0 in the past and the update brought version 7.2. PHP 7.2 has significant incompatible changes. So we prepared the update for months: We rewrote the entire database access code for the Stendhal website from mysql_* to PDO. And we did an update of MediaWiki, the software that powers the Stendhal Wiki and LibreGameWiki, on a local computer to fix a number of extension issues.
Database
Most of the update went smoothly with only a couple of hours of downtime. The MySQL update, however, create a bit of a headache: It had to rewrite all of its data files. And with a database of almost 200 GB of structured data, that took quite a while.
A database table of 1 GB without indexes took about 20 minutes. With indexes, it took about 60 minutes. Unfortunately this process is not linear: A 6 GB table with indexes took a significant longer time than 6 * 60 minutes. The largest tables are 80 GB and 60 GB.
This in itself would not have been a problem. All the tables that the Stendhal server needs are in the 2 GB range, or lower. So we updated these tables first and the Stendhal server went up again on Sunday morning.
Sunday morning: Stendhal is up again and there is no lag
In the background, more tables were still being converted. The MySQL database server was very busy, but there was virtually no lag in Stendhal. How did this work?
You might not know, but Stendhal is in fact a turn based games. It's just that the turns are very short, just 300 milliseconds. Each turn, the server processes commands sent by the clients, creature logic, NPCs logic and many other things.
If the server is not able to handle everything within 300ms, the game will lag. In the past, a significant slice of those 300 milliseconds were used up by waiting for database queries. So a couple of years ago we rewrote the database access code to be asynchronous. The Stendhal server does not wait anymore, for a database query to return. Instead it will continue to move the world forward regardless. It will check pending database results in the next turn, or the turn after that, whenever the database is ready.
Sunday evening: Logins take forever
While this approach works fine for everything that happens in world, there is one operation that absolutely requires answers from the database: Logins.
On login, the Stendhal server has to do quite a number of database queries: It has to check, whether the account is banned, whether the account or network is blocked, whether the password is correct. It also has to load a list of characters.
As mentioned above, the world of Stendhal moves forward without issues. But the player who is trying to login, has to wait until the data is available.
On Sunday evening the queue of waiting database queries got so large, that logins took several minutes. The client appeared frozen, but just riding through the login process would eventually lead to being able to play.
The situation was far from ideally, but more or less acceptable.
Monday: Things getting worse
While MySQL was working on larger tables in the background, the queue of pending database operations grew and grew.
To give some numbers: The database query to record a rat being killed, usually takes less than 10 milliseconds. Now, however, the background process was using up all system resource, and this query took 8 seconds (8000 milliseconds) instead.
While in game, the client sends regular "I am alive" messages to the server. If the server does not receive such a message from the client for more than a couple of minutes, it will disconnect the client.
On Monday, we ran into the following issue: The login process took so long, that the timeout limit was reached. The server disconnected clients, which were waiting for a login confirmation. As mentioned above, the client only starts sending "I am alive" messages after the login process was completed.
Post mortem
The update of the operating system went smoothly. The update of the PHP version did go smoothly, too, because we did prepare well for it.
The update of the MySQL server did not go as planned: Converting tables in the background did use significantly more system resources than we expected. This resulted in the Stendhal server being mostly unavailable for almost another two days. We still have three database tables that need converting, at some point in the future.
Please keep mind, that Stendhal is a spare time project. If we were a company, which owned a second server, we would have done this: Use xtrabackup to create a backup of the database server, which is point in time consistent. Then use this backup to setup replication on a second server. Pause replication and do the MySQL update on that server, resume replication to catch up all the changes that happened in the mean time.
In the future, the client should handle pending login requests better, instead of freezing. In theory, the Stendhal server knows the size of the database queue, so it could signal the client the remaining time.
Anyway, after doing this operating system, we are good to go for another 4 years.
We are sorry for any inconvenience caused and hope that you still enjoy playing Stendhal and might consider to contribute.