Moving the Stendhal server to a new host

From Arianne
Jump to navigation Jump to search

Arianne and its subprojects Marauroa and Stendhal are open source. The development mostly takes place on the SourceForge hosting platform. But we run one instance of Stendhal ourself as "official" game server. This article describes how we moved that server to new hardware with only two relatively short windows of down time.

Why move to a new server?

For about the last 3 years we have rented a server for the official Stendhal game server at a hosting company. The server had 2 CPU cores and 2 GB of RAM. In the beginning only Webflag and Stendhal ran on it. But soon we put up the Wiki there. And last year we moved the test server for Stendhal onto the same hardware. With more and more things added to the server and the growing content, we outgrew the current hardware. The new server has 8GB of RAM and 8 CPU cores.

Setting up the new server

Moving to a new server requires quite a few steps, most of which could be done in advance:

  • install software (e. g. install mysql and postgres, java and php, stendhal server, apache webserver and tomcat, many command line tools)
  • copy configurations from old server where possible (we are updating from an old Debian operating system to a recent Ubuntu OS, so some configurations are different)
  • configuration of users and services (e.g. create accounts for the server admins, configure email server, etc.)
  • setup scripts and cron jobs
  • transfer of 52 GB game data in the database and about 20 GB of files (see below for details on database)
  • test

Database

The game data is the data stored in the database - stored characters which get loaded on login; account and character information; logging tables for login, game events, items and kills; stored postman messages; achievements and much more. When we migrate the copy on the new server needs to be exactly what was on the old server at the last moment any player played - otherwise, players will lose progress.

52 GB may sound harmless nowadays. But you have to keep in mind, this is not a collection of huge files, but structured data. So you cannot simply copy the files from one computer to another one.

We could have shut down the game, copied the data over, imported it on the new server, and then started the game again over there. But that would have meant about two weeks of down time while the server would be completely unavailable. It was by far the easiest option! But we have a bit of database administration experience between us, and we found a solution which minimised downtime while also allowing us time to check data integrity on the new server. To achieve this data transfer with minimal down time, we've used replication and advanced backup techniques using open source software. Heavy background processes and network activity would normally cause lag on a turn based application which must wait for database operations to complete before starting the next turn. However, during most of this processing, we have not needed to take the game server offline, because the game engine has asynchronous database access.


Replication

Replication is a MySQL feature where 'slave' servers maintain copies of data from a 'master' in real time. Replication starts with an identical copy of the existing data from the master. Then, the slave mirrors every change made on the master by executing every command which the master executes. This means that the slave can stay microseconds behind the master. They don't have to be physically connected, they can just use an internet connection. And one master can have many slaves. It's no load at all on the master, the slaves just find out what commands have been run, and they run them too.

Lots of websites use replication - for example, Wikipedia. When you look at data from a huge website like Wikipedia which gets lots of 'reads' you may be reading from one of many practically identical slaves. When you edit a Wikipedia page, it's the master you edit, and the slaves will get instructed about your changes.

We used replication in a different way. The new server got a copy of the data to start replicating from (more about that below) and since then, up until the final migration, the new server mirrored every change that happened on the old server. When we were ready to migrate we could simply cross over. Before then, the replication could also be paused and continued at a later time. We used this to refine the indices on the gameEvent and itemlog tables to speed up certain kinds of queries in the future.

So, as replication starts with a backup, how did we manage to backup 52 GB data without a long downtime while we stopped the server?

Backup

Backing up database tables and getting a consistent copy means that the data can't be modified while the backup progresses. So, the tables should be protected during the backup. Protecting the tables usually means locking them and preventing writes - but the game needs to write data all the time (e.g. for logging of events) and read data (e.g. loading a character from the database, or retrieving stored postman messages).

Nightly backups

We do a small backup every night which only consists of player progress and the wiki. So in case something bad happens your data will not be lost. But for this server move, we want to include all the statistics data which is about 52 GB instead of just about 100MB progress data.

As pointed out above, we need a consistent backup for the server move. What does this mean? For example moving of items works like this: Start a transaction, copy the item to the new location, delete the item from the old location, commit transaction. The transaction status is black and white, so either it is carried out completely or it is rolled back as if it never started. For the nightly emergency backup we accept the risk that this rule is violated and a very small number of items are duplicated. After all these kind of backups are for emergencies only.

Using xtrabackup for innodb

So what did we do? Instead of the normal mysql backup tool mysqldump, there is an open source backup tool by Percona called xtrabackup. On InnoDB tables, it can perform backups really fast - way faster than normal methods - and without needing to lock tables. It starts a background thread that records all changes since the start of the backup similar to the way replication works. After the backup is completed, those logs can be applied to make the backup consistent to the point in time the backup was completed.

Another very nice feature of xtrabackup is that you can copy the result directly to the data folder of the new MySQL server without having to do an import.

A trick to work with MyISAM

So far so good. Unfortunately not all of our tables are using the InnoDB storage engine. To be precise: Two archived gameEvents tables and the live itemlog table used MyISAM. We converted all other tables to InnoDB some time ago but back then we had to skip those tables because it would have taken several days.

We did some magic for them. Any archive logging tables we could use a normal backup for - doesn't matter about locking them when they're not written to by the game now anyway. Then we dropped them on the main server so they didn't get copied again with the rest of the data.

The only live game table which was MyISAM was the itemlog. It's huge! Most of the time the data doesn't need to be read from it at all though - it's just being written to, logging item transfer, merge, movements. So just for a short while, we renamed the itemlog to something else - itemlog_2011_06_01 and created a new empty table called itemlog instead. Renaming tables is really fast, even for huge tables. We did a dummy insert on the empty itemlog table to ensure that it started logging with an id matching the last id on the previous table. The game was happy because it had a table to write to, but we could safely lock the huge old MyISAM table.

Importing the dumped MyISAM tables on the new server took about 10 days because they are so huge. On the positive side those tables are now InnoDB, too. And we used the import to refine search indices to speed up future queries. We completed this phase with stitching the itemlog tables together again: We shortly paused replication, copied all rows from the new small itemlog table to itemlog_2011_06_01, dropped the itemlog table and renamed itemlog_2011_06_01 back to itemlog.

Asynchronous database access

Why did the game not lag like crazy while we did heavy stuff with MySQL using up the complete input/output bandwidth? Some time ago we implemented asynchronous database access which was initially introduced to help solve server lag.

Although Stendhal feels like a real-time game for chatting, walking, combat etc - it's internally turn based with very short turns (300ms). For example. when you say something to an NPC, they calculate their answer and reply in the next turn.

Server side lag happens when there are too many calculations to complete within the normal length of the turn - the game doesn't move on until they are done. In the past, the main cause of server lag was in waiting for data to be written to or retrieved from persistent storage - the database. If the queries took too long or the database was locked by something else, the game server would need to wait. So, to eliminate this major cause of lag, we implemented Asynchronous Database Access in Stendhal's game engine, Marauroa.

Most of the database access is for saving progress and logging events. These operations are easy to implement asynchronously because they are fire and forget. The difficult cases are the ones that require a response from the database. Instead of waiting for the answer, the server continues to process events and moves the world forward. In the next turn it will check, if there are database results ready to process.

On the server side this worked very well. But on login the client gives up after 10 seconds without a login confirmation from the server. Unfortunately logins are the most intensive database operations we have to do. As a result we opted to take the server offline for about an hour during a very database intensive phase of the preparations. Starting with the next version the client will be smarter: it will apply the 10 seconds timeout only for the initial communication with the server. Once it knows that there is a Stendhal server responding, it will wait longer for an login confirmation.

To summarize: The asynchronous database access allows the Stendhal world to move forward without lag even if the database needs several minutes to answer queries.

DNS records

The domain name system is responsible for resolving domain name such as stendhalgame.org to IP-addresses. IP-addresses are used by computers to address data packets. While the final phase of the migration itself was done within an hour, DNS turned out to be an issue.

Internet provider cache DNS answers for performance reasons. So even after we changed the DNS entry of stendhalgame.org to point to the IP-address of the new server, many players still got the old IP-address as answer. So instead of less than an hours it took us to get the new server live, some people would have had to wait for five hours.

As a workaround for that issue, we setup a temporary new DNS record called new.stendhalgame.org. As this entry was new, there were no old cache entries for it. So it returned the IP-address of the new server immediately and players could connect to the game.

If the admins have control of the DNS server, they can decrease the cache time beforehand. But we don't have our own DNS server. (We are not a company with a huge number of servers, just a spare time project). So we depend on the DNS server of our provider and that one does not offer an option to decrease the cache time.

On the next day most ISP servers were already updated. But some ISPs are slow, so we setup an rinetd daemon on the old server to forward requests to the webserver port and game server port to the new server. This has two negative side effects: It introduces additional network lag of about 20ms and the account login history does not show the real ip-address. Given that only a very small number of people are affected from this issue, it is acceptable and we are able to update the account login history with the real ip-addresses from the rinetd log.

To summarize: The move would have been a lot easier if we had been able to decrease the cache time of the DNS entries. But we found good workaround to live with the situation.

In game handling

Replication was up and running on the new server, with data being reliably transferred live, for some days before we did the actual move. However, game testing on the new server was not possible as this would have written data, and risked the replication process. In order to test we needed a period of time where new data was not being written on the old server. We could have shut the game server down to achieve this freeze, but we found a solution where players could still login, see their friends, and get information from admins. This required a safe world where changes could not be made and therefore a freeze didn't matter. A special zone was created and all players were teleported there. Once inside, they could chat and walk around the zone, but not lose items, trade, fight or die. It was a weird and wonderful zone which suited the totally new situation that players found themselves in.

Servermove20110612-1.jpg

An NPC called Megan was available to explain the situation, and hendrikus and superkym answered questions while also doing the final server migrate behind the scenes. Games like dancing and posing for screenshots provided some entertainment, too.

Servermove20110612-2.jpg

Normal play for players was halted for about one hour while the new server was tested and the website configured, but the special safe zone meant that the 'downtime' was a fun experience and not simply a dead server.

Conclusion

We managed to do the migration with only two one hour windows of downtime because we did a lot of preparations in advance. The creative idea of a safe zone to play in meant players could still contact their online friends and have some fun. A simple approach would have taken about 12 days instead, with a totally offline server.