Complex game server problems that call overtime, why does it happen?

  • subject : How to develop a more stable and faster game server
  • Lecturer: Bae Hyun-jik – Net Tension / CTO
  • Presentation area: development, server
  • Lecture time: 2021.11.18 (Thu) 15:00 ~ 15:50
  • Lecture Summary: Server development becomes a bottleneck when developing. Server failure may occur while the service is open. Problems related to the server, such as item payment errors, may occur. And server developers make excuses. How should we understand this vicious cycle? And how to solve it? Let’s find out.

  • ■ What is a typical game server architecture? – Complex server structure that varies by tens of thousands even in the same genre

    ▲ Server structure of MO games divided into in-game and out-game

    CTO Bae Hyun-jik’s lecture started with a story about the general structure of game server architecture. He said that even in the same game genre, the server structure can be different depending on the content of the game, and he introduced the server structure that varies depending on the game method, such as MO and MMO, along with images.

    In the case of MO games, in which a room is first created and entered to play the game, the server is divided into in-game and out-game structures. When processing matchmaking or chatting in out-game, there is no problem even if a client’s request is delayed by about 1 second, but in-game, even if it is delayed by 0.05 seconds, the user’s experience is greatly deteriorated.

    MMO games in which thousands of players play together in one world are divided into multiple zone servers. Each zone server is spread across multiple regions, and each zone server delays in response to a request by 0.1 second, which will affect the play experience index. He explained that unlike the simplified image, the actual server structure is more complex as other things such as in-app payment, login integration, BI, and operation are included.

    ▲ Server structure of MMO game divided into multiple zone servers

    ■ Types of server-related problems during service open – From server crashes to non-response, lag, malfunction and hacking…

    CTO Bae Hyun-jik continued to introduce the server-related problems. There are various types of server-related errors that developers encounter during service opening, from server crashes to non-response, lag, and hacking.

    First of all, a server crash means, as the name suggests, ‘the server dies’. It is caused by memory scratching in the server program, and the server program is forcibly terminated and all connected users are forcibly evicted. Server administrators are relatively easy to recognize a problem phenomenon, and can take action after detecting it as soon as it occurs.

    The second symptom is ‘server not responding’. The server itself is not dead, but it occurs in a situation where processing is delayed due to too many requests from clients. If it’s temporary, there’s no big problem, but if this phenomenon accumulates, users who are connected to the server will feel as if the server is dead. Since it is not easy to understand at the beginning that the server is going to a non-responsive state, it is also more difficult to solve than a crash. Usually, after dealing with the crash problem, the non-response phenomenon often occurs.

    The third symptom is ‘server lag’. It is similar to the non-response symptom, but it occurs when there is a kind of ‘cycle’ in which the request rate of the client is faster and then slower than the rate at which the server is doing the work. If slow processing speed continues to accumulate, it becomes ‘server not responding’ as described above, and if there is no accumulation, it becomes ‘server lag’.

    Unlike the previous two symptoms, server lag is often caused by a network problem or a problem in the location of the server rather than an internal server problem. He explained that it is less urgent than a crash or non-response, but if left alone, the dissatisfaction of users using the server may increase.

    The fourth symptom is ‘server malfunction’. This is a problem that occurs when the server’s calculation result is wrong, such as missing one of the player-to-player item exchanges or not accumulating experience points even after catching monsters. If it is a problem caused by a simple coding mistake, it can be easily fixed, but if the server structure is designed in a complex way, it can be a difficult problem.

    He explained that it is good for server design to be as simple as possible, but this often happens because developers are worried about Maserati problems. Here, the ‘Maserati problem’ means thinking about which Maserati model and color to purchase without having the money to buy it, meaning that you are thinking about things you don’t need to worry about at this stage.

    The last one is ‘hacking’. No matter how flawless a game is, it can be hacked. This is because there are bound to be malicious people everywhere, and they are constantly trying to hack. He explained that server developers should always be mindful of ‘hackability’ in any design, and set a clear line on how much they can compromise in terms of cost and quality.

    ■ Why do server problems cause problems with development schedules?

    So, why are these server problems causing problems in the development schedule? CTO Bae Hyun-jik said that there is a hurdle that ‘we can’t do enough testing’ first. In-house testing is the primary way to test whether the game server is working well, but in this case, the number of simultaneous users is not enough, so it is difficult to properly check problems that may occur when a large number of users connect, such as server overload or bugs.

    As a countermeasure to solve this problem, a ‘bot test’ using auto macro bots is sometimes conducted. While this can proactively filter out quite a few bugs and load issues, it doesn’t cover all the behaviors of unpredictable people. So, it can be said that the QA team is formed to meet the remaining deficient roles. The QA team builds and executes bot test scenarios that apply human imagination.

    The second reason is that, unlike the client, the server has a large impact when a bug occurs. A problem with the client program does not affect other players, but a problem with the server program affects other players as well. Because a single user can cause problems for all or some of the many players at the same time, the server must be developed and tested more carefully.

    If a bug is found, the cause should be specified. The client can check this immediately, and it is easy to trace the cause by attaching a debugger. The debugger lets you look at things that aren’t on the screen—even the detailed data and state of the program that’s hiding behind the screen. Even after fixing the program and forcibly restarting it, no problem occurs. But not on the server.

    Server developers have to create their own monitoring tools or use known ones, which don’t show the server’s internal data as detailed as a debugger. Server developers also think that they can attach a debugger like a client, but the server will stop if they do. Attaching a debugger means that it can only be done during in-house development, not during service. Unfortunately, he explains, the need for a debugger is more likely to happen after the actual release than during in-house testing.

    As a result, server developers choose the ‘next best solution’ instead of a quality tool such as a debugger. All the remaining server records, down to the smallest actions, are just to look through the logs. Of course, since it is not possible to see all the logs, the cause is estimated by extracting the logs related to the malfunction. It’s not quite as reliable as a tool like a debugger, but it’s probably the best way to go.

    CTO Bae Hyun-jik said that almost all commercial game services currently operate on a scale-out server cluster, and scale-out servers have a much more complex design than a single server. Because of its complex design, it is difficult to trace the cause of a bug when it occurs.

    Ironically, there are cases where a design that pursues horizontal expansion causes an overload on the server conversely. At this point, each server machine is doing nothing much, and gameplay processing across the server cluster is delayed exponentially like dominoes. In this situation, increasing the number of servers does not have a significant effect on solving the problem. As a result, scale-out server designs must find a compromise between simplicity and scale-out flexibility.

    ■ If so, how to solve these server problems? – ‘Proudnet 2’

    So, how to solve a series of server problems? CTO Bae Hyun-jik explained that we should always discuss with the server team whether there are unnecessary elements in the server design, and induce the server team to become aware of and simplify the unnecessary elements. Insufficient information on the game system and content may complicate server design for horizontal expansion. In order to prevent this in advance, it is good to inform as much as possible the limiting factors on various contents. He emphasized that each element of server design can be dangerous if the ‘clear reason’ cannot be explained, so the future can be comfortable only when unexplained complexity is removed and simplicity is pursued.

    Even with these efforts, sometimes inevitably, a horizontally scaled server design with a complex structure will come out. At this time, the server team should prepare for the complex horizontally scaled server design. You have the skills to quickly track problems when they arise, and you have powerful weapon tools to do so. CTO Bae Hyun-jik introduced ‘Proudnet 2’ as one of the powerful new tools that can be used at this time.

    Proudnet 2 is a server engine that can quickly find the cause of various server problems that game developers face while developing servers, and quickly solve the found cause. Proudnet 2’s ‘Debug Trace’ function works by attaching a debugger but not stopping the server. When the server does a specific task, it can also log in detail the state of the data inside the server at that moment.

    Proudnet 2 also includes a tool called ‘Server Memory Viewer’. Using the Server Memory Viewer, you can view the details of the memory in the game server. Of course, for this purpose, the server is not paused or slowed down, and it is characterized by being able to view it in the live service.

    In addition, the ‘performance analysis function’ that allows you to check at any point on the server where there is a throughput bottleneck, the ‘crash report function’ that helps you easily find the cause by leaving the point at the debugger level when a server crash occurs, and immediately after finding the cause of the problem It includes a ‘hot reloading’ function to help patching.

    CTO Bae Hyun-jik explained that by using Proudnet 2, it is possible to find and solve the cause of bugs faster than relying on logs or rolling updates even on servers designed for horizontal expansion. He concluded the presentation by conveying his aspiration that ‘Proudnet 2’ will solve the difficulties of developing complex server contents, while Nettension’s Proudnet 1 has solved various difficulties that occur in the process of developing the network and the bottom of the server. .

    Nettension’s ‘Proudnet 2’ is scheduled to be officially released in early 2022. Currently, it is possible to apply for the beta version through the Proudnet website, and anyone can use it for free until the official release date.