Published on

[Dev Log] FFI Optimization Part 1: The Nightmare of Diablo 2 and the Coroutine Massacre

Authors
  • Name
    Logan Kim
    Twitter

The term 'Optimization' is all too easily misused in the field. Countless game developers treat Unity's Coroutine as the magic of asynchronous multithreading, making the fatal mistake of passing off heavy mathematical operations to the main thread in 'installments.' It might look like it's working on the surface, but down at the system level, a single CPU core is coughing up blood and dying.

This post is a benchmark report designed to shatter the illusions of juniors trapped in romanticized coding. Through a test of 10,000 floating-point operations with rendering noise completely controlled, we will nakedly dissect how coroutines massacre a game's framerate. Furthermore, I will prove the overwhelming performance of Data-Oriented Design (DOD) and C++ Native (FFI) that will salvage the system.

1. The Chilling Memory of the Pentium PC: The Secret Cow Level and the 'Tab' Key

For the generation that played Diablo 2 on a Pentium PC in the early 2000s, there is one unforgettable horror: the moment you reflexively hit the 'Tab' key to open the minimap in the middle of the Secret Cow Level, swarming with hundreds of Hell Bovines.

The screen would instantly freeze, the hard drive would scream, and if you were unlucky, your character was already lying dead on the cold ground. We simply called it "lag" back then, but from an architect's perspective, this was clearly a 'disaster of state synchronization and memory bottleneck.' The agonizing computation required to load the coordinates and states of hundreds of objects and map them to the UI had completely strangled the main thread.

CPUs are structurally vulnerable to this kind of calculation. Coordinates in a game are rarely exact integers, and calculating them requires heavy Floating-point Arithmetic, a task often used in CPU benchmarking. Modern 3D game engines like Unity utilize these floating-point operations on a scale that makes the Pentium era look like child's play.

Decades have passed since the release of Diablo 2, and hardware has evolved exponentially. Unfortunately, the architectures built by modern junior developers handling game engines are still repeating the disasters of the Pentium era. At the center of this tragedy lies Unity's sweet poison: the Coroutine.

2. Coroutines Are Not Magic: The Trap of Installment Payments

Many novice developers mistake coroutines for asynchronous multithreading. When faced with heavy operations, they mindlessly shove them into an IEnumerator and throw a yield return.

Let's be brutally honest: Coroutines are not magic. They are merely 'installment payments' that slice up the main thread's computation time and defer it to the next frame. When you have 10 or 100 objects, these installments go unnoticed. But the moment 10,000 objects each hold their own coroutine and present a lump-sum bill to the main thread, bankruptcy (frame drops) is guaranteed.

To prove this, I built a pure CPU logical operation benchmark environment, completely stripping away the rendering load (Draw Calls).

3. Benchmark Design: The Floating-Point Massacre of 10,000 Objects

I stripped away all of Unity's rendering shells to create a 'Headless Benchmark' that performs only memory data manipulation and math operations.

  • Test Environment: Incrementally spawning 1,000 dummy logical objects until 10,000 objects.

  • Computational Load: Every tick, each object repeats heavy floating-point operations (Sin, Cos, Lerp)—equivalent to an A* pathfinding calculation—thousands of times to compute its destination coordinates.

  • Visual Verification: Randomly sampling 100 completed trajectories and visualizing them with red/green lines (Debug.DrawLine) in the Scene view.

Architect's Note: If rendering loads are mixed in, you cannot accurately measure the pure mathematical computation bottleneck of the CPU. True system profiling must be conducted in an environment where all external noise is strictly controlled.

4. Benchmark Execution: Two Starkly Contrasting Worlds

I implemented a test UI in a default Unity environment to compare two scenarios. By clicking the '+1000 Coroutine Jobs' or '+1000 FFI Jobs' buttons, I incrementally poured the exact same computational tasks into the system.

[Case 1: The Fate of Mindless Coroutines]

After adding a mere 2,000 coroutine tasks, the framerate plummets vertically, tearing the screen to a point where normal gameplay is utterly impossible.

[Case 2: C++ Native Multithreading (FFI)]

Even after adding 1,000 tasks, there isn't a single micro-stutter. Although edited for video length, stacking the workload up to 5,000 tasks results in only a slight framerate dip, maintaining a perfectly smooth environment.

As these results show, the exact same objective can yield drastically different outcomes depending on how the system is architected. Hardware has undeniably become monstrously powerful, but if the implementation remains stuck in the Pentium 2 era—writing code thoughtlessly or blindly trusting an AI to do it—you will almost certainly be handed the hellscape of the first case (Coroutines).

This kind of development is essentially strangling a single CPU core to death while it handles the main thread. Meanwhile, the rest of the CPU cores do absolutely nothing, merely pretending to be busy by enthusiastically cheering on the dying core. This creates a deceptive illusion where the overall CPU usage in the Task Manager appears bizarrely high.

Ultimately, the consequence of this horrific overhead is a game branded by players as having 'garbage optimization.' When run on a portable device like the Steam Deck (UMPC), it will spin the cooling fans like a jet engine taking off and completely melt the battery.

5. Next Episode Preview: Stupid C++ Integration and the Marshalling Tax

We have reached the conclusion that these garbage operations must be ripped out of the coroutines and thrown to the peacefully idling CPU cores.

So, if we simply create a C++ plugin and hand over the computations, does everything magically resolve itself? Unfortunately, a sloppy C++ FFI (Foreign Function Interface) integration invites a second terror that is even worse than coroutines: the 'Garbage Collector (GC) Spike.'

In the next post, I will dissect the junior anti-pattern where the main thread is shattered all over again by memory Marshalling costs, despite introducing C++ computations.