I play World of Warcraft. Being a Linux user requires me to run this in Wine, a Windows API translation layer. While Wine is continuously improving, its performance in Direct3D games still leaves much to be desired.
Having very little familiarity with Direct3D or the Wine codebase, I decided to spend a weekend diving straight into the deep end to make things better.
We have a bit of a head start regarding where to look- since the
    game runs well on Windows, we know that our bottleneck is likely to
    be CPU-side or synchronization related (either in NVIDIA’s GL
    driver, or in wine). Using tools such as nvidia-smi
    reveals that our GPU utilization is fairly low (30-40%), which
    reaffirms my suspicions.
But before touching a single line of code, we need visibility. What problem are we trying to solve? Where is the slowness coming from?
One of my favourite tools for answering these questions is perf, a
    performance counter based profiler for Linux. perf
    gives us insight into what the distribution of CPU time is-
    importantly, which functions are the hottest.
That’s all we need to get started here. I opened WoW and traveled
    to an area of the game that performed much worse under Wine- running
    at ~14fps, versus ~40fps in Windows. perf sheds a light
    into where our time was spent:
    
    Let’s analyze some of the top offenders, and figure out at a high level how to make things better.
Firstly, wined3d_cs_run leads us
    head first into how Wine’s Direct3D implementation works internally.
    wined3d is the library responsible for translating
    Direct3D calls into OpenGL calls. Newer versions of wine use a
    command stream to execute the OpenGL calls after
    translation from their D3D equivalents. Think of this as a separate
    thread that executes a queue of draw commands and state updates from
    other threads. This not only solves the issue of the multithreaded
    OpenGL story being nightmarish, but lets us parallelize more
    effectively and ensures that the resulting GL calls are executed in
    some serialized ordering.
Right, back to wined3d_cs_run. This is the core of
    the command stream- it’s just a function that busy-waits on a queue
    for commands from other threads. Some brief analysis of the source
    code indicates that it does no real work of interest, other than
    invoke op handlers for the various commands. But at least we know
    about command streams now!
wined3d_resource_map is where
    things get interesting. Fundamentally, it’s a function that maps a
    slice of GPU memory into the host’s address space, typically for
    streaming geometry data or texture uploads. Given a “resource”
    (which is typically just a handle into some kind of GPU memory), it
    does the following;
glMapBufferRange.Intuitively, this makes a lot of sense. We need to wait for the command stream to finish before this function can return- otherwise, where would we get our pointer from? We can’t execute any OpenGL commands to map the resource off the command thread, so we’re forced to wait.
I needed to learn more about how WoW uses its buffer maps to figure out what we could do here.
Enter apitrace. By
    using apitrace to wrap the execution of WoW under wine,
    I could intercept the D3D9 calls it was making prior to hitting
    wined3d.
    
    Using this data, I was able to construct a simple model of how WoW renders dynamic geometry;
D3DLOCK_NOOVERWRITE flag, which promises not to
    overwrite any data involved in an in-flight draw call.
    D3DLOCK_DISCARD flag, which invalidates the buffer’s
    contents.
    memcpy.DrawIndexedPrimitive on the new segment of
    vertex data in the buffer.This technique works very well for ensuring that we should never have to wait for the GPU. In fact, Microsoft recommends it.
But, the story with wine is less than ideal. In practice, this is what happens:
    
    This is a textbook example of a pipeline stall.
    By waiting for the command stream thread and the mapping thread to
    synchronize, we’re wasting time that could instead be spent
    dispatching more draw calls to the GPU. If the command stream thread
    is busy, the D3D thread could be waiting a nontrivial amount of time
    for a response! Not only this, but glMapBufferRange-
    the actual OpenGL call used to map a buffer- is
    slow. Actually mapping the buffers, even when
    synchronization is explicitly disabled, takes a long time.
We could solve our problem handily by not having to wait for the CS thread. The question is- how?
Suppose we had access to a large, persistently mapped buffer in the host address space. We never had to unmap it to make a draw call, and writes to it were coherently visible to the GPU without any GL calls.
D3DLOCK_NOOVERWRITE was provided, then
    we can return the address of the last persistent mapping for that
    buffer.
    D3DLOCK_NOOVERWRITE fundamentally lets us
    ignore synchronization.D3DLOCK_DISCARD was provided, then we
    can remap the buffer to an unused section of persistently mapped GPU
    memory.
    Enter the holy grail: ARB_buffer_storage. This lets
    us allocate an immutable section of GPU memory, and allow persistent
    (always mapped) and coherent (write-through) maps of it. We’re
    effectively replacing the role of the driver here, which would
    handle DISCARD (INVALIDATE in GL) and
    NOOVERWRITE (UNSYNCHRONIZED in GL) buffer
    maps itself.
This is an AZDO (approaching zero driver overhead) style GL extension. If you’re interested, check out this article by NVIDIA.
wine-pba (short for persistent buffer allocator) is
    a set of patches I’ve written that leverages
    ARB_buffer_storage to implement a GL-free GPU heap
    allocator, vastly improving the speed of buffer maps.
At device initialization, a very large OpenGL buffer is allocated. This buffer is governed by a simple re-entrant heap allocator that allows both the command stream thread and D3D thread to make allocations and recycle them independently from each other.
When a D3DLOCK_DISCARD map is made, the D3D thread
    immediately asks the allocator for a new slice of GPU memory and
    returns it. The command stream thread is sent an asynchronous
    message informing it of the discard, with information on the new
    buffer location so that future draw commands on the command stream
    thread are aware. The command stream thread returns its old buffer
    to the heap allocator when this happens, with a fence to ensure that
    the buffer isn’t reused until it is no longer being used by the
    GPU.
When a D3DLOCK_NOOVERWRITE map is made, we can just
    return the buffer’s mapped base address plus the offset desired.
    Sweet!
Otherwise, the old synchronous path is undergone- except this
    time, without requiring a call to glMapBufferRange
    (only waiting on a fence).
So, what does this look like?
    
    Unfortunately, WoW does not have great public-facing benchmarking
    functionality. I settled for using the console command
    /timetest 1, which measures the average FPS when taking
    a flight path- no ground NPCs or players are loaded while in-flight
    to reduce variation in test runs. Additionally, I eyeballed the
    average idle FPS in various common in-game locations.
These benchmarks were performed on patch 7.3.5 running with 4x SSAA at a resolution of 2560x1440. The CPU is an i5-3570k, and the GPU is a GTX 1070. The graphics preset “7” was chosen (as recommended by the game).


I’m fairly satisfied with the results demonstrated by this benchmark. I hope to update this post with additional frame timing data from a GL intercept tool.
You can find an early prototype of wine-pba at github.com/acomminos/wine-pba. This is far from production quality, and makes several (erroneous) assumptions regarding implementation capabilities. I hope to mainline this once the patchset becomes more mature.
I hope you found this post valuable. I look forward to digging deeper into how to use AZDO techniques to improve wine performance, particularly for uniform updates and texture uploads.
← Home