Development Update: The Third 386 Core Iteration

2nd Jun 2026 daveemu cpu386 asmjit performance retroide

It has been a while since the last public update, so this one is partly about DaveEmu and partly about what has been happening around it.

The short version is that the 386 CPU core is now entering its third serious implementation iteration. That sounds a little dramatic, but it is the most honest way to describe the current state of the project.

DaveEmu has always been pulled between two goals. On one side there is compatibility: DOS, Windows 3.x, BIOS behavior, weird old software, and all the small architectural details that only show up after hours of debugging. On the other side there is speed. A 386 emulator that is correct but crawls is useful for research, but it is not the emulator I want to ship as a comfortable public release.

The hard part is that a CPU emulator cannot optimize only "around" the CPU. Video, audio, disk and UI all matter, but the CPU core is still the place where most long-term performance decisions become visible. If the execution model is wrong, everything above it inherits that cost.

The first 386 core was almost entirely interpreted. It worked, and over time it became quite broad functionally. I would describe it as being roughly around 95% of the way toward what I needed for the DOS and Windows 3.x workloads I care about. But it was slow. Painfully slow. In many scenarios it behaved like a machine running at about 5 MHz, and sometimes even below that. It was useful for correctness work, compatibility work and debugging, but not good enough for the kind of experience I want DaveEmu to provide.

The second approach was the first AsmJit-based 386 engine. This was a hybrid model: some execution stayed in the old path, while selected parts were moved to generated native code. A lot of work went into that version, and percentage wise it looked promising. But the architecture had a fundamental problem. The engine kept entering and leaving JIT code, and that split personality limited the performance gains. Even worse, as more code paths were admitted into the JIT side, stability problems started to appear.

That was not a useless detour. It taught me a lot. It proved which parts of the old design were solid, which ones were too expensive, and which assumptions did not survive contact with real Windows workloads. It also made the performance problem more concrete. The issue was not simply that one or two hot instructions were missing a fast path. The deeper issue was that the old engine had been designed as an interpreter first, and the JIT was being attached to it later.

That kind of hybrid can work, but it has a cost. Every transition needs state to be synchronized. Every fallback has to preserve exactly the same CPU view as the generated code. Every partial fast path needs careful rules about when it is allowed to run. The more complete the JIT side becomes, the more the old and new worlds have to agree on details. If they do not, the result is either slower than expected or fragile.

In my case it became both.

That is why I decided to stop treating the hybrid engine as the final direction. The new 386 work is being built from scratch as a JIT-first, JIT-only core. There is no hidden interpreter fallback in the new design. The goal is not to quickly mark as many opcodes as possible as "supported". The goal is to build a core where the generated code owns execution, guest CPU state is explicit, and the Intel 80386 documentation is treated as the source of truth for semantics.

This is slower at the beginning, because every instruction has to earn its way in through real execution behavior, not just through a decoder table. But I think it is the right tradeoff. The previous attempts taught me that correctness and performance cannot be separated cleanly in a CPU emulator. If the execution model is wrong, you eventually pay for it either in speed, stability, or both.

The new work therefore starts from a stricter foundation. Unsupported behavior should fail loudly instead of silently falling back to a different engine. Guest state has to be the real source of truth. Generated code has to execute real CPU behavior, not just look plausible in a disassembler. That means progress is more deliberate, but it also means each step should be easier to trust.

From the outside this may look like taking a step backward, because an older core already runs much more software today. Internally it is the opposite. This is an attempt to stop carrying the performance ceiling of the old architecture forward forever. The interpreted core remains important history and a useful reference point, but the future 386 path has to be designed as a recompiler from the beginning.

There is also a personal reason why progress slowed down recently: I started working at AMD as a Senior Performance Engineer. To be honest, it pulled me in hard. The work is very close to the kind of low-level performance thinking I enjoy, and it has taken a lot of my attention. DaveEmu did not stop, but the pace changed for a while. I plan to return to it more broadly soon, especially now that the direction for the new 386 core is much clearer.

This is also why I have been thinking more carefully about what "performance" means in DaveEmu. It is not just a number printed at the end of a benchmark. For an emulator, performance is a chain of design decisions: how state is represented, how often code exits, how much work is done on hot paths, how diagnostics are kept out of release builds, how memory and I/O helpers are called, and how much of the guest machine can run without unnecessary control returning to the host.

Working professionally on performance has made those tradeoffs even more interesting to me, not less.

In the meantime I also worked on a few other projects.

One of them is an adaptive file compressor. It is a separate experiment, but it fits the same pattern: I am interested in systems where performance depends on choosing the right strategy for the actual data, not just applying one fixed algorithm everywhere.

The other larger side project is RetroIDE.

RetroIDE is not just "SSH for old Windows machines". The idea is different: keep the comfortable editing, project structure and UI on a modern host, while a small agent on a real or virtual legacy Windows target performs the work that has to happen there.

The target range is old Windows systems such as Windows 95, 98, NT4, 2000 and XP. The host side is a modern desktop application. The remote side is designed to be small and conservative, with code that can realistically run on those old systems. Instead of exposing a raw shell or raw debugger to the network, RetroIDE uses a versioned binary protocol, capability negotiation and authenticated encrypted sessions. The planned workflow is local editing, remote build/run, stdout and stderr streaming, file synchronization, logs, process status, screenshot capture and eventually debugger proxying.

That distinction is important. SSH is a remote command line. It is excellent when the target system supports it and when the workflow is naturally terminal based. Old Windows development is not like that. A lot of the interesting target systems do not have a modern SSH environment. Many workflows depend on old toolchains, old debuggers, GUI applications, legacy Win32 behavior and awkward file layouts. You can remote into the desktop, but then you are back to using an old desktop for everything. You can copy files manually, but then the workflow becomes slow and error-prone.

RetroIDE is meant to sit in the middle. The modern machine keeps the pleasant parts: editor, project model, high resolution UI, local indexing, modern build configuration and clean logs. The old target keeps the parts that must be real: Win32 behavior, compiler runtime, DLL loading, process execution, debugger behavior and OS compatibility. The protocol between them is not an afterthought. It is the core of the project.

The remote side is intentionally split conceptually into a small bootstrap component and a larger working server. The small part should be boring and reliable: start the server, update it, recover it, report status and keep enough control to avoid losing the target after a bad update. The larger server can do the heavier work, such as file sync, process execution, logs, screenshots and debugger integration. That separation is there because old systems are fragile enough already; the recovery path should not be the riskiest part of the tool.

The security model is also part of the design, not decoration. RetroIDE is not supposed to expose an unauthenticated "run whatever you want" service on a LAN. The protocol is versioned, commands are scoped, capabilities are negotiated, and the production direction is authenticated encrypted transport. That may sound heavy for a retro development tool, but the moment a program can upload files, start processes and read logs on another machine, it needs a serious boundary.

That matters because developing for old Windows is awkward in a very specific way. RDP or VNC gives you a desktop, and SSH gives you a shell where available, but neither gives you a modern IDE workflow while still executing inside the real legacy environment. RetroIDE is meant to fill that gap: modern local tools, old target behavior, and a controlled protocol between them.

There is a visual side to RetroIDE too. I want it to feel like a native desktop tool with a retro identity, not a browser app pretending to be an IDE. But the visual style is not the foundation. The foundation is the workflow: edit locally, run remotely, get reliable output, keep logs, recover when the remote side breaks, and eventually debug without exposing a raw debugger directly to the network.

So the current state is this:

DaveEmu's original interpreted 386 core works broadly, but is too slow.
The first hybrid AsmJit attempt proved useful, but not clean enough as a final architecture.
The new 386 core is being rebuilt from scratch around generated execution rather than interpreter fallback.
RetroIDE has become a second major retro-systems project, focused on making real development on old Windows targets less painful.

For DaveEmu, the next work is not glamorous in the screenshot sense. It is CPU core work: decoding, executing, checking edge cases, matching real 80386 semantics and slowly increasing the amount of DOS and Windows startup code that can run through the new engine. That is exactly the kind of work that makes a later public release feel solid instead of lucky.

For RetroIDE, the next useful direction is a practical vertical slice: a modern host connects to an old Windows target, authenticates, starts the remote server, syncs files, runs a process, streams output and retrieves logs. Once that works reliably, the editor, debugger and screenshot features have something real to attach to.

The next DaveEmu milestone is to keep expanding the new 386 core carefully until it can take over the early DOS and Windows paths that the old core already knows how to run, but without inheriting the old performance ceiling.

That is the direction now: fewer shortcuts, a cleaner execution model, and a 386 core that is designed for speed from the beginning instead of trying to bolt speed onto an interpreter after the fact.

It is a longer route, but I think it is the route that gives DaveEmu the best chance of becoming the emulator I originally wanted to build: compatible enough to run the software that matters, fast enough to enjoy using, and clean enough inside that future work does not feel like fighting the previous architecture.