In the previous blog we looked at abusing callbacks inside of ntdll for the purpose of control flow obfuscation. While interesting, these techniques leave a visible trace inside of its disassembly. In this blog I would like to share a second technique that completely eliminates that trace. If executed correctly, you should be able to completely hide control flow and data loads.
To aid in explaining the technique I’ve made a benign proof of concept executable that poses as a simple calculator that, when given the right parameters, executes a shell command. This source for this executable can be found here. If you have access to a disassembler I invite you to load the executable into a disassembler and try explain the behavior shown next!
Let’s take a look at the proof of concept “calculator” executable in IDA and see if we can spot the control flow leading to the execution of a shell command:
The main function that takes 3 arguments (2 values and an operator), checks if they’re in the 16-bit range and passes them on to a function called
The calc function. This function takes the values and the operator, performs the calculation and returns the result:
For completeness sake, the exception handler for calc:
This is every possible code path reachable from main that isn’t a call to an API, or so it seems.. Executing this program normally will give us the expected results:
But if we trigger a divide by 0 exception, it executes cmd.exe with the
How did the binary manage to execute a shell command with no visible paths leading to such functionality? Let’s dive in!
To begin understanding how this technique works requires a decent understanding of how exception handling is done in usermode on Windows. For this we take quick look at
KiUserExceptionDispatcher, both present inside of ntdll, and the concept of unwind codes.
When certain types of exceptions are triggered in a program the kernel interrupts the thread’s execution and forcefully makes it execute KiUserExceptionDispatcher (much like the APC dispatcher covered in the previous blog). Once control is handed back from kernel to usermode the thread will be at KiUserExceptionDispatcher with a
CONTEXT and an
EXCEPTION_RECORD on the stack. This function will then promptly call RtlDispatchException.
The RtlDispatchException function is where the core logic for SEH exception handling is implemented (both vectored exception handling and the microsoft compiler’s try-except c language extension). The first thing that is performed is checking if there’s a vectored exception handler registered. We used this functionality in the previous blog to hijack a context, but this time around we skip it. Next, it will move on to the, slightly more complex,
try-except handling code.
Starting out the try-except handling logic makes an allocation on the stack the size of a context struct. The allocation is used to create a copy of the thread’s context at the time of triggering the exception. This context copy will be used during the stack unwinding process, more on this later. Next, the original context is used to find the faulting function using the
CONTEXT->Rip value which is passed to
RtlLookupFunctionEntry. The exception handling logic uses this function to find the
RUNTIME_FUNCTION entry that holds exception handling details for the function in which the exception was triggered. Assuming the exception was triggered inside of a function that was known to the compiler and has a corresponding RUNTIME_FUNCTION registered inside of the pe header, we will obtain our runtime function info and move on to the next step.
The next step is taking the RUNTIME_FUNCTION and a few other contextual parameters and passing them to
RtlVirtualUnwind. Combining the info from the RUNTIME_FUNCTION, the CONTEXT, and the instruction pointer it will attempt to unwind the stack from the point of the exception. The unwinding takes place on the copy of the CONTEXT. As the last step in the unwinding process (under most circumstances) the unwinder will check if the RUNTIME_FUNCTION has an exception handler registered and return a pointer to it.
The try-except handling code then takes that returned exception handler pointer and executes it using
RtlpExecuteHandlerForException. This function takes a pointer to the original CONTEXT (not the copy, again, this becomes important later) and the EXCEPTION_RECORD and passes it to the user registered exception handler. The user registered exception handler can then return one of these three values
[EXCEPTION_CONTINUE_SEARCH, EXCEPTION_CONTINUE_EXECUTION, EXCEPTION_EXECUTE_HANDLER] which will impact whether the exception handling is considered finished or if another loop is going to be performed.
With a little bit of background on the control flow that we’re trying to achieve out of the way, let us have a closer look at the function that enables all of this:
RtlVirtualUnwind. Looking at the windows research kernel source we can get a quick idea of the arguments the function takes:
As the description on github might suggest, this function is used for unwinding the stack from the point of the exception. The goal of this is to find return pointers stored on the stack which in turn allow the dispatcher to find the nearest registered exception handler. Given an instruction pointer (taken from the CONTEXT) it will perform roughly one of 3 things:
- Check if the instruction pointer is located in the epilogue of a function;If it thinks it’s in an epilogue it will emulate all the instructions up to the return and fetch the return pointer off the stack.
- Check if the instruction pointer is located in the prologue; no emulation this time, instead we’re using so called unwind codes to revert the effects the prologue had on the stack to try and find the stored return pointer
- If we are neither in the prologue or epilogue, but in the middle of a function, look up the exception handler for the current function. If none is present return NULL.
For this technique we are mostly interested in the 2nd path.
Each function (with some exceptions like thunk and leaf functions) will have a RUNTIME_FUNCTION entry created for it by the compiler. These RUNTIME_FUNCTION entries are inserted into an array that can be reached from the PE header by the compiler. One of the primary uses for these structs is keeping track of exception and unwind info for each significant function. An entry looks like this:
UnwindInfoAddress points to an
UNWIND_INFO struct containing the unwind information:
When RtlVirtualUnwind is invoked, the RUNTIME_FUNCTION entry (obtained from a call to RtlLookupFunctionEntry) is passed to it as an argument. Then, if it is determined that the exception occurred inside of the prologue, the unwinder will get a pointer to the UNWIND_INFO struct through the UnwindInfoAddress member. From this struct the primary members that are used during unwinding are
UnwindCode. Where CountOfCodes gives the size of the UnwindCode array.
These unwind codes are essentially a simplified representation of the instructions contained in the prologue that had an effect on the stack. For Example:
The unwinder will parse these unwind codes, in reverse order, and “undo” the effects the prologue had on the stack. This is to say that if a
push rdi instruction was generated by the compiler for your c code, a
UWOP_PUSH_NONVOL unwind code would’ve been inserted for it and during unwinding it will revert that push by doing a doing a “pop” and registering its effects inside of the context copy. Under normal circumstances it would parse the full array and perfectly undo the effects the instructions had on the stack and arrive at a stored return pointer value that can then be used for further unwinding.
This is of course not how we will be using it today.
Setting the Stage
Unwind codes offer a very convenient way for covertly influencing the thread’s state through its context. However, its power is limited. Unwind codes can alter the general use registers (RAX, RBX, RCX, etc), RSP, RBP and RIP inside of the context copy. There also isn’t any immediately obvious way of causing the execution of any code. We can’t even use the context hijacking technique described in the previous blog, because we can’t alter the actual context, only a copy. In addition to this it is the microsoft compiler that generates these codes for you from your C source code and there’s no way to intercept it during compilation (unless you want to have some fun with RtlAddFunctionTable of course).
So you’re saying there’s no way to work with these codes ourselves? Well.. there is one slightly obscure way..
Unwind Code Abuse
The MASM assembler supports some meta keywords for inserting these unwind codes. An example of a simple assembly function with these meta keywords added:
The important keywords to take note of in the above example are (all these effects take place in the copy of the context):
.pushreg reg: this inserts the
UWOP_PUSH_NONVOLunwind code into the array. During unwinding this has the effect of popping the current value at RSP into the
.allocstack XXh: this inserts the
UWOP_ALLOC_SMALLunwind code into the array. During unwinding this has the effect of adding
XXhto the RSP
.endprolog: this signals the end of the prologue declarations, according to msdn. Funnily enough this appears to have no real effect on the unwind codes for the prologue, as you can put unwind codes after it and they will still get inserted.
FRAME:<func>: This, in combination with the
.endprologis what triggers the creation of a RUNTIME_FUNCTION entry for the function. The
<func>is the name of the function that you want to assign as the exception handler for that function
So what happens if we just insert some random keywords that are unrelated to the actual instructions like this:
That works! The unwinding mechanism takes the generated unwind codes as truth and updates the context to reflect the effect they had on the real stack. After it is done parsing these spurrious unwind codes it assumes
CONTEXT->Rsp is now pointing to the stack location containing a saved return pointer. Lastly the
CONTEXT->Rip member is updated with the value read from the top of the stack. The copied context is now nicely corrupted, but nothing really happens after that. How can we abuse this corruption to get something to execute?
Directing Control Flow
Now that we understand how to use, and misuse, the unwind codes, let’s look at how we can use this to alter the control flow. For this I have created a very minimal pseudo (containing only the important bits) example consisting of a C source file and an assembly file:
If we were to execute this program the following will happen:
examplefunction with a pointer to our
- The example function takes the pointer to
decoythat is stored in rcx and puts it on the stack
- It then pushes 2 more values onto the stack
- Eventually it triggers a divide by 0 exception
Upon triggering the exception our stack is as follows:
Next, the kernel drops our thread off at KiUserExceptionDispatcher and subsequently RtlDispatchException. The exception dispatcher makes a copy of our context, finds our RUNTIME_FUNCTION entry and calls RtlVirtualUnwind. This is where the unwind codes come into play, let’s take a look at the resulting steps:
- The unwinder locates the last unwind code
.pushreg rax; it pops the value located at
CONTEXT->Raxand adds 8 (or 4 on 32-bit) to the rsp value in the context
- The unwinder proceeds to the next unwind code and finds another
.pushreg rax; it performs the same steps as above
- The unwinder finds no more unwind codes, as we only inserted 2, and thinks it is done unwinding the prologue. Having unwound the stack it should now find the stored return pointer at the location
CONTEXT->Rsppoints to and loads it into
CONTEXT->Rip. However, rather than a return pointer, it loads in the pointer to
- Lastly, it grabs the registered exception handler for the
examplefunction, in this case that would be the
handler()function from our assembly file (we registered it using
FRAME:handler), and returns it from RtlVirtualUnwind.
Now we have a copy of a context with its RIP member set to the
decoy() function, but no execution… yet. The next step the dispatcher takes is executing the exception handler that got returned from the unwinder,
handler in our case. For clarity the exception handler is empty and only returns 2 (
EXCEPTION_CONTINUE_SEARCH). This will trigger the dispatcher to do another loop starting at RtlLookupFunctionEntry. Again, this function takes our context copy and checks if
CONTEXT->Rip is within the bounds of any registered function and returns its RUNTIME_FUNCTION entry. As we managed to control
CONTEXT->Rip using the unwind codes in the previous loop, we get to decide which entry is returned here. This RUNTIME_FUNCTION is then again passed on to
RtlVirtualUnwind and used to perform the previously explained logic. You can see where this is going… (or maybe not as all of this is a confusing mess!)
Having a way to covertly alter context info and control which RUNTIME_FUNCTION is returned by
RtlLookupFunctionEntry, the final step will be converting that into code execution. For this we take the assembly file from our previous example and move our
decoy() function into it. We then create an exception handler for this function using the
FRAME keyword and have it point to the function we actually want to execute (
Now, when the exception dispatcher starts handling our divide by 0 exception it will:
- Find the function entry for
- Parse our malicious unwind codes, putting a pointer to
- Execute the exception handler
handlerreturns 2, causing another loop of the dispatcher
- Now it finds the function entry for the
- Unwind nothing as it has no unwind codes
- Execute registered exception handler
- Execute our actual payload
- Return 0 to return from the exception dispatcher
Through a few simple steps we went from triggering an exception to NTDLL executing code for us using “meta opcodes” that don’t show up in the disassembly. While theoretically simple to implement, there’s a lot of minor snags in the process of developing a payload that we ignored in this example. Up next, let’s look at an actual working proof of concept that successfully hides a call to
cmd.exe /c whoami as its argument) from the disassembler.
The following PoC (and how to compile it) can also be found on my github.
This proof of concept masquerades as a simple calculator command line application; this will be the only code that appears to be part of the execution control flow of the binary. Using the input “10 / 0” as a trigger, the code enters the exception dispatching process explained above. In here it will pop a pointer that was stored in an earlier stackframe into RSP and pivot the stack to a pre-constructed “stack” in the data section. Using this pre-constructed stack it will abuse unwind codes to hide the data loads of a pointer to our shell command, a pointer to
CreateProcessA and eventually setting
CONTEXT->Rip to the value at the current stack location (the offset to our
Remember, none of these loads are at all visible in the disassembly as they happen deep inside of the unwinding logic inside of ntdll.
With the context copy now containing our shell command pointer in rdx, the pointer to
CreateProcessA inside of r15 and the instruction pointer pointing to
decoy, everything is set up to continue to our payload execution function. As explained before, first the normal exception handler for the erroring function is executed (this one is fully visible in disassemblers such as IDA). The
handler exception handler is a benign exception handler that merely fixes the exception by changing the divide by 0 to a divide by 1. We return 2 from this exception handler to perform another loop of the dispatcher. The dispatcher now thinks
decoy is the erroring function and loads its RUNTIME_FUNCTION with exception handler
dispatcher. Through the process explained above eventually
dispatcher is executed, unbeknownst to anyone reading the disassembly, which executes our shell command using the registers in the context copy.
The proof of concept is still a very simple example, only proving the possibility of hiding control flow. For more advanced examples you could be looking at executing multiple functions in a row by looping the dispatcher, abusing uninitialised variables that get skipped by
RtlInitializeExtendedContext2, obfuscating only sensitive data loads, etc.
Hopefully, despite being painfully complex, it was an interesting read and inspires others to do more research into advanced control flow obfuscation techniques.
To be continued…