Exception Hijacking

August 4, 2022 4844 words 23 minutes

Contents

In the previous blog we looked at abusing callbacks inside of ntdll for the purpose of control flow obfuscation. While interesting, these techniques leave a visible trace inside of its disassembly. In this blog I would like to share a second technique that completely eliminates that trace. If executed correctly, you should be able to completely hide control flow and data loads.

Intro

To aid in explaining the technique I’ve made a benign proof of concept executable that poses as a simple calculator that, when given the right parameters, executes a shell command. This source for this executable can be found here. If you have access to a disassembler I invite you to load the executable into a disassembler and try explain the behavior shown next!

Let’s take a look at the proof of concept “calculator” executable in IDA and see if we can spot the control flow leading to the execution of a shell command:

The main function that takes 3 arguments (2 values and an operator), checks if they’re in the 16-bit range and passes them on to a function called calc:

The calc function. This function takes the values and the operator, performs the calculation and returns the result:

For completeness sake, the exception handler for calc:

This is every possible code path reachable from main that isn’t a call to an API, or so it seems.. Executing this program normally will give us the expected results:

But if we trigger a divide by 0 exception, it executes cmd.exe with the whoami command!?

How did the binary manage to execute a shell command with no visible paths leading to such functionality? Let’s dive in!

Foundational Knowledge

To begin understanding how this technique works requires a decent understanding of how exception handling is done in usermode on Windows. For this we take quick look at RtlDispatchException and KiUserExceptionDispatcher, both present inside of ntdll, and the concept of unwind codes.

Exception Dispatcher

When certain types of exceptions are triggered in a program the kernel interrupts the thread’s execution and forcefully makes it execute KiUserExceptionDispatcher (much like the APC dispatcher covered in the previous blog). Once control is handed back from kernel to usermode the thread will be at KiUserExceptionDispatcher with a CONTEXT and an EXCEPTION_RECORD on the stack. This function will then promptly call RtlDispatchException.

The RtlDispatchException function is where the core logic for SEH exception handling is implemented (both vectored exception handling and the microsoft compiler’s try-except c language extension). The first thing that is performed is checking if there’s a vectored exception handler registered. We used this functionality in the previous blog to hijack a context, but this time around we skip it. Next, it will move on to the, slightly more complex, try-except handling code.

Starting out the try-except handling logic makes an allocation on the stack the size of a context struct. The allocation is used to create a copy of the thread’s context at the time of triggering the exception. This context copy will be used during the stack unwinding process, more on this later. Next, the original context is used to find the faulting function using the CONTEXT->Rip value which is passed to RtlLookupFunctionEntry. The exception handling logic uses this function to find the RUNTIME_FUNCTION entry that holds exception handling details for the function in which the exception was triggered. Assuming the exception was triggered inside of a function that was known to the compiler and has a corresponding RUNTIME_FUNCTION registered inside of the pe header, we will obtain our runtime function info and move on to the next step.

The next step is taking the RUNTIME_FUNCTION and a few other contextual parameters and passing them to RtlVirtualUnwind. Combining the info from the RUNTIME_FUNCTION, the CONTEXT, and the instruction pointer it will attempt to unwind the stack from the point of the exception. The unwinding takes place on the copy of the CONTEXT. As the last step in the unwinding process (under most circumstances) the unwinder will check if the RUNTIME_FUNCTION has an exception handler registered and return a pointer to it.

The try-except handling code then takes that returned exception handler pointer and executes it using RtlpExecuteHandlerForException. This function takes a pointer to the original CONTEXT (not the copy, again, this becomes important later) and the EXCEPTION_RECORD and passes it to the user registered exception handler. The user registered exception handler can then return one of these three values [EXCEPTION_CONTINUE_SEARCH, EXCEPTION_CONTINUE_EXECUTION, EXCEPTION_EXECUTE_HANDLER] which will impact whether the exception handling is considered finished or if another loop is going to be performed.

RtlVirtualUnwind

With a little bit of background on the control flow that we’re trying to achieve out of the way, let us have a closer look at the function that enables all of this: RtlVirtualUnwind. Looking at the windows research kernel source we can get a quick idea of the arguments the function takes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


PEXCEPTION_ROUTINE
RtlVirtualUnwind (
    IN ULONG HandlerType,
    IN ULONG64 ImageBase,
    IN ULONG64 ControlPc,
    IN PRUNTIME_FUNCTION FunctionEntry,
    IN OUT PCONTEXT ContextRecord,
    OUT PVOID *HandlerData,
    OUT PULONG64 EstablisherFrame,
    IN OUT PKNONVOLATILE_CONTEXT_POINTERS ContextPointers OPTIONAL
    )

As the description on github might suggest, this function is used for unwinding the stack from the point of the exception. The goal of this is to find return pointers stored on the stack which in turn allow the dispatcher to find the nearest registered exception handler. Given an instruction pointer (taken from the CONTEXT) it will perform roughly one of 3 things:

Check if the instruction pointer is located in the epilogue of a function;If it thinks it’s in an epilogue it will emulate all the instructions up to the return and fetch the return pointer off the stack.
Check if the instruction pointer is located in the prologue; no emulation this time, instead we’re using so called unwind codes to revert the effects the prologue had on the stack to try and find the stored return pointer
If we are neither in the prologue or epilogue, but in the middle of a function, look up the exception handler for the current function. If none is present return NULL.

For this technique we are mostly interested in the 2nd path.

Unwind codes

Each function (with some exceptions like thunk and leaf functions) will have a RUNTIME_FUNCTION entry created for it by the compiler. These RUNTIME_FUNCTION entries are inserted into an array that can be reached from the PE header by the compiler. One of the primary uses for these structs is keeping track of exception and unwind info for each significant function. An entry looks like this:

1
2
3
4
5
6
7
8


typedef struct _IMAGE_RUNTIME_FUNCTION_ENTRY {
  DWORD BeginAddress;
  DWORD EndAddress;
  union {
    DWORD UnwindInfoAddress;
    DWORD UnwindData;
  } DUMMYUNIONNAME;
} RUNTIME_FUNCTION, *PRUNTIME_FUNCTION, _IMAGE_RUNTIME_FUNCTION_ENTRY, *_PIMAGE_RUNTIME_FUNCTION_ENTRY;

The UnwindInfoAddress points to an UNWIND_INFO struct containing the unwind information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


typedef struct _UNWIND_INFO {
  UBYTE Version : 3;
  UBYTE Flags : 5;
  UBYTE SizeOfProlog;
  UBYTE CountOfCodes;
  UBYTE FrameRegister : 4;
  UBYTE FrameOffset : 4;
  UNWIND_CODE UnwindCode[1]; 
  UNWIND_CODE MoreUnwindCode[((CountOfCodes + 1) & ~1) - 1];
  union {
    OPTIONAL ULONG ExceptionHandler;
    OPTIONAL ULONG FunctionEntry;
  };
  OPTIONAL ULONG ExceptionData[];
} UNWIND_INFO, *PUNWIND_INFO;

When RtlVirtualUnwind is invoked, the RUNTIME_FUNCTION entry (obtained from a call to RtlLookupFunctionEntry) is passed to it as an argument. Then, if it is determined that the exception occurred inside of the prologue, the unwinder will get a pointer to the UNWIND_INFO struct through the UnwindInfoAddress member. From this struct the primary members that are used during unwinding are CountOfCodes and UnwindCode. Where CountOfCodes gives the size of the UnwindCode array.

These unwind codes are essentially a simplified representation of the instructions contained in the prologue that had an effect on the stack. For Example:

1
2
3
4


INSTRUCTION	: CORRESPONDING UNWIND CODE

sub rsp, 58h	: UWOP_ALLOC_SMALL
push rdi	: UWOP_PUSH_NONVOL 

The unwinder will parse these unwind codes, in reverse order, and “undo” the effects the prologue had on the stack. This is to say that if a push rdi instruction was generated by the compiler for your c code, a UWOP_PUSH_NONVOL unwind code would’ve been inserted for it and during unwinding it will revert that push by doing a doing a “pop” and registering its effects inside of the context copy. Under normal circumstances it would parse the full array and perfectly undo the effects the instructions had on the stack and arrive at a stored return pointer value that can then be used for further unwinding.

This is of course not how we will be using it today.

Setting the Stage

Unwind codes offer a very convenient way for covertly influencing the thread’s state through its context. However, its power is limited. Unwind codes can alter the general use registers (RAX, RBX, RCX, etc), RSP, RBP and RIP inside of the context copy. There also isn’t any immediately obvious way of causing the execution of any code. We can’t even use the context hijacking technique described in the previous blog, because we can’t alter the actual context, only a copy. In addition to this it is the microsoft compiler that generates these codes for you from your C source code and there’s no way to intercept it during compilation (unless you want to have some fun with RtlAddFunctionTable of course).

So you’re saying there’s no way to work with these codes ourselves? Well.. there is one slightly obscure way..

Unwind Code Abuse

The MASM assembler supports some meta keywords for inserting these unwind codes. An example of a simple assembly function with these meta keywords added:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


align(16)
exception_handler PROC
	ret

align(16)
my_function PROC FRAME:exception_handler
	push rdi
	.pushreg rdi
	
	push rsi
	.pushreg rsi
	
	sub rsp, 58h
	.allocstack 58h
	
	.endprolog
	
	add rsp, 58h
	pop rsi
	pop rdi
	ret

The important keywords to take note of in the above example are (all these effects take place in the copy of the context):

.pushreg reg: this inserts the UWOP_PUSH_NONVOL unwind code into the array. During unwinding this has the effect of popping the current value at RSP into the reg
.allocstack XXh: this inserts the UWOP_ALLOC_SMALL unwind code into the array. During unwinding this has the effect of adding XXh to the RSP
.endprolog: this signals the end of the prologue declarations, according to msdn. Funnily enough this appears to have no real effect on the unwind codes for the prologue, as you can put unwind codes after it and they will still get inserted.
FRAME:<func>: This, in combination with the .endprolog is what triggers the creation of a RUNTIME_FUNCTION entry for the function. The <func> is the name of the function that you want to assign as the exception handler for that function

So what happens if we just insert some random keywords that are unrelated to the actual instructions like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


push rbp
mov rbp, rsp
push rax

.pushreg rbx
.pushreg rcx
.pushreg rdx
.allocstack 200h
.pushreg rdi
.pushreg r15

.endprolog

pop rax
leave
ret

That works! The unwinding mechanism takes the generated unwind codes as truth and updates the context to reflect the effect they had on the real stack. After it is done parsing these spurrious unwind codes it assumes CONTEXT->Rsp is now pointing to the stack location containing a saved return pointer. Lastly the CONTEXT->Rip member is updated with the value read from the top of the stack. The copied context is now nicely corrupted, but nothing really happens after that. How can we abuse this corruption to get something to execute?

Directing Control Flow

Now that we understand how to use, and misuse, the unwind codes, let’s look at how we can use this to alter the control flow. For this I have created a very minimal pseudo (containing only the important bits) example consisting of a C source file and an assembly file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


void
decoy()
{
	printf("AV's got nothing on this!");
}

void
main()
{
	printf("Passing the decoy function to our assembly example function");
	example(decoy);
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


handler PROC
	;return 2 from our exception handler
	mov rax, 2
	ret

example PROC FRAME:handler

	;put our "decoy" function pointer on the stack
	push rcx
	
	;put some more values on the stack
	xor rax, rax
	push rax
	push rax
	
	;insert our unwind codes
	.pushreg rax
	.pushreg rax
	.endprolog
	
	;trigger an exception (divide 0x30 by 0)
	mov rax, 30h
	mov rcx, 0h
	idiv rcx
	
	;epilogue (not relevant)
	pop rax
	pop rax
	pop rcx
	ret

If we were to execute this program the following will happen:

main calls our example function with a pointer to our decoy function
The example function takes the pointer to decoy that is stored in rcx and puts it on the stack
It then pushes 2 more values onto the stack
Eventually it triggers a divide by 0 exception

Upon triggering the exception our stack is as follows:

Next, the kernel drops our thread off at KiUserExceptionDispatcher and subsequently RtlDispatchException. The exception dispatcher makes a copy of our context, finds our RUNTIME_FUNCTION entry and calls RtlVirtualUnwind. This is where the unwind codes come into play, let’s take a look at the resulting steps:

The unwinder locates the last unwind code .pushreg rax; it pops the value located at CONTEXT->Rsp into CONTEXT->Rax and adds 8 (or 4 on 32-bit) to the rsp value in the context
The unwinder proceeds to the next unwind code and finds another .pushreg rax; it performs the same steps as above
The unwinder finds no more unwind codes, as we only inserted 2, and thinks it is done unwinding the prologue. Having unwound the stack it should now find the stored return pointer at the locationCONTEXT->Rsp points to and loads it into CONTEXT->Rip. However, rather than a return pointer, it loads in the pointer to decoy.
Lastly, it grabs the registered exception handler for the example function, in this case that would be the handler() function from our assembly file (we registered it using FRAME:handler), and returns it from RtlVirtualUnwind.

Now we have a copy of a context with its RIP member set to the decoy() function, but no execution… yet. The next step the dispatcher takes is executing the exception handler that got returned from the unwinder, handler in our case. For clarity the exception handler is empty and only returns 2 (EXCEPTION_CONTINUE_SEARCH). This will trigger the dispatcher to do another loop starting at RtlLookupFunctionEntry. Again, this function takes our context copy and checks if CONTEXT->Rip is within the bounds of any registered function and returns its RUNTIME_FUNCTION entry. As we managed to control CONTEXT->Rip using the unwind codes in the previous loop, we get to decide which entry is returned here. This RUNTIME_FUNCTION is then again passed on to RtlVirtualUnwind and used to perform the previously explained logic. You can see where this is going… (or maybe not as all of this is a confusing mess!)

Code Execution

Having a way to covertly alter context info and control which RUNTIME_FUNCTION is returned by RtlLookupFunctionEntry, the final step will be converting that into code execution. For this we take the assembly file from our previous example and move our decoy() function into it. We then create an exception handler for this function using the FRAME keyword and have it point to the function we actually want to execute (my_execution):

1
2
3
4
5
6


void
main()
{
	printf("Passing the decoy function to our assembly example function");
	example(decoy);
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45



my_execution PROC
	;perform the logic you want to hide in here
	ShellExec(...)
	mov rax, 0
	ret

decoy PROC FRAME:my_execution
	;you can put anything in here, this function is merely a decoy and will never be executed
	;it is also possible to do more context editing using the unwind code keywords in here (we won't for simplicity's sake)
	push rbp
	mov rbp, rsp
	leave
	ret

handler PROC
	;return 2 from our exception handler
	mov rax, 2
	ret

example PROC FRAME:handler

	;put our "decoy" function pointer on the stack
	push rcx
	
	;put some more values on the stack
	xor rax, rax
	push rax
	push rax
	
	;insert our unwind codes
	.pushreg rax
	.pushreg rax
	.endprolog
	
	;trigger an exception (divide 0x30 by 0)
	mov rax, 30h
	mov rcx, 0h
	idiv rcx
	
	;epilogue (not relevant)
	pop rax
	pop rax
	pop rcx
	ret

Now, when the exception dispatcher starts handling our divide by 0 exception it will:

Find the function entry for example
Parse our malicious unwind codes, putting a pointer to decoy in CONTEXT->Rip
Execute the exception handler handler for our example function
handler returns 2, causing another loop of the dispatcher
Now it finds the function entry for the decoy function
Unwind nothing as it has no unwind codes
Execute registered exception handler my_execution for our decoy function
Execute our actual payload
Return 0 to return from the exception dispatcher

Through a few simple steps we went from triggering an exception to NTDLL executing code for us using “meta opcodes” that don’t show up in the disassembly. While theoretically simple to implement, there’s a lot of minor snags in the process of developing a payload that we ignored in this example. Up next, let’s look at an actual working proof of concept that successfully hides a call to CreateProcessA (with cmd.exe /c whoami as its argument) from the disassembler.

PoC Payload

The following PoC (and how to compile it) can also be found on my github.

This proof of concept masquerades as a simple calculator command line application; this will be the only code that appears to be part of the execution control flow of the binary. Using the input “10 / 0” as a trigger, the code enters the exception dispatching process explained above. In here it will pop a pointer that was stored in an earlier stackframe into RSP and pivot the stack to a pre-constructed “stack” in the data section. Using this pre-constructed stack it will abuse unwind codes to hide the data loads of a pointer to our shell command, a pointer to CreateProcessA and eventually setting CONTEXT->Rip to the value at the current stack location (the offset to our decoy function).

Remember, none of these loads are at all visible in the disassembly as they happen deep inside of the unwinding logic inside of ntdll.

With the context copy now containing our shell command pointer in rdx, the pointer to CreateProcessA inside of r15 and the instruction pointer pointing to decoy, everything is set up to continue to our payload execution function. As explained before, first the normal exception handler for the erroring function is executed (this one is fully visible in disassemblers such as IDA). The handler exception handler is a benign exception handler that merely fixes the exception by changing the divide by 0 to a divide by 1. We return 2 from this exception handler to perform another loop of the dispatcher. The dispatcher now thinks decoy is the erroring function and loads its RUNTIME_FUNCTION with exception handler dispatcher. Through the process explained above eventually dispatcher is executed, unbeknownst to anyone reading the disassembly, which executes our shell command using the registers in the context copy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


/*
// Poof of concept code for abusing the exception handling logic
// inside of ntdll to hide code and data flow.
//
// This code and research was done by:              lldre
//
//
// DISCLAIMER: This proof of concept is for educational purposes only, I am
//             not liable for any misuse or abuse resulting from it.
//
*/

#include <windows.h>
#include <stdio.h>


void
main(int argc, char** argv)
{

/*
    Get the variable that will be used to store the first
    value in our calculation. We will take the address of this
    value and use it to access the second value as well. Eventually
    we pass the address to our calc function.

    NOTE: This is where we provide calc with an address inside the data
          section that our exception handler can later use to reference
          other members in the data section
*/ 
    extern __int64 first_val;
    int result;

    if (argc != 4)
    {
        printf("Usage: <exe> [16-bit number] [+|-|/] [16-bit number]\n");
        exit(-1);
    }

/*
    Get our calculator values from the cmdline args
*/
    (&first_val)[0] = atoi(argv[1]);
    (&first_val)[1] = atoi(argv[3]);

/*
    Check if our values fit inside of a word, so we don't
    run into integer overflow issues inside of our calc code.
*/
    if ( (&first_val)[0] > 0xFFFF || (&first_val)[1] > 0xFFFF)
    {
        printf("Error: values too big\n");
        exit(-1);
    }

    result = calc(&first_val, argv[2]);
    printf("result: %d\n", result);
    
    return;
}

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257


;
; Poof of concept code for abusing the exception handling logic
; inside of ntdll to hide code and data flow.
;
; This code and research was done by:              lldre
;
;
; DISCLAIMER: This proof of concept is for educational purposes only, I will
;             not be liable for any misuse or abuse resulting from it
;
;


PUBLIC first_val

EXTERN CreateProcessA :PROC


.DATA
; "cmd.exe /c echo. & whoami" (xor'd with 0x88)
cmd DB 0EBh, 0E5h, 0ECh, 0A6h, 0EDh, 0F0h, 0EDh, 0A8h, 0A7h, 0EBh, 0A8h, 0EDh, 0EBh, 0E0h, 0E7h, 0A6h, 0A8h, 0AEh, 0A8h, 0FFh, 0E0h, 0E7h, 0E9h, 0E5h, 0E1h, 088h
align(8)
usage DB "Usage: <exe> [16-bit number] [+|-|/] [16-bit number]\n", 0
align(8)
first_val  QWORD 0
second_val QWORD 0
tmp_cmd QWORD (OFFSET cmd)
tmp_ptr  QWORD (OFFSET CreateProcessA)

; After the first RtlVirtualUnwind, this will be
; the value that it thinks is the stored return pointer
; after unwinding the prolog. As such it will take this
; pointer, check if it has a valid RUNTIME_FUNCTION associated
; with it and execute its exception handler, which is `dispatcher`.
tmp_var  QWORD (OFFSET decoy)


.CODE

; The "Exception Handler" that is linked to our `decoy` function.
; This gets executed after the `handler` exception handler and 
; shouldn't be linked to anything in the binary.
align(16)
dispatcher PROC
    push rbp
    mov rbp, rsp
    push rdi

    ; Create enough stack space to store 2 structs
    ; and push the arguments to CreateProcessA
    sub rsp, (068h + 020h + 020h + (6 * 8))
    
    ; zero initialise lpStartupInfo
    lea rdi, [rbp - 068h]
    mov rcx, 0Dh
    xor rax, rax
    rep stosq

    ; initialise STARTUPINFO->cb member
    mov eax, 068h
    lea rdi, [rbp - 068h]
    mov [rdi], eax

    ; zero initialise lpProcessInformation
    lea rdi, [rbp - 088h]
    mov rcx, 4
    xor rax, rax
    rep stosq

    ; Load CONTEXT->Rdx containing ptr to our
    ; encrypted cmd. This was obtained with
    ; the `.pushreg rdx` inside of calc()
    mov rdx, [r14 + 088h]
    xor rcx, rcx

dcrypt:
    ; We stored our cmdline "encrypted" in the data
    ; segment, so we need to decrypt it
    mov al, [rdx + rcx]
    xor al, 088h
    mov [rdx + rcx], al
    inc rcx

    test al, al
    jne dcrypt

    ; More null arguments
    mov rcx, 0  
    mov r8, 0
    mov r9, 0

    ; lpStartupInfo
    lea rax, [rbp - 068h]
    mov [rbp - 0A0h], rax

    ; lpProcessInformation
    lea rax, [rbp - 088h]
    mov [rbp - 098h], rax

    ; Push null arguments to CreateProcessA
    xor rax, rax
    mov [rbp - 0C0h], rax
    mov [rbp - 0B8h], rax
    mov [rbp - 0B0h], rax
    mov [rbp - 0A8h], rax

    ; Call CreateProcessA from the r15 register that
    ; was used to obtain the pointer from the data section
    ; using the meta codes earlier
    mov rax, [r14 + 0F0h]
    call rax


    ; Epilog
    add rsp, (068h + 020h + 020h + (6 * 8))
    pop rdi

    ; Store return value of CreateProcessA
    ; for future use (not used in this example)
    mov [r14 + 078h], rax

    ; return code required to return from
    ; exception handler to normal code  
    mov rax, 00h              
    pop rbp
    ret

dispatcher ENDP


align(16)
decoy PROC FRAME:dispatcher
; We can't use a function containing just 1 ret. The code determines if its in
; an epilogue and if it is it won't execute the exception handler.
; That's why we add some fodder instructions so the unwinder will attempt
; to find and execute the exception handler
    .endprolog

    push rax
    pop rax
    mov rcx, r8
    add r8, 012h
    ret

decoy ENDP


; Our normal exception handler for handling the divide by 0
; In this case we statically edit "10 / 0" to "10 / 1", so
; it won't trigger an exception again and continue execution.
align(16)
handler PROC
    ; Fix the divide by 0 in "10 / 0"
    mov rax, [r8+0C8h]
    test rax, rax
    jne LABEL1
    mov rax, 01h
    mov [r8+0C8h], rax
LABEL1:
    ; Copy RSP from the original CONTEXT to the CONTEXT copy
    mov rax, [r8+098h]
    mov [r14+098h], rax

    ; Return 2 to force another loop of the exception
    ; handling logic
    mov rax, 2
    ret

handler ENDP


; The simple calculator code used as a front for hiding
; our exception shenanigans. 
align(16)
calc PROC FRAME:handler

    push rbp
    mov rbp, rsp

    ; Store the pointer to `first_val` on the stack, so we
    ; can retrieve it later
    mov [rbp + 10h], rcx
    mov [rbp + 18h], rdx


; ################################################################################
; ################################################################################
; This is where the magic happens. These are the "meta opcodes"
; that are invisible to the disassembler and static analysis engines.
; Keep in mind the following meta codes are executed in reverse order.
; ################################################################################

    ; load ptr to CreateProcessA
    .pushreg r15

    ; load ptr to our cmd string stored inside of 
    ; the `tmp_cmd` value
    .pushreg rdx

    ; skip `second_val`
    .pushreg rax

    ; Pivot RSP to our .DATA section using the
    ; stored pointer to `first_val`
    .pushreg rsp

    ; pop return pointer off stack
    .pushreg rax

    ; pop rbp off stack
    .pushreg rax
    .endprolog


; ################################################################################
; ################################################################################
; ################################################################################


    ; Execute the simple calculator logic
    mov rbx, [rcx]
    mov r10, [rcx + 08h]

    mov al, [rdx]

    cmp al, 02Bh
    je l_add
    cmp al, 02Dh
    je l_sub
    cmp al, 02Fh
    je l_div
    mov eax, 0
    jmp l_end

l_add:
    mov rax, rbx
    add rax, r10
    jmp l_end

l_sub:
    mov rax, rbx
    sub rax, r10
    jmp l_end

l_div:
    mov rax, rbx
    xor rdx, rdx
    idiv r10

l_end:

    pop rbp
    ret

calc ENDP

END

Final remarks

The proof of concept is still a very simple example, only proving the possibility of hiding control flow. For more advanced examples you could be looking at executing multiple functions in a row by looping the dispatcher, abusing uninitialised variables that get skipped by RtlInitializeExtendedContext2, obfuscating only sensitive data loads, etc.

Hopefully, despite being painfully complex, it was an interesting read and inspires others to do more research into advanced control flow obfuscation techniques.

To be continued…