So I'm just spitballing this (not even sure I'll have time), but the basic design would be like this.
1) If we want to reconstruct a worst-case 20kHz sine wave, the period of this signal is 50us. At 32MHz main clock, this gives one clock every 31ns. Or about 1600 clock ticks per revolution of the sine wave.
2) If we want to compute the sine "lobe" (up to pi/2 radians), we have to recycle the computation at twice the main frequency of the sine wave. So this would happen at a 40kHz rate, or once every 25us, or every 800 main clock cycles. We get the "other" sine lobe by taking the first and just multiplying everything by -1.
3) We have to step the DAC through the sine wave table. If we split the lobe into something like 64 steps (this is essentially the sampling rate), this only leaves 800/64 = 12 clocks per step to calculate the "next" sine value.
So I can see this would be quite a challenge to compute on the fly.
Doing a load/store from a pre-calculated sine wave table would be done easily in 12 clock cycles, even with vanilla C.
Perhaps a hybrid approach is better: take the step size as an input, run the sine wave table calculation in real time before the output signal is started. Adjusting the frequency is simple changing the rate that we step through the table, which is a single PWM divider register.