Use real data for the overlaps at the start / end of every iblock iteration#4
Use real data for the overlaps at the start / end of every iblock iteration#4David-McKenna wants to merge 1 commit intocbassa:masterfrom
Conversation
This is achieved by initially processing less data on the first iteration to fill up the buffer, then on future iterations re-using data for overlaps. The general layout of a given block is <overlap_0><data0..N><overlap_1>, followed by the next iteration <dataN><overlap_1><data0...N><overlap_2>, etc.
|
Given the activity on #1, I don't 100% remember if I fixed it or not, in the back of my mind I think there was an indexing error in this MR that I never fixed in that branch after I fixed it in my main one. As it happens, I'm tweaking the IE613 version of cdmt at the moment, so I have it all pulled down. I'll run a diff and see if I can spot the error and submit an updated MR with the fix in place if that was the case. Not sure if it'll be tonight, but I can probably do it tomorrow evening AU time. |
|
No worries. I'm trying to make some changes to the code to write out 32bit floats and be able to select a time range (your #2 ), but even running it with a latest cuda version only yields zeros, so I was hopeful your first MR might fix it. It doesn't. |
|
That functionality is in the https://github.com/David-McKenna/cdmt/blob/master/cdmt_udp.cu |
Hey Cees,
Decided to port this one back while I was doing the cuFFT one -- it re-uses data in order to pad the FFT data with real data where previously 0s were used.
The overall setup to perform this results in less samples being processed on the first iteration so that the end of the first buffer can be filled, but after that it just recycles the data already in the buffer from the current iteration to prepare cp1p/cp2p for the next iteration.
So the overall data structure look like this:
t=N: overlap | processed data | overlap
t=0: <overlap_0 = 0> | <noverlap = reflected data><data_0> | <overlap_1 = data>
t=1: <data_0 overlap> | <overlap_1><data_1> | <overlap_2>
t=2: <data_1 overlap> | <overlap_2>
... etc.
I'm going to make a note of it here as it took me a couple tries to get the indexing on it right: on the first iteration, I discarded / offset the output by 2 * noverlap samples as we are effectively losing noverlap samples on each end of the data.
At the start because we offset the starting point in the array due to there being insufficient data, losing noverlap samples, and at the end we perform overlap which causes another loss of noverlap samples.
Overall the implementation is stable judging by my outputs, but I suspect the process could be made more efficient by tweaking the block/grid sizes for padd_next_iteration (since it only needs to iterate over the first 2 * noverlap samples) and the new unpack_and_padd (as it can skip the first 2 * noverlap samples), though with my layout it's hard to judge what kind of performance effect it'll have on your setup (I reduce nforward from 100 to 8 and increase nsub to 488)
Cheers