recalibrate timestamps when drift is dectected#818
recalibrate timestamps when drift is dectected#818
Conversation
Because of device clock drift, we can end up with uncoherent submit/start/end timestamps. When such case is dectected, recalibrate start/end time using the device timestamps. For start time, consider that end time is correct and put start time before end time of the same offset we have seen on the device domain. Make sure that submit is still happening before start time. For end time, we can only rely on submit time.
|
We might want to put that behind an option? |
| "due to a drift in the device timer. Recalibrating the host " | ||
| "end time using the device timestamps.", | ||
| end_time, submit_time); | ||
| return submit_time + (end_dev_time - start_dev_time); |
There was a problem hiding this comment.
This doesn't include the time in between submit_time and start_dev_time. If the GPU is backed up due to a lot of GPU work, I think that could be a non-neglible amount of time
There was a problem hiding this comment.
We re-calibrate end _time when it is below submit_time. In that case we have no way to know how long time elapsed between the submit_time and start_time. The only thing available to us is the runtime on the device domain. So we are just using that to put end_time no too far in the future (compared to where we wanted to put it with the translation), and then we have no other choice than putting start_time equal to submit_time.
| start_time, submit_time, end_time); | ||
| start_time = end_time - (end_dev_time - start_dev_time); | ||
| // Make sure submit time happens before start time. | ||
| return std::max(start_time, submit_time); |
There was a problem hiding this comment.
I could be wrong but I'm not sure that start_time is actually being updated above since start_time and end_time would probably be based on the same m_sync_host and m_sync_dev.
If start_time is not being updated, then we're fixing the problem by setting start_time to submit_time. If start_time is getting updating, at least some of the time we're still doing that. Needing to do it this way makes this approach seem a bit like hiding the problem rather than fixing it.
There was a problem hiding this comment.
Needing to do it this way makes this approach seem a bit like hiding the problem rather than fixing it.
The goal here is not to find a way to get the correct timestamp. The goal is to expose coherent timestamps even when the device clock has drifted.
When the clock drifts the information that PR is keeping is the elapsed time in the clock domain, as long as we can, considering that the end_time is about right.
We could try to do much better, I tried different approaches keeping a queue of a pair of timestamp (device&host), but it has proven to always keep a little bit of flakiness. I think it could work if we had a side thread taking regular data points to ensure we are not missing any drift. But even like that, we need to consider how many timestamps you want to keep, we cannot just store forever, so you can still have a usecase where you do not have any good data points in the queue leading to inaccuracy.
I think what's important as far as profiling is concerned is to do the minimum when a drift is detected to return something coherent (submit < start < end).
|
I thought of another idea for recalibrating timestamps. We could add another sync timestamps (let's call it The idea would be that all the timestamps belonging to an event would key off of a single Don't have to do it but thought I'd share. |
Because of device clock drift, we can end up with uncoherent submit/start/end timestamps.
When such case is dectected, recalibrate start/end time using the device timestamps.
For start time, consider that end time is correct and put start time before end time of the same offset we have seen on the device domain. Make sure that submit is still happening before start time.
For end time, we can only rely on submit time.