Skip to content

recalibrate timestamps when drift is dectected#818

Open
rjodinchr wants to merge 1 commit intokpet:mainfrom
rjodinchr:pr/timestamps
Open

recalibrate timestamps when drift is dectected#818
rjodinchr wants to merge 1 commit intokpet:mainfrom
rjodinchr:pr/timestamps

Conversation

@rjodinchr
Copy link
Copy Markdown
Contributor

Because of device clock drift, we can end up with uncoherent submit/start/end timestamps.
When such case is dectected, recalibrate start/end time using the device timestamps.

For start time, consider that end time is correct and put start time before end time of the same offset we have seen on the device domain. Make sure that submit is still happening before start time.

For end time, we can only rely on submit time.

Because of device clock drift, we can end up with uncoherent
submit/start/end timestamps.
When such case is dectected, recalibrate start/end time using the
device timestamps.

For start time, consider that end time is correct and put start time
before end time of the same offset we have seen on the device domain.
Make sure that submit is still happening before start time.

For end time, we can only rely on submit time.
@rjodinchr
Copy link
Copy Markdown
Contributor Author

We might want to put that behind an option?

"due to a drift in the device timer. Recalibrating the host "
"end time using the device timestamps.",
end_time, submit_time);
return submit_time + (end_dev_time - start_dev_time);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't include the time in between submit_time and start_dev_time. If the GPU is backed up due to a lot of GPU work, I think that could be a non-neglible amount of time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We re-calibrate end _time when it is below submit_time. In that case we have no way to know how long time elapsed between the submit_time and start_time. The only thing available to us is the runtime on the device domain. So we are just using that to put end_time no too far in the future (compared to where we wanted to put it with the translation), and then we have no other choice than putting start_time equal to submit_time.

start_time, submit_time, end_time);
start_time = end_time - (end_dev_time - start_dev_time);
// Make sure submit time happens before start time.
return std::max(start_time, submit_time);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I'm not sure that start_time is actually being updated above since start_time and end_time would probably be based on the same m_sync_host and m_sync_dev.

If start_time is not being updated, then we're fixing the problem by setting start_time to submit_time. If start_time is getting updating, at least some of the time we're still doing that. Needing to do it this way makes this approach seem a bit like hiding the problem rather than fixing it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needing to do it this way makes this approach seem a bit like hiding the problem rather than fixing it.

The goal here is not to find a way to get the correct timestamp. The goal is to expose coherent timestamps even when the device clock has drifted.
When the clock drifts the information that PR is keeping is the elapsed time in the clock domain, as long as we can, considering that the end_time is about right.

We could try to do much better, I tried different approaches keeping a queue of a pair of timestamp (device&host), but it has proven to always keep a little bit of flakiness. I think it could work if we had a side thread taking regular data points to ensure we are not missing any drift. But even like that, we need to consider how many timestamps you want to keep, we cannot just store forever, so you can still have a usecase where you do not have any good data points in the queue leading to inaccuracy.

I think what's important as far as profiling is concerned is to do the minimum when a drift is detected to return something coherent (submit < start < end).

@renatopereyra
Copy link
Copy Markdown

I thought of another idea for recalibrating timestamps.

We could add another sync timestamps (let's call it m_sync_clvk for now) to align with m_sync_dev and m_sync_host. We could set it by averaging the result of sample_clock before and after calling vkGetCalibratedTimestampsEXT. The idea would be to use this m_sync_clvk to convert between the sample_clock domain and the domain of the host timestamp returned by the Vulkan driver. Empirically, sample_clock differed from the host timestamp returned by vkGetCalibratedTimestampsEXT by more than 1 ms in some of my testing.

The idea would be that all the timestamps belonging to an event would key off of a single vkGetCalibratedTimestampsEXT sync and they would be recalibrated based on that sync point (either dev->host or clvk->host). The m_sync_* could be stored within the individual cvk_event as an add on.

Don't have to do it but thought I'd share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants