demux/packet_pool: add fast quit mode#12566
demux/packet_pool: add fast quit mode#12566kasper93 wants to merge 1 commit intompv-player:masterfrom
Conversation
|
I'm not a big fan of doing this, this isn't really a problem outside of stupid settings like having a bigger demuxer cache than physical memory available. This will also ignore all leaks in demuxer.c, and in general just doesn't feel like the right thing to do. mpv closes fast enough when it's not swapped already, even with 12GiB demuxer cache on my machine. |
|
You'd have to differentiate between libmpv and the "normal player" at the least. |
|
It seems to be platform related, Windows deallocate cache is significantly slower than Linux, even 8GB takes about 10 seconds on my machine. |
Allocating hundreds of thousands small packets it bad enough and iterating over linked list just to deallocate them for nothing is just unneeded. Not doing that is just pragmatic thing to do. Also the overhead is clearly visible in profile, whether it is Linux or Windows. While on Windows the overhead of actual memory free is quite huge in comparison. It doesn't matter, because even when not dealocating, traversing whole list is significant. And the swapped memory is only one factor that expose the completely unneeded clearing that we do. Whether it is 150MB, 1GB or 100GB, touching all this memory on exit is just not needed overhead.
I expected libmpv doesn't use |
a78808e to
1ac142c
Compare
Added a flag now, that is set only from player, and possibly can be set by libmpv user if they want it. |
|
This also breaks |
|
What about making this user-configurable for now and not default? |
Not directly related to this but is mpv leak report even useful at all anymore? I just use ASAN since it's just plain better. |
No. Internal implementation details should not be exposed to users. |
I know, I was hoping no one is using this anymore, oh well. |
I'll probably remove it in a PR later. It was probably not common used before and is definitely pretty obsolete now. |
Ideally this should be done everywhere, but this is a good start. Are there any other low hanging fruits? (I.e where cleanup is noticeably slow but isn't needed).
This is change that's strictly better. There's no reason (that I'm aware of) why anyone would want to disable it.
Hopping lots of pointers to do basically nothing worthwhile is wasteful on any platform.
Another way of looking at it is that it's |
Yes, that's why I now added |
There is no way any other part of the player is slow on cleanup. And this particular section only slow in extreme cases where you have a massive cache. And even then with massive caches it seems not really reliably reproduced (at least with my naive attempts there was no hangups). Considering that mpv is also a library, we should still be cleaning up properly basically all the time. As for this specific change, I kind of personally don't like it because it feels like a sledgehammer for a edge case. But I dunno how everyone else feels. |
I was personally triggered when we started suggesting how to set up keybind to SIGKILL mpv... because we cannot close in reasonable time ourselves. I'm not saying this is elegant solution, but it is a solution that works and as far as I can tell will not have any side effects. |
7f68997 to
586ebd2
Compare
We did it for a user who explicitly asked about it, my recommendation to that user would always be to use a sane demuxer cache value or buy more RAM. |
|
I initially suggested this in here but forgot about it: but instead of LSan specific suppression, you can just avoid setting This is arguably better since it will always exercise the cleanup code under debug builds since...
... but still avoid doing unnecessary work on release builds. (You can probably avoid breaking |
|
Mostly bikeshedding but I feel quite strongly about the following comment, so I'll leave a response:
This might be appropriate if the software actually needs/uses additional memory to do useful things but in this case it's doing what the OS would've (more efficiently) done anyways except being significantly slower and wasting more resources (reading from disk if the memory was swapped, evicting other useful stuff from shared L2 & L3 cache etc). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Proposed fix, instead of "shutdown_dealloc" change to "shutdown_fast" so that the "default value" (i.e |
|
Will merge this "memory leak" if there are no objections. |
|
I am still not a fan of this approach. |
|
IMO this should be an optional argument of the The argument approach also allows more agressive fast quit modes, for example by calling |
Are you debugging memory leaks without symbols? Honestly I called this memory leak as a joke, because it's not a leak. It's reachable memory that is purposefully not cleaned, because there is no technical reason to do so.
No, we want to save watch later, close scripts gracefully. Allow threads to join, where dependencies threads may want to do something like saving cache or anything. OS cleans after us regardless, you don't have to call |
|
As I understand, the goal is just to make the perceived mpv exit faster, then as I suggested before, it'd be better to destroy VO window first thing. I understand this could be a big change, but if we want to do it then that'd be the best way. As an aside, in at least one of the issues, one potential cause of slow exit could be that we do |
Those are all separate improvements that can be done. This PR focuses on significant delay that is caused by traversing linked list of hundreds thousands of packets. I do not claim to fix other issues. We can block this PR forever on "I don't like it" basis, but do we have any technical or otherwise issues with this approach? Except probably debugging mpv without symbols... with leak checkers that don't recognize reachable memory... which is no issue really. |
Requiring exceptions is already bad enough especially when mpv is the only "bad kid in the block". Because currently mpv cleans up memory completely, automated memory leak tests can check mpv at strict settings without needing to add exceptions. For example, valgrind is strict by default, and it will only report no leak if all memory are freed, even reachable memory. As you mentioned, the benefit of this PR is situational for users with large demux cache settings. By not making this behavior a defailt and make it opt-in, users with such cache settings can opt-in the fast quitting behavior. Can you demonstrate any significant measurable benefit for users with default cache settings and with enough memory to not swap significantly? If not, then this should not be enabled by default. |
If the argument is that defaults should be set based on majority of the users (valid argument) then...
...this is inconsistent. Memory debugging is NOT a common user scenario at all.
Let's assume it only matters for a small portion of the users, but are the rest of them being harmed by this change? No. So there's no downside for any user whatsoever if this behavior is default while having upside for some. Only downside would be for devs trying to memory debug, which again is not the common usecase so it shouldn't dictate the default according to your own argument. |
No, this was never the implication. Clean up all memory is good programming practice and hygiene, and not cleaning it up needs to have a good reason to do so. You present as if not cleaning up is as good as cleaning up in the absense of performance issues, which is not the case. The rest of your argument builds on this assumption, so it is moot. |
This again just boils down to "I don't like it". It's not backed up by technical reason. In fact there's technical reason NOT to do needless cleanup ourself when the OS will do it anyways - and much faster since it's doesn't have to hop individual pointers. |
|
If pointer chasing is the bottleneck in clearing the demux pool then we could consider a different data structure. |
This is simply not true. Even the current situation with talloc the practice is already lenient enough that without carefully manually freeing memory at the right place, unbonded memory growth can occur even though they are not techniacally leaked. The PR would also not help any libmpv clients at all, since it is disabled, and it cannot afford leaking as a library. So it is more like a partial workaround.
Agreed. If they are allocated into a few large chunks of continuous memory regions, freeing will not be a bottleneck. This is a better solution than this PR. |
Holding heavy resources that are not needed is obviously bad. But no one is advocating for that and I fail to see how that's relavant to this PR.
It's been a while since I've looked at this, but I think clients can set
Again, been a while so I might remember wrong, but using better allocation pattern/data structure was suggested but we don't control the data structure/allocation functions, ffmpeg does. And every packet needs to allocated and free-ed individually. |
Managing and keeping track of objects lifetime is good practice and we do that. "good reason" is arguable here.
Such us? If this was a possibility I would do that. See we only need linked list to track allocated packets, currently we pool them, and access from the front, append to the end. At no point there is a need to iterate or access other packets. Except when we are clearing them at the end, which I argue is pointless work. The only way to work with that is to clear all packets at once, which we cannot do, whatever structure you use to store pointer to AVPackets, you still need to iterate over this million long storage and call free individually. And sure, some packets may be backed by some more resources, that we can avoid clearing manually.
I've reproduced the issue on linux primarily. The fact that mpv's biggest user base is on Windows causes bias towards reporting issues on Windows. Also linux is likely faster, but not immune to the issue, especially if memory is swapped.
It doesn't harm them either. If any user of libmpv wants such feature they can request or add an api to set this flag.
Exactly, this is a pattern that is recurring. There is "a change" that improves A and doesn't degrade anything else, but people come to review on "I don't like A" basis and try to move the bar to some imaginary scenarios that effectively stall "a change'. I don't claim this PR fixes all mpv slowdowns, I don't claim it fixes libmpv. I fix one specific scenario, with to the best of my knowledge doesn't have negative effects. I agree that debugging/leak checking is affected, but it is trivial to adjust this as needed.
How would you achieve that? Mind I remind you: also AVPackets themselves have child allocations, so still you have to clear them all individually.
There is no public API currently, which is out of scope of this PR, first we have to agree on the change itself.
Correct. |
It's definitely not just Windows, I can confirm it by regularly using 8-32 GiB cache sizes for scrubbing through large videos without wanting to be bottlenecked by HDD performance.
There are definitely other improvements which could be done, such as:
Generally I'm a fan of the work done here, even if people are afraid of using some of the capabilities of the OS. But even as-is, this is useful work for use cases with large caches, and I also believe that shortcomings for the "I don't like it" "use cases" weren't actually proven to even exist. |
@voidpointertonull already said it but: something like struct chunk {
T content[512]; // T = AVPacket*
struct chunk *next;
};
The reason I said this is because I remember the Windows CRT malloc known to be slow.
Really I don't see why ffmpeg couldn't add a function that returns /// Initializes an AVPacket in-place in the buffer given by the caller.
/// @param buf buffer (must be aligned to XXX)
/// @param buflen number of available bytes in buffer
/// @param consumed number of bytes consumed by the AVPacket will be written here
/// @return 0 if success: the AVPacket will live at the beginning of the buffer
int av_packet_alloc_inplace(char *buf, size_t buflen, size_t *consumed); |
We don't control where the allocation is placed and it maybe scattered all over the address space.
I don't mind this. But you still need to needlessly free on all the packets (and their nested allocations) individually.
This might be faster than doing it single threaded, but I'm going to bet that it won't be faster than the OS. So we'd be doing more work using more resource just to get worse results than what doing no work would give.
Agreed. There's no need for us to step in and do a shoddy job when the OS would do a much better job anyways. And I just don't buy the argument that it's "best practice" to waste resources doing absolutely nothing worthwhile. |
This is interesting. If we can place all the packets in one (or a few) large array(s) and call free only a couple times on the few arrays rather than individual packets, then that's not so bad. But @kasper93 said earlier that packets also have nested allocations, so you'd need to iterate and free individually anyways. |
They can and do in practice. If we want to fix this in a different way we need to know what of...
...is the bottleneck and how to best address it. |
Something like that, although packets don't seem to be particularly fat, so thought something more like this: This would reduce the pressure on the memory allocator, and cut down on indirection, there's enough of that inside demux_packet anyway. Oh, just noticed Using boundary indices are not necessarily the best if packets are often removed from the middle, but if it's a rare occasion, then such an operation could split the chunk into 2 chunks, they don't have to be full. Still, I'm not joining the bikeshedding committee, these improvements can be considered elsewhere.
Definitely didn't go deep enough to confidently say it is easy to apply in this case, I just know that the performance gains are typically quite neat, so could be worth a try somewhere.
Technically it can be faster, at least I'm guessing that process cleanup is single-threaded in the kernel, and I've seen some kernel work on page table lock improvements, but the issue is mostly with the more resource usage problem you mentioned. I don't think it's worth it in this case, just mentioned as one potential alternative as this PR seems to be quite stuck, and I've been just looking forward to any kind of improvement.
Apparently this intentionally doesn't exist, as AVPacket* is desired to be an opaque pointer: "sizeof(AVPacket) being a part of the public ABI is deprecated". I believe this is rather silly, but then this is yet another reason why the current PR is fine as-is, and chasing perfection would just leave the known good improvement wasting away here for more years. |
A new field, MPContext::quit_fast, has been introduced. Setting this field to true bypasses the time-consuming packet deallocation process. This is particularly useful when exiting the process, as there's no need to iterate over and free all packets. The cache, stored as a linked list with potentially millions of small packets, requires significant time to iterate through and deallocate, especially if the memory is swapped out. This change significantly speeds up the player closure, especially when a large demuxer cache is used. Fixes: mpv-player#12294 Fixes: mpv-player#12563
All of the above. Many small allocations invites fragmentation which were an issue already on glibc #12076, but that's not what were are fixing here anyway. Now, any access to Even if we patch ffmpeg to allow allocating To summary, I don't see the solution to those problems by changing allocation scheme, when we do not even need to access those packets in most cases. The only reliable solution would be to have custom memory allocator, and group all those allocations in bigger chunks, to allow effectively page fault this memory when needed. Though ffmpeg doesn't allow any external allocators and like I said there is no reason to complicate this.
Pushed. |
To be clear, it's obvious that "not freeing" will always be the most efficient way to quit an mpv process. Allowing custom allocators for packets and their data on the ffmpeg is still a worthwhile goal IMO to reduce fragmentation at runtime/in general. My personal opinion is not that we shouldn't merge this PR but it would be nice if we could get ffmpeg to consider the alternatives. |
Packets are not cleared during runtime already, they are returned to the packet pool where they wait to be reused.
It's still not clear to me what are the alternatives. Especially on ffmpeg side, any changes in this area would require huge refactoring and it's not even clear to me how we could make this work differently that would resolve the issue. Only feasible change would be to implement custom allocator and use it throughout the ffmpeg, this would allow us to manage the memory better, but this is big project in itself. |
The cache is stored as a linked list, containing possibly hundreds of thousands of small cached packets. Iterating over all of them and deallocating takes a significant amount of time, especially if this memory gets swapped out.
When exiting the player, there is no need to clear all of this, let the kernel reclaim the memory after the program exits.
This change significantly speeds up the closing time when a large demuxer cache is used.
Fixes: #12294, #12563