8312182: THPs cause huge RSS due to thread start timing issue#3223
8312182: THPs cause huge RSS due to thread start timing issue#3223tstuefe wants to merge 11 commits into
Conversation
…uld fail on machine with large number of cores Reviewed-by: shade, stuefe (cherry picked from commit 733250288325bc663afc0376342d4c5a7a471cbd)
|
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@tstuefe The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
|
@tstuefe |
|
@tstuefe |
|
@tstuefe Thank you for working on the patch. Here are the details of the reproduction I observed on RHEL 9: OS: RHEL 9.6 khugepaged settings: Reproducer program: Test.java prints When I ran Test.java in the above environment, I got the following result: After changing the THP configuration from On my RHEL 9 environment, I observed a significant difference in RSS depending on the THP configuration. |
Webrevs
|
|
MacOS GHA errors obviously unrelated. |
...
Hmm, it could be that the kernel version delta between 9.6 and 9.8 (what I use) could be the explanation for why I did not observe the issue on RHEL 9.8. In any case, it makes sense to fix this in JDK 11 and 8. |
|
If you change the PR title to |
gnu-andrew
left a comment
There was a problem hiding this comment.
This mostly looks good and I much prefer this to introducing a bunch of additional infrastructure just for this change.
Just a few questions:
- The 17u version makes the
stack_size += os::vm_page_size();added by 8303215 conditional on the mitigation being on but I don't see that in this patch. Is that deliberate? - The 17u version turns the mitigation off when we are not in always mode using
FLAG_SET_ERGO(DisableTHPStackMitigation, true);while it's a simple assignment (THPStackMitigation = false`) in this version. Again, I just wondered why the change.
A few of the switch renaming changes (8312585) crept into the first commit but this was minor enough to not cause any issue when reviewing and it'll all be squashed together for the final commit anyway. Similarly, I wondered where the test changes were until I saw them in the third commit :)
|
/issue 8312585 |
|
This backport pull request has now been updated with issues from the original commit. |
|
@tstuefe |
|
Thanks @gnu-andrew for the review. Answers inline.
It was deliberate, and I reconsidered. It does not matter much either way. This was just to test the effectiveness of THP stack mitigation by switching it off completely, including the intra-stack-related fixes done previously by 8303215. Since it does no harm, I took over the condition and the newer comment (mostly to not confuse future code readers who assume the patch has been integrated). I kept the old logic inside the condition, though, with an added comment for JDK 11 and 8 explaining the difference.
My bad, I thought we don't have FLAG_SET_ERGO in jdk 11. I fixed it.
Sorry :-) I tested the new changes in RHEL 8 and 9, patch is still good (with me still being unable to reproduce much of the problem on RHEL 9, but I trust @tabata-d here, he wrote the bug also occurs there.) The MacOS / Windows GHA errors are annoying but unrelated since this patch only changes linux-specific code. |
|
@tstuefe Thank you for creating the patch. |
tabata-d
left a comment
There was a problem hiding this comment.
Using a build with this fix applied, I measured RSS in the same way in #3223 (comment). I confirmed that I do not see any abnormal RSS increasing on my REHL 9.6 environment.
The fix looks good to me overrall. I've left a few minor comments for your consideration.
| size_t guard_size = os::Linux::default_guard_size(thr_type); | ||
| // Configure glibc guard page. Must happen before calling |
There was a problem hiding this comment.
A line break is needed here to align with upstream.
There was a problem hiding this comment.
Sure, but lets only fix things the patch touches.
There was a problem hiding this comment.
For backports, I believe that even a single space or line break should be aligned with upstream as much as possible. Is there any reason not to do so here?
There was a problem hiding this comment.
Because this suggested linebreak was not part of the original patch and was not part of any hunks touched by it. When backporting, we try to avoid changes to code parts unrelated to the original patch.
If that is necessary/a useful goal, then a separate RFE should be done that does that (as in: "8u: align whitespace usage to upstream". And then, you'd have to trickle such a patch down the JDK versions first.
There was a problem hiding this comment.
I might be missing something, but it looks like the linebreak is included in the upstream patch.
openjdk/jdk@84b325b#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR918
Could you please confirm whether this understanding is correct?
There was a problem hiding this comment.
Hmm, you are right; I missed that. Okay then, I'll add the line break.
There was a problem hiding this comment.
Yes, I did spot that too but it didn't seem worth mentioning as HotSpot code is already pretty divergent between releases anyway, so it's unlikely patches apply cleanly. It actually worries me a little if they do. Had it been a JDK file that changes little, it would have been more of an issue.
Still, good to see it fixed.
Co-authored-by: Daishi Tabata <tabata.daishi@fujitsu.com>
Co-authored-by: Daishi Tabata <tabata.daishi@fujitsu.com>
|
Thanks for testing and review, @tabata-d . I applied most of your suggestions. |
I agree it doesn't make a big difference as it is code that is already in 11u. Still, nice to now have it in sync with later versions.
Yeah, my first thought was that it must not be in 11 and then I found it was :) Thanks for fixing.
No harm done. I just thought it worth mentioning for future backports that may be more complex.
Great. Changes look good to me so I think this is ready to go in.
I agree. Windows build should already be fixed by Severin's recent backport, so a merge would likely fix that one, but isn't needed with it only touching Linux code. |
gnu-andrew
left a comment
There was a problem hiding this comment.
Current version looks good to me. I'll let @jerboaa handle approval as a final thumbs up.
|
|
tabata-d
left a comment
There was a problem hiding this comment.
Thank you for addressing even my minor comments. Looks good to me.
I'm not a reviewer, so this doesn't have any procedual significance, but I will approve current changes anyway.
Thank you! We always take insightful reviews from non-reviewers into account. |
|
Please merge master to get rid of the Windows build failures in GHA. Also this Linux x64 failure seems related: https://github.com/tstuefe/jdk11u-dev/actions/runs/27748064701/job/82094979103#step:11:3014 Please fix. Thanks! |
This is a regression caused by one of the last minute fixes; there is no Defining |
|
Fixed and re-tested. Sorry for the trouble. Always be wary of last-minute changes :-/ |
gnu-andrew
left a comment
There was a problem hiding this comment.
I still think it would be better to add in a file variable to avoid the filename duplication and long lines as I suggested.
Oh, I missed your suggestion. Sure, will do it tomorrow. |
Thanks. Just bugs me a bit to see such a long path repeated. :) |
|
Please merge master branch. It gets rid of the Windows build failures in GHA. Thanks! |
|
I think this is good now. GHA checks were clean apart from an unrelated issue on MacOS/x86 which cannot have been caused by this PR. I will resync master periodically until there is time to review and approve it. @gnu-andrew @jerboaa if you want to wait until the July CPU is done this is okay, too. I would like to see this fixed soon, but OTOH the bug is decades old now, so it also could wait until the October release. |
|
@tstuefe Was closing the PR intentional? |
I thought we already had reviewed it? Other than the
July is already in rampdown so any changes to 11u-dev now would be for October anyway. |
Oops. The heat gets to my head. Definitely not. |
This is a streamlined version of the fix for JDK 11 (and eventually, 8).
Thanks to @tabata-d for doing preparatory analysis and proposing a first version of backport fixes; we ended up deciding against those in favor of a simpler, streamlined solution for JDK 11 and (later) JDK 8, but his analysis was valuable and made the problem apparent to us.
The problem occurs when multiple Java threads are created, and
khugepagedis scheduled between thread starts and Java stack guard page creation. Kernel minimizes VMA fragmentation, so there is a high chance that the stacks for these new threads are placed adjacent to each other. No guard pages have been established yet, so all these thread stacks look like one big memory area just ripe for THP coalescing. Of course, those THPs shatter a moment later when the Java guards are placed, but by that time the damage is done: now all pages in the new thread stacks are resident, and we have burned CPU time to make it so.The problem is clearly that khugepaged is too aggressive.
Newer kernels have toned down khugepaged's aggressiveness: it is now a lot less likely to coalesce THPs out of sparsely populated areas like thread stacks. So the problem is non-existent in newer kernels (>=6).
It still very much exists in RHEL 7 and 8. Allegedly also in RHEL 9 (see openjdk/jdk8u-dev#746 (comment)), though I was not able to reproduce it there.
Some numbers:
RHEL 8 x64: 5000 idle threads, time between thread creation and Java guard placement artificially increased with a delay of 100ms:
Fixing this in older JVMs is very important, since those are typically deployed on older Linux versions that are much more vulnerable to this problem.
The fix is very simple: if we feel this could be a problem, we enable glibc guard pages. On older glibc variants, this narrows the "dangerous" time window significantly; on newer glibc variants it avoids the problem completely (glibc got smarter, too).
The most vulnerable architectures are those with small THP sizes:
Note, though, that for s390, IBM/Red Hat strongly recommends switching off THPs, as they do not play well with Z's internal memory manager.
The following architectures are less affected because the THP size is large. Bloated RSS is still possible, but one needs a fair bit of bad luck, or an application with exceptionally large stack sizes:
The fix is unclean. In newer releases, the fix was implemented on top of a number of code improvements. I avoided pulling all of that stuff into JDK 11 and instead rewrote parts of the patch.
Notable changes to the upstream version:
// In addition to the glibc guard page ...) - that part deals with a rare corner case (e.g. 1GB large page size) and is not important enough to downport.HugePageshelper Java framework into JDK 11.I also integrated two small follow-ups into this patch:
(https://bugs.openjdk.org/browse/JDK-8312585)
(https://bugs.openjdk.org/browse/JDK-8314139)
since these would be needed anyway.
Testing:
Tested on RHEL 8 and 9. With the fix, on RHEL 8, RSS in my example program goes down from 6.54 GB to 0.36GB.
I tested the new regression test on RHEL 8 and RHEL 9.
GHAs are fine apart from an unrelated infra problem on mac aarch64.
Hotspot tier1 and 2 tests on Fyre are running.
Risk:
With the patch, we now allocate glibc guards on the stack. So there is a theoretical increase of virtual size (VSS) by one system page (4K or 16K or 64K depending on platform) per thread. But no RSS increase: guard pages are not resident, so they don't count toward RSS.
In practice, I found the effect to be too small to measure on 4K-paged x64, even with 5000 threads (which would cause a 20KB increase in VSS). It went lost in normal random VSS fluctuations.
Theoretically, on larger-paged architectures like ppc64, the increase could cause the JVM to run against an
ulimit -vlimit set by an administrator. I just mentioned the risk for completeness sake. It is extremely unlikely given that java VSS is gargantuan and somewhat unpredictable, so those few KB more are lost in the random noise./issue 8312182
/issue 8312585
/issue 8314139
Progress
Issues
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk11u-dev.git pull/3223/head:pull/3223$ git checkout pull/3223Update a local copy of the PR:
$ git checkout pull/3223$ git pull https://git.openjdk.org/jdk11u-dev.git pull/3223/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 3223View PR using the GUI difftool:
$ git pr show -t 3223Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk11u-dev/pull/3223.diff
Using Webrev
Link to Webrev Comment