Optimize `IntRange::from_pat`, then shrink `ParamEnv` by ecstatic-morse · Pull Request #77257 · rust-lang/rust

ecstatic-morse · 2020-09-27T04:23:39Z

Resolves #77058.

Looking at the output of perf report for #76244, the hot instructions seemed to be around the call to pat_constructor in IntRange::from_pat. I carried out an obvious optimization, but it actually made the instruction count higher (see #77075). However, it seems to have mitigated whatever was causing the pipeline stalls, so when combined with #76244, it's a net win.

As you can see below, the regression in #76244 seems to have originated from something measured by stalled-cycles-backend. I'll try to collect some finer-grained stats to see if I can isolate it. I wish I had a better idea of what was going on here. I'd like to prevent the regression from reappearing in the future due to small changes in unrelated code.

Current `master`:

 Performance counter stats for 'cargo +baseline-stage1 check':

          2,275.67 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            49,826      page-faults:u             #    0.022 M/sec                  
     5,117,221,678      cycles:u                  #    2.249 GHz                    
       299,655,943      stalled-cycles-frontend:u #    5.86% frontend cycles idle   
     2,284,213,395      stalled-cycles-backend:u  #   44.64% backend cycles idle    
     8,051,871,959      instructions:u            #    1.57  insn per cycle         
                                                  #    0.28  stalled cycles per insn
     1,359,589,402      branches:u                #  597.447 M/sec                  
         7,359,347      branch-misses:u           #    0.54% of all branches        

       2.281030026 seconds time elapsed

       2.108197000 seconds user
       0.164183000 seconds sys

Shrink `ParamEnv` without changing `IntRange::from_pat`:

 Performance counter stats for 'cargo +perf-stage1 check':

          2,751.79 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            50,103      page-faults:u             #    0.018 M/sec                  
     6,260,590,019      cycles:u                  #    2.275 GHz                    
       317,355,920      stalled-cycles-frontend:u #    5.07% frontend cycles idle   
     3,397,743,582      stalled-cycles-backend:u  #   54.27% backend cycles idle    
     8,276,224,367      instructions:u            #    1.32  insn per cycle         
                                                  #    0.41  stalled cycles per insn
     1,370,453,386      branches:u                #  498.023 M/sec                  
         7,281,031      branch-misses:u           #    0.53% of all branches        

       2.763265838 seconds time elapsed

       2.544578000 seconds user
       0.204548000 seconds sys

Shrink `ParamEnv` and change `IntRange::from_pat`:

 Performance counter stats for 'cargo +perf-stage1 check':

          2,295.57 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            49,959      page-faults:u             #    0.022 M/sec                  
     5,151,407,066      cycles:u                  #    2.244 GHz                    
       324,517,829      stalled-cycles-frontend:u #    6.30% frontend cycles idle   
     2,301,671,001      stalled-cycles-backend:u  #   44.68% backend cycles idle    
     8,130,868,329      instructions:u            #    1.58  insn per cycle         
                                                  #    0.28  stalled cycles per insn
     1,356,618,512      branches:u                #  590.972 M/sec                  
         7,323,800      branch-misses:u           #    0.54% of all branches        

       2.304509653 seconds time elapsed

       2.128090000 seconds user
       0.163909000 seconds sys

Previously, this method called the more general `pat_constructor` function, which can return other pattern variants besides `IntRange`. Then it throws away any non-`IntRange` variants. Specialize it so work is only done when it could result in an `IntRange`.

…uct." This reverts commit ab83d37.

ecstatic-morse · 2020-09-27T04:24:04Z

@bors try
@rust-timer queue

rust-timer · 2020-09-27T04:24:06Z

Awaiting bors try build completion

bors · 2020-09-27T04:24:15Z

⌛ Trying commit c4d8089 with merge baa01e8429de2c041f0095c01d4e8ca99c0b11a7...

bors · 2020-09-27T05:06:05Z

☀️ Try build successful - checks-actions, checks-azure
Build commit: baa01e8429de2c041f0095c01d4e8ca99c0b11a7 (baa01e8429de2c041f0095c01d4e8ca99c0b11a7)

rust-timer · 2020-09-27T05:06:07Z

Queued baa01e8429de2c041f0095c01d4e8ca99c0b11a7 with parent 1ec980d, future comparison URL.

rust-timer · 2020-09-27T19:37:37Z

Finished benchmarking try commit (baa01e8429de2c041f0095c01d4e8ca99c0b11a7): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

ecstatic-morse · 2020-09-27T20:20:38Z

(moved to summary)

jackh726

LGTM :) Super weird that this was the cause of the slowdown. (A large one at that!)

vandenheuvel · 2020-09-28T09:01:21Z

Very interesting. Please let me know if I'm understanding your comment correctly: combining these two issues results in this PR having a net gain, but this doesn't address the issue underlying #76913? So this would not close #77058?

ecstatic-morse · 2020-09-28T17:29:17Z

I'm not quite sure what you mean @vandenheuvel. I think I've gone as far as I care to. If you'd like to investigate further, feel free to keep #77058 open. Perhaps you can do even better.

Here's the "slow" assembly (ParamEnv shrunk with no other changes) in rustc_mir_build::thir::pattern::_match::PatStack::specialize_constructor() annotated via perf with the sum of StoreQueueRsrcStall and LoadQueueRsrcStall events (I'm on a Zen 2 Ryzen processor).

Excerpt from IntRange::from_pat (inlined)

  0.00        → callq  rustc_mir_build::thir::pattern::_match::pat_constructor
  0.00          mov    0xb0(%rsp),%cl
  0.02          mov    0xc0(%rsp),%r9
  0.20          mov    0xb8(%rsp),%rbp
  0.01          mov    0xd0(%rsp),%r13
  0.01          mov    0xc8(%rsp),%r8
  0.01          mov    0xd8(%rsp),%al
  0.01          mov    0xd9(%rsp),%edx
 32.37          mov    %edx,0x40(%rsp)
  0.00          mov    0xdc(%rsp),%edx
  0.00          mov    %edx,0x43(%rsp)
                cmp    $0x7,%cl 
              ↓ jne    5a1

Excerpt from (I think) IntRange::intersection (inlined)

         5a1:   mov    0x40(%rsp),%edx
 20.80          mov    0x43(%rsp),%esi
  0.03          mov    %esi,0x183(%rsp)
                mov    %edx,0x180(%rsp)
                mov    $0x1,%dl 
                cmp    $0x3,%cl 
              ↓ jne    5d9      
                mov    0x180(%rsp),%ecx
  0.01          mov    0x183(%rsp),%edx
 23.64          mov    %edx,0x23(%rsp)
  0.12          mov    %ecx,0x20(%rsp)
                cmp    $0x2,%al 
                sete   %dl      
  0.07   5d9:   mov    0x20(%rsp),%eax
  0.01          mov    0x23(%rsp),%ecx
 18.20          mov    %ecx,0xb3(%rsp)
  0.01          mov    %eax,0xb0(%rsp)
                test   %dl,%dl  
  0.00        ↓ jne    1546     
                mov    %r14,0x18(%rsp)
  0.00          mov    0x218(%rsp),%rax
  0.00          mov    (%rax),%rdi
  0.01          mov    0x30(%rbx),%rax
  0.01          mov    0x8(%rbx),%r14
  0.14          mov    0x10(%rbx),%rsi
                mov    0x20(%rbx),%rdx
                mov    0x18(%rbx),%r10
                movzbl (%rax),%ecx
                cmp    $0x2,%rcx
              ↓ je     629

We're repeatedly copying two words into local stack variables, then branching, then reading them again. This sequence has a lot of data hazards, which is clearly slowing down the pipeline significantly. I can't say why LLVM chose to generate code this way, or what causes the data hazards to disappear with my changes to IntRange::from_pat. You'd need someone with more LLVM expertise than I have. Perhaps building with debug-info-level=2 and stepping through it with a debugger would be helpful?

simonvandel · 2020-09-28T18:20:30Z

I think the tag was meant for @vandenheuvel

ecstatic-morse · 2020-09-28T18:22:20Z

I think the tag was meant for @vandenheuvel

Sorry!

Mark-Simulacrum · 2020-09-28T20:12:16Z

r=me unless you're waiting on something further, not quite sure.

ecstatic-morse · 2020-09-28T20:52:25Z

@bors r=Mark-Simulacrum

bors · 2020-09-28T20:52:26Z

📌 Commit c4d8089 has been approved by Mark-Simulacrum

bors · 2020-09-29T02:29:20Z

⌛ Testing commit c4d8089 with merge 48cab67...

bors · 2020-09-29T04:54:32Z

☀️ Test successful - checks-actions, checks-azure
Approved by: Mark-Simulacrum
Pushing 48cab67 to master...

vandenheuvel · 2020-09-30T19:45:26Z

@ecstatic-morse I don't have the expertise for further investigation either, unfortunately.

And why I asked whether #77058 would remain open: it seems that the root cause has not yet been found.

I wish I had a better idea of what was going on here. I'd like to prevent the regression from reappearing in the future due to small changes in unrelated code.

So to document that, it seems useful to keep #77058 open, perhaps after renaming it and adding another comment.

ecstatic-morse · 2020-09-30T22:27:47Z

I don't think there's anything actionable left in #77058. If you disagree, you're free to reopen it (you should have the requisite permissions).

ecstatic-morse · 2020-10-06T02:16:16Z

Final perf results are in. Looks to be an improvement everywhere except unicode-normaliaztion and the match stress test, as expected.

ecstatic-morse added 2 commits September 26, 2020 20:00

Speed up IntRange::from_pat

bb6c249

Previously, this method called the more general `pat_constructor` function, which can return other pattern variants besides `IntRange`. Then it throws away any non-`IntRange` variants. Specialize it so work is only done when it could result in an `IntRange`.

Revert "Add an unused field of type Option<DefId> to ParamEnv str…

c4d8089

…uct." This reverts commit ab83d37.

rust-highfive assigned Mark-Simulacrum Sep 27, 2020

jackh726 approved these changes Sep 27, 2020

View reviewed changes

Mark-Simulacrum added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 28, 2020

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 28, 2020

bors added the merged-by-bors This PR was explicitly merged by bors. label Sep 29, 2020

bors merged commit 48cab67 into rust-lang:master Sep 29, 2020

rustbot added this to the 1.48.0 milestone Sep 29, 2020

ecstatic-morse deleted the optimize-int-range-from-pat branch October 6, 2020 01:42

Uh oh!

Conversation

ecstatic-morse commented Sep 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecstatic-morse commented Sep 27, 2020

Uh oh!

rust-timer commented Sep 27, 2020

Uh oh!

bors commented Sep 27, 2020

Uh oh!

bors commented Sep 27, 2020

Uh oh!

rust-timer commented Sep 27, 2020

Uh oh!

rust-timer commented Sep 27, 2020

Uh oh!

ecstatic-morse commented Sep 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackh726 left a comment

Choose a reason for hiding this comment

Uh oh!

vandenheuvel commented Sep 28, 2020

Uh oh!

ecstatic-morse commented Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonvandel commented Sep 28, 2020

Uh oh!

ecstatic-morse commented Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum commented Sep 28, 2020

Uh oh!

ecstatic-morse commented Sep 28, 2020

Uh oh!

bors commented Sep 28, 2020

Uh oh!

bors commented Sep 29, 2020

Uh oh!

bors commented Sep 29, 2020

Uh oh!

vandenheuvel commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecstatic-morse commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecstatic-morse commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ecstatic-morse commented Sep 27, 2020 •

edited

Loading

ecstatic-morse commented Sep 27, 2020 •

edited

Loading

ecstatic-morse commented Sep 28, 2020 •

edited

Loading

ecstatic-morse commented Sep 28, 2020 •

edited

Loading

vandenheuvel commented Sep 30, 2020 •

edited

Loading

ecstatic-morse commented Sep 30, 2020 •

edited

Loading