Add null checks for iface before accessing index in mpnh_compare_node#117
Add null checks for iface before accessing index in mpnh_compare_node#117matteyeux wants to merge 2 commits intoprojectcalico:feature-ipinipfrom
Conversation
|
/sem-approve |
|
Looks like we also need to update the Semaphore machine type. I'll do that in a separate PR. |
|
@matteyeux Please merge master in order to pick up CI fixes from #118 |
|
@nelljerram master seems out of date, do you mean feature-ipinip ? It seems to be the main branch |
|
I'm so sorry. Yes, I meant |
|
/sem-approve |
|
Thanks, CI is green |
|
@matteyeux I'm sorry to ask another question - but do you know if there is something unusual about your clusters that hits this problem? I've just checked the upstream BIRD code - even the latest 3.2.0 release - and it does not appear to contain a fix like this, which feels very surprising given how widely used BIRD is. Have you also proposed this fix to upstream? It's possible that in the nearish future we might revert to using upstream BIRD instead of our own fork, because I believe we now have different solutions for the scenarios (mainly IP-in-IP) that prompted us to fork originally. Hence it would make sense to propose this fix upstream as well, to eliminate the possibility of your own clusters regressing when we switch to using upstream. |
|
FTR, in more recent BIRD code |
|
I do not know what is wrong with my cluster to have such segfaults. I have some nodes that have these segfault and some others don't (some are k3s workers some other are k3s masters), I also tested on virtual machine (in the same network as the other nodes) and same behaviour. Disabling IPv6 on my nodes fixed the segfault, but I wanted to also address the segfault. Thanks for pointing to the new name of the function, I will propose a patch to the upstream BIRD project. |
|
Many thanks @matteyeux . Given that we're expecting to retire this fork in the next year, we've decided to transition to an "upstream first" policy in advance of that - so please do let us know when this fix has been accepted upstream and then we'll pick it here too. |
|
Hello, Maria from BIRD here, we don't have this fix basically because it makes no sense to us to have a nexthop without an interface. If such a thing happens, the nexthop is already semantically invalid. Or I'm wrong, and then please tell me how such a nexthop is expected to be handled by Netlink. |
|
Thanks @marenamat for your take on this. I agree that it's important first to understand why this occurs, before applying an apparent fix. @matteyeux If this segfault is easily reproducible in your setup, I think you will need to add more logging to the code to make progress on understanding that. |
|
Hello, My first intend was to avoid the NULL-deref, as I am not sure why/how this happens. I'll continue to investigate to get the full picture of this. |
|
Your original patch may help you show the offending route table contents, anyway. I would suggest checking the code which generates the nexthop for that offending route. There may be something like wrongly marked route destination, or an omission in some code branch where the nexthop would be otherwise set. It may also be a deeper problem of some invariant violation, causing this one. I have not studied your data structures, and thus I can't point at the exact location. |
Description
This PR fixes #116, which is a segfault in
mpnh_compare_nodeby validating that x->iface and y->iface are not null before accessing their index field.Null interfaces are treated as greater than non-null ones, consistent with the existing null handling pattern for x and y nodes.
Tested on my machine that was crashing. After replacing the binary in the running pod by the patched one from my branch, there is no more segfault (it was crashing every second in my pod)