Skip to content
This repository was archived by the owner on Sep 26, 2018. It is now read-only.

Add Sidecar maxconn support - DO NOT MERGE#7

Closed
mihaitodor wants to merge 3 commits intomasterfrom
add-sidecar-maxconn-support
Closed

Add Sidecar maxconn support - DO NOT MERGE#7
mihaitodor wants to merge 3 commits intomasterfrom
add-sidecar-maxconn-support

Conversation

@mihaitodor
Copy link

@mihaitodor mihaitodor commented Apr 21, 2017

!!! DO NOT MERGE THIS !!! - I'll close it once we agree on the requirements for the future PRs that we want to extract from it.

Here's my attempt at adding maxconn support for the Sidecar provider... I'm opening a separate PR on top of this one to add the UI support.

Notes:

  • I renamed (used the default) parameter for the sidecar.toml file. It is called "filename" now (just like for all the other providers), so we'll need to update the ansible scripts in erector
  • Test_SidecarWatcher wasn't exercising the watcher functionality properly because httpmock doesn't work out of the box for a custom http.Client. We could use httpmock.ActivateNonDefault, but it only supports a single custom http.Client, so it's easier to overwrite them with http.DefaultClient in the test code.

TODO:

  • Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts.
  • Is there any point to add unreachable hosts?
  • Do we really want to kill Traefik if we can't connect to Sidecar when it starts up? No other provider does that
  • Convey panics inside a goroutine when using c.So(err, ShouldBeNil), if err != nil. It might be a bug in Convey, but I'm not sure yet. This is annoying because we shouldn't get a panic when tests fail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CancelRequest has been deprecated and the new context functionality should be used instead.

@mihaitodor mihaitodor force-pushed the add-sidecar-maxconn-support branch from 4ff19fa to 9f6537a Compare April 21, 2017 17:08
@bparli
Copy link

bparli commented Apr 24, 2017

@mihaitodor A lot of this you're changing is actually working as engineered. I'm not comfortable refactoring all this at once. We should prioritize and just do the necessary ones in separate, distinct PRs.

Re: some of your TODOs:

  • "Is there any point to add unreachable hosts?" => I'm not sure what you mean here. We're relying on sidecar for service health.

  • "Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts." => good point, we should address this in a separate PR

  • "I renamed (used the default) parameter for the sidecar.toml file. It is called "filename" now (just like for all the other providers), so we'll need to update the ansible scripts in erector" => this is actually less descriptive/helpful since this is our frontend configuration in its entirety. Comparing to the other providers doesn't really make sense since this dynamic is unique to the Sidecar provider.

  • "Do we really want to kill Traefik if we can't connect to Sidecar when it starts up? No other provider does that" => I will again argue we do. Having sidecar up and working is a hard dependency. Again, the other providers are not a good comparison since using service discovery (Sidecar) as a provider is a dynamic unique to us.

}
for {
// Wrap code in an anonymous function because defer has function scope
func() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a personal preference, but I don't like these large anonymous functions if it can be avoided. What was the motivation to refactor this function?

Copy link
Author

@mihaitodor mihaitodor Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence with this. I'll turn it into a separate method.

What was the motivation to refactor this function?

Mostly cosmetic and trying to make it work using the standard Go patterns that I am used to. Also, cleaning up the call to the deprecated Transport.CancelRequest function.

@mihaitodor
Copy link
Author

A lot of this you're changing is actually working as engineered. I'm not comfortable refactoring all this at once. We should prioritize and just do the necessary ones in separate, distinct PRs.

OK, I was a bit concerned about that, but it took me a good amount of time time to get my head around the exiting code and I decided to rewrite parts of it in a way that makes more sense to me. A part of this PR contains cosmetic changes which I believe make the code more readable and cleaner for the purpose of submitting it as a PR to the upstream project which I am still hoping that we will do soon.

If necessary, now that I think I understand the code, I can try to create incremental patches, but that will require more work. I'd like to first know if the overall approach that I have taken here is OK before putting more effort into it.

"Is there any point to add unreachable hosts?" => I'm not sure what you mean here. We're relying on sidecar for service health.

I mean adding servers to the backend for hosts received via Sidecar for which we cannot resolve the IP address. See here: https://github.com/Nitro/traefik/pull/7/files#diff-fac1a7b56acf416d184a3da0990e8ef4R167 I am thinking to have servers added only when the IPs are resolvable.

"Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts." => good point, we should address this in a separate PR

Agreed.

this is actually less descriptive/helpful since this is our frontend configuration in its entirety. Comparing to the other providers doesn't really make sense since this dynamic is unique to the Sidecar provider.

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

Having sidecar up and working is a hard dependency. Again, the other providers are not a good comparison since using service discovery (Sidecar) as a provider is a dynamic unique to us.

Given that we're running this under s6, I can't understand why it's better to have it die over and over instead of letting it spit out errors every few seconds, which is what will happen if Sidecar goes down after it starts. The current design of sidecarWatcher() prevents it from loading sidecar.toml if it can't reach Sidecar initially, so that should suffice.

@bparli
Copy link

bparli commented Apr 25, 2017

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

You're actually changing the design here. The point was to keep as much out of the filesystem config as possible. I also question whether we need that level of granularity for the max connections. Lets add it if we get to a point where we need to start tuning max connections for individual services. Given we're using round robin, we should be scaling well before we hit that point anyway. If the main objective is to get your connection counter code in, we can make it a global setting (even expose it in the traefik.toml if we want) and would need to touch a lot fewer places.

Given that we're running this under s6, I can't understand why it's better to have it die over and over instead of letting it spit out errors every few seconds, which is what will happen if Sidecar goes down after it starts. The current design of sidecarWatcher() prevents it from loading sidecar.toml if it can't reach Sidecar initially, so that should suffice.

The only objective is to make sure traefik doesn't come up unusable. The reason being, we're using keepalived to monitor the process on each host, and trigger a failover of the EIP if the process dies. If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host. Open to other solutions, but thats the problem scenario I was trying to address.

@mihaitodor
Copy link
Author

mihaitodor commented Apr 28, 2017

@bparli OK, let's try making incremental changes. I started with #8 and I'll make a plan for further cleaning the code in order to have it ready for a PR to Traefik, if you agree with that.

Regarding your replies:

You're actually changing the design here. The point was to keep as much out of the filesystem config as possible. I also question whether we need that level of granularity for the max connections.

Yes. From my discussion with @relistan, we do want to be able to configure connection limits independently for each service (and maybe disable them in some cases, if the locks that the MaxConn implementation has end up being too slow for some services).

I think we should make it resemble the file backend in this sense.

Given we're using round robin, we should be scaling well before we hit that point anyway.

Not sure I understand this. MaxConn applies on each entire backend, not per backend server.

If the main objective is to get your connection counter code in, we can make it a global setting (even expose it in the traefik.toml if we want) and would need to touch a lot fewer places.

Done in #8.

If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host.

In this updated code, the frontends (actually, the TOML config) does not get loaded until the Sidecar watcher endpoint can be reached. See here. I am considering to write a test for this. If we route traffic to Traefik before its config is loaded, well, then maybe we need to brainstorm a better healthcheck for it. Here's how the authors of Traefik do it: https://github.com/containous/traefik/blob/master/integration/healthcheck_test.go#L45


for _, serv := range services {
if serv.IsAlive() {
ipAddr, err := net.LookupIP(serv.Hostname)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the IP address for the Port if there is one and only fall back to lookups when there isn't (now that we have this avialable).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that ^^ was discussed here earlier. Probably best in a separate PR.

defaultSidecar.Watch = true
defaultSidecar.Endpoint = "http://127.0.0.1:7777"
defaultSidecar.Frontend = "sidecar.toml"
defaultSidecar.Filename = "sidecar.toml"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me given the nature of the PR.

}

// Create backends from Sidecar state data
for serviceName, services := range sidecarStates {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could just pass the whole state in and call EachService on it rather than first generating the map and then passing it here and then ranging over it. I know this is different from the original code, too. But if we were going to redo it, that's what I'd recommend over this.

log.Debugf("Using %s Sidecar connection refresh interval", provider.RefreshConn)
provider.recycleConn(client, tr)
return nil
for {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this for loop be exited?

Copy link
Author

@mihaitodor mihaitodor Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@relistan Not sure what should make it exit. My understanding of the design is that it should keep running as long as Traefik is in Watch mode. Maybe have a way to stop it if the config changes to turn the Watch mode off?

PS: it's the same as in the original implementation: https://github.com/Nitro/traefik/blob/master/provider/sidecar.go#L141

return nil
for {
// Call a separate function because the defer statement has function scope
provider.sidecarWatcher()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an anonymous function here would be better since this method does nothing on its own and all the work is now in sidecarWatcher

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I had it like that, but @bparli argued that it's less readable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd originally reused pointer variables. I still don't see a problem with doing that and it seems more efficient than the overhead of repeatedly calling a function, even in Go. I guess I'm still not seeing how this particular refactoring is an improvement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read the original code multiple times now and I still don't feel confident enough to explain to somebody else why it's 100% correct. It may very well be, but I don't understand it fully and I'm probably missing some context. The fact that it is working in production is encouraging so I don't mind leaving it the way you designed it. In this sense, I have created #8 for moving forward with the MaxConn additions.

My objection to the way it was done is clarity and an explanation of why it works the way it does that we can relate to some existing documentation. The fact that Close() is never called on that HTTP response body deviates from a well-established Go pattern and the whole pointer reuse technique makes me scratch my head, because the variable will get overwritten during the subsequent iteration of the for loop.

@relistan
Copy link

I made some comments in line. Some replies below:

@mihaitodor wrote:

OK, I was a bit concerned about that, but it took me a good amount of time time to get my head around the exiting code and I decided to rewrite parts of it in a way that makes more sense to me. A part of this PR contains cosmetic changes which I believe make the code more readable and cleaner for the purpose of submitting it as a PR to the upstream project which I am still hoping that we will do soon.

It looks like a lot of rewriting of code that I already found pretty easy to read. This is a judgment call on when to refactor and when not to. I'd personally not restructure all of this.

@mihaitodor wrote:

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

Why would we be putting backends in the file? I don't see why we would want that in this provider. We talked about adding a connection limit setting in that file and then applying it to the backend when loaded. But putting backends in that file is explicitly not a goal unless I somehow miss the point here.

@bparli wrote:

The only objective is to make sure traefik doesn't come up unusable. The reason being, we're using keepalived to monitor the process on each host, and trigger a failover of the EIP if the process dies. If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host. Open to other solutions, but thats the problem scenario I was trying to address.

@bparli is spot on here. We also need Traefik not to start on deploy (thereby canceling the deploy to other nodes) if it comes up without being able to talk to Sidecar. Since we do a rolling deploy, this is how we keep it from taking out all the LBs at once.

@mihaitodor
Copy link
Author

@relistan Answers below:

I'd personally not restructure all of this.

OK, that's not a problem. Initially, I had trouble following / understanding the code, so I rewrote it as I went along. I'll try to get some of the changes in other PRs once we agree on the overall goals.

Why would we be putting backends in the file? I don't see why we would want that in this provider.

Because otherwise it seems there are only 2 options: we can add a global setting, like here or we'd have to modify this struct, which is used by all the other providers. Is there any other way?

But putting backends in that file is explicitly not a goal unless I somehow miss the point here.

This might give better flexibility for other people, like being able to specify a HealthCheck

Since we do a rolling deploy, this is how we keep it from taking out all the LBs at once.

OK, I didn't have enough context here. Is there any plan to use Mesos for deploying Traefik in the future? If using a health check is not an option, then, indeed, we have to let it die.

@mihaitodor
Copy link
Author

@relistan On second thought, I could embed the Frontend struct into another struct and add a MaxConn field alongside it. Not sure if that's supported, but I can try if you think we really don't want people to add other options to the backends.

mihaitodor added a commit that referenced this pull request May 9, 2017
I decided to not add a dependency on go-director, since Traefik
already has similar functionality builtin (safe/routine.go).

Discussion:
#7 (comment)
#8 (comment)
mihaitodor added a commit that referenced this pull request May 9, 2017
- This commit enables adding a MaxConn field on the frontends defined
for Sidecar via the TOML specified in the Filename field.

- Design discussed here: #7 (comment)
and here: #8 (comment)
@mihaitodor mihaitodor changed the title Add Sidecar maxconn support Add Sidecar maxconn support - DO NOT MERGE May 12, 2017
@mihaitodor
Copy link
Author

Closing this since it has been superseded by other PRs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants