Add Sidecar maxconn support - DO NOT MERGE#7
Conversation
provider/sidecar.go
Outdated
There was a problem hiding this comment.
Note that CancelRequest has been deprecated and the new context functionality should be used instead.
4ff19fa to
9f6537a
Compare
|
@mihaitodor A lot of this you're changing is actually working as engineered. I'm not comfortable refactoring all this at once. We should prioritize and just do the necessary ones in separate, distinct PRs. Re: some of your TODOs:
|
provider/sidecar.go
Outdated
| } | ||
| for { | ||
| // Wrap code in an anonymous function because defer has function scope | ||
| func() { |
There was a problem hiding this comment.
Probably a personal preference, but I don't like these large anonymous functions if it can be avoided. What was the motivation to refactor this function?
There was a problem hiding this comment.
I was on the fence with this. I'll turn it into a separate method.
What was the motivation to refactor this function?
Mostly cosmetic and trying to make it work using the standard Go patterns that I am used to. Also, cleaning up the call to the deprecated Transport.CancelRequest function.
OK, I was a bit concerned about that, but it took me a good amount of time time to get my head around the exiting code and I decided to rewrite parts of it in a way that makes more sense to me. A part of this PR contains cosmetic changes which I believe make the code more readable and cleaner for the purpose of submitting it as a PR to the upstream project which I am still hoping that we will do soon. If necessary, now that I think I understand the code, I can try to create incremental patches, but that will require more work. I'd like to first know if the overall approach that I have taken here is OK before putting more effort into it.
I mean adding servers to the backend for hosts received via Sidecar for which we cannot resolve the IP address. See here: https://github.com/Nitro/traefik/pull/7/files#diff-fac1a7b56acf416d184a3da0990e8ef4R167 I am thinking to have servers added only when the IPs are resolvable.
Agreed.
I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.
Given that we're running this under s6, I can't understand why it's better to have it die over and over instead of letting it spit out errors every few seconds, which is what will happen if Sidecar goes down after it starts. The current design of |
You're actually changing the design here. The point was to keep as much out of the filesystem config as possible. I also question whether we need that level of granularity for the max connections. Lets add it if we get to a point where we need to start tuning max connections for individual services. Given we're using round robin, we should be scaling well before we hit that point anyway. If the main objective is to get your connection counter code in, we can make it a global setting (even expose it in the traefik.toml if we want) and would need to touch a lot fewer places.
The only objective is to make sure traefik doesn't come up unusable. The reason being, we're using keepalived to monitor the process on each host, and trigger a failover of the EIP if the process dies. If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host. Open to other solutions, but thats the problem scenario I was trying to address. |
|
@bparli OK, let's try making incremental changes. I started with #8 and I'll make a plan for further cleaning the code in order to have it ready for a PR to Traefik, if you agree with that. Regarding your replies:
Yes. From my discussion with @relistan, we do want to be able to configure connection limits independently for each service (and maybe disable them in some cases, if the locks that the MaxConn implementation has end up being too slow for some services). I think we should make it resemble the file backend in this sense.
Not sure I understand this. MaxConn applies on each entire backend, not per backend server.
Done in #8.
In this updated code, the frontends (actually, the TOML config) does not get loaded until the Sidecar watcher endpoint can be reached. See here. I am considering to write a test for this. If we route traffic to Traefik before its config is loaded, well, then maybe we need to brainstorm a better healthcheck for it. Here's how the authors of Traefik do it: https://github.com/containous/traefik/blob/master/integration/healthcheck_test.go#L45 |
|
|
||
| for _, serv := range services { | ||
| if serv.IsAlive() { | ||
| ipAddr, err := net.LookupIP(serv.Hostname) |
There was a problem hiding this comment.
This should use the IP address for the Port if there is one and only fall back to lookups when there isn't (now that we have this avialable).
There was a problem hiding this comment.
I guess that ^^ was discussed here earlier. Probably best in a separate PR.
| defaultSidecar.Watch = true | ||
| defaultSidecar.Endpoint = "http://127.0.0.1:7777" | ||
| defaultSidecar.Frontend = "sidecar.toml" | ||
| defaultSidecar.Filename = "sidecar.toml" |
There was a problem hiding this comment.
This change makes sense to me given the nature of the PR.
| } | ||
|
|
||
| // Create backends from Sidecar state data | ||
| for serviceName, services := range sidecarStates { |
There was a problem hiding this comment.
Here we could just pass the whole state in and call EachService on it rather than first generating the map and then passing it here and then ranging over it. I know this is different from the original code, too. But if we were going to redo it, that's what I'd recommend over this.
| log.Debugf("Using %s Sidecar connection refresh interval", provider.RefreshConn) | ||
| provider.recycleConn(client, tr) | ||
| return nil | ||
| for { |
There was a problem hiding this comment.
@relistan Not sure what should make it exit. My understanding of the design is that it should keep running as long as Traefik is in Watch mode. Maybe have a way to stop it if the config changes to turn the Watch mode off?
PS: it's the same as in the original implementation: https://github.com/Nitro/traefik/blob/master/provider/sidecar.go#L141
| return nil | ||
| for { | ||
| // Call a separate function because the defer statement has function scope | ||
| provider.sidecarWatcher() |
There was a problem hiding this comment.
Probably an anonymous function here would be better since this method does nothing on its own and all the work is now in sidecarWatcher
There was a problem hiding this comment.
Initially, I had it like that, but @bparli argued that it's less readable.
There was a problem hiding this comment.
I'd originally reused pointer variables. I still don't see a problem with doing that and it seems more efficient than the overhead of repeatedly calling a function, even in Go. I guess I'm still not seeing how this particular refactoring is an improvement.
There was a problem hiding this comment.
I have read the original code multiple times now and I still don't feel confident enough to explain to somebody else why it's 100% correct. It may very well be, but I don't understand it fully and I'm probably missing some context. The fact that it is working in production is encouraging so I don't mind leaving it the way you designed it. In this sense, I have created #8 for moving forward with the MaxConn additions.
My objection to the way it was done is clarity and an explanation of why it works the way it does that we can relate to some existing documentation. The fact that Close() is never called on that HTTP response body deviates from a well-established Go pattern and the whole pointer reuse technique makes me scratch my head, because the variable will get overwritten during the subsequent iteration of the for loop.
|
I made some comments in line. Some replies below: @mihaitodor wrote:
It looks like a lot of rewriting of code that I already found pretty easy to read. This is a judgment call on when to refactor and when not to. I'd personally not restructure all of this. @mihaitodor wrote:
Why would we be putting backends in the file? I don't see why we would want that in this provider. We talked about adding a connection limit setting in that file and then applying it to the backend when loaded. But putting backends in that file is explicitly not a goal unless I somehow miss the point here. @bparli wrote:
@bparli is spot on here. We also need Traefik not to start on deploy (thereby canceling the deploy to other nodes) if it comes up without being able to talk to Sidecar. Since we do a rolling deploy, this is how we keep it from taking out all the LBs at once. |
|
@relistan Answers below:
OK, that's not a problem. Initially, I had trouble following / understanding the code, so I rewrote it as I went along. I'll try to get some of the changes in other PRs once we agree on the overall goals.
Because otherwise it seems there are only 2 options: we can add a global setting, like here or we'd have to modify this struct, which is used by all the other providers. Is there any other way?
This might give better flexibility for other people, like being able to specify a HealthCheck
OK, I didn't have enough context here. Is there any plan to use Mesos for deploying Traefik in the future? If using a health check is not an option, then, indeed, we have to let it die. |
I decided to not add a dependency on go-director, since Traefik already has similar functionality builtin (safe/routine.go). Discussion: #7 (comment) #8 (comment)
- This commit enables adding a MaxConn field on the frontends defined for Sidecar via the TOML specified in the Filename field. - Design discussed here: #7 (comment) and here: #8 (comment)
|
Closing this since it has been superseded by other PRs. |
!!! DO NOT MERGE THIS !!! - I'll close it once we agree on the requirements for the future PRs that we want to extract from it.
Here's my attempt at adding maxconn support for the Sidecar provider... I'm opening a separate PR on top of this one to add the UI support.
Notes:
httpmockdoesn't work out of the box for a customhttp.Client. We could usehttpmock.ActivateNonDefault, but it only supports a single customhttp.Client, so it's easier to overwrite them withhttp.DefaultClientin the test code.TODO:
c.So(err, ShouldBeNil), iferr != nil. It might be a bug in Convey, but I'm not sure yet. This is annoying because we shouldn't get a panic when tests fail.