Add Sidecar maxconn support - DO NOT MERGE by mihaitodor · Pull Request #7 · Nitro/traefik

mihaitodor · 2017-04-21T16:43:30Z

!!! DO NOT MERGE THIS !!! - I'll close it once we agree on the requirements for the future PRs that we want to extract from it.

Here's my attempt at adding maxconn support for the Sidecar provider... I'm opening a separate PR on top of this one to add the UI support.

Notes:

I renamed (used the default) parameter for the sidecar.toml file. It is called "filename" now (just like for all the other providers), so we'll need to update the ansible scripts in erector
Test_SidecarWatcher wasn't exercising the watcher functionality properly because httpmock doesn't work out of the box for a custom http.Client. We could use httpmock.ActivateNonDefault, but it only supports a single custom http.Client, so it's easier to overwrite them with http.DefaultClient in the test code.

TODO:

Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts.
Is there any point to add unreachable hosts?
Do we really want to kill Traefik if we can't connect to Sidecar when it starts up? No other provider does that
Convey panics inside a goroutine when using c.So(err, ShouldBeNil), if err != nil. It might be a bug in Convey, but I'm not sure yet. This is annoying because we shouldn't get a panic when tests fail.

mihaitodor · 2017-04-21T16:47:36Z

provider/sidecar.go

Note that CancelRequest has been deprecated and the new context functionality should be used instead.

bparli · 2017-04-24T18:29:30Z

@mihaitodor A lot of this you're changing is actually working as engineered. I'm not comfortable refactoring all this at once. We should prioritize and just do the necessary ones in separate, distinct PRs.

Re: some of your TODOs:

"Is there any point to add unreachable hosts?" => I'm not sure what you mean here. We're relying on sidecar for service health.
"Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts." => good point, we should address this in a separate PR
"I renamed (used the default) parameter for the sidecar.toml file. It is called "filename" now (just like for all the other providers), so we'll need to update the ansible scripts in erector" => this is actually less descriptive/helpful since this is our frontend configuration in its entirety. Comparing to the other providers doesn't really make sense since this dynamic is unique to the Sidecar provider.
"Do we really want to kill Traefik if we can't connect to Sidecar when it starts up? No other provider does that" => I will again argue we do. Having sidecar up and working is a hard dependency. Again, the other providers are not a good comparison since using service discovery (Sidecar) as a provider is a dynamic unique to us.

bparli · 2017-04-24T18:38:07Z

provider/sidecar.go

-}
+	for {
+		// Wrap code in an anonymous function because defer has function scope
+		func() {


Probably a personal preference, but I don't like these large anonymous functions if it can be avoided. What was the motivation to refactor this function?

I was on the fence with this. I'll turn it into a separate method.

What was the motivation to refactor this function?

Mostly cosmetic and trying to make it work using the standard Go patterns that I am used to. Also, cleaning up the call to the deprecated Transport.CancelRequest function.

mihaitodor · 2017-04-25T09:06:58Z

A lot of this you're changing is actually working as engineered. I'm not comfortable refactoring all this at once. We should prioritize and just do the necessary ones in separate, distinct PRs.

OK, I was a bit concerned about that, but it took me a good amount of time time to get my head around the exiting code and I decided to rewrite parts of it in a way that makes more sense to me. A part of this PR contains cosmetic changes which I believe make the code more readable and cleaner for the purpose of submitting it as a PR to the upstream project which I am still hoping that we will do soon.

If necessary, now that I think I understand the code, I can try to create incremental patches, but that will require more work. I'd like to first know if the overall approach that I have taken here is OK before putting more effort into it.

"Is there any point to add unreachable hosts?" => I'm not sure what you mean here. We're relying on sidecar for service health.

I mean adding servers to the backend for hosts received via Sidecar for which we cannot resolve the IP address. See here: https://github.com/Nitro/traefik/pull/7/files#diff-fac1a7b56acf416d184a3da0990e8ef4R167 I am thinking to have servers added only when the IPs are resolvable.

"Can we fetch the host IP from Sidecar now? If yes, I can remove the part which resolves IPs for hosts." => good point, we should address this in a separate PR

Agreed.

this is actually less descriptive/helpful since this is our frontend configuration in its entirety. Comparing to the other providers doesn't really make sense since this dynamic is unique to the Sidecar provider.

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

Having sidecar up and working is a hard dependency. Again, the other providers are not a good comparison since using service discovery (Sidecar) as a provider is a dynamic unique to us.

Given that we're running this under s6, I can't understand why it's better to have it die over and over instead of letting it spit out errors every few seconds, which is what will happen if Sidecar goes down after it starts. The current design of sidecarWatcher() prevents it from loading sidecar.toml if it can't reach Sidecar initially, so that should suffice.

bparli · 2017-04-25T22:29:38Z

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

You're actually changing the design here. The point was to keep as much out of the filesystem config as possible. I also question whether we need that level of granularity for the max connections. Lets add it if we get to a point where we need to start tuning max connections for individual services. Given we're using round robin, we should be scaling well before we hit that point anyway. If the main objective is to get your connection counter code in, we can make it a global setting (even expose it in the traefik.toml if we want) and would need to touch a lot fewer places.

Given that we're running this under s6, I can't understand why it's better to have it die over and over instead of letting it spit out errors every few seconds, which is what will happen if Sidecar goes down after it starts. The current design of sidecarWatcher() prevents it from loading sidecar.toml if it can't reach Sidecar initially, so that should suffice.

The only objective is to make sure traefik doesn't come up unusable. The reason being, we're using keepalived to monitor the process on each host, and trigger a failover of the EIP if the process dies. If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host. Open to other solutions, but thats the problem scenario I was trying to address.

mihaitodor · 2017-04-28T00:16:04Z

@bparli OK, let's try making incremental changes. I started with #8 and I'll make a plan for further cleaning the code in order to have it ready for a PR to Traefik, if you agree with that.

Regarding your replies:

You're actually changing the design here. The point was to keep as much out of the filesystem config as possible. I also question whether we need that level of granularity for the max connections.

Yes. From my discussion with @relistan, we do want to be able to configure connection limits independently for each service (and maybe disable them in some cases, if the locks that the MaxConn implementation has end up being too slow for some services).

I think we should make it resemble the file backend in this sense.

Given we're using round robin, we should be scaling well before we hit that point anyway.

Not sure I understand this. MaxConn applies on each entire backend, not per backend server.

If the main objective is to get your connection counter code in, we can make it a global setting (even expose it in the traefik.toml if we want) and would need to touch a lot fewer places.

Done in #8.

If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host.

In this updated code, the frontends (actually, the TOML config) does not get loaded until the Sidecar watcher endpoint can be reached. See here. I am considering to write a test for this. If we route traffic to Traefik before its config is loaded, well, then maybe we need to brainstorm a better healthcheck for it. Here's how the authors of Traefik do it: https://github.com/containous/traefik/blob/master/integration/healthcheck_test.go#L45

relistan · 2017-04-28T11:03:21Z

provider/sidecar.go

+
+		for _, serv := range services {
+			if serv.IsAlive() {
+				ipAddr, err := net.LookupIP(serv.Hostname)


This should use the IP address for the Port if there is one and only fall back to lookups when there isn't (now that we have this avialable).

I guess that ^^ was discussed here earlier. Probably best in a separate PR.

relistan · 2017-04-28T11:40:15Z

configuration.go

 	defaultSidecar.Watch = true
 	defaultSidecar.Endpoint = "http://127.0.0.1:7777"
-	defaultSidecar.Frontend = "sidecar.toml"
+	defaultSidecar.Filename = "sidecar.toml"


This change makes sense to me given the nature of the PR.

relistan · 2017-04-28T11:43:06Z

provider/sidecar.go

 	}
+
+	// Create backends from Sidecar state data
+	for serviceName, services := range sidecarStates {


Here we could just pass the whole state in and call EachService on it rather than first generating the map and then passing it here and then ranging over it. I know this is different from the original code, too. But if we were going to redo it, that's what I'd recommend over this.

relistan · 2017-04-28T11:44:37Z

provider/sidecar.go

 	log.Debugf("Using %s Sidecar connection refresh interval", provider.RefreshConn)
-	provider.recycleConn(client, tr)
-	return nil
+	for {


How will this for loop be exited?

@relistan Not sure what should make it exit. My understanding of the design is that it should keep running as long as Traefik is in Watch mode. Maybe have a way to stop it if the config changes to turn the Watch mode off?

PS: it's the same as in the original implementation: https://github.com/Nitro/traefik/blob/master/provider/sidecar.go#L141

relistan · 2017-04-28T11:45:20Z

provider/sidecar.go

-	return nil
+	for {
+		// Call a separate function because the defer statement has function scope
+		provider.sidecarWatcher()


Probably an anonymous function here would be better since this method does nothing on its own and all the work is now in sidecarWatcher

Initially, I had it like that, but @bparli argued that it's less readable.

I'd originally reused pointer variables. I still don't see a problem with doing that and it seems more efficient than the overhead of repeatedly calling a function, even in Go. I guess I'm still not seeing how this particular refactoring is an improvement.

I have read the original code multiple times now and I still don't feel confident enough to explain to somebody else why it's 100% correct. It may very well be, but I don't understand it fully and I'm probably missing some context. The fact that it is working in production is encouraging so I don't mind leaving it the way you designed it. In this sense, I have created #8 for moving forward with the MaxConn additions.

My objection to the way it was done is clarity and an explanation of why it works the way it does that we can relate to some existing documentation. The fact that Close() is never called on that HTTP response body deviates from a well-established Go pattern and the whole pointer reuse technique makes me scratch my head, because the variable will get overwritten during the subsequent iteration of the for loop.

relistan · 2017-04-28T11:46:25Z

I made some comments in line. Some replies below:

@mihaitodor wrote:

OK, I was a bit concerned about that, but it took me a good amount of time time to get my head around the exiting code and I decided to rewrite parts of it in a way that makes more sense to me. A part of this PR contains cosmetic changes which I believe make the code more readable and cleaner for the purpose of submitting it as a PR to the upstream project which I am still hoping that we will do soon.

It looks like a lot of rewriting of code that I already found pretty easy to read. This is a judgment call on when to refactor and when not to. I'd personally not restructure all of this.

@mihaitodor wrote:

I disagree. This PR actually changes the format of the sidecar.toml file, as discussed. Please see the modified test file here. Basically, we are now able to add backends in it as well and when Sidecar sends an update for an existing backend, the updates get merged. Otherwise, if the backend doesn't exist, it gets created. So, I think we should preserve the naming convention and think of a way to document it better.

Why would we be putting backends in the file? I don't see why we would want that in this provider. We talked about adding a connection limit setting in that file and then applying it to the backend when loaded. But putting backends in that file is explicitly not a goal unless I somehow miss the point here.

@bparli wrote:

The only objective is to make sure traefik doesn't come up unusable. The reason being, we're using keepalived to monitor the process on each host, and trigger a failover of the EIP if the process dies. If the traefik process starts up, but errors out over and over, we could easily find ourselves in a situation where we're routing live traffic to a bad host. Open to other solutions, but thats the problem scenario I was trying to address.

@bparli is spot on here. We also need Traefik not to start on deploy (thereby canceling the deploy to other nodes) if it comes up without being able to talk to Sidecar. Since we do a rolling deploy, this is how we keep it from taking out all the LBs at once.

mihaitodor · 2017-04-28T12:37:03Z

@relistan Answers below:

I'd personally not restructure all of this.

OK, that's not a problem. Initially, I had trouble following / understanding the code, so I rewrote it as I went along. I'll try to get some of the changes in other PRs once we agree on the overall goals.

Why would we be putting backends in the file? I don't see why we would want that in this provider.

Because otherwise it seems there are only 2 options: we can add a global setting, like here or we'd have to modify this struct, which is used by all the other providers. Is there any other way?

But putting backends in that file is explicitly not a goal unless I somehow miss the point here.

This might give better flexibility for other people, like being able to specify a HealthCheck

Since we do a rolling deploy, this is how we keep it from taking out all the LBs at once.

OK, I didn't have enough context here. Is there any plan to use Mesos for deploying Traefik in the future? If using a health check is not an option, then, indeed, we have to let it die.

mihaitodor · 2017-04-28T12:57:50Z

@relistan On second thought, I could embed the Frontend struct into another struct and add a MaxConn field alongside it. Not sure if that's supported, but I can try if you think we really don't want people to add other options to the backends.

I decided to not add a dependency on go-director, since Traefik already has similar functionality builtin (safe/routine.go). Discussion: #7 (comment) #8 (comment)

- This commit enables adding a MaxConn field on the frontends defined for Sidecar via the TOML specified in the Filename field. - Design discussed here: #7 (comment) and here: #8 (comment)

mihaitodor · 2017-05-15T13:28:02Z

Closing this since it has been superseded by other PRs.

mihaitodor commented Apr 21, 2017

View reviewed changes

provider/sidecar.go Outdated

Copy link

Author

mihaitodor Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CancelRequest has been deprecated and the new context functionality should be used instead.

Add Sidecar maxconn support

9f6537a

mihaitodor force-pushed the add-sidecar-maxconn-support branch from 4ff19fa to 9f6537a Compare April 21, 2017 17:08

Fix linter issue

3043b96

bparli reviewed Apr 24, 2017

View reviewed changes

Refactor anonymous function as a separate Sidecar method

10dd332

relistan reviewed Apr 28, 2017

View reviewed changes

relistan mentioned this pull request Apr 28, 2017

Add Sidecar MaxConn support without other functionality changes #8

Merged

mihaitodor added a commit that referenced this pull request May 9, 2017

Use a pool for every goroutine

027ce69

I decided to not add a dependency on go-director, since Traefik already has similar functionality builtin (safe/routine.go). Discussion: #7 (comment) #8 (comment)

mihaitodor changed the title ~~Add Sidecar maxconn support~~ Add Sidecar maxconn support - DO NOT MERGE May 12, 2017

mihaitodor closed this May 15, 2017

mihaitodor mentioned this pull request May 19, 2017

Use service server IPs from Sidecar #11

Merged

1 task

Conversation

mihaitodor commented Apr 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bparli commented Apr 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihaitodor Apr 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihaitodor commented Apr 25, 2017

Uh oh!

bparli commented Apr 25, 2017

Uh oh!

mihaitodor commented Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihaitodor Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

relistan commented Apr 28, 2017

Uh oh!

mihaitodor commented Apr 28, 2017

Uh oh!

mihaitodor commented Apr 28, 2017

Uh oh!

mihaitodor commented May 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mihaitodor commented Apr 21, 2017 •

edited

Loading

mihaitodor Apr 25, 2017 •

edited

Loading

mihaitodor commented Apr 28, 2017 •

edited

Loading

mihaitodor Apr 28, 2017 •

edited

Loading