Use Envoy Aggregated Discovery Service via the gRPC API#52
Use Envoy Aggregated Discovery Service via the gRPC API#52relistan merged 7 commits intoNinesStack:masterfrom
Conversation
f1561b8 to
06304c3
Compare
d3855ca to
4ef0166
Compare
|
Can't see complete implementation because of missing file(s) - in the meantime, I've read through all the comments between Mihai and @relistan - everything makes sense. I'll familiarize myself with all the docs that @mihaitodor linked in the PR (thank you btw, very helpful!). |
|
@dselans Thanks for having a look! Actually, there are no missing files. It's just GH being dumb. Sometimes it doesn't expand some files like envoy/server.go. |
6fdb9e4 to
cfff433
Compare
| d.state.Lock() | ||
| defer d.state.Unlock() | ||
| d.state.ExpireServer(node.Name) |
There was a problem hiding this comment.
Since this is the only place where ExpireServer is called, I moved the locks to the body of the function, to make it easier to write one of my tests. It needs to lock the state anyway.
| @@ -0,0 +1,307 @@ | |||
| package adapter | |||
There was a problem hiding this comment.
There is still some duplication in the functions here and what remained in sidecarhttp/envoy_api.go, but I think deduplicating it will be ugly and I'm thinking that it's probably safe to delete that REST API soon, given that Envoy has a 1-2 year API lifecycle and the V2 API that I just implemented will be deprecated at the end of this year.
I wanted to switch directly to V3, but the go-control-plane doesn't seem to have made the necessary updates yet. It should be a much easier transition, though, since V3 is mostly similar to V2.
651d727 to
a158d0f
Compare
envoy/adapter/adapter.go
Outdated
| state.RLock() | ||
| defer state.RUnlock() | ||
|
|
||
| svcs := state.ByService() |
There was a problem hiding this comment.
I think we can and should use state.EachService() which takes a closure to operate on the state items. This save an allocation of a big blob of memory—which will happen pretty often—and it saves an unnecessary sort as well.
A good way to do that would be to build your own hash as you go (it will be much smaller) and then map it into a list at the very end.
listenerMap := make(map[string]cache.Resource)
state.EachService(func(hostname *string, id *string, svc *service.Service) {
// inspect things, put them in the hash as needed, overwriting without checking
})
listeners := make([]cache.Resource, len(listenersMap))
// insert items in a numbered way to avoid using append and the associate re-allocs
This is not blocking feedback, but it's a better implementation I think.
There was a problem hiding this comment.
Hehe, that's what I get for copy/pasting code 😅This is a great suggestion! Now this API should be much faster than the old REST-based one. Thanks for that!
relistan
left a comment
There was a problem hiding this comment.
Some comments about alternate implementation of part of this, and a couple little cleanup thing, but in general it looks very good!
ec06e17 to
12bbdaa
Compare
Using two distinct transactions for sending the updated resources to Envoy ensures that we don't get an error back if one of the listeners contains a cluster that Envoy hasn't received yet.
instead of using the listener mechanism
Use state.EachService() instead of state.ByService() when populating the clusters and listeners for the Envoy gRPC API. This is much more efficient, since state.ByService() does a lot of unnecessary manipulation, including sorting. Also merge EnvoyClustersFromState and EnvoyListenersFromState into one function to avoid locking the state several times.
Also remove deprecated sudo directive
Services from expired servers need to be tombstoned before updating the state and informing listeners about them.
12bbdaa to
9afd79a
Compare
|
@relistan Thanks a lot for the feedback! I decided to switch to using a simple TimedLooper instead of hooking into the Sidecar Listener functionality because this prevents the spurious updates issue and it ended up being much simpler overall, given your suggestion to check if Using your other suggestion of calling After adding a few more tests, I noticed that it makes sense to tombstone services before sending updates in Finally, I noticed that we were using the wrong zero value for |
catalog/services_state.go
Outdated
| Services: make(map[string]*service.Service, 5), | ||
| LastUpdated: time.Unix(0, 0), | ||
| LastChanged: time.Unix(0, 0), | ||
| LastUpdated: time.Time{}, |
There was a problem hiding this comment.
I'd rather not make this change. I definitely have outside code that uses this in a few projects.
There was a problem hiding this comment.
If we can back this out, I'll merge it!
There was a problem hiding this comment.
Done and created #54 to keep a record of this issue in case we want to fix it in the future.
9afd79a to
50c2fb7
Compare
|
@mihaitodor sorry I approved and thought you'd merge. I'll merge it now. |
|
I don't have permissions to merge. Thanks! :) |
Hey @relistan, this is the initial implementation of the new Envoy gRPC API that we discussed about. I also switched to go mod, which probably messes-up #43 (in case you want to resurrect it), sorry about that.
For now, it has no tests and I duplicated a bunch of code from the REST API. Please have a look at the changes and let me know if the design makes sense and if you'd like anything changed / done differently. Happy to do so before I proceed to deduplicate code and write some tests. I also need to update the Readme / docs. There are quite a few
TODOitems that I left in the code which would be great if you can review and leave some feedback.If you have some time, I'd appreciate it if you can test out these changes in a dev cluster and let me know if you see any issues.
For some background on the implementation, please see this blog post, the v2 API Overview and the xDS REST and gRPC protocol details (especially the part about the Aggregated Discovery Service).
Note that the current implementation, while it is eventually consistent (I hope!), can lead to transient errors because old listeners / clusters might be removed before new ones are added when Envoy receives an update. See the Eventual consistency considerations and On eventually consistent service discovery topics in the docs. They offer an incremental xDS which can be used to address such issues, but it's not yet implemented in the go control plane. Somebody is currently working on it, though (see here). More details here, here and here.
Also, for a large cluster, this implementation is inefficient, because, for each state change, we recreate the listener and cluster definitions and send them to Envoy, so I think using the incremental xDS in the future makes sense.
To validate this implementation, I have created the following setup on my Macbook using a custom Envoy config (see below) which has ADS enabled via gRPC:
envoy.yaml:Note that the
http2_protocol_optionsstuff seems to be needed for the gRPC protocol and192.168.65.2is the host IP address because I ran this using Docker for Mac.Given the setup above, I have performed the following
curltests:I have also tested the above in the browser, which Envoy detects and seems to always forward the
localhost:10080request to the same backend which responded to the initial request, or it might be due to some browser caching, not sure.I tested that clusters are created and removed as expected when backend endpoints go down or come back up.
I made sure that Sidecar can be killed safely and Envoy will continue to happily proxy requests to the last known backends.
Finally, I wanted to compare this implementation to the old Envoy REST API, so I stopped the new Envoy and ran the old one like so:
If you're using Docker for Mac, you'll have to hack the config in the container to point to the host IP, so I added the following to the s6 run script to make my life easier:
Then I fetched the Sidecar
services.jsonand did a config dump in both the old Envoy and the new one, which I am attaching here as reference: json_files.zip. As far as I can tell, everything looks OK in there.