Fix unexpected EOF on log upload#1978
Conversation
FileReader::Rewind() refused to rewind once the stream had reached EOF because the eofbit/failbit left by the previous read made the seek fail. Clear the stream state and seek to the configured start offset instead, so a reader created with a non-zero offset can be re-read from its start after reaching EOF. Ticket: None Changelog: None Signed-off-by: Aksel Lunde Aase <aksel.aase@disruptive-technologies.com>
Calling SanitizeLogs() as part of Rewind would invalidate the previous Content-Length calculation if the log has grown in the meantime. This would cause the server to reject the log upload with HTTP 400: unexpected EOF. The log file can grow between computing the Content-Length and streaming the body, because the deployment keeps logging to it while the upload is in flight. Ticket: None Changelog: Fix deployment log uploads being rejected in some cases. Signed-off-by: Aksel Lunde Aase <aksel.aase@disruptive-technologies.com>
|
Merging these commits will result in the following changelog entries: Changelogsmender (aksel.aase/log-upload-unexpected-eof)New changes in mender since 5.1.x: Bug Fixes
|
There was a problem hiding this comment.
First of all -- thanks for the potential improvement!
I think this is a good change. But it made me think about this once again (last time was when I worked on the logs sanitation and later trimming). What is the point of recording deployment logs at the point where it's no longer possible to upload them? With this change any extra logs appended to the deployment logs file after starting the upload wont' be uploaded, right? And without it, they would be attempted to be uploaded with an unavoidable failure. So what is the point in recording them in the first place? I think we need to either:
- call
FinishDeploymentLogging()earlier to make sure deployment logs are finished/closed before being uploaded, or - make sure the total size is recalculated (and re-sent?) on a rewind.
|
At least for us, continuing to log to the deployment log is valuable in at least two cases:
So its the "won't be uploaded" part that's inaccurate. There's still a chance they'll be uploaded (later retries), or at least accessed (through mender-connect or other means). |
I see. Then I think the focus should be on how to make sure they are fully uploaded at the first possible attempt. To me that means we should make the Rewind() work with the original intention and fix the size mismatch. |
|
I'm not sure I get what you mean. Isn't this PR already doing that? By fixing the size mismatch we either have to make the request body fit the declared content length (this PR), or make the content length fit the request body after its re-sanitized (when headers are already sent). The latter seems much more difficult, and I don't understand why you'd want to sanitize twice per upload anyway (once for calculating content length, then again when streaming the body). It seems much more robust to just sanitize once when preparing the request, then use Rewind to seek to the correct offset in the sanitized file (without sanitizing again). Please let me know if I'm missing something here. |
Absolutely. But isn't it the only way to make sure full deployment logs are uploaded? The way I understand it is that the Rewind() only happens in case of specific failures. Failures that should append messages to the deployment logs. So then we want these messages to be part of the deployment logs uploaded to the server. Which means that we need to re-sanitize the logs and sent a new Content-length header (perhaps by doing a completely new HTTP request if that's the only way). Without that (and I believe that's what this PR is doing), incomplete deployment logs would be uploaded, masking the failure(s) from failed attempts. With complete deployment logs being available on the device. So this is an improvement for sure, I'm just trying to see (and suggest) if we could go one step further and implement the above behavior with complete deployment logs being uploaded. |
I think this is the misunderstanding. error::Error DeploymentClient::PushLogs(
const string &deployment_id,
const string &log_file_path,
api::Client &client,
LogsAPIResponseHandler api_handler) {
auto logs_reader = make_shared<JsonLogMessagesReader>(log_file_path);
auto err = logs_reader->SanitizeLogs(); // <-- (1) First call to SanitizeLogs
if (err != error::NoError) {
return err;
}
auto req = make_shared<api::APIRequest>();
req->SetPath(http::JoinUrl(deployments_uri_prefix, deployment_id, logs_uri_suffix));
req->SetMethod(http::Method::PUT);
req->SetHeader("Content-Type", "application/json");
req->SetHeader("Content-Length", to_string(logs_reader->TotalDataSize())); // <-- (2) Header length calculation
req->SetHeader("Accept", "application/json");
req->SetBodyGenerator([logs_reader]() {
logs_reader->Rewind(); // <-- (3) Unconditional call to Rewind and second call to SanitizeLogs
// If logs have been written to after (1) and before (3), which our patches do,
// this will result in a mismatch of body size and content length.
// As documented by SetBodyGenerator:
// "Make sure that the Content-Length set in headers matches the length of the body."
// This means that we must not re-sanitize the log at this point, as that risks breaking that contract.
// Therefore, remove SanitizeLogs from Rewind, and turn it into a pure seek to the proper offset.
return logs_reader;
});
Edit: Note that we do not care that the log statements we do between (1) and (3) do not get included in the initial upload. They are only useful in case of failures, and in that case the upload will be retried later, including the relevant log entries. |
|
No worries! Feel free to make edits to the branch if you had any specific changes in mind. I'll admit I'm a bit uneasy juggling all these offsets around to ensure we seek to the right place, and I believe you might have a better mental map of how that works. As far as I can tell the first bullet point is satisfied as well. |
|
Hehe, understandable! I'll get (back) to this next week. |

Upon upgrading to Mender 5.1.0, we noticed that deployment log uploads would fail with a HTTP 400 "unexpected EOF" errors. Turns out this was an interaction with a patch we applied to the client and a latent bug in the upload retry code. Our patch simply logged a "Connected to ..." line with the resolved address of the remote. This however, caused the deployment log to grow in between the calculation of Content-Length and the actual log body streaming, causing a mismatch between declared and actual length. This caused server-side truncation of the uploaded JSON body, and thus JSON parsing failed with "unexpected EOF".
The fix is to avoid re-sanitizing the log file between content length calculation and body upload, such that the uploaded contents match the declared content length. If the upload fails, the client FSM logic will retry the whole task later (including re-sanitizing and calculating Content-Length).
Even though our patch exposed this issue, I think it's reachable on stock Mender as well by enabling debug logging, as there are some debug log statements in the same window IIRC.
Note that I initially developed this fix against 5.1.0, and have successfully tested it on our devices. When I rebased it on 5.1.x for upstreaming there were some conflicts with the large-log-file truncation logic (which necessitates the first commit). I have not tested these changes on top of 5.1.x in the field.