Skip to content

Fix unexpected EOF on log upload#1978

Open
akselaase wants to merge 2 commits into
mendersoftware:5.1.xfrom
akselaase:aksel.aase/log-upload-unexpected-eof
Open

Fix unexpected EOF on log upload#1978
akselaase wants to merge 2 commits into
mendersoftware:5.1.xfrom
akselaase:aksel.aase/log-upload-unexpected-eof

Conversation

@akselaase

Copy link
Copy Markdown

Upon upgrading to Mender 5.1.0, we noticed that deployment log uploads would fail with a HTTP 400 "unexpected EOF" errors. Turns out this was an interaction with a patch we applied to the client and a latent bug in the upload retry code. Our patch simply logged a "Connected to ..." line with the resolved address of the remote. This however, caused the deployment log to grow in between the calculation of Content-Length and the actual log body streaming, causing a mismatch between declared and actual length. This caused server-side truncation of the uploaded JSON body, and thus JSON parsing failed with "unexpected EOF".

The fix is to avoid re-sanitizing the log file between content length calculation and body upload, such that the uploaded contents match the declared content length. If the upload fails, the client FSM logic will retry the whole task later (including re-sanitizing and calculating Content-Length).

Even though our patch exposed this issue, I think it's reachable on stock Mender as well by enabling debug logging, as there are some debug log statements in the same window IIRC.

Note that I initially developed this fix against 5.1.0, and have successfully tested it on our devices. When I rebased it on 5.1.x for upstreaming there were some conflicts with the large-log-file truncation logic (which necessitates the first commit). I have not tested these changes on top of 5.1.x in the field.

FileReader::Rewind() refused to rewind once the stream had reached EOF
because the eofbit/failbit left by the previous read made the seek fail.

Clear the stream state and seek to the configured start offset instead,
so a reader created with a non-zero offset can be re-read from its start
after reaching EOF.

Ticket: None
Changelog: None
Signed-off-by: Aksel Lunde Aase <aksel.aase@disruptive-technologies.com>
Calling SanitizeLogs() as part of Rewind would invalidate the previous
Content-Length calculation if the log has grown in the meantime. This
would cause the server to reject the log upload with HTTP 400:
unexpected EOF.

The log file can grow between computing the Content-Length and streaming
the body, because the deployment keeps logging to it while the upload is
in flight.

Ticket: None
Changelog: Fix deployment log uploads being rejected in some cases.
Signed-off-by: Aksel Lunde Aase <aksel.aase@disruptive-technologies.com>
@mender-test-bot

Copy link
Copy Markdown

Merging these commits will result in the following changelog entries:

Changelogs

mender (aksel.aase/log-upload-unexpected-eof)

New changes in mender since 5.1.x:

Bug Fixes
  • Fix deployment log uploads being rejected in some cases.

@vpodzime vpodzime left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all -- thanks for the potential improvement!

I think this is a good change. But it made me think about this once again (last time was when I worked on the logs sanitation and later trimming). What is the point of recording deployment logs at the point where it's no longer possible to upload them? With this change any extra logs appended to the deployment logs file after starting the upload wont' be uploaded, right? And without it, they would be attempted to be uploaded with an unavoidable failure. So what is the point in recording them in the first place? I think we need to either:

  • call FinishDeploymentLogging() earlier to make sure deployment logs are finished/closed before being uploaded, or
  • make sure the total size is recalculated (and re-sent?) on a rewind.

@akselaase

Copy link
Copy Markdown
Author

At least for us, continuing to log to the deployment log is valuable in at least two cases:

  • The deployment log upload fails temporarily (connectivity issues etc.,). The failures will be logged and uploaded on the later retries (re-sanitized).
  • The upload fails for some more permanent reason. They are still stored on the device, and we can later use mender-connect to inspect them on the device.

So its the "won't be uploaded" part that's inaccurate. There's still a chance they'll be uploaded (later retries), or at least accessed (through mender-connect or other means).

@vpodzime

Copy link
Copy Markdown
Contributor

So its the "won't be uploaded" part that's inaccurate.

I see. Then I think the focus should be on how to make sure they are fully uploaded at the first possible attempt. To me that means we should make the Rewind() work with the original intention and fix the size mismatch.

@akselaase

Copy link
Copy Markdown
Author

I'm not sure I get what you mean. Isn't this PR already doing that?

By fixing the size mismatch we either have to make the request body fit the declared content length (this PR), or make the content length fit the request body after its re-sanitized (when headers are already sent). The latter seems much more difficult, and I don't understand why you'd want to sanitize twice per upload anyway (once for calculating content length, then again when streaming the body).

It seems much more robust to just sanitize once when preparing the request, then use Rewind to seek to the correct offset in the sanitized file (without sanitizing again).

Please let me know if I'm missing something here.

@vpodzime

Copy link
Copy Markdown
Contributor

The latter seems much more difficult

Absolutely. But isn't it the only way to make sure full deployment logs are uploaded? The way I understand it is that the Rewind() only happens in case of specific failures. Failures that should append messages to the deployment logs. So then we want these messages to be part of the deployment logs uploaded to the server. Which means that we need to re-sanitize the logs and sent a new Content-length header (perhaps by doing a completely new HTTP request if that's the only way).

Without that (and I believe that's what this PR is doing), incomplete deployment logs would be uploaded, masking the failure(s) from failed attempts. With complete deployment logs being available on the device. So this is an improvement for sure, I'm just trying to see (and suggest) if we could go one step further and implement the above behavior with complete deployment logs being uploaded.

@akselaase

akselaase commented Jun 25, 2026

Copy link
Copy Markdown
Author

The way I understand it is that the Rewind() only happens in case of specific failures. Failures that should append messages to the deployment logs.

I think this is the misunderstanding. Rewind() happens unconditionally on each upload attempt, without any failure triggering it. I agree that any failing attempt should result in a new attempt including the failure logs, but that's not the worry here. Here's an excerpt of PushLogs on v5.1.0:

error::Error DeploymentClient::PushLogs(
	const string &deployment_id,
	const string &log_file_path,
	api::Client &client,
	LogsAPIResponseHandler api_handler) {
	auto logs_reader = make_shared<JsonLogMessagesReader>(log_file_path);
	auto err = logs_reader->SanitizeLogs(); // <-- (1) First call to SanitizeLogs
	if (err != error::NoError) {
		return err;
	}

	auto req = make_shared<api::APIRequest>();
	req->SetPath(http::JoinUrl(deployments_uri_prefix, deployment_id, logs_uri_suffix));
	req->SetMethod(http::Method::PUT);
	req->SetHeader("Content-Type", "application/json");
	req->SetHeader("Content-Length", to_string(logs_reader->TotalDataSize())); // <-- (2) Header length calculation
	req->SetHeader("Accept", "application/json");
	req->SetBodyGenerator([logs_reader]() {
		logs_reader->Rewind(); // <-- (3) Unconditional call to Rewind and second call to SanitizeLogs
                // If logs have been written to after (1) and before (3), which our patches do,
                // this will result in a mismatch of body size and content length.
                // As documented by SetBodyGenerator:
                //    "Make sure that the Content-Length set in headers matches the length of the body."
                // This means that we must not re-sanitize the log at this point, as that risks breaking that contract.
                // Therefore, remove SanitizeLogs from Rewind, and turn it into a pure seek to the proper offset.
		return logs_reader;
	});

Rewind is simply used as part of ordinary body streaming. Its wrong for it to re-sanitize logs at this point, because that is done just before. If this upload fails for any reason, the state machinery will schedule another attempt at PushLogs later, which will re-do the sanitation as desired.

Edit:

Note that we do not care that the log statements we do between (1) and (3) do not get included in the initial upload. They are only useful in case of failures, and in that case the upload will be retried later, including the relevant log entries.

@vpodzime

Copy link
Copy Markdown
Contributor

I think this is the misunderstanding. Rewind() happens unconditionally on each upload attempt, without any failure triggering it.

🎯 ❗ This is it! And it's stupid! (I can say so because I wrote that code myself)
image

The Rewind() is required there because the body generator can be called multiple times by the HTTP code and it is expected to generate the same content repeatedly. And there are two issues:

  • it should do nothing on the first run of the generator (because the reader is properly initialized)
  • it should always generate the same content

Your change fixes the latter and makes the former issue harmless. But we should fix the former issue properly as well.

Thanks for your patience and keeping with me! ❤️

@akselaase

Copy link
Copy Markdown
Author

No worries! Feel free to make edits to the branch if you had any specific changes in mind. I'll admit I'm a bit uneasy juggling all these offsets around to ensure we seek to the right place, and I believe you might have a better mental map of how that works. As far as I can tell the first bullet point is satisfied as well.

@vpodzime

Copy link
Copy Markdown
Contributor

Hehe, understandable! I'll get (back) to this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants