Skip to content

UriModifyingContentModifier.modifyContent decodes with the request charset but re-encodes with the platform default #1039

@config25

Description

@config25

Summary

In UriModifyingContentModifier.modifyContent(byte[], MediaType), the incoming bytes are decoded using the charset declared on the Content-Type header, but the modified result is written back with String.getBytes() and no
charset argument, which falls back to the JVM's default file.encoding. When those two charsets differ, non-ASCII bytes in the body can end up corrupted in the recorded output.

The recently merged fix in #1034 (for #1033) addressed a very similar decode/encode mismatch in MockMvcRequestConverter, so I wanted to flag that the same shape of issue may also be present here.

Affected code

spring-restdocs-core/src/main/java/org/springframework/restdocs/operation/preprocess/UriModifyingOperationPreprocessor.java, lines 194–205:

@Override
public byte[] modifyContent(byte[] content, @Nullable MediaType contentType) {
    String input;
    if (contentType != null && contentType.getCharset() != null) {
        input = new String(content, contentType.getCharset());   // decode: uses declared charset
    }
    else {
        input = new String(content);
    }

    return modify(input).getBytes();   // encode: uses JVM default charset
}

A similar content modifier in the same package, PatternReplacingContentModifier, happens to handle this by using a single charset for both directions, which may be useful as a reference:

Charset charset = (contentType != null && contentType.getCharset() != null) ? contentType.getCharset()
        : this.fallbackCharset;
String original = new String(content, charset);
...
return builder.toString().getBytes(charset);

Scenario — non-ASCII content with a non-default charset

The issue shows up when a request or response declares a Content-Type charset that differs from the JVM's file.encoding and the body contains non-ASCII characters. Two cases that come to mind:

  • A legacy service serving text/...; charset=ISO-8859-1, documented from a JVM running with file.encoding=UTF-8 (the default since JEP 400 / Java 18).
  • A service serving application/json; charset=UTF-8, documented from a JVM whose default differs (e.g. file.encoding=MS949 on a Korean Windows machine, or Cp1252 on a Western Windows machine).

Here is a small test that reproduces the request side using ISO-8859-1, which makes the asymmetry observable regardless of the host platform:

@Test
void requestContentWithNonAsciiCharactersIsPreservedWhenCharsetIsIso88591() {
    this.preprocessor.scheme("https");
    String original = "café http://localhost:12345 done";
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.parseMediaType("text/plain;charset=ISO-8859-1"));
    OperationRequest request = this.requestFactory.create(URI.create("http://localhost"), HttpMethod.GET,
            original.getBytes(StandardCharsets.ISO_8859_1), headers,
            Collections.<OperationRequestPart>emptyList());
    OperationRequest processed = this.preprocessor.preprocess(request);
    String result = new String(processed.getContent(), StandardCharsets.ISO_8859_1);
    assertThat(result).isEqualTo("café https://localhost:12345 done");
}

I tried this locally against main (commit 631b6c22). The Gradle test JVM runs with -Dfile.encoding=UTF-8, and the assertion fails as follows:

expected: "café https://localhost:12345 done"
 but was: "café https://localhost:12345 done"

For what it's worth, the mechanics seem to line up with the code:

  1. The byte 0xE9 is decoded as ISO-8859-1, and the Java String ends up with the character é as expected.
  2. modify(...) returns the modified String with é intact.
  3. .getBytes() (no charset) then encodes é using the JVM default — UTF-8 in this environment — producing 0xC3 0xA9 (two bytes).
  4. Reading those bytes back as ISO-8859-1 yields é instead of é, so the body emitted by the preprocessor no longer matches the encoding declared in the Content-Type header.

Since UriModifyingContentModifier is used for both requests and responses, the same thing seems to happen on the response side too.

Possible fix

One option would be to use the same charset for both decoding and encoding, so that the bytes that come out continue to match the encoding declared in the Content-Type header. For example:

@Override
public byte[] modifyContent(byte[] content, @Nullable MediaType contentType) {
    Charset charset = (contentType != null && contentType.getCharset() != null) ? contentType.getCharset()
            : Charset.defaultCharset();
    return modify(new String(content, charset)).getBytes(charset);
}

I'm not sure what the preferred fallback would be when no charset is declared — Charset.defaultCharset() would preserve the current behaviour in that branch, but a fixed StandardCharsets.UTF_8 might be more in line with
how REST Docs handles encoding elsewhere. Happy to go with whichever you'd prefer.

Environment

  • Spring REST Docs: main branch, commit 631b6c22
  • Java 17, Gradle test JVM running with -Dfile.encoding=UTF-8
  • Reproduced locally via two new tests in UriModifyingOperationPreprocessorTests covering the request and response paths; both fail against the current code at UriModifyingOperationPreprocessor.java:204.

This may well be the same root cause as #1033, just in a different class. If the analysis looks reasonable, I'd be glad to put together a PR with a fix and regression tests for both the request and response paths — but happy to wait for your thoughts first.


Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions