A small, dependency-light Java library to sanitize and normalize HTML document bodies.
It extracts the document body, sanitizes tags and attributes according to configurable rules,
converts inline style attributes into a deduplicated CSS block, and reassembles the cleaned HTML.
Designed for server-side use in Java web pipelines and batch processors.
- 🧩 Extracts and isolates content inside the
<body>element. - ⚙️ Configurable sanitization of allowed/disallowed tags and attributes.
- 🔁 Converts inline
styleattributes into a single generated CSS block with deduplication. - 🛡️ Preserves token order and textual content while removing disallowed constructs.
src/main/java/cz/jpmad/htmlbodycleaner/
├── cli
│ └── Main.java # Command-line interface for processing HTML files.
├── config
│ └── SanitizationConfig.java # Configuration class for sanitization rules.
├── parser
│ ├── HtmlToken.java # Represents a token in the HTML document.
│ ├── HtmlTokenizer.java # Tokenizer that converts HTML into a stream of tokens.
│ └── HtmlWriter.java # Utility to write tokens back into HTML format.
├── sanitizer
│ ├── BodyExtractor.java # Extracts content from the <body> element.
│ ├── StyleExtractor.java # Extracts and deduplicates inline styles into a CSS block.
│ └── TagSanitizer.java # Sanitizes tags and attributes based on configuration.
└── HtmlBodyCleaner.java # Main class that orchestrates the cleaning process.
This library is published to GitHub Packages.
To use it in your project, you need to add the GitHub Packages repository and the dependency.
<repositories>
<repository>
<id>github</id>
<url>https://maven.pkg.github.com/petrsafrata/HtmlBodyCleaner</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>cz.jpmad</groupId>
<artifactId>html-body-cleaner</artifactId>
<version>1.0.3</version>
</dependency>
</dependencies>repositories {
maven {
url = uri("https://maven.pkg.github.com/petrsafrata/HtmlBodyCleaner")
}
}
dependencies {
implementation("cz.jpmad:html-body-cleaner:1.0.3")
}
Simple example using the default configuration:
import cz.jpmad.htmlbodycleaner.HtmlBodyCleaner;
public class Example {
public static void main(String[] args) {
final String rawHtml = "<html><body><div style=\"color:red\">Hello</div></body></html>";
final HtmlBodyCleaner cleaner = new HtmlBodyCleaner(); // or new HtmlBodyCleaner(config)
final String cleaned = cleaner.clean(rawHtml);
System.out.println(cleaned);
}
}Usage with CLI interface:
java -jar html-body-cleaner-1.0.3.jar input.html output.htmlThe library includes full unit tests covering:
- Allowed/disallowed tag handling
- Inline style extraction and CSS deduplication
- Edge cases (null/empty input, malformed HTML, nested disallowed tags)
You can run tests with:
mvn testContributions are welcome! Feel free to open issues or submit pull requests to improve the API, add features (e.g. CLI options, additional sanitization strategies, style extraction improvements), fix bugs, or expand the documentation. Please read the CONTRIBUTING.md file for details.
This project is open-source and released under the Apache License 2.0. You are free to use, modify, distribute, and use it commercially under the terms of the Apache 2.0 license. See the LICENSE file for full details.
Apache-2.0 – Copyright (c) 2025 Petr Šafrata