Conversation
- Scans build directory for files that are never referenced anywhere - Creates base filename mappings and searches for references - Excludes HTML files, robots.txt, favicon.ico, and sitemap.xml by default - Provides detailed reporting of potentially unused assets - Includes useful test script and warning about manual verification Features: - Skips binary files (images, fonts, etc.) to avoid parsing errors - Case-sensitive filename matching - Provides relative paths in the build directory - Exits with error code when unused assets are found Note: This is a naive checker that may have false positives/negatives
|
this replaces #152 |
There was a problem hiding this comment.
Pull request overview
This PR introduces an automated unused assets detection tool to help maintain a clean build directory by identifying orphaned files that are not referenced anywhere in the codebase.
Key Changes:
- Adds a new test script (
test/unused-assets.mjs) that scans the build directory for potentially unreferenced assets - Integrates the unused assets check into the main test suite via package.json
- Implements a naive string-matching approach that searches for base filenames across all build files
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| test/unused-assets.mjs | New script that identifies potentially unused assets by searching for base filename references across all build files, with configurable exclusion patterns and detailed reporting |
| package.json | Adds test:unused-assets script and integrates it into the main test command chain |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function main() { | ||
| console.log("🔍 Scanning for unused assets in the build directory...\n"); | ||
|
|
||
| // Get all files | ||
| const allFiles = getAllBuildFiles(); | ||
| console.log(`ℹ️ Found ${allFiles.length} total files in build directory`); |
There was a problem hiding this comment.
The main() function doesn't check if the BUILD_DIR exists before scanning. If the build directory doesn't exist (e.g., before running yarn build), the script will fail ungracefully. Consider adding a check and providing a helpful error message:
if (!fs.existsSync(BUILD_DIR)) {
console.error(`❌ Error: Build directory not found at ${BUILD_DIR}`);
console.error('ℹ️ Please run "yarn build" first.');
process.exit(1);
}|
|
||
| // Check each remaining base name to see if it appears in this file | ||
| for (const baseName of unreferenced.keys()) { | ||
| if (content.includes(baseName)) { |
There was a problem hiding this comment.
Using content.includes(baseName) for searching could produce false positives if the base name is a substring of other content (e.g., searching for "app.js" would match "webapp.js" or "my-app.js-backup"). While this is acceptable for the stated "naive" approach and the script already documents this as a limitation, consider documenting this specific case in the limitations section at the top of the file.
| * Limitations (under reporting): | ||
| * 1. If the basename of an asset is mentioned anywhere (even if not actually used), it will NOT be reported | ||
| * | ||
| * Limitations (over reporting): | ||
| * 1. If another website or external source references the asset, it will be reported as unused | ||
| * 2. If the asset is referenced in a way that does not include the base filename (e.g., dynamically constructed paths), it will be reported as unused | ||
| * 3. If the asset is referenced with URL encoding or special characters, it may be reported as unused | ||
| * |
There was a problem hiding this comment.
The documentation states "Limitations (under reporting)" and "Limitations (over reporting)" but these terms may be confusing. Consider using clearer terminology:
- "False negatives" instead of "under reporting" (assets that ARE unused but NOT reported)
- "False positives" instead of "over reporting" (assets that ARE reported but NOT actually unused)
This would make it clearer what each limitation means.
| "build": "bundle exec jekyll build", | ||
| "generate-sitemap": "node scripts/generate-sitemap.mjs", | ||
| "test": "sh -c 'yarn test:html-validate \"$@\" && yarn test:dirty-file-paths-checker \"$@\"' --", | ||
| "test": "sh -c 'yarn test:html-validate \"$@\" && yarn test:dirty-file-paths-checker \"$@\" && yarn test:unused-assets' --", |
There was a problem hiding this comment.
[nitpick] The test:unused-assets script is not following the same pattern as other test scripts. While test:html-validate and test:dirty-file-paths-checker accept command-line arguments through "$@", this script doesn't pass arguments. For consistency, consider using:
"test": "sh -c 'yarn test:html-validate \"$@\" && yarn test:dirty-file-paths-checker \"$@\" && yarn test:unused-assets \"$@\"' --",
This would maintain consistency even if the unused-assets script doesn't currently use the arguments.
| "test": "sh -c 'yarn test:html-validate \"$@\" && yarn test:dirty-file-paths-checker \"$@\" && yarn test:unused-assets' --", | |
| "test": "sh -c 'yarn test:html-validate \"$@\" && yarn test:dirty-file-paths-checker \"$@\" && yarn test:unused-assets \"$@\"' --", |
There was a problem hiding this comment.
That is intentional because you must always consider the whole site to figure out which assets are not being used
|
|
||
| for (const file of allFiles) { | ||
| // Skip binary files that we can't meaningfully search | ||
| if (/\.(png|jpg|jpeg|gif|ico|woff|woff2|ttf|eot|otf)$/i.test(file)) { |
There was a problem hiding this comment.
The binary file exclusion pattern is missing several common binary file types that could cause the script to fail or behave unexpectedly when attempting to read them as UTF-8. Consider adding extensions like .webp, .svg, .pdf, .zip, .mp4, .webm, .mp3, .wav, etc.
Also note that .svg files are actually text-based XML and could be searched, so they might want to be excluded from the binary list while other formats should be added.
| if (/\.(png|jpg|jpeg|gif|ico|woff|woff2|ttf|eot|otf)$/i.test(file)) { | |
| if (/\.(png|jpg|jpeg|gif|ico|webp|pdf|zip|mp4|webm|mp3|wav|woff|woff2|ttf|eot|otf)$/i.test(file)) { |
| // Check each remaining base name to see if it appears in this file | ||
| for (const baseName of unreferenced.keys()) { | ||
| if (content.includes(baseName)) { | ||
| unreferenced.delete(baseName); | ||
|
|
||
| // Early exit if we've found all files | ||
| if (unreferenced.size === 0) { | ||
| return unreferenced; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Iterating over all remaining base names for every file could be inefficient when there are many assets and files. The current implementation has O(n*m) complexity where n is the number of files and m is the number of unreferenced assets. Consider optimizing by:
- Building a single search pattern with all base names (e.g., using a regex with alternation)
- Or at minimum, converting the
unreferenced.keys()to an array once before the outer loop to avoid repeated iterator creation
| // Check each remaining base name to see if it appears in this file | |
| for (const baseName of unreferenced.keys()) { | |
| if (content.includes(baseName)) { | |
| unreferenced.delete(baseName); | |
| // Early exit if we've found all files | |
| if (unreferenced.size === 0) { | |
| return unreferenced; | |
| } | |
| } | |
| } | |
| // Build a regex that matches any of the remaining base names | |
| const baseNames = Array.from(unreferenced.keys()); | |
| if (baseNames.length === 0) { | |
| return unreferenced; | |
| } | |
| // Escape regex special characters in base names | |
| const escapedBaseNames = baseNames.map(name => name.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')); | |
| const pattern = new RegExp(escapedBaseNames.join('|'), 'g'); | |
| const matches = new Set(); | |
| let match; | |
| while ((match = pattern.exec(content)) !== null) { | |
| matches.add(match[0]); | |
| } | |
| for (const found of matches) { | |
| unreferenced.delete(found); | |
| } | |
| // Early exit if we've found all files | |
| if (unreferenced.size === 0) { | |
| return unreferenced; | |
| } |
|
|
||
| // Configurable: Regular expressions to exclude files from being checked as potential unused assets | ||
| // By default, we exclude HTML files since they are the primary content, not assets | ||
| // If there are assets in this site which are not referenced by filename |
There was a problem hiding this comment.
There's an incomplete comment that ends abruptly: "If there are assets in this site which are not referenced by filename". This should be completed or removed to maintain code clarity.
| // If there are assets in this site which are not referenced by filename |
- Updated canonical link URLs in experiment template source files - Removed trailing slashes from canonical links - Tests now pass successfully
Introduces a new test script to identify potentially unused assets in the build directory.
The script scans all files, creates a mapping of base filenames, and searches for references. Assets whose base filenames are never found are reported as potentially unused.
Features:
Run with: yarn test:unused-assets
Note: This is a naive checker. Always verify manually before deleting files.