Dedupe is a simple tool to deduplicate lines in text files. It processes files and directories, removing duplicate lines and optionally sorting the content.
When I worked with large text files, I often found myself needing to remove duplicate lines. I tried several existing tools, but they either lacked features I needed or were too slow for my use case. So I decided to create my own tool that would be fast and with the features I wanted.
- Deduplication of lines in text files
- Sorting of output file content
- Processing of directories (even recursively)
- Configurable memory usage
- Concurrent processing with multiple workers
- Progress indication and verbose logging
Usage of dedupe:
-crossfile
Enable cross-file deduplication mode (requires already deduplicated and sorted individual files)
-delete
Delete original file after successful deduplication
-input string
Input file or directory to process
-long-threshold int
Long Line Threshold (lines longer than this will be skipped if SkipLongLines is enabled) (default 2000)
-max-memory uint
Maximum total memory usage in Megabytes (default 8192)
-nologo
Disable printing the logo
-output string
Output directory for deduplicated files (if not overwriting originals)
-overwrite
Overwrite original files with deduplicated versions
-recursive
Recursively process directories
-short-threshold int
Short Line Threshold (lines shorter than or equal to this will be skipped if SkipShortLines is enabled) (default 6)
-skip-long
Skip very long lines exceeding Long Line Threshold
-skip-short
Skip very short lines below (or equal to) Short Line Threshold
-sort
Sort output file content alphabetically (default true)
-strip
Strip whitespace from lines before processing
-verbose
Enable verbose logging
-workers int
Number of concurrent workers for processing (default 6)
I don't plan to add many more features to this tool as it already serves my needs.