Dedupe

Dedupe is a simple tool to deduplicate lines in text files. It processes files and directories, removing duplicate lines and optionally sorting the content.

Why

When I worked with large text files, I often found myself needing to remove duplicate lines. I tried several existing tools, but they either lacked features I needed or were too slow for my use case. So I decided to create my own tool that would be fast and with the features I wanted.

Features

Deduplication of lines in text files
Sorting of output file content
Processing of directories (even recursively)
Configurable memory usage
Concurrent processing with multiple workers
Progress indication and verbose logging

Help

Usage of dedupe:
  -crossfile
    	Enable cross-file deduplication mode (requires already deduplicated and sorted individual files)
  -delete
    	Delete original file after successful deduplication
  -input string
    	Input file or directory to process
  -long-threshold int
    	Long Line Threshold (lines longer than this will be skipped if SkipLongLines is enabled) (default 2000)
  -max-memory uint
    	Maximum total memory usage in Megabytes (default 8192)
  -nologo
    	Disable printing the logo
  -output string
    	Output directory for deduplicated files (if not overwriting originals)
  -overwrite
    	Overwrite original files with deduplicated versions
  -recursive
    	Recursively process directories
  -short-threshold int
    	Short Line Threshold (lines shorter than or equal to this will be skipped if SkipShortLines is enabled) (default 6)
  -skip-long
    	Skip very long lines exceeding Long Line Threshold
  -skip-short
    	Skip very short lines below (or equal to) Short Line Threshold
  -sort
    	Sort output file content alphabetically (default true)
  -strip
    	Strip whitespace from lines before processing
  -verbose
    	Enable verbose logging
  -workers int
    	Number of concurrent workers for processing (default 6)

Next Steps

I don't plan to add many more features to this tool as it already serves my needs.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
internals		internals
models		models
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dedupe

Why

Features

Help

Next Steps

About

Uh oh!

Releases

Packages

Languages

tomventa/dedupe

Folders and files

Latest commit

History

Repository files navigation

Dedupe

Why

Features

Help

Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages