Skip to content

tomventa/dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe

Dedupe is a simple tool to deduplicate lines in text files. It processes files and directories, removing duplicate lines and optionally sorting the content.

Why

When I worked with large text files, I often found myself needing to remove duplicate lines. I tried several existing tools, but they either lacked features I needed or were too slow for my use case. So I decided to create my own tool that would be fast and with the features I wanted.

Features

  • Deduplication of lines in text files
  • Sorting of output file content
  • Processing of directories (even recursively)
  • Configurable memory usage
  • Concurrent processing with multiple workers
  • Progress indication and verbose logging

Help

Usage of dedupe:
  -crossfile
    	Enable cross-file deduplication mode (requires already deduplicated and sorted individual files)
  -delete
    	Delete original file after successful deduplication
  -input string
    	Input file or directory to process
  -long-threshold int
    	Long Line Threshold (lines longer than this will be skipped if SkipLongLines is enabled) (default 2000)
  -max-memory uint
    	Maximum total memory usage in Megabytes (default 8192)
  -nologo
    	Disable printing the logo
  -output string
    	Output directory for deduplicated files (if not overwriting originals)
  -overwrite
    	Overwrite original files with deduplicated versions
  -recursive
    	Recursively process directories
  -short-threshold int
    	Short Line Threshold (lines shorter than or equal to this will be skipped if SkipShortLines is enabled) (default 6)
  -skip-long
    	Skip very long lines exceeding Long Line Threshold
  -skip-short
    	Skip very short lines below (or equal to) Short Line Threshold
  -sort
    	Sort output file content alphabetically (default true)
  -strip
    	Strip whitespace from lines before processing
  -verbose
    	Enable verbose logging
  -workers int
    	Number of concurrent workers for processing (default 6)

Next Steps

I don't plan to add many more features to this tool as it already serves my needs.

About

Tool to deduplicate file contents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages