Skip to content

A high-performance FastCDC 2020 implementation written in Python + Cython

License

Notifications You must be signed in to change notification settings

Fallen-Breath/pyfastcdc

Repository files navigation

PyFastCDC

License Issues PyPI Version

A high-performance FastCDC 2020 implementation written in Python + Cython

Supports Python 3.6+. Provides prebuilt wheels for Python 3.8+

Its core algorithm implementation is a direct port of the v2020 module from nlfiedler/fastcdc-rs, which means that the output of PyFastCDC completely matches the output of nlfiedler/fastcdc-rs

Installation

PyFastCDC is available on PyPI, with prebuilt wheels for many common platforms thanks to cibuildwheel

To install, you can use pip or any other Python package manager you prefer:

pip install pyfastcdc

For platforms without prebuilt wheels, a suitable build environment capable of compiling Python extension modules is required. For example, on Debian, you might need to install gcc and python3-dev via apt

If the Cython extension fails to compile, the installation will fall back to a pure Python implementation, which is significantly slower (about 0.5% or less in memory chunking speed)

I only want to use the Cython implemetion, not the slow pure-Python one

You can set the environment variable PYFASTCDC_REQUIRE_CYTHON=true or PYFASTCDC_REQUIRE_CYTHON=1 for the pip installation command to disable the pure-Python fallback on extension compilation error and make the installation fail hard. Thus, after a successful installation, you will always have a working Cython extension

Example bash command using pip:

$ PYFASTCDC_REQUIRE_CYTHON=true pip install pyfastcdc
... some pip output ...
Building wheels for collected packages: pyfastcdc
  Building wheel for pyfastcdc (pyproject.toml) ... error
  ... some pip output ...
  ###########################################################################################################
  Failed to compile pyfastcdc Cython extension, fail hard since PYFASTCDC_REQUIRE_CYTHON is set to true
  Unset PYFASTCDC_REQUIRE_CYTHON to allow pure-Python fallback if that's acceptable for your use case.
  <class 'distutils.compilers.C.errors.CompileError'> command 'gcc' failed: No such file or directory
  ###########################################################################################################
  error: command 'gcc' failed: No such file or directory
  ----------------------------------------
  ERROR: Failed building wheel for pyfastcdc
  ... some pip output ...

Usage

The usage of PyFastCDC is simple:

  1. Construct a FastCDC instance with desired parameters
  2. Call FastCDC.cut_xxx() function to chunk your data
    • Call cut_buf() to chunk in-memory data buffers
    • Call cut_file() to chunk a regular file using mmap
    • Call cut_stream() to chunk a custom file-like streaming object

Example:

import hashlib
from pyfastcdc import FastCDC

for chunk in FastCDC(16384).cut_file('archive.tar'):
	print(chunk.offset, chunk.length, hashlib.sha256(chunk.data).hexdigest())

See docstrings of exported objects in the pyfastcdc module for more API details

Please only import members from pyfastcdc in your application code and avoid importing inner modules (e.g. pyfastcdc.common) directly. Only public APIs inside the pyfastcdc module are guaranteed to be stable across releases

from pyfastcdc import NormalizedChunking         # GOOD
from pyfastcdc.common import NormalizedChunking  # BAD, no API stability guarantee

Performance

With the help of Cython, PyFastCDC can achieve near-native performance on chunking inputs

benchmark

Each test was run 10 times for averaging, achieving a maximum in-memory chunking speed of about 4.8GB/s

Benchmark details

FastCDC parameters:

  • avg_size: Independent variable
  • min_size: avg_size / 4 (default)
  • max_size: avg_size * 4 (default)
  • normalized_chunking: 1 (default)
  • seed: 0 (default)

Test environment:

  • PyFastCDC 0.2.0b1, precompiled wheel from Test PyPI, Cython 3.2.4
  • Python 3.11.14 using docker image python:3.11
  • Ryzen 7 6800H @ 4.55GHz, NVMe SSD, Debian 13.2

Test files:

Test command:

cd scripts
python benchmark.py --test-files rand_10G.bin AlmaLinux-10.1-x86_64-dvd.iso llvmorg-21.1.8.tar

Difference from iscc/fastcdc-py

This project is inspired by iscc/fastcdc-py, but differs in the following ways:

  1. Based on nlfiedler/fastcdc-rs, using its FastCDC 2020 implementation aligned with the original paper, rather than the simplified ronomon implementation
  2. Supports multiple types of input, including in-memory data buffers, regular file using mmap, and custom streaming input
  3. Does not include any CLI tool. It provides only the core FastCDC functionality

License

MIT

Reference

Papers

Other FastCDC Implementations

About

A high-performance FastCDC 2020 implementation written in Python + Cython

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors