Skip to content

Add support for generic creation of sparse entries#419

Open
jebradbury39 wants to merge 9 commits into
composefs:mainfrom
jebradbury39:main
Open

Add support for generic creation of sparse entries#419
jebradbury39 wants to merge 9 commits into
composefs:mainfrom
jebradbury39:main

Conversation

@jebradbury39
Copy link
Copy Markdown

@jebradbury39 jebradbury39 commented Nov 7, 2025

Problem

Currently, the safest way to create a sparse entry in the Builder is via append_file, which relies on having the data present on the filesystem (at best, you can use tmpfs). However, this has some limitations:

  • You are stuck with whatever your filesystem supports (w.r.t. size/offsets), when in reality your tar archive might be getting streamed to another destination which CAN support those size/offsets.
  • You're required to hit the disk for these types of data (e.g. you have some data in memory, which you then need to write to a file, call append_file, and then remove that file).

Solution

Added a new function to the Builder called append_sparse_data, which takes a data reader which implements a new SeekSparse trait. Internally, files on supported unix systems can implement this trait (to avoid code duplication), but it really means that now, anyone using the tar crate can implement SeekSparse on their own datatypes.

Note that tar requires blocks to be 512-byte aligned, and with files, you get this for free (data/holes are block-aligned per the filesystem, which is generally some multiple of 512). With SeekSparse data, you have no restrictions about data alignment, which means that some holes may be encoded as zeros (or whatever "empty" byte the SeekSparse Read impl returns when a caller reads from a hole). This translates into some added complexity in the find_sparse_entries_seek function. I think this is worthwhile, because it decouples the caller of append_sparse_data from having to care about what block-alignment tar uses (which is an internal implementation detail of tar).

@xzfc
Copy link
Copy Markdown
Collaborator

xzfc commented Nov 23, 2025

The new trait adds much complexity. Can your use-case be solved with simpler means, e.g. passing a slice of SparseEntry alongside the &[u8] data?

@jebradbury39
Copy link
Copy Markdown
Author

One of the objectives here is to avoid holding all the data in memory, which is why I didn't want to rely on passing a byte slice. By using a read-based trait, the implementation of the trait can choose to limit the memory that gets loaded.

@xzfc
Copy link
Copy Markdown
Collaborator

xzfc commented Nov 23, 2025

One of my concerns is that this new trait mirrors the Linux-style SEEK_DATA / SEEK_HOLE semantics (as said in the doc comment). I'm not sure whether other sparse APIs (e.g. older Linux's FIEMAP, Windows' FSCTL_QUERY_ALLOCATED_RANGES) would fit into this API well if we decide to support them.

Another concern is that the new abstraction feel too complex/indirect for me, compared to the simplicity of the underlying format (which encodes a list of (offset, length) pairs).


I agree with the problem statement in your PR description. But I'd like to avoid adding overly generalized API to reduce the maintenance burden.

I think of the following approaches as an alternative:

  • A method that accepts sparse_entries: &[SparseEntry] and dense_data: &[u8].
  • A method that accepts sparse_entries: &[SparseEntry] and returns EntryWriter to write dense data.

Both of these would require allocation for sparse entries (and perhaps for the dense data), making them not suitable for some particular use-cases. But, maybe for such cases, a custom tar writer would be a better fit than the general-purpose tar-rs Builder.

Would you please list your use-cases for this feature? This would help us to prioritize what we want to support.

@jebradbury39
Copy link
Copy Markdown
Author

The second option might work, as long as the EntryWriter works with a writer that has only Write and not Seek.

My main use case is reading parts of procfs mem for various pids (which is essentially sparse data) and streaming this as tar sparse entries (one mem entry per pid) to stdout (hence the requirement that the writer must not require Seek). There's a twist, in that I may not be directly reading the procfs (the data may be streamed in from another process via a thin protocol over tcp - one resident page at a time). Stdout may then be piped to other processes. I need to keep memory usage relatively low during this (some of the pids have multi-GB resident mem).

I did initially try to just create a custom tar writer, but found that tar-rs did not expose enough information to make this practical (exposing certain APIs to enable custom tar writing might be another solution), and creating a tar-rs alternative felt like the wrong way to go. I did notice that tar-rs already supports files on disk, and I was considering creating a temp sparse file, but that felt like an odd/inefficient solution, plus some filesystems may impose limitations on the logical size of a file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants