LakeGen

🌊 LakeGen a Python library for quickly generating data for testing lakehouse architectures, benchmarking, and demos.

Data Generation

TPC-H data generation is provided via the (tpchgen-rs)[https://github.com/clflushopt/tpchgen-rs] project. The project is currently about 10x+ faster than the next closest method of generating TPC-H datasets. The TPC-DS version of project is currently under development.

The below are generation runtimes on a 64 v-core VM writing to OneLake. Scale factors below 1000 can easily be generated on a 2 v-core machine.

Scale Factor Duration (hh:mm:ss)

1 00:00:04

10 00:00:09

100 00:01:09

1000 00:07:15

10000 01:10:03 (run with multithreading disabled )
TPC-DS data generation is provided via the DuckDB TPC-DS extension. The LakeGen wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeGen defaults to target 128MB row groups but can be configured via the target_row_group_size_mb parameter of both TPC-H and TPC-DS DataGenerator classes.
ClickBench data is downloaded directly from the Clickhouse host site.
Shipments is a generator to produce high volume shipment and scan events for streaming use cases. The generator currently writes JSON files to your storage of choice. Message brokers like EventHub and Kafka will be supported soon.

TPC-H Data Generation

from lakegen import TPCHGen

datagen = TPCH(
    scale_factor=1,
    target_folder_uri='/lakehouse/default/Files/tpch_sf1'
)
datagen.run()

TPC-DS Data Generation

from lakegen import TPCDSGen

datagen = TPCDS(
    scale_factor=1,
    target_folder_uri='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Notes:

TPC-DS data up to SF1000 can be generated on a 32-vCore machine.
TPC-H datasets are generated extremely fast (i.e. SF1000 in 10 minutes on an 64-vCore machine).
The ClickBench dataset (only 1 size) should download with partitioned files in ~ 1 minute and ~ 6 minutes as a single file.

McMillan Industrial Group - Kafka/EventHub Support

The McMillan Industrial Group generator now supports writing data to Kafka/EventHub endpoints in addition to JSON and Parquet files. Configure output types per table using the output_type_map parameter.

from lakegen.generators.mcmillan_industrial_group import McMillanDataGen

# Kafka connection string (EventHub format)
connection_string = "Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=key;SharedAccessKey=value;EntityPath=topic"

# Configure mixed outputs: some tables to Kafka, others to files
output_type_map = {
    "shipment": "kafka",
    "order": "kafka",
    "customer": "json",
    "item": "parquet",
}

gen = McMillanDataGen(
    target_folder_uri="./output",  # For file outputs
    kafka_connection_string=connection_string,  # For Kafka outputs
    output_type_map=output_type_map,
    concurrenct_threads=2,
    max_events_per_second=100
)

gen.start()

Each table writes to a Kafka topic named after the table (e.g., shipment → topic shipment). Messages contain batches of records in JSON format with metadata wrapper.

Important Notes:

EventHub Compatibility: The Kafka writer auto-detects the API version for proper EventHub compatibility. This prevents message corruption when reading from EventHub.
Single Topic: When using EventHub, all messages are sent to the EntityPath topic from the connection string. The table name is preserved in the message metadata (recordType field).
Message Size Limits: Large batches are automatically split into 100-record chunks to respect EventHub's 1MB message size limit.
Batch Metadata: Each message includes batchIndex, batchSize, and totalRecords fields for reassembly downstream.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
examples		examples
src/lakegen		src/lakegen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LakeGen

Data Generation

TPC-H Data Generation

TPC-DS Data Generation

McMillan Industrial Group - Kafka/EventHub Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Scale Factor	Duration (hh:mm:ss)
1	00:00:04
10	00:00:09
100	00:01:09
1000	00:07:15
10000	01:10:03 (run with multithreading disabled )

Folders and files

Latest commit

History

Repository files navigation

LakeGen

Data Generation

TPC-H Data Generation

TPC-DS Data Generation

McMillan Industrial Group - Kafka/EventHub Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages