Skip to content

mwc360/LakeGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LakeGen

PyPI Release PyPI Downloads

🌊 LakeGen a Python library for quickly generating data for testing lakehouse architectures, benchmarking, and demos.

Data Generation

  • TPC-H data generation is provided via the (tpchgen-rs)[https://github.com/clflushopt/tpchgen-rs] project. The project is currently about 10x+ faster than the next closest method of generating TPC-H datasets. The TPC-DS version of project is currently under development.

    The below are generation runtimes on a 64 v-core VM writing to OneLake. Scale factors below 1000 can easily be generated on a 2 v-core machine.

    Scale Factor Duration (hh:mm:ss)
    1 00:00:04
    10 00:00:09
    100 00:01:09
    1000 00:07:15
    10000 01:10:03 (run with multithreading disabled )
  • TPC-DS data generation is provided via the DuckDB TPC-DS extension. The LakeGen wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeGen defaults to target 128MB row groups but can be configured via the target_row_group_size_mb parameter of both TPC-H and TPC-DS DataGenerator classes.

  • ClickBench data is downloaded directly from the Clickhouse host site.

  • Shipments is a generator to produce high volume shipment and scan events for streaming use cases. The generator currently writes JSON files to your storage of choice. Message brokers like EventHub and Kafka will be supported soon.

TPC-H Data Generation

from lakegen import TPCHGen

datagen = TPCH(
    scale_factor=1,
    target_folder_uri='/lakehouse/default/Files/tpch_sf1'
)
datagen.run()

TPC-DS Data Generation

from lakegen import TPCDSGen

datagen = TPCDS(
    scale_factor=1,
    target_folder_uri='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Notes:

  • TPC-DS data up to SF1000 can be generated on a 32-vCore machine.
  • TPC-H datasets are generated extremely fast (i.e. SF1000 in 10 minutes on an 64-vCore machine).
  • The ClickBench dataset (only 1 size) should download with partitioned files in ~ 1 minute and ~ 6 minutes as a single file.

McMillan Industrial Group - Kafka/EventHub Support

The McMillan Industrial Group generator now supports writing data to Kafka/EventHub endpoints in addition to JSON and Parquet files. Configure output types per table using the output_type_map parameter.

from lakegen.generators.mcmillan_industrial_group import McMillanDataGen

# Kafka connection string (EventHub format)
connection_string = "Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=key;SharedAccessKey=value;EntityPath=topic"

# Configure mixed outputs: some tables to Kafka, others to files
output_type_map = {
    "shipment": "kafka",
    "order": "kafka",
    "customer": "json",
    "item": "parquet",
}

gen = McMillanDataGen(
    target_folder_uri="./output",  # For file outputs
    kafka_connection_string=connection_string,  # For Kafka outputs
    output_type_map=output_type_map,
    concurrenct_threads=2,
    max_events_per_second=100
)

gen.start()

Each table writes to a Kafka topic named after the table (e.g., shipment → topic shipment). Messages contain batches of records in JSON format with metadata wrapper.

Important Notes:

  • EventHub Compatibility: The Kafka writer auto-detects the API version for proper EventHub compatibility. This prevents message corruption when reading from EventHub.
  • Single Topic: When using EventHub, all messages are sent to the EntityPath topic from the connection string. The table name is preserved in the message metadata (recordType field).
  • Message Size Limits: Large batches are automatically split into 100-record chunks to respect EventHub's 1MB message size limit.
  • Batch Metadata: Each message includes batchIndex, batchSize, and totalRecords fields for reassembly downstream.

About

A Python library for generating both benchmark and sample datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages