E2E Benchmarking

General

NebulaStream supports end-to-end benchmarking directly without the need of writing any additional code. To enable the e2e benchmarking framework, build NebulaStream with the CMAKE_FLAG -DNES_BUILD_BENCHMARKS=1. Configuration of a benchmark takes place in yaml files and thus, no additional code has to be written. Furthermore, during a benchmark run, NebulaStream automatically tracks the throughput and latency and writes the values into a user-definable file in the form of comma-separated values.

A typical workflow consists of the following steps

Build NebulaStream with -DNES_BUILD_BENCHMARKS=1
Create a yaml configuration file
Run e2e-benchmark-main with the created yaml file and a logger as command line arguments, e.g. ./e2e-benchmark-main --configPath=/path/to/file --logPath=logger.log
Analyze measurements and plot figures from the csv file

Configuration

As NebulaStream is still under development, this is just a snapshot of the current configuration options. New options might get added or existing options removed. At the beginning of each benchmark, all created runs are written to the logger.

The configuration file is divided into two parts:

Configurations that change per run
Configurations that stay constant over the whole benchmark

# ~~~ Configurations for single run ~~~
numberOfWorkerThreads: 1, 2
bufferSizeInBytes: 512, 1024, 2048
numberOfBuffersToProduce: 10000000
numberOfPreAllocatedBuffer: 1000


# Benchmark parameter for the entire run
logLevel: LOG_INFO
experimentMeasureIntervalInSeconds: 1
startupSleepIntervalInSeconds: 1
numberOfMeasurementsToCollect: 3

logicalSources:
  - name: input1
    type: Default
  - name: input2
    type: Zipfian
    numberOfPhysicalSources: 3, 2, 1
    alpha: 0.99
    minValue: 0
    maxValue: 1000

dataProvider:
  name: External
  ingestionRateCount: 1000
  ingestionRateDistribution:
    type: Sinus
    numberOfPeriods: 64
    ingestionRateInBuffers: 50000
    
inputType: MemoryMode
dataProviderMode: ZeroCopy
outputFile: FilterOneSource.csv
benchmarkName: FilterOneSource
query: 'Query::from("input1").filter(Attribute("value") < 100).sink(NullOutputSinkDescriptor::create());'

As a design choice, we opted for not calculating the cross-product of all options that change per run. Rather for each option, a list will be created and then we iterate through all lists simultaneously. If one option has less values than the other, the last value will be duplicated. In the example, three runs will be created as bufferSizeInBytes has a maximum of three values. numberOfWorkerThreads, numberOfBuffersToProduce and numberOfPreAllocatedBuffer will be padded to a size of three. Thus, we run the option displayed below.

numberOfWorkerThreads: 1, 2, 2
bufferSizeInBytes: 5120, 5120, 5120
numberOfBuffersToProduce: 1000000, 1000000, 1000000
numberOfPreAllocatedBuffer: 1000, 1000, 1000

Configuration Options

The following are currently all options for each run.

Config options for single run	Description
numberOfWorkerThreads	The number of worker threads.
bufferSizeInBytes	The size (in Bytes) of each buffer.
numberOfBuffersToProduce	No. total buffers to generate. Should be larger than the no. buffer processed by the system over each run.
numberOfPreAllocatedBuffer	No. buffers to generate. If it is smaller than numberOfBuffersToProduce, previously allocated buffers will be processed again starting with the first preAllocatedBuffer.

The following are currently all options for all runs.

Config options over all runs	Description
logLevel	Log level during all runs. One of the NebulaStream log levels.
experimentMeasureIntervalInSeconds	Time between measurement points; minimum is one second.
startupSleepIntervalInSeconds	Time between registering the query and starting measuring.
numberOfMeasurementsToCollect	No. samples for each run
logicalSources: type	Specifies the data generators. Types are: Default, Uniform or Zipfian
numberOfPhysicalSources	No. sources for all runs
dataProvider: name	Either Internal or External
ingestionRateCount	No. potentially different ingestion rates
ingestionRateDistribution: type	Specefies the ingestion rate distribution. Types are: Uniform, Sinus, Cosinus, Custom
ingestionRateInBuffers	The maximum ingestion rate per 10 milliseconds in buffers. Only applicable for Uniform, Sinus and Cosinus distribution.
numberOfPeriods	The number of periods for the Sinus or Cosinus distribution within the interval from 0 to ingestionRateCount.
values	A list of comma seperated ingestion rates. Only applicable for Custom distribution. This is a child Element of ingestionRateDistribution
inputType	For now, only MemoryMode is supported
dataProviderMode	Either MemCpy or ZeroCpy
outputFile	Measurements will be written to the specified file as comma-separated values
benchmarkName	Name of this benchmark. Currently, it is used as the name for the logger file
query	Query to be measured

NebulaStream supports the following data generators for e2e benchmarks:

Data Generators

Default: Creates uniformly distributed data in the range of [0, 999]
Uniform: Creates uniformly distributed data in a given range. Requires following attributes
- minValue
- maxValue
Zipfian: Creates Zipfian distributed data in a given range. Requires following attributes
- alpha
- minValue
- maxValue
YSB: Creates data corresponding to the yahoo streaming benchmark

NebulaStream supports the following data generators for e2e benchmarks:

Data Providers

Internal: Ingests the pre-allocated buffers as fast as possible into the system
External: Ingests the pre-allocated buffers according to the given parameters into the system. The different types are:
- Uniform: Ingestes ingestionRateInBuffers steadily each second
- Sinus, Cosinus: A sine/cosine distribution with period length of numberOfPeriods divided by ingestionRateCount. Ingestes a maximum of ingestionRateInBuffers per second
- Custom: Allows for a custom number of buffers to be ingested each second

Edit this page on GitHub

E2E Benchmarking

General #

Configuration #

Configuration Options #

Data Generators #

Data Providers #

General

Configuration

Configuration Options

Data Generators

Data Providers