E2E Benchmarking

General

NebulaStream supports end-to-end benchmarking directly without the need of writing any additional code. To enable the e2e benchmarking framework, build NebulaStream with the CMAKE_FLAG -DNES_BUILD_BENCHMARKS=1. Configuration of a benchmark takes place in yaml files and thus, no additional code has to be written. Furthermore, during a benchmark run, NebulaStream automatically tracks the throughput and latency and writes the values into a user-definable file in the form of comma-separated values.

A typical workflow consists of the following steps

  1. Build NebulaStream with -DNES_BUILD_BENCHMARKS=1
  2. Create a yaml configuration file
  3. Run e2e-benchmark-runner with the created yaml file as command line arguments, e.g. ./e2e-benchmark-runner --configPath=/path/to/file
  4. Analyze measurements and plot figures from the csv file

Configuration

As NebulaStream is still under development, this is just a snapshot of the current configuration options. New options might get added or existing options get removed. At the beginning of each benchmark, all created runs are written to the logger.

The configuration file is divided into two parts:

  1. Configurations that change per run
  2. Configurations that stay constant over the whole benchmark
# ~~~ Configurations for single run ~~~
numberOfWorkerThreads: 1, 2
bufferSizeInBytes: 512, 1024, 2048
numberOfBuffersToProduce: 10000000


# Benchmark parameter for the entire run
logLevel: LOG_INFO
experimentMeasureIntervalInSeconds: 1
startupSleepIntervalInSeconds: 1
numberOfMeasurementsToCollect: 3

numberOfSources: 1
inputType: MemoryMode
dataProviderMode: ZeroCopy
outputFile: FilterOneSource.csv
benchmarkName: FilterOneSource
query: 'Query::from("input1").filter(Attribute("value") < 100).sink(NullOutputSinkDescriptor::create());'

As a design choice, we opted for not calculating the cross-product of all options that change per run. Rather for each option, a list will be created and then we iterate through all lists simultaneously. If one option has less values than the other, the last value will be duplicated. In the example, three runs will be created as bufferSizeInBytes has a maximum of three values. numberOfWorkerThreads and numberOfBuffersToProduce will be padded to a size of three. Thus, we run the option displayed below.

numberOfWorkerThreads: 1, 2, 2
bufferSizeInBytes: 5120, 5120, 5120
numberOfBuffersToProduce: 1000000, 1000000, 1000000

Configuration Options

The following are currently all options for each run.

Config options for single runDescription
numberOfWorkerThreadsThe number of worker threads.
bufferSizeInBytesThe size (in Bytes) of each buffer.
numberOfBuffersToProduceNo. total buffers to generate. Should be larger than the no. buffer processed by the system over each run.

The following are currently all options for all runs.

Config options over all runsDescription
logLevelLog level during all runs. One of the NebulaStream log levels.
experimentMeasureIntervalInSecondsTime between measurement points; minimum is one second.
startupSleepIntervalInSecondsTime between registering the query and starting measuring.
numberOfMeasurementsToCollectNo. samples for each run
numberOfSourcesNo. sources for all runs
dataGeneratorsSpecifies the data generators. Types are: Default, Uniform or Zipfian
inputTypeFor now, only MemoryMode is supported
dataProviderModeEither MemCpy or ZeroCpy
outputFileMeasurements will be written to the specified file as comma-separated values
benchmarkNameName of this benchmark. Currently, it is used as the name for the logger file
queryQuery to be measured

NebulaStream supports the following data generators for e2e benchmarks:

Data Generators

  • Default: Creates uniformly distributed data in the range of [0, 999]
  • Uniform: Creates uniformly distributed data in a given range. Requires following attributes
    • minValue
    • maxValue
  • Zipfian: Creates Zipfian distributed data in a given range. Requires following attributes
    • alpha
    • minValue
    • maxValue