E2E Benchmarking

General

NebulaStream supports end-to-end benchmarking directly without the need of writing any additional code. To enable the e2e benchmarking framework, build NebulaStream with the CMAKE_FLAG -DNES_BUILD_BENCHMARKS=1. Configuration of a benchmark takes place in yaml files and thus, no additional code has to be written. Furthermore, during a benchmark run, NebulaStream automatically tracks the throughput and latency and writes the values into a user-definable file in the form of comma-separated values.

A typical workflow consists of the following steps

  1. Build NebulaStream with -DNES_BUILD_BENCHMARKS=1
  2. Create a yaml configuration file
  3. Run e2e-benchmark-main with the created yaml file and a logger as command line arguments, e.g. ./e2e-benchmark-main --configPath=/path/to/file --logPath=logger.log
  4. Analyze measurements and plot figures from the csv file

Configuration

As NebulaStream is still under development, this is just a snapshot of the current configuration options. New options might get added or existing options removed. At the beginning of each benchmark, all created runs are written to the logger.

The configuration file is divided into two parts:

  1. Configurations that change per run
  2. Configurations that stay constant over the whole benchmark
# ~~~ Configurations for single run ~~~
numberOfWorkerThreads: 1, 2
bufferSizeInBytes: 512, 1024, 2048
numberOfBuffersToProduce: 10000000
numberOfPreAllocatedBuffer: 1000


# Benchmark parameter for the entire run
logLevel: LOG_INFO
experimentMeasureIntervalInSeconds: 1
startupSleepIntervalInSeconds: 1
numberOfMeasurementsToCollect: 3

logicalSources:
  - name: input1
    type: Default
  - name: input2
    type: Zipfian
    numberOfPhysicalSources: 3, 2, 1
    alpha: 0.99
    minValue: 0
    maxValue: 1000

dataProvider:
  name: External
  ingestionRateCount: 1000
  ingestionRateDistribution:
    type: Sinus
    numberOfPeriods: 64
    ingestionRateInBuffers: 50000
    
inputType: MemoryMode
dataProviderMode: ZeroCopy
outputFile: FilterOneSource.csv
benchmarkName: FilterOneSource
query: 'Query::from("input1").filter(Attribute("value") < 100).sink(NullOutputSinkDescriptor::create());'

As a design choice, we opted for not calculating the cross-product of all options that change per run. Rather for each option, a list will be created and then we iterate through all lists simultaneously. If one option has less values than the other, the last value will be duplicated. In the example, three runs will be created as bufferSizeInBytes has a maximum of three values. numberOfWorkerThreads, numberOfBuffersToProduce and numberOfPreAllocatedBuffer will be padded to a size of three. Thus, we run the option displayed below.

numberOfWorkerThreads: 1, 2, 2
bufferSizeInBytes: 5120, 5120, 5120
numberOfBuffersToProduce: 1000000, 1000000, 1000000
numberOfPreAllocatedBuffer: 1000, 1000, 1000

Configuration Options

The following are currently all options for each run.

Config options for single runDescription
numberOfWorkerThreadsThe number of worker threads.
bufferSizeInBytesThe size (in Bytes) of each buffer.
numberOfBuffersToProduceNo. total buffers to generate. Should be larger than the no. buffer processed by the system over each run.
numberOfPreAllocatedBufferNo. buffers to generate. If it is smaller than numberOfBuffersToProduce, previously allocated buffers will be processed again starting with the first preAllocatedBuffer.

The following are currently all options for all runs.

Config options over all runsDescription
logLevelLog level during all runs. One of the NebulaStream log levels.
experimentMeasureIntervalInSecondsTime between measurement points; minimum is one second.
startupSleepIntervalInSecondsTime between registering the query and starting measuring.
numberOfMeasurementsToCollectNo. samples for each run
logicalSources: typeSpecifies the data generators. Types are: Default, Uniform or Zipfian
numberOfPhysicalSourcesNo. sources for all runs
dataProvider: nameEither Internal or External
ingestionRateCountNo. potentially different ingestion rates
ingestionRateDistribution: typeSpecefies the ingestion rate distribution. Types are: Uniform, Sinus, Cosinus, Custom
ingestionRateInBuffersThe maximum ingestion rate per 10 milliseconds in buffers. Only applicable for Uniform, Sinus and Cosinus distribution.
numberOfPeriodsThe number of periods for the Sinus or Cosinus distribution within the interval from 0 to ingestionRateCount.
valuesA list of comma seperated ingestion rates. Only applicable for Custom distribution. This is a child Element of ingestionRateDistribution
inputTypeFor now, only MemoryMode is supported
dataProviderModeEither MemCpy or ZeroCpy
outputFileMeasurements will be written to the specified file as comma-separated values
benchmarkNameName of this benchmark. Currently, it is used as the name for the logger file
queryQuery to be measured

NebulaStream supports the following data generators for e2e benchmarks:

Data Generators

  • Default: Creates uniformly distributed data in the range of [0, 999]
  • Uniform: Creates uniformly distributed data in a given range. Requires following attributes
    • minValue
    • maxValue
  • Zipfian: Creates Zipfian distributed data in a given range. Requires following attributes
    • alpha
    • minValue
    • maxValue
  • YSB: Creates data corresponding to the yahoo streaming benchmark

NebulaStream supports the following data generators for e2e benchmarks:

Data Providers

  • Internal: Ingests the pre-allocated buffers as fast as possible into the system
  • External: Ingests the pre-allocated buffers according to the given parameters into the system. The different types are:
    • Uniform: Ingestes ingestionRateInBuffers steadily each second
    • Sinus, Cosinus: A sine/cosine distribution with period length of numberOfPeriods divided by ingestionRateCount. Ingestes a maximum of ingestionRateInBuffers per second
    • Custom: Allows for a custom number of buffers to be ingested each second