How to benchmark NebulaStream?
With NebulaStream, we currently provide nesBench
, a helper for single-node
benchmarks.
Compile NES for Benchmarking
There are a few vital things to consider when building NebulaStream to achieve optimal performance:
- set the CMake build type to
Release
. - enable CMake cache variable
NES_BUILD_NATIVE
to allow for processor-specific tuning like enabling AVX. - Maybe disable logging altogether, i.e. set
NES_LOG_LEVEL
toLOG_NONE
.
If you have a NUMA system, also consider binding to a specific node.
Run Your First Benchmark
To run a benchmark, create a benchmark configuration like e.g.
logLevel: LOG_NONE
logicalSources:
- name: ysb
type: YSB
numberOfMeasurementsToCollect: 5
query: >
Query::from("ysb")
.filter(Attribute("event_type") < 1)
.window(SlidingWindow::of(IngestionTime(), Seconds(5), Seconds(1)))
.byKey(Attribute("campaign_id"))
.apply(Sum(Attribute("user_id")))
.sink(FileSinkDescriptor::create("out", "CSV_FORMAT", "OVERWRITE"));
# A bit of tuning
numberOfWorkerThreads: 8
numberOfBuffersInGlobalBufferManager: 10240
numberOfBuffersInSourceLocalBufferPool: 1024
bufferSizeInBytes: 131072
Then, run nesBench
from the docker image. We mount the current directory
since nesBench
will produce output in its working directory, which is
/bench
.
$ docker run --rm -v .:/bench nebulastream/nes-executable-image nesBench bench.yml
After the run, we obtain a csv file with measurements, from which we could generate graphs etc. For now, we just select a few columns and look at the first few rows:
$ awk -F, < bench.csv '{ print $13, $17, $20 }' | column -t | head -n 5
tuplesPerSecond numberOfWorkerOfThreads bufferSizeInBytes
344656 8 131072
344560 8 131072
344080 8 131072
411264 8 131072
Pre-defined Benchmarks
There is a handful of benchmarks defined in the nebulastream repository, most notably some queries of the Yahoo Streaming Benchmark and the Nexmark Benchmark.
How to define your own Benchmark
With most config options, you can supply a comma separated list of values.
nesBench
will execute as many runs as there are items in the longest list.
Shorter lists are padded with their last value.
So given the configuration snippet:
# specify source & query
numberOfWorkerThreads: 1, 1, 2
bufferSizeInBytes: 131072, 262144, 131072, 262144
nesBench
will execute four runs:
numberOfWorkerThreads | bufferSizeInBytes |
---|---|
1 | 131072 |
1 | 262144 |
2 | 131072 |
2 | 262144 |
Note that the last value for numberOfWorkerThreads
is added by the padding.
Overview of configration options
These are general knobs:
logLevel
numberOfWorkerThreads
bufferSizeInBytes
numberOfBuffersInGlobalBufferManager
numberOfBuffersInSourceLocalBufferPool
: Sources have their own local pool of buffers, separate from the global buffer pool.
The following options control different features of NES:
nautilusBackend
: The backend of the Nautilus compilation frameworkqueryCompilerDumpMode
: If and how to dump the intermediate representations of the nautilus compilation processwindowingStrategy
: The kind of windowing optimization to apply (Slicing or Bucketing)pipeliningStrategy
: Whether to enable operator fusionstreamJoinStrategy
: The kind of join to useuseCompilationCache
: If compilation results should be cached between executions
Interesting csv output fields
The following three are ever increasing statistic counters:
processedTuples
processedTasks
processedBuffers
And nesBench
also calculates aggregates, i.e. tuplesPerSecond
, tasksPerSecond
, and bufferPerSecond
.