Architecture

NebulaStream Overview

General

NebulaStream is build in a server/worker setup. The user has to start one NebulaStream Coordinator and at least one NebulaStream Worker to process data. In the following, we describe the architecture of NebulaStream in detail.

NebulaStream Coordinator

The NebulaStream Coordinator is the central component and is responsible for processing user requests, scheduling queries, and managing the life-cycle of query execution. The interaction with the NebulaStream Coordinator happens via the Rest Interface. In general, the user can interact in the following ways:

  • Submit a query
  • Remove a query
  • Check the status of a query
  • List currently running queries
  • Register a new logical source
  • Fetch information about an existing logical stream
  • List all physical sources for a given logical source In the current version of NebulaStream, the coordinator is the single-point-of-failure for the entire cluster. This means if the Coordinator is shut-down for some reason, all running queries will be terminated.

In the following, we describe the individual components within the NebulaStream Coordinator.

Query Submission

Query submission is handled within the NebulaStream Coordinator via a Query Request Processor. This component enqueues the query request and handles the processing for its entire life-cycle. In NebulaStream, we handle two different query plans centrally:

  1. The Global Query Plan consolidates all logical queries that were submitted into one consistent query plan.
  2. The Global Execution Plan represents the current physical execution plan that the system is currently executing.

We refer to Query Submission for an detailed overview of this process.

Maintenance

The NebulaStream Coordinator maintains its status in three maintenance structures:

  1. The Query Catalog maintains all queries as well as their status, e.g., started/stopped, etc.
  2. The Topology Manager maintains an overview of the underlying Topology, e.g, the nodes, the connections, and their resources.
  3. The Source Catalog maintains an overview of all available logical source and physical source in the system.

Communication

The communication with the NebulaStream Worker happens using two servers:

  1. The RPC Server handles all control messages to/from the workers, e.g., deploy query, start/stop query, etc.
  2. The ZMQ Server handles all data transfers to/from the workers, e.g., sending query results.

Monitoring

The NebulaStream Coordinator uses two background services to monitor its environment:

  1. The Monitoring Service allows the user and the system to define individual monitoring requests which update the status of the nodes in the topology. One example is the continuous reporting of the resource utilization of nodes.
  2. The Health Check Service is a basic background service that sends heart-beats from/to workers and if a node disconnect, perform clearance tasks.

NebulaStream Worker

Within a NebulaStream cluster, there can be several workers. These workers are either directly connected to the coordinator or transitively connected to the coordinator via another worker. The job of a worker is to locally manage the lifecycle of a query provided to it by the coordinator. To this end, the NebulaStream Coordinator is providing a sub-query derived from a larger query to the worker.

A worker can also contain a data source, i.e., it is responsible for providing the stream of data that the query is interested in. Therefore, a worker with a data source contains all the information about the schema, the source type, the connection configuration, etc., that are necessary for identifying and consuming a data source. These configurations are supplied while starting a worker node as runtime arguments.

Communication and Monitoring

The NebulaStream Worker has the counterpart from the NebulaStream Coordinator for communicating and monitoring.

Processing

To process a query, the NebulaStream Worker contains a Node Engine, which by itself contains the following components:

  1. The Query Manager manages the local life-cycle of a query, e.g., start/stop, deploy/undeploy.
  2. The Compiler generates efficient code for each query and maintains the executable binaries.
  3. The State Backend maintains the state for each query.
  4. The Buffer Manager manages the memory footprint of the engine and provides buffers to different internal components.
  5. The Thread Pool manages thread to query assignment in a highly dynamic fashion.
  6. The Network Manager maintains all incoming and outgoing connections of the NebulaStream Worker.

Data Ingestion

The data ingestion in NebulaStream happens via dedicated data sources wrappers that manage the life-cycle of a data source. Currently, NebulaStream supports the following source types:

  • File: reading data for example from a binary file
  • CSV: reading data from a CSV file
  • ZMQ/MQTT: reading data from an external broker using ZMQ or MQTT
  • OPC: reading data following the OPC industry standard
  • Kafka: reading data from a Kafka broker
  • Network: reading data sent from another NebulaStream Worker