General Concepts

A NebulaStream cluster consists of a coordinator and one or more workers. Together, these make up the topology of the NebulaStream cluster. The NebulaStream cluster processes data from multiple streams which are defined using logical sources and physical sources. The coordinator provides information about the state of the NebulaStream cluster in catalogs, e.g., the query catalog and the source catalog.

Coordinator

The coordinator is the central component in NebulaStream which is responsible for processing user requests, scheduling queries, and managing the life cycle of the running queries. The coordinator allows users to interact via a REST API. The user has the option to:

  • Submit a query.
  • Remove a query.
  • Check status of a query.
  • List currently running queries.
  • Register a new logical source.
  • Fetch information about an existing logical stream.
  • List all physical sources for a given logical source, etc.

💡 To learn more about how to configure the coordinator, please refer to configurations.

⚠️ The coordinator is currently the single point of failure for the entire cluster.

Worker

A worker manages the life cycle of multiple subqueries running on it. It executes one or more subqueries derived from a larger query that the coordinator provides. Within a NebulaStream cluster, there can be several workers. These are either directly connected to the coordinator or indirectly through another worker.

A worker can also contribute to a data source, i.e., it takes part in providing data that a query is interested in. The workers providing a data source requested by a query are then considered when scheduling the query for execution.

💡 To learn more about how to configure a worker, e.g., how to configure data sources, please refer to configurations.

Topology

The topology represents the interconnection between workers and the coordinator within a NebulaStream cluster. As NebulaStream is designed to support efficient IoT data management, the framework allows running workers on geo-distributed locations which are connected via gateway nodes to the coordinator running at the cloud data center. Workers can potentially have different resources, and can interconnect using networks with varying capacity. The topology stores this compute and network resource information. As new queries are deployed or a running query is removed this resource information is updated by the topology. Similarly, when a new worker joins a NebulaStream cluster, the topology updates itself to reflect the new worker.

Defining Data Sources

Logical Source

A collection of physical data sources sharing common properties are represented by logical source in NebulaStream. For example, in a smart city, the collection of streetlamps capable of producing usage data are represented by a logical source. As there could be thousands of physical streetlamps within a city, NebulaStream uses the concept of logical sources to group a collection of data sources together. All data sources within the collection share common physical characteristics and the same schema.

A logical source contains the source name and the schema information. All physical sources mapped to a logical source produce data in the same schema. Otherwise, a query will fail during its execution.

💡 To learn more about how to configure a logical source, please refer to configurations.

Physical Source

A physical source represents the data produced by a sensor, e.g., an individual streetlamp. In the previous example, the logical source streetlamps can contain a physical source streetlamp-1 at some physical location within the city. A logical source can have multiple physical sources.

NebulaStream manages physical sources centrally at the coordinator in a catalog. When a worker with a data source starts, it passes the information about the name of a logical source, the worker information, and other configuration parameters to the coordinator for registration.

Currently, NebulaStream supports reading data in JSON and CSV format. Additionally, a physical source can be configured to consume data using Files, OPC, MQTT, and Kafka.

💡 To learn more about host to configure a physical source, please refer to configurations.

⚠️ If the logical source used for the registration does not exist at the coordinator, the registration of the physical source will fail.

Catalogs

NebulaStream stores the current state of the system in two different catalogs.

Query Catalog

The query catalog stores information regarding all queries managed by the coordinator. The catalog stores all additional metadata information about a query request such as the placement strategy, the fault tolerance guarantee, etc. As a query goes through different phases (registered, scheduling, running, stopped, or failed), the catalog stores the current status of the query.

Source Catalog

The source catalog contains information about all sources available in NebulaStream for processing, and the mapping between physical and logical sources. Additionally, the catalog stores the mapping between workers and physical sources. The query placement phase utilizes the source catalog when scheduling queries.