What it is

The BigData Analysis GE is intended to deploy means for analysing both batch and stream data (in order to get, in the end, insights on such a data).

The batch part has been widely developed through the adoption and/or the in-house creation of the following tools:

The batch part is enriched with the following tools, that conform the so-called Cosmos Ecosystem:

The streaming part has both adopted and in-house created certain parts:

Why get it

Cosmos and Sinfonier are mainly addressed to those service providers aiming to expose a BigData Analysis GE-like services. For those service providers, the data analysis is not a goal itself but providing ways others can perform such data analysis. This especially applies to Openstack's Sahara installation.

If you are a data scientist willing to get some insights on certain data; or you are a software engineer in charge of productizing an application based on a previous data scientist analysis, then please visit the User and Programmer Guide; and/or go directly to the FIWARE Lab Global Instances of Cosmos and/or Sinfonier, there you will find an already deployed infrastructure ready to be used through the different APIs.

If you don't relay on FIWARE Lab Global Instances of Cosmos and/or Sinfonier and you want to use Hadoop and of Storm, do not install them; that will be as installing a complete Cloud just for creating a single virtual machine. Instead, simply install a private instance of Hadoop and/or Storm!

Avaliable for:


Use Cases



The BigData Analysis GE is intended to deploy means for analyzing both batch and/or stream data, in order to get, in the end, insights on such a data revealing new information that was hidden. Batch data is stored in advance, and latency is not extremelly important when processing it. Stream data is recevived and processed in near real-time, and the results are expected to be ready immediately, basically because this kind of data is not stored at all and the insights must be taken on the fly. These two different aspects of the GE lead to the definition of two main blocks, the batch processing and the stream processing ones.

Additional blocks may complement the above ones, such as the block in charge of the historic persistence of context data, or the short-term historic block, a querying API regarding historical time series.

Please observe the usage of all these blocks together is not mandatory; so a GE implementation can be based on one, two, three or all the blocks.

Batch processing block

The batch processing block aims to be a Big Data as a Service platform on top of Openstack infrastructure. This platform will be in charge of providing, mainly, Apache Hadoop clusters on demand, but should not be restricted to that, and other cluster-based analysis solutions may be considered as candidate software to be installed within the cluster, such as Apache Spark.

This part of the GE clearly differentiates among computing and storage services:

Using any of the services (computing or storage) does not imply to use the other. This way, the data storage is independent from the data lifecycle associated to its processing.

In addition, this block will enrich the Hadoop ecosystem with specific tools allowing for the integration of Big Data with other GEs such as the Publish-Subscribe Context Broker or the Open Data Portal.

Stream processing block

The stream processing component is a tool which represents a significant change from the current solutions in the area of the real-time processing of information. This is achieved by means of:

There are several solutions on the market which allow real-time data processing with high capacities, e.g. Apache Storm, but to use these tools it is necessary to have fairly advanced technical knowledge and costly installations with dedicated machines to obtain good yields. One of the purposes of the Stream processing component, since its inception, has been to eliminate this initial complexity, in many cases causing people with less technical knowledge but high capabilities and concerns when creating intelligence to stop employing tools as powerful as tools as Apache Storm.

To overcome this obstacle the stream processing component implements a new layer of abstraction over the Apache Storm technology, simplifying as much as possible the creation of processing elements and modules and combining them to form topologies in a simple and rapid way.

Historic persistence block

From a FIWARE perspective, one of the main sources of data both for batch and stream processing blocks before mentioned is the Publish/Subscribe Context Broker GE. Since the data managed by the Context Broker is an snapshot of the current context, past context data is lost unless persisted as an historic somewhere else; in this case, HDFS.

This block will use the subscription mechanism of the Context Broker in order to receive a flow of context-related notifications that will be stored in HDFS using some standard file format, e.g. CSV, Json or Parquet.

Although the main purpose for this block is building historics in HDFS, other storages can be used as well, e.g. a NoSQL backend. In this case, many other storing strategies can be adopted, for instance, time series.

Short-term historic querying block

As previously said, the architecture of the Big Data GE may be enriched with the usage of a block in charge of persisting historic context data, mainly in HDFS, but in other storages such as a NoSQL backend handling time series. Because of the characteristics of these storages making them different to HDFS, i.e. less capacity but higher access time, a specific block may be defined in charge of retrieving historical raw and aggregated time series information about the evolution in time of context data registered by a Publish/Subscribe Context Broker.

FIWARE Webpage

BigData Analysis - Cosmos


BigData Analysis - Cosmos Documentation


BigData Analysis - Cosmos Download

Fiware Academy

BigData Analysis - Cosmos Courses


Click on the images to enlarge them.