Skip to main content

What is DataSQRL?

DataSQRL is an open-source compiler and build tool for implementing data products as data pipelines. A data product processes, transforms, or analyzes data from one or multiple sources (user input, databases, data streams, API calls, file storage, etc.) and exposes the result as raw data, in a database, or through an API.
DataSQRL eliminates most of the laborious code of implementing and stitching together multiple technologies into data pipelines.

Building a data product with DataSQRL takes 3 steps:

  1. Implement SQL script: You combine, transform, and analyze the input data using SQL.
  2. Expose Data (optional): You define how to expose the transformed data in the API or database.
  3. Compile Data Pipeline: DataSQRL compiles the SQL script and output specification into a fully integrated data pipeline. The compiled data pipeline ingests raw data, processes it according to the transformations and analyses defined in the SQL script, and serves the resulting data through the specified API or database.

In a nutshell, DataSQRL is an abstraction layer that takes care of the nitty-gritties of building efficient data pipelines and gives developers an easy-to-use tool to build data products.

Follow the quickstart tutorial to build a data product in a few minutes and see how DataSQRL works in practice.

How DataSQRL Works

Compiled DataSQRL data pipeline >

DataSQRL compiles the SQL script and output specification into a data pipeline that uses data technologies like Apache Kafka, Apache Flink, or Postgres.

DataSQRL has a pluggable engine architecture which allows it to support various stream processors, databases, data warehouses, data streams, and API servers. Feel free to contribute your favorite data technology as a DataSQRL engine to the open-source, wink wink.

DataSQRL can generate data pipelines with multiple topologies. Take a look at the types of data products that DataSQRL can build. You can further customize those pipeline topologies in the DataSQRL package configuration which defines the data technologies at each stage of the resulting data pipeline.

DataSQRL compiles executables for each engine in the pipeline which can be deployed on the data technologies and cloud services you already use. In addition, DataSQRL provides development tooling that makes it easy to run and test data pipelines locally to speed up the development cycle.

What DataSQRL Does

Okay, you get the idea of a compiler that produces integrated data pipelines. But what exactly does DataSQRL do for you? Glad you asked.

DataSQRL Compilation >

To produce fully integrated data pipelines, the DataSQLR compiler:

  • resolves data imports to data source connectors and generates input schemas for the stream ingestion,
  • synchronizes data schemas and data management across all engines in the data pipeline,
  • aligns timestamps and watermarks across the engines,
  • orchestrates optimal data flow between engines,
  • translates the SQL script to the respective engine for execution,
  • and generates an API server that implements the given API specification.

To produce high-performance data pipelines that respond to new input data in realtime and provide low latency, high throughput APIs to many concurrent users, DataSQRL optimizes the compiled data pipeline by:

  • partitioning the data flow and co-locating data where possible.
  • pruning the execution graph and consolidating repetitive computations.
  • determining when to pre-compute data transformations in the streaming engine to reduce response latencies versus computing result sets at request time in the database or server to avoid data staleness and combinatorial explosion in pre-computed results.
  • determining the optimal set of index structures to install in the database.

In other words, DataSQRL can save you a lot of time and allows you to focus on what matters: implementing the logic and API of your data product.

Learn More