Querying and Designing the API
When we run our seedshop.sqrl
script, DataSQRL compiles a data pipeline that exposes an API to access the resulting data. We queried the API via GraphiQL in the browser by opening http://localhost:8888//graphiql/
. Let's look at those queries in more detail.
We will be accessing the generated GraphQL API. If you are new to the GraphQL API standard, take a quick look at the documentation for querying GraphQL. We are working on REST support.
Querying the API
In the first chapter we retrieved the purchase history and spending analysis of the user with id=10
by running the following query.
{
Users (id: 10) {
purchases {
id
totals {
price
saving
}
}
spending {
week
spend
saved
}
}}
At the root of this query, we are querying the Users
table for rows that match the given id. We then navigate along the purchases
and spending
relationships to query for the related rows from the Orders
and nested Users.spending
tables, respectively.
Defining relationships explicitly in SQRL allows us to query related data easily in the API. We can also use filters in relationships to select the related rows we want to retrieve.
{
Products (category: "acorns") {
name
weight_in_gram
category
volume_10day(country: "US") {
quantity
spend
weight
}
}}
This query retrieves all products from the "acorns" category. Acorns are a favorite among squirrels. We then navigate to the nested volume_10day
relationship to the nested table that aggregates orders of that product over the last 10 days and groups them by country. We filter those rows to only retrieve the aggregates for the US.
Relationships allow us to construct complex queries which return all the data we need in a single request. We don't have to stitch our desired result set together by querying multiple tables. That saves you a ton of time and is also a lot faster.
Customizing the API Specification
By default, DataSQRL generates an API specification that exposes query endpoints for all tables defined in the SQRL script and makes all fields in those tables accessible, including relationships to navigate to related tables. In addition, DataSQRL generates field filters for all queries and relationships that give the user of the API the option to filter out rows.
The easiest way to customize the GraphQL API is to start with the default GraphQL schema generated by DataSQRL with the command:
docker run --rm -v $PWD:/build datasqrl/cmd compile seedshop.sqrl -a graphql
First, let's take a look at the API specification that DataSQRL generated in the file schema.graphqls
.
type NumOrders {
count: Int!
}
type Products {
id: Int!
name: String!
sizing: String!
weight_in_gram: Int!
type: String!
category: String!
usda_id: Int!
updated: String!
ordered_items(productid: Int, quantity: Int, unit_price: Float, discount: Float, total: Float): [items!]
volume_10day(country: String, quantity: Int, spend: Float, weight: Int): [volume_10day!]
}
type Query {
NumOrders(count: Int): [NumOrders!]
Orders(id: Int, customerid: Int, time: String): [orders!]
Products(id: Int, name: String, sizing: String, weight_in_gram: Int, type: String, category: String, usda_id: Int, updated: String): [Products!]
Users(id: Int, first_name: String, last_name: String, email: String, ip_address: String, country: String, changed_on: Int, timestamp: String): [Users!]
}
type Users {
id: Int!
first_name: String!
last_name: String!
email: String!
ip_address: String
country: String
changed_on: Int!
timestamp: String!
purchases(id: Int, customerid: Int, time: String): [orders!]
spending(week: String, spend: Float, saved: Float): [spending!]
past_purchases(productid: Int, num_orders: Int, total_quantity: Int): [past_purchases!]
}
.... [truncated]
The GraphQL schema has one type for each table we defined in the SQRL script. The types have a field for each column in the associated table, including relationship columns. Tables map to types and columns map to fields based on name.
The Query
type contains one query endpoint for each (non-nested) table. The queries and relationships have one argument for each field in the queried or related table. When those arguments are provided by a user querying the API, they translate to a filter on the returned rows from the underlying table. In our query examples above, we filtered Users
by id, Products
by category, and the volume_10day
relationship traversal by country.
If we specify multiple arguments, only those rows are returned that match all filter conditions.
However, most of the filters we don't need in the API, so we are going to remove them and trim down the API. We are also going to limit the query endpoints and remove some fields we don't want to expose in the API.
Rename the schema.graphqls
file to seedshop.graphqls
and change it to the following.
type Products {
id: Int!
name: String!
sizing: String!
category: String!
volume_10day(country: String): [volume_10day!]
}
type Query {
Products(id: Int, name: String, category: String): [Products!]
Users(id: Int!): Users
}
type Users {
id: Int!
first_name: String!
last_name: String!
country: String
purchases: [orders!]
spending: [spending!]
past_purchases(productid: Int): [past_purchases!]
}
type items {
quantity: Int!
unit_price: Float!
discount: Float!
total: Float!
product: Products!
}
type orders {
id: Int!
time: String!
items: [items!]
totals: totals
}
type past_purchases {
productid: Int!
num_orders: Int!
total_quantity: Int!
}
type spending {
week: String!
spend: Float!
saved: Float!
}
type totals {
price: Float!
saving: Float!
}
type volume_10day {
country: String
quantity: Int!
spend: Float!
weight: Int!
}
Note, that we removed the entire NumOrders
table because we don't need it for now.
We changed the Users
table query endpoint to Users(id: Int!): Users
to make a user id required and return only a single user (which may be null
if it doesn't exist) instead of a list of users [Users!]
.
To instruct the DataSQRL compiler to use our custom API specification, we add it as a second argument to the command.
docker run --rm -v $PWD:/build datasqrl/cmd compile seedshop.sqrl seedshop.graphqls --mnt $PWD
To launch the updated pipeline, execute:
(cd build/deploy; docker compose up)
If refresh GraphiQL in the browser, you will see your custom API.
Another neat benefit of customizing and trimming down the API specification is that it allows DataSQRL to generate more efficient data pipelines. DataSQRL automatically removes computations that aren't visible in the API and selects optimal index structures for the database based on the filters that are available in the API.
Pagination
Our current API always returns all filtered results for queries or when navigating relationships. In some cases, those result sets can be very large, and we don't want to transfer huge result sets through the API. Instead, we want to allow consumers of our API to page through the results.
We are going to add limit+offset based pagination to our API. It only requires adding the limit: Int, offset: Int
arguments to queries and relationship fields.
type Products {
id: Int!
name: String!
sizing: String!
category: String!
volume_10day(country: String, limit: Int = 20, offset: Int = 0): [volume_10day!]
}
type Query {
Products(id: Int, name: String, category: String, limit: Int!, offset: Int = 0): [Products!]
Users(id: Int!): Users
}
type Users {
id: Int!
first_name: String!
last_name: String!
country: String
purchases(limit: Int = 20, offset: Int = 0): [orders!]
spending: [spending!]
past_purchases(productid: Int): [past_purchases!]
}
.... [truncated]
The limit
argument limits the size of the returned result set to the given number. The offset
argument moves to the given position in the result set before starting to retrieve records.
Both arguments can be made required and given a default value as shown in the example above.
Update the seedshop.graphqls
schema with limit+offset pagination as shown above, save the file, and re-run the script. You can now execute the following query.
{
Products (category: "acorns", limit: 2, offset: 2) {
name
category
volume_10day(limit: 2) {
quantity
spend
weight
}
}}
This query limits the number of returned products to 5 starting after position 2 in the complete result set. When you navigate through a relationship, you can filter records and use limit
and offset
in the same way you would when querying a table at the top level to specify which related records you want to be returned, in what order, and how many of them.
Note, that these arguments are applied locally for each record that is returned. In the query above, volume_10day(limit: 2)
means that we are asking for up to 2 results for each product and not 2 total for the entire request.
As we navigate through relationships, we need to keep in mind that result set cardinalities multiply and choose small enough page sizes to avoid huge responses from the server.
Mutations and Inserting Data
Next, we are going to collect product visits through the API in order to improve our recommendation engine with recent user behavior. We are going to capture when a user visits a product page and aggregate those product visits to determine which products a user is interested in.
First, we are going to add a mutation to our GraphQL API schema to capture the product visit event. Add the following to the end of the seedshop.graphqls
file:
type Mutation {
ProductVisit(event: VisitEvent!): CreatedProductVisit
}
input VisitEvent {
userid: Int!
productid: Int!
}
type CreatedProductVisit {
_source_time: String!
productid: Int!
userid: Int!
}
We created a mutation (i.e. an API endpoint that accepts data) called ProductVisit
that accepts an input of type VisitEvent
and returns the type CreatedProductVisit
.
When you create mutations, the input type can have arbitrary fields to represent the data you want to capture. The fields of the mutation return type must be a subset of those fields plus the special field _source_time
which returns the time when the event was created on the server.
The use the product visits in our SQRL script, we import it like the other data before. Add the following line to the imports in the seedshop.sqrl
.
IMPORT seedshop.ProductVisit;
To import the data from a mutation, you use the name of the GraphQL schema file as the package name (i.e. seedshop
) and the mutation name as the table name (i.e. ProductVisit
).
Now, we can aggregate the product visits over the last 90 days for each user to determine what products they might like:
Users.product_visits := SELECT productid, count(1) as visits
FROM @ JOIN ProductVisit v ON @.id = v.userid
WHERE v._source_time > now() - INTERVAL 90 DAY
GROUP BY productid ORDER BY visits DESC;
Add the table definition after Users.past_purchases
and save the SQRL script.
To retrieve product_visits
through the API, we add the corresponding type and relationship to the GraphQL schema:
type User {
[existing fields...]
product_visits: [product_visits!]
}
type product_visits {
productid: Int!
visits: Int!
}
And there we have our custom, polished data API in GraphQL that allows us to query and add data. Check out the final GraphQL schema and corresponding SQRL script.
Next Steps
Wonderful, you have completed the 3 essential steps of building a data pipeline with DataSQRL:
- Writing SQRL scripts
- Connecting data sources
- Designing and querying the data API
Now you can go off, build amazing data products, and tell us about it.
If you are eager to continue learning, we have two more optional chapters in this tutorial:
- Advanced Topics covers additional features of DataSQRL.
- Deployment shows you how to deploy the data pipeline compiled by DataSQRL.
If you want to learn more about querying the data API from your application or favorite programming language, the reference documentation has an overview. It also covers API design in more detail.