Examples

The goal of this section is to provide some examples of real Burroughs use cases and demonstrate the different types of queries Burroughs can perform. The queries all rely on The Complete Journey 2.0 from 84.51. The data encode a large number of retail transactions and can be produced using the command .producer transactions_producer start from the CLI or by navigating to the producers tab and starting the transactions_producer from the GUI interface.

The schema for this data is as follows:

Field Name Type Description
BasketNum Integer A number that identifies a single transaction, potentially with multiple products
Date BigInt The date of the transaction
ProductNum String Identifies the product purchased
Spend Double The amount spent
Units Double The quantity purchased for that product
StoreR String Identifies which location it was purchased at. One of SOUTH, EAST, WEST, or CENTRAL

For testing, Burroughs also provides a second, synthetic dataset called customers which can be joined with the transactions data. The customers data can be produced in a similar fashion using customers_producer and has the following schema:

Field Name Type Description
BasketNum Integer The same as the basketnum in transactions, for each record in the transactions data set there is a record in customers with a matching basketnum
CustId Integer The id of the customer who made the purchase

Example 1: Basic Aggregation

This is an example of a simple query that performs some aggregations and filters the results both before and after the grouping.

Query

select
    basketnum,
    count(*) as ProductCount,
    sum(spend) as TotalSpend,
    avg(units) as AverageUnits
from transactions
where storer = 'CENTRAL'
group by basketnum
having count(*) > 2

Output

Example 2: Basic Join

This query uses a join to compute the total amount spent by each customer.

Query

select
    custid,
    sum(spend) as TotalSpend
from transactions t
inner join customers c
    on t.basketnum = c.basketnum
group by custid

Output

Example 3: Subquery

Like in a relational database, the output from one query can be used as the input in another one. The following query investigates differences in quantity purchased per transaction for each productnum between the CENTRAL and SOUTH stores.

Query

select
    central_products.productnum,
    avg(central_products.units - south_products.units) as AverageDiff
from 
(
    select productnum, units 
    from transactions
    where storer = 'CENTRAL'
) as central_products
inner join
(
    select productnum, units
    from transactions
    where storer = 'SOUTH'
) as south_products
on central_products.productnum = south_products.productnum
group by 1;

Output

Example 4: Common Table Expression

Instead of placing subqueries inline, they can also be defined ahead of time using common table expressions. For instance, the previous example could be rewritten as the following query:

Query

with central_products as (
    select productnum, units 
    from transactions
    where storer = 'CENTRAL'
), south_products as (
    select productnum, units
    from transactions
    where storer = 'SOUTH'
)
select
    central_products.productnum,
    avg(central_products.units - south_products.units) as AverageDiff
from central_products
inner join south_products
    on central_products.productnum = south_products.productnum
group by 1;

Output

The output is the same as Example 3

Example 5: Self-join

Burroughs supports selj-joins, something that is not inherently supported by KsqlDB. The below example uses a self-join to find the frequency with with certain combinations of items appear in the same basket.

Query

with pairs as ( 
    select 
        it1.productnum as source_item, 
        it2.productnum as target_item 
    from transactions it1 
    inner join transactions it2 
        on it1.basketnum = it2.basketnum 
        and it1.productnum < it2.productnum 
    where 
        it1.units > 0 
        and it2.units > 0 
) 
select 
    source_item, 
    target_item, 
count(1) as frequency 
from pairs 
group by 1,2 
having count(1) > 2;

Output

Example 6: Group_Concat

group_concat, as described in the previous section, is an aggregate function that appends all of the values for a given field in each group together into one long string. The user can specify the separator to use between values as well as whether to include duplicates. The below query gets a list of all of the locations a customer has shopped at, in addition to other summary statistics.

Query

select
    custid as Customer,
    avg(units) as AverageUnits,
    sum(spend) as TotalSpend,
    count(*) as TransactionCount,
    group_concat(distinct storer, '&') as Regions
from transactions t
left join customers c
    on t.basketnum = c.basketnum
group by 1;

Output

Example 7: Date Manipulation

Burroughs, like its dependencies Avro and KsqlDB, does not support any kind of date data type. Instead, dates must be stored as integers representing the number of days since the epoch. Such dates are not human readable, so Burroughs provides the ability to turn those into normal date strings using the cast function as shown in the following query which counts the total number of transactions that occur on each day.

Query

select 
    cast("Date" as Date), 
    count(*) as TotalTransactions
from transactions
group by 1;

Note: if date is a column name, it must be surrounded by double quotes because date is a keyword in SQL

Output

You can also use the cast function to transform a literal date to the proper integer representation for comparison, like in the following query, which finds largest purchase in terms of quantity for each date after June 1, 2016.

Query

select 
    max(units) as BiggestPurchase,
    cast("Date" as Date) from transactions
where "Date" > CAST('2016-06-01' as date)
group by 2; 

Output

Example 8: Multiple Group By Clauses

The following query is an example of performing aggregations in multiple stages using common table expressions. Note that it would be cleaner to group by storer in addition to basketnum in the inner query, but this is not possible because Burroughs does not allow multiple fields in the group by of a CTE. Instead, the function earliest_by_offset is used to include the region, assuming that each basket number belongs to only one region.

Query

with raw as (
    select
        earliest_by_offset(store_r) as region,
        basketnum,
        count(1) as num_products,
        sum(units) as total_quantity
    from transactions
    group by 2
)
select
    region,
    avg(num_products) as avg_products,
    avg(total_quantity) as avg_quantity
from raw
group by 1;

Output

Example 9: Windowing