esProc SPL’s Grouping Operations: The Most Powerful in History, Bar None

Grouping is a common structured data calculation, with corresponding statements and functions available in both SQL and Python. However, these languages are far less powerful than the grouping operations offered by esProc SPL.

Ordinary grouping often involves aggregation. For example, to calculate the number of employees for each department, the SQL code would be:

select DEPT,count(1) emp_count from employee group by DEPT

Of course, SPL also supports such basic grouping and aggregation operations:

T("employee.csv").groups(DEPT;count(1):emp_count)

It’s similar to SQL.

In addition, SPL offers richer grouping operations. Below are two scenarios:

1. Grouped subset

We are sometimes interested not only in grouped aggregate values, but also in the detailed data contained in each group – i.e., the grouped subset.

For instance, to identify employees from departments with over 10 members, a straightforward approach is to divide all employees into sub-sets by department, forming a set of sets. Then, filter the set to select the grouped subsets with a length greater than 10. Finally, union these subsets to get the desired result.

This means you cannot perform forced aggregation after grouping; you need to keep the grouped subsets.

SPL supports grouped subset:

T("employee.csv").group(DEPT).select(~.len()>10).conj()

The group function divides the data into grouped subsets by department, the select function filters the subsets with a length exceeding 10, and the conj function unions the filtered subsets.

SQL’s grouping is always bound to aggregation and lacks the concept of grouped subset. So, the easiest way to solve this problem in SQL is to traverse the data twice:

select *
from employee
where DEPT in
    (select DEPT
     from employee
     group by DEPT
     having count(1)>10)
order by DEPT

First, use a subquery to group and identify departments with more than 10 employees. Then, filter the employee data for those departments again in the outer query.

Python also doesn’t essentially support grouped subset objects. groupby function doesn’t return a set of grouped subsets. However, since this problem only requires one filtering step, it can be solved relatively concisely using lambda syntax:

import pandas as pd
employee = pd.read_csv('employee.csv')
groups = employee.groupby('DEPT').filter(lambda x: len(x) > 10)
result = groups.sort_values(by='DEPT')

It appears to follow a computation logic similar to SPL, essentially implementing the concept of grouped subsets. Here, the ‘x’ in lambda serves the same purpose as the ‘~’ symbol in SPL. The difference is that filter function doesn’t return a set of sets, but automatically flattens them into a single-layer set, eliminating the need for conj to union in SPL.

Thus, the filter result is no longer a set of grouped subsets and cannot be used in subsequent grouped subset operations.

Let’s explore more operations on the grouped subsets. After identifying departments with more than 10 employees, sort these departments by their average employee salary in descending order, while also sort employees within each department by their individual salaries in descending order.

This requires performing sequential operations on the set of grouped subsets: first filtering, then sorting, followed by sorting each individual subset.

SPL can naturally continue this process, starting from the code above:

=T("employee.csv").group(DEPT).select(~.len()>10).sort(-~.avg(SALARY)).(~.sort(-SALARY)).conj()

Because SQL lacks the concept of grouped subsets, it still requires traversing the data twice and using a join.

select * from employee
join (
    select DEPT,avg(SALARY) as avg_salary,
    from employee
    group by DEPT
    having count(1) > 10
) dept_stats on employee.DEPT = dept_stats.DEPT
order by 
    dept_stats.avg_salary desc,
SALARY desc

The subquery calculates the average salary, requiring both the inner and outer queries to be rewritten using JOIN operations, making the code somewhat convoluted.

Python’s grouped subset mechanism falls short in this scenario, as any operation after grouping will flatten the results into a single-layer table. To perform further computations on the grouped subsets, the data must be regrouped to regenerate grouped subsets:

import pandas as pd
employee = pd.read_csv('employee.csv')
groups = employee.groupby('DEPT').filter(lambda x: len(x) > 10)
dept_avg_salary = groups.groupby('DEPT')['SALARY'].mean().reset_index(name='avg_salary')
merged_df = pd.merge(groups, dept_avg_salary, on='DEPT') 
sorted_df = merged_df.sort_values(by=['avg_salary', 'SALARY'], ascending=[False, False])
final_result = sorted_df[employee.columns]

Since the filter operation in the first step returns a single-layer table, regrouping is required to compute the aggregate values. The subsequent code will also need to borrow ideas from the preceding SQL approach, using merge or join operations to bind the aggregate values to achieve the sorting objective.

Python lacks essential grouped subset mechanism. While leveraging lambda syntax can achieve the effect of grouped subsets in a single step, enabling more complex aggregation computations—a significant advantage over SQL—it still falls far short when compared to SPL.

To address this problem in Python, if the requirement is to group only once, you could apply the entire SQL strategy mentioned earlier. However, this would still require traversing the data twice and results in more verbose code. We therefore omit the implementation details here.

Computations involving grouped subsets are extremely common, as evidenced by the numerous related questions easily found on prominent Q&A platforms, such as:

How do I perform recursive search in Oracle

Adding tuple rows to each subtuple group in SQL

Column comparison between rows of the same group used in Partition By Clause

…

2. Ordered grouping

Grouped subsets are often involved with ordered grouping operations.

For example, to find the longest consecutive days of price increase for a stock, besides the direct traversal method, you can also utilize a grouping approach: First, sort the stock’s closing prices by date and traverse the data. When the price increases on a given day, group that day with the previous day; when the price decreases, start a new group. After completing the traversal, consecutive days of price increases will be grouped together. Then, you simply need to identify which group has the most members.

However, this grouping is not conventional equivalence grouping, but an ordered grouping based on conditions during traversal.

SPL supports this kind of ordered grouping based on changing conditions.

stock.sort(Date).group@i(Price<=Price[-1]).max(~.len())

By appending the @i option to the group function, a new group will be created whenever the expression Price<=Price[-1] evaluates to true, corresponding to scenarios where the stock price has not risen. SPL supports cross-row referencing. Specifically, Price[-1] represents the closing price of the preceding row.

SQL lacks this feature, making it much harder.

SELECT MAX(ContinuousDays)
FROM(
    SELECT COUNT(*) AS ContinuousDays
    FROM (
        SELECT SUM(RisingFlag) OVER (ORDER BY Date) AS NoRisingDays
        FROM(
            SELECT Date, 
                    CASE WHEN Price > LAG(Price) OVER (ORDER BY Date) THEN 0
                    ELSE 1 END AS RisingFlag
            FROM stock
	     )
         )
    )
GROUP BY NoRisingDays

Due to the lack of ordered grouping support, SQL has to convert it to common equivalence grouping through “accumulating non-increasing days”, resulting in a highly convoluted process. Determining increases and calculating cumulative values both require window functions in conjunction with subqueries, resulting in a situation with two window functions nested four layers deep, making the code difficult to write and understand.

Python also lacks direct ordered grouping support. Like SQL, it requires deriving a cumulative value column and then converting it to equivalence grouping, following similarly convoluted logic. However, Python's stronger adjacent element computation capabilities make the code more concise than SQL:

stock = pd.read_csv('stock.csv')
stock.sort_values(by=['Date'], inplace=True)
stock['NoRisingDays'] = stock['Price'].diff() > 0
grouped=stock.groupby((~stock['NoRisingDays']).cumsum())['NoRisingDays'].cumsum().where(stock['NoRisingDays'], 0)
max_increase_days = grouped.max()

There are numerous practical problems involving ordered grouping, such as:

List of all the last people to enter the elevator

SQL query to track production operations outcome progression with conditional nulls

Oracle SQL data migration row to column based in month

…

Grouped subsets and ordered grouping can also be nested.

For instance, if we want to calculate the longest consecutive days of price increase for each stock, we can first group by stock code to keep grouped subsets of each stock. Then, within each subset, we can apply the aforementioned ordered grouping method to compute the longest consecutive days of price increase.

SPL supports both true grouped subset concept and ordered grouping, enabling effortless implementation of nested computations.

stock.sort(Date).group(Code;~.group@i(Price<=Price[-1]).max(~.len()):max_increase_days)

You simply need to combine the two methods described previously.

SQL, on the other hand, requires adding the stock code as a partition field within the window function:

SELECT Code, MAX(ContinuousDays)
FROM(
    SELECT Code, COUNT(*) AS ContinuousDays
    FROM(
        SELECT Code, Date, SUM(RisingFlag) OVER (PARTITION BY Code ORDER BY Code, Date) AS NoRisingDays
        FROM(
            SELECT Code, Date, 
                    CASE WHEN Price > LAG(Price) OVER (PARTITION BY Code ORDER BY Code, Date)  THEN 0
                    ELSE 1 END AS RisingFlag
            FROM stock
            )
        )
    GROUP BY Code, NoRisingDays
    )
GROUP BY Code

Once you understand the previous code, making such modifications isn’t particularly difficult, but the code becomes even more cumbersome.

Python neither supports continuous operations on grouped subsets nor ordered grouping. The only approach is to convert ordered grouping to equivalence grouping using the previously mentioned method, combined with repeated regrouping:

import pandas as pd
stock = pd.read_csv('stock.csv')
stock.sort_values(by=['Code', 'Date'], inplace=True)
stock['NoRisingDays']=stock.groupby('Code')['Price'].diff().fillna(0).le(0).astype(int).cumsum()
grouped=stock.groupby(['Code','NoRisingDays']).size().reset_index(name='ContinuousDays')
max_increase_days = grouped.groupby('Code')['ContinuousDays'].max()
max_rise_df = max_increase_days.reset_index(name='max_increase_days')

The code is convoluted and cumbersome.

There are also numerous practical problems involving nested ordered grouping within grouped subsets:

Postgresql - How to calculate swipe in and swipe out time

SQL formatting to "user friendly date"

Identify groups of sequential records

…

SPL offers additional industry-unique grouping methods such as alignment grouping, enumeration grouping, and sequence-based grouping, far surpassing the capabilities of SQL and Python, truly earning its reputation as the most powerful in history.

esProc SPL is open-source and can be available here: Open-Source Address.

Say something nice to AdamHanson…

1

This comparison is wonderful. SPL has obvious advantages in grouped subsets and ordered grouping. However, SQL is irreplaceable in complex transaction processing and close integration with databases. Python is highly flexible for quickly building data processing flows. So, if real - time performance is extremely crucial in a project, which language do you think is more suitable?

mars_hughes

·
a year ago
·
1

This is quite exciting. SPL can solve the pain point in SQL implementations requiring nested subqueries.

BrettHoffman

·
a year ago
·
1

Groups are really strong
A second order arithmetic grouping was previously done (I don't know if this statement is correct), which is to group a sequence so that the first group subset contains 1 element, the second group itself contains 2 elements, and so on, the k-1 group subset has k-1 elements, and the rest goes into the K-1 group subset. In the past, it was necessary to calculate the general term formula by mathematics, such as the first and second writing methods below, and later wrote the third one, which was not so laborious.

swifya

·
a year ago
·