esProc SPLSomething would save lives of data scientists

What should data scientists use for analysis and calculation?

SQL ?
esProc SPL !
Python ?

Looks simple, actually hard

e.g.Simple query
select id, name from T where id = 1
select area,sum(amount) from T group by area
e.g.E-commerce funnel analysis
WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3'
    GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
e.g.A player who scores three times within one minute
e.g.Number of users active for three consecutive days within every 7-day period
e.g.Next-day retention rate for new users
e.g.Price increase on the day when the stock price is higher than the price 5 days before and 5 days after
e.g....

Debugging is cumbersome

SQL lacks common debugging methods such as setting breakpoints and stepping into code; when nested in multiple layers, it needs to be split and executed layer by layer, which becomes very troublesome.

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
SELECT CODE, COUNT(*) AS con_rise
FROM (
SELECT CODE, DT,
SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
FROM (
SELECT CODE, DT,
CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0
ELSE 1 END AS updown_flag
FROM stock
)
)
GROUP BY CODE, no_up_days
)
GROUP BY CODE

Unable to Develop High-Performance Algorithms, Relying on Database Optimizers for Speed

e.g.Retrieve the top 10 records from 100 million rows of data
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC

Optimizer works: The task involves using 'order by' , which means big sorting and leads to very slow performance. However, the database can automatically optimize it avoiding big sorting to enhance performance.

e.g.Finding the top 10 in each group
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10

Optimizer fails:The complexity doesn't increase significantly, yet it often overwhelms most optimizers, leading to literal sorting and a sharp decline in performance.

The closeness of SQL requires importing external data into the database first, with too many trivial tasks involved.

6 Major Advantage, esProc SPL saves lives of data scientists!

1More Concise Code

e.g.Longest consecutive rising days of each stock

SPL is simpler, eliminating any loop statements.

SQL

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT, 
				SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
				CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
					ELSE 1 END AS updown_flag
            FROM stock
        )    )
    GROUP BY CODE, no_up_days
)GROUP BY CODE

SPL

A
1=stock.sort(StockRecords.txt)
2=T(A1).sort(DT)
3=A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days)

Python

import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
e.g.E-commerce funnel analysis

SPL is simpler and aligns more with natural thinking. This code can handle any-step funnel, offering greater simplicity and versatility compared to SQL.

SQL

WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime< end_date AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime< end_date AND e2.etime>t1 AND e2.etime< t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime< end_date AND e3.etime>t2 AND e3.etime< t1+7 AND etype='etype3'
    GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid

SPL

A
1=["etype1","etype2","etype3"]
2=file("event.ctx").open()
3=A2.cursor(id,etime,etype;etime>=end_date-14 && etime< end_date && A1.contain(etype))
4=A3.group(uid)
5=A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first)
6=A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime< t1+7).etime, null))))
7=A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3)

2Strong Interactivity Suitable for Multi-Step Exploratory Analysis and Convenient Debugging

3Built-in big data and parallel processing capabilities

Simple Big Data Computing

In-memory computing
A
1smallData.txt
2=file(A1).import@t()
3=A2.groups(state;sum(amount):amount)
External storage computing
A
1bigData.txt
2=file(A1).cursor@t()
3=A2.groups(state;sum(amount):amount)

SPL offers a cursor-based access method for big data that exceeds memory capacity. Importantly, the code for in-memory computing and external storage computing remains almost identical, requiring no additional workload.

Simple parallel computing

Serial
A
1bigData.txt
2=file(A1).cursor@t()
3=A2.groups(state;sum(amount):amount)
Parallel
A
1bigData.txt
2=file(A1).cursor@tm()
3=A2.groups(state;sum(amount):amount)

SPL supports thread-based parallel computing, allowing full utilization of multiple CPU cores. Enabling parallel computing is very simple —— just adding an @m option, ensuring convenience.

4High-performance algorithms

SPL facilitates writing code with low computational complexity and achieves faster execution. For instance, in the previously mentioned topN problem, SPL treats topN as aggregate computation, avoiding big sorting in the calculation logic and achieving significant speed improvements. This capability is inherently supported and doesn't rely on optimizer assistance.

Orders.groups(;top(10;-Amount))

Moreover, the code for top N within each group and across the entire set is similar—both are straightforward and run quickly.

Orders.groups(Area;top(10;-Amount))

SPL also offers a wide range of high-performance algorithms such as search, traversal, join, and clustering operations, many of which are unique innovations of SPL.
These algorithms lead to a dramatic skyrocketing in computational performance.

Search

  • Binary search
  • Sequence number positioning
  • Index search
  • Batch search

Traversal

  • Cursor filtering
  • Multi-purpose traversal
  • Multi-cursor
  • Aggregate Understanding
  • Ordered grouping
  • Program cursor
  • Column-wise computing

Association

  • Foreign key pre-association
  • Numberizng of foreign key
  • Alignment sequence
  • Big dimension table search
  • One-side partition
  • Order-based merge
  • Associated positioning

Cluster

  • Multi-zone composite table
  • Duplicated dimension table
  • Segmented dimension table
  • Preemptive Load Balancing

5Openness, directly calculate diverse data sources

6Pure Java Enables direct Integration of Explored Results into Enterprise Applications

Python surpasses SQL, still difficult to write in complex situations

e.g.Longest consecutive rising days of each stock
import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

Debugging still relies on print method

No big data support, Pseudo parallelism

Non-Java System Often Requires Redevelopment When Integrating Explored Results into Enterprise Application