# esProc SPLSomething would save lives of data scientists

## What should data scientists use for analysis and calculation?

### Looks simple, actually hard

e.g.Simple query
`select id, name from T where id = 1`
`select area,sum(amount) from T group by area`
e.g.E-commerce funnel analysis
```WITH e1 AS (
SELECT uid,1 AS step1, MIN(etime) AS t1
FROM events
WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1'
GROUP BY uid),
e2 AS (
SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2'
GROUP BY uid),
e3 as (
SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3'
GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid```
e.g.A player who scores three times within one minute
e.g.Number of users active for three consecutive days within every 7-day period
e.g.Next-day retention rate for new users
e.g.Price increase on the day when the stock price is higher than the price 5 days before and 5 days after
e.g....

### Debugging is cumbersome

SQL lacks common debugging methods such as setting breakpoints and stepping into code; when nested in multiple layers, it needs to be split and executed layer by layer, which becomes very troublesome.

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
SELECT CODE, COUNT(*) AS con_rise
FROM (
SELECT CODE, DT,
SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
FROM (
SELECT CODE, DT,
CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0
ELSE 1 END AS updown_flag
FROM stock
)
)
GROUP BY CODE, no_up_days
)
GROUP BY CODE

### Unable to Develop High-Performance Algorithms, Relying on Database Optimizers for Speed

e.g.Retrieve the top 10 records from 100 million rows of data
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC

Optimizer works: The task involves using 'order by' , which means big sorting and leads to very slow performance. However, the database can automatically optimize it avoiding big sorting to enhance performance.

e.g.Finding the top 10 in each group
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10

Optimizer fails:The complexity doesn't increase significantly, yet it often overwhelms most optimizers, leading to literal sorting and a sharp decline in performance.

## 6 Major Advantage, esProc SPL saves lives of data scientists!

### 1More Concise Code

e.g.Longest consecutive rising days of each stock

SPL is simpler, eliminating any loop statements.

#### SQL

```SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
SELECT CODE, COUNT(*) AS con_rise
FROM (
SELECT CODE, DT,
SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
FROM (
SELECT CODE, DT,
CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
ELSE 1 END AS updown_flag
FROM stock
)    )
GROUP BY CODE, no_up_days
)GROUP BY CODE```

#### SPL

 A 1 =stock.sort(StockRecords.txt) 2 =T(A1).sort(DT) 3 =A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days)

#### Python

```import pandas as pd
stock_file = "StockRecords.txt"
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
```
e.g.E-commerce funnel analysis

SPL is simpler and aligns more with natural thinking. This code can handle any-step funnel, offering greater simplicity and versatility compared to SQL.

#### SQL

```WITH e1 AS (
SELECT uid,1 AS step1, MIN(etime) AS t1
FROM events
WHERE etime>=end_date-14 AND etime< end_date AND etype='etype1'
GROUP BY uid),
e2 AS (
SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
WHERE e2.etime>=end_date-14 AND e2.etime< end_date AND e2.etime>t1 AND e2.etime< t1+7 AND etype='etype2'
GROUP BY uid),
e3 as (
SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
WHERE e3.etime>=end_date-14 AND e3.etime< end_date AND e3.etime>t2 AND e3.etime< t1+7 AND etype='etype3'
GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid```

#### SPL

 A 1 =["etype1","etype2","etype3"] 2 =file("event.ctx").open() 3 =A2.cursor(id,etime,etype;etime>=end_date-14 && etime< end_date && A1.contain(etype)) 4 =A3.group(uid) 5 =A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first) 6 =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime< t1+7).etime, null)))) 7 =A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3)

### 3Built-in big data and parallel processing capabilities

#### Simple Big Data Computing

##### In-memory computing
 A 1 smallData.txt 2 =file(A1).import@t() 3 =A2.groups(state;sum(amount):amount)
##### External storage computing
 A 1 bigData.txt 2 =file(A1).cursor@t() 3 =A2.groups(state;sum(amount):amount)

SPL offers a cursor-based access method for big data that exceeds memory capacity. Importantly, the code for in-memory computing and external storage computing remains almost identical, requiring no additional workload.

#### Simple parallel computing

##### Serial
 A 1 bigData.txt 2 =file(A1).cursor@t() 3 =A2.groups(state;sum(amount):amount)
##### Parallel
 A 1 bigData.txt 2 =file(A1).cursor@tm() 3 =A2.groups(state;sum(amount):amount)

SPL supports thread-based parallel computing, allowing full utilization of multiple CPU cores. Enabling parallel computing is very simple —— just adding an @m option, ensuring convenience.

### 4High-performance algorithms

SPL facilitates writing code with low computational complexity and achieves faster execution. For instance, in the previously mentioned topN problem, SPL treats topN as aggregate computation, avoiding big sorting in the calculation logic and achieving significant speed improvements. This capability is inherently supported and doesn't rely on optimizer assistance.

Orders.groups(;top(10;-Amount))

Moreover, the code for top N within each group and across the entire set is similar—both are straightforward and run quickly.

Orders.groups(Area;top(10;-Amount))

SPL also offers a wide range of high-performance algorithms such as search, traversal, join, and clustering operations, many of which are unique innovations of SPL.
These algorithms lead to a dramatic skyrocketing in computational performance.

#### Search

• Binary search
• Sequence number positioning
• Index search
• Batch search

#### Traversal

• Cursor filtering
• Multi-purpose traversal
• Multi-cursor
• Aggregate Understanding
• Ordered grouping
• Program cursor
• Column-wise computing

#### Association

• Foreign key pre-association
• Numberizng of foreign key
• Alignment sequence
• Big dimension table search
• One-side partition
• Order-based merge
• Associated positioning

#### Cluster

• Multi-zone composite table
• Duplicated dimension table
• Segmented dimension table

### Python surpasses SQL, still difficult to write in complex situations

e.g.Longest consecutive rising days of each stock
```import pandas as pd
stock_file = "StockRecords.txt"
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
```