select id, name from T where id = 1
select area,sum(amount) from T group by area
WITH e1 AS ( SELECT uid,1 AS step1, MIN(etime) AS t1 FROM events WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1' GROUP BY uid), e2 AS ( SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2 FROM events AS e2 JOIN e1 ON e2.uid = e1.uid WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2' GROUP BY uid), e3 as ( SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3 FROM events AS e3 JOIN e2 ON e3.uid = e2.uid WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3' GROUP BY uid) SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3 FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
SQL lacks common debugging methods such as setting breakpoints and stepping into code; when nested in multiple layers, it needs to be split and executed layer by layer, which becomes very troublesome.
Optimizer works: The task involves using 'order by' , which means big sorting and leads to very slow performance. However, the database can automatically optimize it avoiding big sorting to enhance performance.
Optimizer fails:The complexity doesn't increase significantly, yet it often overwhelms most optimizers, leading to literal sorting and a sharp decline in performance.
SPL is simpler, eliminating any loop statements.
SELECT CODE, MAX(con_rise) AS longest_up_days FROM ( SELECT CODE, COUNT(*) AS con_rise FROM ( SELECT CODE, DT, SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days FROM ( SELECT CODE, DT, CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0 ELSE 1 END AS updown_flag FROM stock ) ) GROUP BY CODE, no_up_days )GROUP BY CODE
A | |
1 | =stock.sort(StockRecords.txt) |
2 | =T(A1).sort(DT) |
3 | =A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days) |
import pandas as pd stock_file = "StockRecords.txt" stock_info = pd.read_csv(stock_file,sep="\t") stock_info.sort_values(by=['CODE','DT'],inplace=True) stock_group = stock_info.groupby(by='CODE') stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum() max_increase_days = {} for code, group in stock_info.groupby('CODE'): max_increase_days[code] = group.groupby('label').size().max() – 1 max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
SPL is simpler and aligns more with natural thinking. This code can handle any-step funnel, offering greater simplicity and versatility compared to SQL.
WITH e1 AS ( SELECT uid,1 AS step1, MIN(etime) AS t1 FROM events WHERE etime>=end_date-14 AND etime< end_date AND etype='etype1' GROUP BY uid), e2 AS ( SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2 FROM events AS e2 JOIN e1 ON e2.uid = e1.uid WHERE e2.etime>=end_date-14 AND e2.etime< end_date AND e2.etime>t1 AND e2.etime< t1+7 AND etype='etype2' GROUP BY uid), e3 as ( SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3 FROM events AS e3 JOIN e2 ON e3.uid = e2.uid WHERE e3.etime>=end_date-14 AND e3.etime< end_date AND e3.etime>t2 AND e3.etime< t1+7 AND etype='etype3' GROUP BY uid) SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3 FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=end_date-14 && etime< end_date && A1.contain(etype)) |
4 | =A3.group(uid) |
5 | =A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime< t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3) |
A | |
1 | smallData.txt |
2 | =file(A1).import@t() |
3 | =A2.groups(state;sum(amount):amount) |
A | |
1 | bigData.txt |
2 | =file(A1).cursor@t() |
3 | =A2.groups(state;sum(amount):amount) |
SPL offers a cursor-based access method for big data that exceeds memory capacity. Importantly, the code for in-memory computing and external storage computing remains almost identical, requiring no additional workload.
A | |
1 | bigData.txt |
2 | =file(A1).cursor@t() |
3 | =A2.groups(state;sum(amount):amount) |
A | |
1 | bigData.txt |
2 | =file(A1).cursor@tm() |
3 | =A2.groups(state;sum(amount):amount) |
SPL supports thread-based parallel computing, allowing full utilization of multiple CPU cores. Enabling parallel computing is very simple —— just adding an @m option, ensuring convenience.
SPL facilitates writing code with low computational complexity and achieves faster execution. For instance, in the previously mentioned topN problem, SPL treats topN as aggregate computation, avoiding big sorting in the calculation logic and achieving significant speed improvements. This capability is inherently supported and doesn't rely on optimizer assistance.
Moreover, the code for top N within each group and across the entire set is similar—both are straightforward and run quickly.
SPL also offers a wide range of high-performance algorithms such as search, traversal, join, and clustering operations, many of which are unique innovations of SPL.
These algorithms lead to a dramatic skyrocketing in computational performance.
import pandas as pd stock_file = "StockRecords.txt" stock_info = pd.read_csv(stock_file,sep="\t") stock_info.sort_values(by=['CODE','DT'],inplace=True) stock_group = stock_info.groupby(by='CODE') stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum() max_increase_days = {} for code, group in stock_info.groupby('CODE'): max_increase_days[code] = group.groupby('label').size().max() – 1 max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])