esProc SPL

A must-have tool for data scientists
Low code / Strong interactivity / Big data / Super portable / Easy integration

Table of Contents

  1. Common tools and their problems
  2. A must-have tool for data scientists
  3. Technical Characteristics
  4. More code comparison examples
  5. FAQ
  6. Summary

01Common tools and their problems

Exploring data using SQL looks simple

select id, name from T where id = 1
select area,sum(amount) from T group by area
Almost all data scientists can use SQL for exploration and analysis. SQL looks very simple and has a certain degree of interactivity, and using it for data exploration and analysis seems very good.

Facing complex problems, the difficulty of writing SQL increases sharply

Longest consecutive rising days of a stock

select max(consecutive_day)  from (
    select count(*) consecutive_day  from (
        select sum(rise_or_fall) over(order by trade_date) day_no_gain  from (
            select trade_date, 
               case when close_price > lag(close_price) over( order by trade_date) 
                then 0 else 1 end rise_or_fall from stock_price
        )
     ) group by day_no_gain
)				

Stock price increase for the Top3 days

with ranked_prices as (
	select closing,lag(closing) over (order by closing) as previous_closing,
			row_number() over (order by closing desc) as rank
	from share62)
select
	closing as closing_price,closing / previous_closing - 1 as growth_rate
from ranked_prices
where rank <= 3 order by rank

E-commerce funnel analysis

with e1 as
(select uid, 1 as step1, min(etime) as t1
	from event
where etime >= to_date('2021-01-10')
	and etime < to_date('2021-01-25')
	and eventtype = 'eventtype1'
	and …
group by 1),
e2 as
(select uid, 1 as step2, min(e1.t1) as t1, min(e2.etime) as t2
	from event as e2
inner join e1
	on e2.uid = e1.uid
where e2.etime >= to_date('2021-01-10')
	and e2.etime < to_date('2021-01-25')
	and e2.etime > t1
	and e2.etime < t1 + 7
	and eventtype = 'eventtype2'
	and …
group by 1),
e3 as
(select uid, 1 as step3, min(e2.t1) as t1, min(e3.etime) as t3
	from event as e3
inner join e2
	on e3.uid = e2.uid
where e3.etime >= to_date('2021-01-10')
	and e3.etime < to_date('2021-01-25')
	and e3.etime > t2
	and e3.etime < t1 + 7
	and eventtype = 'eventtype3'
	and …
group by 1)
select sum(step1) as step1, sum(step2) as step2, sum(step3) as step3
from e1
left join e2
	on e1.uid = e2.uid
left join e3
	on e2.uid = e3.uid
Funnel with more steps requires more subqueries

Facing complex problems, the difficulty of writing SQL increases sharply

  • Identify the top n customers who account for 50% and 80% of total sales from sales data for precision marketing
  • Analyze which stores are most popular, which time periods are busiest, and which dishes are most popular based on data such as the number of customers, consumption amount, consumption time, and consumption location of chain catering stores
  • Calculate the sales volume, sales revenue, average price, sales region, etc. for each model based on car sales data, and analyze which models are most popular and which models need to be adjusted in price or improved in design
  • Based on the purchasing history and preference data of e-commerce customers, divide them into different groups with strong purchasing power, preference for women's clothing, preference for men's clothing, etc., in order to facilitate targeted promotional activities
  • Based on game login data analysis, list the user's first login records, the user's last login interval, and calculate the user's login attempts within the three days prior to their last login
These complex requirements typically require a multi-step process and involve order related operations, requiring a very convoluted detour measure to be solved using SQL, resulting in hundreds lines of code and multiple nested layers.

The problems of SQL

1Debugging is troublesome

  • SQL does not provide breakpoints and step-in, which are common debugging methods in most languages. It can only be disassembled and gradually debugged.
  • SQL statements are often very long and has multiple layers of nesting, making it very difficult to disassemble them.

2The closeness of SQL

  • The external data(such as text, excel, web crawl) must be imported into the database before being calculated
  • Databases have strong constraints on data, the data importing and cleaning process is often cumbersome and time-consuming.

3Poor portability

  • Data scientists often perform calculations on desktop, while SQL must come with a database, making it less portable.

4SQL performance is low

  • The performance of SQL mainly relies on the database optimizer, which can only handle simple SQL.
    select top 10 x from T order by x desc
    when selecting the top 10 from 100 million pieces of data, the optimizer will not directly perform a sorting by literal meaning, but will choose a more efficient method.
  • When the requirements became more complex, optimizers always get dizzy.
    select * from ( 
        select *, row_number() over (partition by  Area 
                           order by Amount desc) rn from Orders ) 
    Such as changing the requirements to find the top 10 within each group. The increase in complexity is not significant, but it has already made most optimizers dizzy, resulting in only literal sorting and a sharp decline in performance.

Switch to Python?Still difficult to write in complex situations

Stock price increase for the Top3 days

sort_pos=(-stock_62["CLOSING"]).argsort()
max3pos=sort_pos.iloc[:3]
stock_62s=stock_62.shift(1)
max3CL=stock_62["CLOSING"].iloc[max3pos]
max3CLs=stock_62s["CLOSING"].iloc[max3pos]
max3_rate=max3CL/max3CLs-1
print(max3_rate)
Python cannot directly reference adjacent members and requires shift alignment, which can be slightly cumbersome.

Longest consecutive rising days of a stock

max_days = 0
current_days = 1
for i in range(1, len(stock_prices)):   
    if stock_prices[i] > stock_prices[i - 1]:        
        current_days += 1   
    else:        
        max_days = max(max_days, current_days)        
        current_days = 1
max_days = max(max_days, current_days)
This more complex task can only be hard coded using a “for” loop.

The problems of Python

1Cumbersome debugging

  • Checking the intermediate values still relies mainly on the print method, and it needs to be deleted after debugging, which is very troublesome.

2Hard to write code for big data

  • The Pandas library can directly perform in-memory calculations, such as sorting, filtering, grouping, and aggregation. However, these functions cannot directly handle large data that exceeds memory and require hard coding to segment, making the code much more complex.
  • Sometimes, specialized algorithms such as Hash Group is needed for performance reasons, making the code even more difficult to write.

3Python only has pseudo parallelism

  • Parallelism in Python is fake. To utilize multiple CPUs, complex multi processes parallelism is required, which is beyond the capabilities of most data scientists.

4Python lacks its own storage solution

  • Python lacks its own storage solution, relying on text files is too slow and ambiguous.
  • Using a database loses portability.

5Python code is difficult to integrate into Java applications

  • It often needs to run in two processes when Python works with Java applications, resulting in poor performance and stability.

6The versions are chaotic and lack code compatibility

  • The versions of Python are chaotic, some major versions is completely incompatible, and there are small troubles between minor versions.
  • Many third-party library packages are independently developed and rarely consider compatibility with other packages. Data scientists find it difficult to understand the compatibility between library packages and between library packages and Python versions, often at a loss.

02A must-have tool for data scientists

esProc SPL is a must-have tool for data scientists

  • esProc is a tool specifically designed for structured data processing
  • SPL is the formal language of esProc, which is concise, easy to understand, and easy to debug
  • SPL has support for big data and high performance
  • Fundamentally solves many drawbacks of SQL and Python
SPL: Structured Process Language

The code is concise

Stock price increase for the Top3 days

SQL
with ranked_prices as (
    select closing,lag(closing) over (order by closing) as previous_closing,
                row_number() over (order by closing desc) as rank
    from share62)
select
    closing as closing_price,closing / previous_closing - 1 as growth_rate
from ranked_prices
where rank <= 3 order by rank
Python
sort_pos=(-stock_62["CLOSING"]).argsort()
max3pos=sort_pos.iloc[:3]
stock_62s=stock_62.shift(1)
max3CL=stock_62["CLOSING"].iloc[max3pos]
max3CLs=stock_62s["CLOSING"].iloc[max3pos]
max3_rate=max3CL/max3CLs-1
print(max3_rate)
SPL
A
1 =share62.psort@z(CLOSING)
2 =A1.m(:3)
3 =share62.calc(A2,if(#==1,null,CLOSING/CLOSING[-1]-1))
SPL provides more professional ordered calculations, and the code is easy to write.

Longest consecutive rising days of a stock

SQL
select max(consecutive_day)  from (
    select count(*) consecutive_day  from (
        select sum(rise_or_fall) over(order by trade_date) day_no_gain  from (
            select trade_date, case when close_price > 
                       lag(close_price)  over(order by trade_date) 
                       then 0 else 1 end rise_or_fall from stock_price
        )
     ) group by day_no_gain
)
Python
max_days = 0
current_days = 1
for i in range(1, len(stock_prices)):   
    if stock_prices[i] > stock_prices[i - 1]:        
        current_days += 1   
    else:        
        max_days = max(max_days, current_days)        
        current_days = 1
max_days = max(max_days, current_days)
SPL
A
1 =stock.sort(tradeDate)
2 =0
3 =A1.max(A2=if(closePrice>closePrice[-1],A2+1,0))
SPL code is simpler and does not require an explicit “for” loop.

E-commerce funnel analysis

SQL
with e1 as
 (select uid, 1 as step1, min(etime) as t1
    from event
   where etime >= to_date('2021-01-10')
     and etime < to_date('2021-01-25')
     and eventtype = 'eventtype1'
     and …
   group by 1),
e2 as
 (select uid, 1 as step2, min(e1.t1) as t1, min(e2.etime) as t2
    from event as e2
   inner join e1
      on e2.uid = e1.uid
   where e2.etime >= to_date('2021-01-10')
     and e2.etime < to_date('2021-01-25')
     and e2.etime > t1
     and e2.etime < t1 + 7
     and eventtype = 'eventtype2'
     and …
   group by 1),
   e3 as
 (select uid, 1 as step3, min(e2.t1) as t1, min(e3.etime) as t3
    from event as e3
   inner join e2
      on e3.uid = e2.uid
   where e3.etime >= to_date('2021-01-10')
     and e3.etime < to_date('2021-01-25')
     and e3.etime > t2
     and e3.etime < t1 + 7
     and eventtype = 'eventtype3'
     and …
   group by 1)
select sum(step1) as step1, sum(step2) as step2, sum(step3) as step3
  from e1
  left join e2
    on e1.uid = e2.uid
  left join e3
    on e2.uid = e3.uid
SPL
A
1 =["etype1","etype2","etype3"]
2 =file("event.ctx").open()
3 =A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime < date("2021-01-25") && A1.contain(etype) && …)
4 =A3.group(uid).(~.sort(etime))
5 =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first)
6 =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime < t1+7).etime, null))))
7 =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)
SPL is also more concise, in line with natural thinking, and this code can handle any step of funnel

Strong interactivity and convenient debugging

The debugging convenience of SPL is significantly better than SQL, and it is also much more convenient than Python's print method.

Self-owned format files

A
1 =d:\normalData.btx
2 =file(A1).import@b()
3 =A2.groups(state;count(custid):count)
btx
A
1 ==d:\bigData.ctx
2 =file(A1).open().cursor@m()
3 =A2.groups@o(cust;sum(amount):amount)
ctx
SPL provides its own high-performance format files, which are convenient and unambiguous, making desktop computing more convenient.

Simple Big Data Computing

In-memory computing
A
1 =d:\smallData.txt
2 =file(A1).import@t()
3 =A2.groups(state;sum(amount):amount)
Cursor computing
A
1 =d:\bigData.txt
2 =file(A1).cursor@t()
3 =A2.groups(state;sum(amount):amount)
SPL has elaborately designed external storage cursor calculations, and the code for both inside and outside memory calculations is almost identical.

Simple parallel computing

Serial
A
1 =d:\bigData.txt
2 =file(A1).cursor@t()
3 =A2.groups(state;sum(amount):amount)
Parallel
A
1 =d:\bigData.txt
2 =file(A1).cursor@tm()
3 =A2.groups(state;sum(amount):amount)
SPL supports thread based parallel computing, which can fully utilize CPU multi-cores. Parallel computing only requires adding one option.

Higher computational performance

SQL
select top 10 x from T order by x desc
select * from ( 
    select *, row_number() over (partition by  Area 
                       order by Amount desc) rn from Orders ) 
SPL
Orders.groups(;top(10;-Amount))
Orders.groups(Area;top(10;-Amount))
In the TopN problem mentioned earlier, SPL understands TopN as aggregate calculation, and the computational logic does not require big sorting, which is much faster.
Moreover, the TopN within groups and the TopN of entire set are written in the same way.

Part of High Performance Computing Mechanism Provided in SPL

Traversal technique
Delayed cursor
Aggregate Understanding
Ordered cursor
Multi-purpose traversal
Prefilter traversal
Highly efficient Joins
Foreign key as pointer
Numbering of foreign keys
Order-based merge
Attached table
Unilateral HASH Join
High performance storage
Orderly Compressed Storage
Free column storage
Hierarchical Numbering positioning
Index and Caching
Double increment segmentation
Cluster computing
Preemptive Load Balancing
Multi-zone composite table
Cluster dimension table
Memory spare tire fault tolerance
External storage redundancy fault tolerance
The algorithms and storage schemes marked with are SPL's original inventions!

Open computing

SPL naturally has openness and can directly calculate various data sources.
No need for data importing, saving time for data cleaning.

Good enterprise attributes

  • After data scientists have finished exploration and analysis, the results are likely to be integrated into enterprise applications.
  • SPL provides friendly interfaces and standardized version control
  • Seamless and authentic integration, and worry free migration and upgrading.
esProc

03Technical Characteristics

Operating environment

  • JVM of JDK1.8 or above version
  • Any operating system, including VM and Containers
  • The full installation space is less than 600M, and the core package is less than 15M, can run on Android smoothly
  • Resource consumption is far less than that of databases

esProc IDE

Rich computations

All computation is implemented within esProc, will not and also can not translate into SQL

Data sources supported

Diversified data sources and mixed computing support

esProc has independent and complete computing power, independent of data sources, and can read any data sources and mixed calculate them.

High performance file formats

BIN FILE

Binary file, simple format, no need to define data structure

Application scenarios
  • Mediation for data dumping
  • Temporary storage
  • Storage of small data in calculations (such as dimension tables)

COMPOSITE TABLE

Mixed row-wise and columnar storage, supports indexing, requires pre-defined structure

Application scenarios
  • Big data storage
  • High performance computing

Easy to integrate: Embedded mode

No server required, esProc JDBC has independent computing power and can be embedded in application to compute
Extremely unique feature

Easy to integrate: Server mode

Server mode, independent deployment, support cluster, provide load balancing and fault tolerance mechanisms

04More code comparison examples

Ordered grouping

Identify the sequence of consecutive numbers in the sequence

SQL
There is no ready-made method to implement this task 
Python
lst = [7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15]
diffs = [j-i for i, j in enumerate(lst)]
groups = pd.Series(diffs).diff().ne(0).cumsum()
g_list = pd.Series(lst).groupby(groups).agg(list)
result = g_list[g_list.apply(lambda x: len(x) > \ 
    1)].tolist()
SPL
A
1 [7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15]
2 =A1.group@o(~-#).select(~.len()>1)

Non equivalence grouping

Number of male employees in designated departments

SQL
SELECT d.DEPT, COUNT(GENDER) AS MaleCount
FROM (
     SELECT COLUMN_VALUE AS DEPT 
     FROM TABLE(SYS.ODCIVARCHAR2LIST(
        'Administration', 'HR', 'Marketing', 'Sales'))
) d
LEFT JOIN EMPLOYEE e
    ON d.DEPT = e.DEPT AND e.GENDER = 'M'
GROUP BY d.DEPT;
Python
There is no specialized method, it needs to be hard coded
SPL
A
1 =file("EMPLOYEE.csv").import@tc().select(GENDER=="M")
2 [Administration, HR, Marketing, Sales]
3 =A1.align@a(A2,DEPT).new(A2(#):DEPT,~.len():NUM)

Adjacent references

The largest monthly growth in a year

SQL
WITH sales AS (
    SELECT COLUMN_VALUE AS sales_amount, ROWNUM AS month_number
    FROM TABLE(SYS.ODCINUMBERLIST(
          123,345,321,345,546,542,874,234,543,983,434,897))
),
lagged_sales AS (
    SELECT sales_amount, month_number, 
          LAG(sales_amount) OVER (ORDER BY NULL) AS prev_sales_amount
    FROM sales
)
SELECT MAX(sales_amount - prev_sales_amount) AS max_monthly_growth
FROM lagged_sales
Python
sales = [123,345,321,345,546,542,874,234,543,983,434,897]
month = [i for i in range(1,13)]
df = pd.DataFrame({'month':month,'sales':sales})
max_growth = df['sales'] - df['sales'].shift(1).max()
SPL
A
1 [123,345,321,345,546,542,874,234,543,983,434,897]
2 =A1.(if(#>1,~-~[-1],0)).max()

Positioning calculation

The position of the highest paying employee in the table

SQL
SELECT rn
FROM (
    SELECT EMPLOYEE.*, ROWNUM AS rn
    FROM EMPLOYEE
) 
WHERE SALARY =
    (SELECT MAX(SALARY) FROM EMPLOYEE)
Python
df = pd.read_csv('../EMPLOYEE.csv’)
highest_salary_employee_index_all = \
    np.where(df['SALARY'] == df['SALARY'].max())[0]
SPL
A
1 =file("EMPLOYEE.csv").import@tc()
2 =A1.pmax@a(SALARY)

The daily increase when the stock price is higher than both the previous 4 days and the next 4 days

SQL
SELECT closing/closing_pre-1 AS raise
FROM(
    SELECT dt,closing,ROWNUM AS rn,
        MAX(closing) OVER (ORDER BY dt ROWS BETWEEN 5  
                PRECEDING AND 1 PRECEDING) AS max_pre,
        MAX(closing) OVER (ORDER BY dt ROWS BETWEEN 1 
                FOLLOWING AND 5 FOLLOWING) AS max_suf,
                LAG(closing,1) OVER (ORDER BY dt) AS closing_pre
    FROM stock
)
WHERE rn>5 AND rn<=(SELECT count(*) FROM stock)-5
    AND CLOSING>max_pre AND CLOSING>max_suf
Python
import pandas as pd
stock_price_df = pd.read_csv('../STOCK.csv')
price_increase_list = []
for i in range(5, len(stock_price_df)-5):
    if stock_price_df['CLOSING'][i] > max(stock_price_df['CLOSING'][i-5:i]) and \
            stock_price_df['CLOSING'][i] > max(stock_price_df['CLOSING'][i+1:i+6]):
        price_increase = stock_price_df['CLOSING'][i] / \ 
                stock_price_df['CLOSING'][i-1]-1
        price_increase_list.append(price_increase)
result = price_increase_list
SPL
A
1 =file("STOCK.csv").import@tc()
2 =lth=A1.len(),A1.pselect@a(#>4&&#<=A1.len()-4&&CLOSING>max(CLOSING[-4:-1])&&CLOSING>max(CLOSING[1:4]))
3 =A1.calc(A2,CLOSING/CLOSING[-1]-1)

The next day retention rate of new users per day

SQL
WITH first_login AS (
    SELECT userid, MIN(TRUNC(ts)) AS first_login_date
    FROM login_data    GROUP BY userid ),
next_day_login AS (
    SELECT DISTINCT(fl.userid), fl.first_login_date, TRUNC(ld.ts) AS 
            next_day_login_date
    FROM first_login fl  LEFT JOIN login_data ld ON fl.userid = ld.userid
    WHERE TRUNC(ld.ts) = fl.first_login_date + 1),
day_new_users AS(
    SELECT first_login_date,COUNT(*) AS new_user_num
    FROM first_login GROUP BY first_login_date),
next_new_users AS(
    SELECT next_day_login_date, COUNT(*) AS next_user_num
     FROM next_day_login GROUP BY next_day_login_date),
all_date AS(
    SELECT DISTINCT(TRUNC(ts)) AS login_date
    FROM login_data)
SELECT all_date.login_date+1 AS dt, dn.new_user_num, nn.next_user_num,
    (CASE WHEN nn. next_day_login_date IS NULL THEN 0 
     ELSE nn.next_user_num END)/dn.new_user_num AS ret_rate
FROM all_date
JOIN day_new_users dn
    ON all_date.login_date=dn.first_login_date
LEFT JOIN next_new_users nn
    ON dn.first_login_date+1=nn. next_day_login_date
ORDER BY all_date.login_date
Python
import pandas as pd
df = pd.read_csv("../login_data.csv")
df["ts"] = pd.to_datetime(df["ts"]).dt.date
gp = df.groupby('userid')
row = []
for uid,g in gp:
    fst_dt = g.iloc[0].ts
    sec_dt = fst_dt + pd.Timedelta(days=1)
    all_dt = g.ts.values
    wether_sec_login = sec_dt in all_dt
    row.append([uid,fst_dt,sec_dt,wether_sec_login])
user_wether_ret_df = pd.DataFrame(row,columns=['userid','fst_dt','sec_dt','wether_sec_login'])
result = user_wether_ret_df.groupby('sec_dt').apply(lambda x:x['wether_sec_login'].sum()/len(x))
SPL
A
1 =file("login_data.csv").import@tc()
2 =A1.group(userid;fst=date(ts):fst_login,~.(date(ts)).pos(fst+1)>0:wether_sec_login)
3 =A2.groups(fst_login+1:dt;count(wether_sec_login)/count(1):ret_rate)

05FAQ

Is esProc based on open source or database technology?

esProc is based on a brand-new computing model, no other open source technology can be referenced, and the code all self-developed

SPL has very different theoretical basis from SQL, and can not base on any database technology. It is impossible to implement calculation of SPL using SQL.


Can esProc connect to the existing database or other data sources to work?

Yes, Of course! esProc supports almost all of the common data sources in the industry and can work directly through the interfaces and syntax of the data sources themselves, without the need to map the data sources to relational data tables.

However, esProc can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm. For high performance, it is recommended to use esProc's own format files for storage.


Where does esProc store data?

esProc does not own the data and, in principle, is not responsible for data storage, and any accessible data can be calculated.

In particular, esProc has excellent support for files. Data files can be stored in any file system, including NFS and object storage on the cloud, which naturally implements the separation between computation and storage.


What is esProc's ability in machine learning?

SPL supports common regression fitting algorithms and aggregation algorithms such as SVM and K-means, providing basic vector matrix operation capabilities.

SPL lacks complex machine learning models such as XGBT, which is a little bit far from Python in this regard.

If complex machine learning and deep learning models are needed, SPL can call YModel to build complex models and perform prediction. YModel is an automatic modeling software that can automatically preprocess and build models with just one click.


Is SPL difficult to learn?

The SPL syntax is very easy, and those with a Java/SQL foundation can get started in just a few hours and become proficient in a few weeks.

SPL is specifically designed for low code and high performance, has more universal data objects and operations than SQL and Python, is more in line with human natural thinking. Solving problems is very direct, and there is no need to go in a convoluted way. SPL code is usually more concise than SQL and Python.

To develope high-performance code using SPL, it is necessary to learn more algorithm knowledge. SPL provides a high-performance function library and encapsulates algorithms. Once learned, many high-performance tasks become routine.

06Summary

Summary of advantages of esProc

5 advantages

Low code

Concise code that conforms to natural thinking
High development efficiency

Strong interactivity

Professional IDE is perfect for exploration and analysis
Convenient to debug

Super Portable

Lightweight and portable for desktop computing
Easy data movement

Big data

Built-in big data and parallel computing syntax
High performance algorithm

Easy integration

Integration into enterprise applications directly
Rich data sources