esProc SPL

Data computing
Low code / High performance / Lightweight framework / All Scenarios

Table of Contents

  1. What is esProc?
  2. Case Brief
  3. Why esProc SPL
  4. Technical Characteristics
  5. Application scenarios
  6. FAQ
  7. Summary

01What is esProc?

What is esProc SPL?

  • Data computing and processing language
  • Run as an analysis database or middleware
  • Computing and processing of structured and semi-structured data
  • Offline batch job, online query
  • neither SQL system nor NoSQL Technology
  • Self created SPL syntax, more concise and efficient
SPL: Structured Process Language

What pain points does esProc SPL solve?

For the data computing scenarios : Offline Batch Job , Online Query/Report

  • Slow batch jobs can not fit in certain time window, being strained especially on critical summary dates
  • Being forced to wait for minutes for a query/report, the business personnel becomes angry; Pre calculation is difficult to predict, and business personnel are not satisfied
  • More concurrencies, longer the query time span, the database crashes
  • Distributed clusters anywhere, consume huge hardware resources and have high operational complexity
  • Database storage space is expensive and requires continuous expansion

What are the counterpart technologies of esProc SPL?

Databases that use SQL syntax and are applied to OLAP scenarios

  • Common database: MySQL, PostgreSQL, Oracle, DB2, …
  • Data warehouse on Hadoop: Hive, Spark SQL, …
  • Distributed data warehouse/MPP: …
  • Cloud data warehouse: Snowflake, …
  • All-in-one database machine: ExaData, …

Other technologies for structured data analysis and statistics

  • Python, Spark/Scala, Java, …

esProc SPL

  • Low code
  • High performance
  • Lightweight framework
  • All Scenarios

The value that esProc SPL brings to users

Performance improvement
by N times

  • Implement batch jobs ahead of schedule and calmly respond to subsequent tasks
  • Query reports in seconds, optimize user experience
  • Pre computing becomes history, changing business models

Costs reduction
by N times

  • Single machine matches a cluster, less hardware resource consumption and O&M costs
  • No need for professional database storage, high-performance can be achieved on common file systems and low-cost cloud storage

02Case Brief

CaseNational Astronomical Observatory – Star aggregation

  • 11 photos, 5000000 objects/photo
  • Celestial bodies with close astronomical distance (trigonometric function calculation) are regarded as the same
  • Complexity: 5000000 * 5000000 * 10 = 250 trillion (times comparison)
  • 500000 celestial bodies test
  • Python 200 lines, single thread 6.5 days
  • SQL 100CPU cluster 3.8 hours
  • 500000 celestial bodies test, 2.5 minutes
  • 5 million celestial bodies, 3 hours
  • Codes: 50 lines
Increase speed
2000 times

CaseBatch job of loan agreements of a bank

  • SQL: 48 steps,3300 lines
  • Historical data: 110 million rows,daily increase: 1.37 million rows
  • Complex multi-table join
  • AIX+DB2
  • Calculation time: 115 minutes
  • Calculation time: 10 minutes,
  • codes: 500 lines
Increase speed
10+ times

CaseE-commerce funnel conversion rate analysis

  • Large number of e-commerce users and huge amount of data
  • Calculating the number of unique users according to the sequence of events is complicated
  • Snowflake Medium 4 nodes cluster
  • 3 step funnel, DNF in 3 minutes
  • Complex SQL code, more steps analysis requires more subqueries
  • One node,10 seconds
  • Shorter code, and the same code for more steps analysis
Optimize from
DNF to 10sec

CaseCalculation of the number of unique loan clients of a bank

  • Too many labels, and hundreds of labels can be arbitrarily combined to query
  • Association, filtering and aggregation calculation of a 20 million rows large table and even larger detailed tables
  • Each page involves the calculation of nearly 200 indexes, and 10 concurrency will cause the concurrent calculation of more than 2000 indexes
  • Oracle
  • Unable to calculate in real time; The query requirements have to be submitted in advance, and the calculation is carried out one day earlier.
  • 10 concurrency, 2000 indexes in total, less than 3 seconds
  • No need to prepare in advance, instantly select any label combination, and get query results in real time
Turn pre-calculation
into real-time calculation

Front-end database in BI System of a bank

Center data warehouse undertakes all data task of whole bank, which is overburdened and can only assign 5 concurrencies to BI system
Only for a small amount of high-frequency data, DB2 is not capable for real-time query, and also unable to achieve data routing, users must select the data source

5 ➔ 100

esProc stores a small amount of high-frequency data, and large low-frequency data is still stored in the data warehouse to avoid repeated construction
esProc takes over the most high frequency computing tasks, and a few low frequency tasks are automatically routed to the center data warehouse

03Why esProc SPL

Why SQL is difficult to write: What is the max days has a stock been rising?

SELECT MAX(ContinuousDays)
FROM (SELECT COUNT(*) ContinuousDays
    FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays
        FROM (SELECT TradeDate,
            	CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate)
                	THEN 0 ELSE 1 END UpDownTag
            FROM Stock ) )
    GROUP BY NoRisingDays )

SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used in a roundabout way.

Such statements are not only difficult to write, but also difficult to understand.

In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.

It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development and maintenance severely.

Why can‘t SQL run fast: Get the top 10 from 100 million rows of data


This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.

We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.

In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint

In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout approach.

In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.

    FROM Orders ) 
WHERE rn<=10

The SPL solution

1=Stock.sort(TradeDate).group@i(Price< Price[-1]).max(~.len())

The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and concise.

2=A1.groups(;top(10;-Amount))Top 10 orders
3=A1.groups(Area;top(10;-Amount))Top 10 orders of each area

SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups, and there is no need to use the roundabout approach.

Funnel analysis of an E-commerce company

WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime < end_date  AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1
            AND e2.etime < t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2
            AND e3.etime < t1+7 AND etype='etype3'
    GROUP BY 1)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid

SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation performance is very low.

Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added when adding more steps.

3=A2.cursor(id,etime,etype;etime>=end_date-14 && etime < end_date && A1.contain(etype) )
6=A5.(A1.(t=if(#==1,t1=first.etime,if(t, && etime>t && etime < t1+7).etime, null))))

SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly according to natural thinking, which is simple and efficient.

This code can handle funnels with any number of steps, as long as the parameters are changed.

Common scenarios to beat SQL

1. DISTINCT/COUNT(DISINCT),Ordered calculations:Funnel analysis

2. Association calculation and multi-index calculation on big data:User profile, Collision analysis

3. Multi-step batch processing of large amounts of data

In real business, complex SQL (and stored procedures) are often hundreds/thousands of lines, and a large number of roundabout approaches have to be used to implement the calculation. The code becomes complex as well as the performance becomes low.

SPL High Performance Computing Concept

Hardware ?

Software can't make hardware run faster, no software can!


But we can design a high efficiency and low complexity algorithm, and it will be faster if the amount of calculation is less.


It's not enough to come up with a good algorithm, but to develop it.


Traditional databases are limited by theoretical system, and it is impossible to implement a good algorithm.
Q: What can we do?
A: Look forward!
Q: Oh, so it is like this.
A: Yes, that's not magical.
Q: Then find a programmer to do it.
A: Not so easy!
Q: Isn't that all you can do is stare in despair?
A: Hey hey, that's how it works most of the time.
So High performance computing = Algorithm design + Algorithm ImplementationBecoming the bottleneck of high performance computing
SQL,NoSQL,NewSQL,Hadoop,all restrict the implementation of algorithms

Why is SPL more advanced?

Analogy Calculate1+2+3+…+100=?

Ordinary people will do like this

  • 1+2=3
  • 3+3=6
  • 6+4=10
  • 10+5=15
  • 15+6=21

Gauss does like this

  • 1+100=101
  • 2+99=101
  • 3+98=101
  • A total of fifty 101
  • 50*101= 5050
SQL is like an arithmetic system with only addition. The code is lengthy and the calculation is inefficient.
SPL is equivalent to the invention of multiplication! Simplify writing and improve performance.

The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still difficult to meet complex requirements.

SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful expression capabilities.

Part of High Performance Computing Mechanism Provided in SPL

Traversal technique
Delayed cursor
Aggregate Understanding
Ordered cursor
Multi-purpose traversal
Prefilter traversal
Highly efficient Joins
Foreign key as pointer
Numbering of foreign keys
Order-based merge
Attached table
Unilateral HASH Join
High performance storage
Orderly Compressed Storage
Free column storage
Hierarchical Numbering positioning
Index and Caching
Double increment segmentation
Cluster computing
Preemptive Load Balancing
Multi-zone composite table
Cluster dimension table
Memory spare tire fault tolerance
External storage redundancy fault tolerance
Many algorithms and storage schemes here are the original inventions of SPL!

Why Java can not work well?

Java is too native, lacking necessary data types and computing libraries, making it difficult or even impossible for application programmers to code.

Calendar cal = Calendar.getInstance();
Map < Object, DoubleSummaryStatistics> c = Orders.collect(Collectors.groupingBy(
                r -> {
                    return cal.get(Calendar.YEAR) + "_" + r.SellerId;
                Collectors.summarizingDouble(r -> {
                    return r.Amount;
for(Object sellerid:c.keySet()){
    DoubleSummaryStatistics r =c.get(sellerid);
    String year_sellerid[]=((String)sellerid).split("_");
    System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum());

High performance algorithms are difficult to implement

  • Causing forced use of low performance algorithms that are easy to write, often fail to beat SQL

No universal high-performance storage

  • Can only use databases or text with low performance
  • If want to realize high-performance storage on its own, then facing difficulties in implementation

Why Python can not work well?

Python's DataFrame is not good at processing structured data computation in complex situations.

import pandas as pd
import datetime
import numpy as np
import math
def salary_diff(g):max_age = g['BIRTHDAY'].idxmin()min_age = 
	g['BIRTHDAY'].idxmax()diff = g.loc[max_age]['SALARY']-

There are still shortcomings in terms of computational completeness

  • Relatively cumbersome in calculations such as adjacent reference, ordered grouping, positioning calculation, non equivalence grouping

Poor syntax consistency

  • Some similar calculations may use different functions

Poor big data capabilities

  • No external storage cursor mechanism or parallelism ability

No universal high-performance storage

  • No efficient storage, just like Java

04Technical Characteristics

Operating environment

  • JVM of JDK1.8 or above version
  • Any operating system, including VM and Containers
  • The full installation space is less than 600M, and the core package is less than 15M, can run on Android smoothly
  • Resource consumption is far less than that of databases

Composition and Concept

Embedded mode

No server required, esProc JDBC has independent computing power and can be embedded in application to compute
Extremely unique feature

Server mode

Server mode, independent deployment, support cluster, provide load balancing and fault tolerance mechanisms

Local and remote development

esProc IDE

Separation of storage and computation

All data sources are logically equivalent, esProc does not own data (only calculates), naturally implementing separation of storage and computation

For data, there is no concept of “inside” or “outside” esProc, no action of “import into” or “export out of” esProc

Data sources supported

Diversified data sources and mixed computing support

esProc has independent and complete computing power, independent of data sources, and can read any data sources and mixed calculate them.

Rich computations

All computation is implemented within esProc, will not and also can not translate into SQL

Special Support for RDB

esProc file storage

File storage, efficient use, fully guaranteeing computing performance

High performance file formats


Binary file, simple format, no need to define data structure

Application scenarios
  • Mediation for data dumping
  • Temporary storage
  • Storage of small data in calculations (such as dimension tables)


Mixed row-wise and columnar storage, supports indexing, requires pre-defined structure

Application scenarios
  • Big data storage
  • High performance computing

Data storage and exchange

  • All accessible data can take part in computation directly.
  • converting data to esProc format files(can be implemented in SPL) will obtain high performance of accessing and calculating.

Data security and reliability

Elastic computing

05Application scenarios

OLAP database/Memory database

  • High performance algorithms overcome SQL defects and avoid wide tables
  • Comprehensive in/outside-memory computing technology, covering in-memory database functions
  • Open system access to production data sources to achieve real time whole-data query
  • Complex reports/procedural calculations, simple technology stack

Front-end database/ODS layer

  • High performance lightweight front-end database
  • programmable calculate routing simulates whole-data calculation
  • Comprehensive and complex computing, avoiding programming in application

Batch job database /ETL

  • No data import and export costs, direct access to data sources
  • High performance algorithms overcome SQL defects
  • Parallel cursor to implement complex procedural computation

Data warehouse

  • Physical/logical all-in-one
    • Logical data sources can be accessed directly without mapping
    • Optimized physical storage gets high performance
  • File storage, very flexible, natural separation of storage and calculation
    • No need to load into database, calculate directly
  • Procedural computing implements stored procedure functionality, fully replacing new databases
  • Lightweight, single machine to match a cluster, replacing heavy MPP and Hadoop

HTAP solution

  • Continue the existing TP solution, retain the advantages of the original data source, and reduce migration risks
  • Allow for elaborated organization of historical data and implementation of high-performance AP using mixed cold and hot calculations
  • Easy implementation of real-time queries

Real time stream computing

  • Integrated stream batch computing system
  • Bidirectional stream data interface: active acquisition, passive reception
  • Diversified data source support
  • Lightweight framework, no need for complex streaming computing frameworks
  • Powerful ordered computing, particularly suitable for streaming data

Hot computing of cold data

  • Loading historical cold data into the database takes up space, leading to complex operation and maintenance, high costs, and low usage frequency. However, without loading, it cannot be calculated
  • Temporary loading efficiency is low, and the loading time far exceeds the calculation time
  • SPL can directly calculate files, no need to load into database, data is no longer "cold", and can provide hot computation with low storage costs

Lighweight layered time series database

  • Hot-warm-cold multi-layer storage schema, solving the contradiction between high-frequency writing and big data query
  • Ordered computing support is particularly suitable for time series data computation, overcoming the drawbacks of SQL
  • Provide mathematical libraries such as vectors, matrices, fitting, and modeling to simplify the technical stack

Microservice platform

  • SPL agile development far surpasses Java
  • Lightweight decoupled elastic computing without Docker/VM
  • Interpreted execution, natural hot swap

Lakehouse/ Cloud Computing Center

  • The raw data can be calculated after entering the lake without any loss
  • After progressive data organization, high performance is achieved
  • Single SPL technology stack, no need for SQL + Python + .. etc. complex programming system
  • Object file storage, natural storage and computation separation, utilizing mature storage technologies
  • High performance elastic computing, reducing resource consumption

Report query data preparation layer

  • SPL agile computing improves development efficiency and avoids complex SQL and stored procedures
  • Computing capability not relying on database, code can be migrated across databases
  • Interpreted execution, natural hot swap, decoupling of report module and application
  • Handle endless report development needs at low cost

Java data logic/microservice implementation

  • Pure Java, packaged within the main application to enjoy the advantages of Java's mature framework
  • Agile development, fully replacing Stream/Kotlin/ORM
  • Computing capability not relying on database, code can be migrated across databases
  • Interpreted execution, natural hot swap, low coupling

Replace stored procedures

  • Pure Java, packaged within the main application to enjoy the advantages of Java's mature framework and overcome the drawbacks of stored procedures
  • Powerful process computing, easy to debug, and higher development efficiency
  • Compute outside the database, not relying on the database. naturally migratable
  • Not requiring the compile privilege of stored procedure, while avoiding inter application coupling, improving security and reliability

Reduce database load/eliminate intermediate tables

  • Remove non critical intermediate data from the database and store them in files, reducing the burden of database storage
  • Tree like directories are easier to manage and reduce coupling between applications
  • Moving computing tasks out of the database to reduce the computational burden on the database
  • File access, higher performance, significantly improving computational performance

Multi data source mixed computing/real-time whole data statistics

  • Rich data source support: RDB, NoSQL, File, HTTP…; JSON and other multi-layer data
  • Direct computing, no need to load into database, real-time computing
  • Cross hetero-database calculation, mixed calculation of production database and analysis database to implement real time statistics
  • Computing power not relying on the data source, naturally migratable

Embedded/edge computing Engine

  • Small size full embedding, and can be used in edge computing scenarios
  • Comprehensive computing capabilities, including mathematical libraries, do not require other components for most tasks
  • Simple file storage, no need for a database
  • Can connect to remote large data sources and storage devices

Data cleaning/preparation outside of applications

  • Rich and consistent data source support, convenient access to various data sources without SQL
  • Powerful language ability, more concise in describing complex operations than Python
  • Parallel computing provides much greater convenience and speed in processing big data than Python
  • Strong integration, can be transferred to inside application computing if necessary

Exploration and Analysis of data scientists

  • Powerful language ability, more concise in describing complex operations than SQL and Python
  • Stronger interactivity than SQL and Python, making debugging more convenient
  • File storage, independent and portable data, can analyze data at hand without the need for a database
  • Parallel computing, processing large amounts of data with much greater convenience and speed than Python

Full stack data technology

Structured data calculation , just use SPL!


Is esProc based on open source or database technology?

esProc is based on a brand-new computing model, no open source technology can be cited, and all independent innovation from theory to code.

SPL is based on innovation theory that can no longer use SQL to achieve high performance, and SQL can not describe most low complexity algorithms.

Where can esProc be deployed

esProc is implemented in pure Java.

esProc can run smoothly under any OS equipped with JVM, including VM, cloud server and even container.

How applications invoke esProc?

esProc provides a standard JDBC driver for Java applications.

esProc can be integrated in a Java application seamlessly.

esProc can be invoked by a non-Java application via HTTP/RESTFul

Can esProc be integrated in other frameworks?

As a Java product with good integration, it can seamlessly be integrated in various Java frameworks and application servers, and its logical status is equivalent to self written Java code

For computational frameworks (such as Spark), although esProc can be seamlessly integrated, it has no practical significance; esProc can replace Spark to compute

Specifically, esProc has its own streaming computing abilities and does not need to be integrated in streaming computing frameworks (such as Flink), typically resulting in better functionality and performance

Can esProc run based on the existing database or other datasources?

Yes, Of course! esProc supports almost all of the common data sources in the industry and can work directly through the interfaces and syntax of the data sources themselves, without the need to map the data sources to relational data tables.

However, esProc can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm. For high performance, it is recommended to use esProc's own format files for storage.

Where does esProc store data?

esProc does not own the data and, in principle, is not responsible for data storage, and any accessible data can be calculated.

In particular, esProc has excellent support for files. Data files can be stored in any file system, including NFS and object storage on the cloud, which naturally implements the separation between computation and storage

How to ensure high reliability of esProc?

When embedded in applications, reliability is guaranteed by the application

When used independently, load balancing and fault tolerance mechanisms are provided, but a single task may fail, only suitable for small-scale clusters

Does not provide automatic recovery function after failure

The elastic computing mechanism of the cloud version avoids the current failed nodes when allocating VMs, achieving high availability to a certain extent

How does esProc extend its functionality?

The provided interface can be used to invoke static functions written in Java to extend functionality

esProc also opens an interface for custom functions, which can be used in SPL after registration

What are the weaknesses of esProc?

Comparing with RDB:

esProc has no metadata, most of computation will begin from accessing data source, it will be a little tedious for very simple operations.

Comparing with Hadoop/MPP:

The cluster function of esProc has not many chances to be well-trained.

esProc has reduced many clusters into a single machine without sacrificing performance in history.

Comparing with Python:

SPL is developing its AI functions, but now is still not even close to Python.

How is SPL compatible with SQL?

SPL is not a computing engine of the SQL system, currently only supports simple SQL with small data volumes and does not guarantee performance; it can be considered that esProc does not support SQL, and of course it is not compatible with any SQL or stored procedures.

In the future, dual engines supporting SQL will be developed, but it is still difficult to ensure high performance and big data, just to make the existing SQL code easy to migrate.

Is there a tool to convert SQL to SPL automatically?

Not yet.

The information in SQL statement is insufficient to optimize its performance. Frankly, we are not a veteran like RDB vendor for guessing goal of a SQL, so converting SQL to SPL directly will usually lead to slower speed.

How difficult is it to learn SPL?

SPL is dedicated to low code and high performance.

SPL syntax is easy, and those with Java/SQL knowledge can get on hand in just a few hours and become proficient in it within a few weeks.

“Difficult”, high-performance algorithms are a bit difficult and require learning more algorithm knowledge;

“Not difficult”, once learned, many high-performance tasks become “routine”.

How to launch a performance optimization process

The first 1-2 scenarios will be implemented by Scudata engineer in collaboration with users.

Most programmers are used to the way of thinking in SQL and are not familiar with high performance solutions of SQL. They need to be trained to understand in one or two scenarios.

Performance optimization routines will be experienced and learned. Algorithmic design and implementation are not so difficult.

Give a man a fish and you feed him for a day. Teach him how to fish and you feed him for a lifetime!


Summary of advantages of esProc

5 advantages

High performance

The processing speed of big data is 1 order of magnitude higher than that of traditional solutions

Efficient development

Procedural syntax, in line with natural thinking
Rich class libraries

Flexible and open

Multi-source mixed computation
Can run independently, or embedded into applications

Save resources

Single machine can match cluster, reducing hardware expense

Sharp cost reduction

Development, hardware, O&M costs reduced by X times