For the data computing scenarios : Offline Batch Job , Online Query/Report
Databases that use SQL syntax and are applied to OLAP scenarios
Other technologies for structured data analysis and statistics
Center data warehouse undertakes all data task of whole bank, which is overburdened and can only assign 5 concurrencies to BI system
Only for a small amount of high-frequency data, DB2 is not capable for real-time query, and also unable to achieve data routing, users must select the data source
esProc stores a small amount of high-frequency data, and large low-frequency data is still stored in the data warehouse to avoid repeated construction
esProc takes over the most high frequency computing tasks, and a few low frequency tasks are automatically routed to the center data warehouse
Vertica does not support stored procedures; To prepare data, complex nested SQL statements have to be written, and Java hardcoding is often required.
When mixed computing with MySQL, MySQL data has to be loaded into Vertica first, which is tedious, not real-time, and the database is bloated.
The best use for us is to pass parameters to the Vertica database.
Each cell becomes a data array that are easy to use, compare and manipulate. It is very logical and you have made it user friendly.SELECT MAX(ContinuousDays) FROM (SELECT COUNT(*) ContinuousDays FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays FROM (SELECT TradeDate, CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate) THEN 0 ELSE 1 END UpDownTag FROM Stock ) ) GROUP BY NoRisingDays )
SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used in a roundabout way.
Such statements are not only difficult to write, but also difficult to understand.
In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.
It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development and maintenance severely.
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.
We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.
In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint
In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout approach.
In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders ) WHERE rn<=10
A | |
1 | =Stock.sort(TradeDate).group@i(Price< Price[-1]).max(~.len()) |
The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and concise.
A | ||
1 | =file(“Orders.ctx”).open().cursor() | |
2 | =A1.groups(;top(10;-Amount)) | Top 10 orders |
3 | =A1.groups(Area;top(10;-Amount)) | Top 10 orders of each area |
SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups, and there is no need to use the roundabout approach.
The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still difficult to meet complex requirements.
SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful expression capabilities.
with e1 as ( select uid,1 as step1,min(etime) as t1 from event where etime>= to_date('2021-01-10') and etime < to_date('2021-01-25') and eventtype='eventtype1' and … group by 1), e2 as ( select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2 from event as e2 inner join e1 on e2.uid = e1.uid where e2.etime>= to_date('2021-01-10') and e2.etime < to_date('2021-01-25') and e2.etime > t1 and e2.etime < t1 + 7 and eventtype='eventtype2' and … group by 1), e3 as ( select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3 from event as e3 inner join e2 on e3.uid = e2.uid where e3.etime>= to_date('2021-01-10') and e3.etime < to_date('2021-01-25') and e3.etime > t2 and e3.etime < t1 + 7 and eventtype='eventtype3' and … group by 1) select sum(step1) as step1, sum(step2) as step2, sum(step3) as step3 from e1 left join e2 on e1.uid = e2.uid left join e3 on e2.uid = e3.uid
SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation performance is very low.
Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added when adding more steps.
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime < date("2021-01-25") && A1.contain(etype) && …) |
4 | =A3.group(uid).(~.sort(etime)) |
5 | =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime < t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3) |
SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly according to natural thinking, which is simple and efficient.
This code can handle funnels with any number of steps, as long as the parameters are changed.
SPL is especially suitable for complex process operations.
Designed specifically for structured data tables
esProc, developed in JAVA, can run independently or be seamlessly integrated into applications.
Multiple data sources are directly used for mixed computations.
Private data storage format: bin files, composite tables
Support to store data by business classification in tree directory
Data routing enables low-cost high-performance full data analysis
Support low-risk, high-performance and strong real-time HTAP with open multi-source hybrid computing capability
esProc is based on a brand-new computing model, no open source technology can be cited, and all independent innovation from theory to code.
SPL is based on innovation theory that can no longer use SQL to achieve high performance, and SQL can not describe most low complexity algorithms.
esProc is implemented in pure Java.
esProc can run smoothly under any OS equipped with JVM, including VM, cloud server and even container.
esProc provides a standard JDBC driver for Java applications.
esProc can be integrated in a Java application seamlessly.
esProc can be invoked by a non-Java application via HTTP/RESTFul
Yes, Of course!
However, esProc can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm.
esProc stores data in files of private format to guarantee performance.
Any file system installed on OS is available, including NFS.
esProc can easily implement separation between storage and computation to scale out
Comparing with RDB:
The metadata is immature in esProc, most of computation will begin from accessing data source, it will be a little tedious for simple operations.
Comparing with Hadoop/MPP:
The cluster function of esProc has not many chances to be well-trained.
esProc has reduced many clusters into a single machine without sacrificing performance in history.
Comparing with Python:
SPL is developing its AI functions, but now is still not even close to Python.
SPL is dedicated to performance optimization and provides a dedicated syntax
Learning SPL is not difficult. It can be mastered in hours and skilled in weeks. What is difficult is to design optimization algorithms!
Not yet.
The information in SQL is insufficient. Frankly, we are not a veteran like RDB vendor for optimizing SQL, so converting SQL to SPL directly will lead to slower performance.
The first 2-3 scenarios will be implemented by Scudata engineer in collaboration with users.
Most programmers are used to the way of thinking in SQL and are not familiar with high performance algorithms. They need to be trained to understand in one or two scenarios.
Dozens of performance optimization routines will be experienced and learned. Algorithmic design and implementation are not so difficult.
Give a man a fish and you feed him for a day. Teach him how to fish and you feed him for a lifetime!
The processing speed of big data is 1 order of magnitude higher than that of traditional solutions
Procedural syntax, in line with natural thinking
Rich class libraries
Multi-source mixed computation
Can run independently, or embedded into applications
Single machine can match cluster, reducing hardware expense
Environment-friendly
Development, hardware, O&M costs reduced by X times