For the data computing scenarios : Offline Batch Job , Online Query/Report
Databases that use SQL syntax and are applied to OLAP scenarios
Other technologies for structured data analysis and statistics
Center data warehouse undertakes all data task of whole bank, which is overburdened and can only assign 5 concurrencies to BI system
Only for a small amount of high-frequency data, DB2 is not capable for real-time query, and also unable to achieve data routing, users must select the data source
QDBase stores a small amount of high-frequency data, and large low-frequency data is still stored in the data warehouse to avoid repeated construction
QDBase takes over the most high frequency computing tasks, and a few low frequency tasks are automatically routed to the center data warehouse
Vertica does not support stored procedures; To prepare data, complex nested SQL statements have to be written, and Java hardcoding is often required.
When mixed computing with MySQL, MySQL data has to be loaded into Vertica first, which is tedious, not real-time, and the database is bloated.
The best use for us is to pass parameters to the Vertica database.
Each cell becomes a data array that are easy to use, compare and manipulate. It is very logical and you have made it user friendly.SELECT MAX(ContinuousDays) FROM (SELECT COUNT(*) ContinuousDays FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays FROM (SELECT TradeDate, CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate) THEN 0 ELSE 1 END UpDownTag FROM Stock ) ) GROUP BY NoRisingDays )
SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used in a roundabout way.
Such statements are not only difficult to write, but also difficult to understand.
In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.
It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development and maintenance severely.
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.
We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.
In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint
In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout approach.
In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders ) WHERE rn<=10
A | |
1 | =Stock.sort(TradeDate).group@i(Price< Price[-1]).max(~.len()) |
The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and concise.
A | ||
1 | =file(“Orders.ctx”).open().cursor() | |
2 | =A1.groups(;top(10;-Amount)) | Top 10 orders |
3 | =A1.groups(Area;top(10;-Amount)) | Top 10 orders of each area |
SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups, and there is no need to use the roundabout approach.
The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still difficult to meet complex requirements.
SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful expression capabilities.
with e1 as ( select uid,1 as step1,min(etime) as t1 from event where etime>= to_date('2021-01-10') and etime < to_date('2021-01-25') and eventtype='eventtype1' and … group by 1), e2 as ( select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2 from event as e2 inner join e1 on e2.uid = e1.uid where e2.etime>= to_date('2021-01-10') and e2.etime < to_date('2021-01-25') and e2.etime > t1 and e2.etime < t1 + 7 and eventtype='eventtype2' and … group by 1), e3 as ( select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3 from event as e3 inner join e2 on e3.uid = e2.uid where e3.etime>= to_date('2021-01-10') and e3.etime < to_date('2021-01-25') and e3.etime > t2 and e3.etime < t1 + 7 and eventtype='eventtype3' and … group by 1) select sum(step1) as step1, sum(step2) as step2, sum(step3) as step3 from e1 left join e2 on e1.uid = e2.uid left join e3 on e2.uid = e3.uid
SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation performance is very low.
Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added when adding more steps.
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime < date("2021-01-25") && A1.contain(etype) && …) |
4 | =A3.group(uid).(~.sort(etime)) |
5 | =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime < t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3) |
SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly according to natural thinking, which is simple and efficient.
This code can handle funnels with any number of steps, as long as the parameters are changed.
Java is too native, lacking necessary data types and computing libraries, making it difficult or even impossible for application programmers to code.
Calendar cal = Calendar.getInstance(); Map < Object, DoubleSummaryStatistics> c = Orders.collect(Collectors.groupingBy( r -> { cal.setTime(r.OrderDate); return cal.get(Calendar.YEAR) + "_" + r.SellerId; }, Collectors.summarizingDouble(r -> { return r.Amount; }) ) ); for(Object sellerid:c.keySet()){ DoubleSummaryStatistics r =c.get(sellerid); String year_sellerid[]=((String)sellerid).split("_"); System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum()); }
High performance algorithms are difficult to implement
No universal high-performance storage
Python's DataFrame is not good at processing structured data computation in complex situations.
import pandas as pd import datetime import numpy as np import math def salary_diff(g):max_age = g['BIRTHDAY'].idxmin()min_age = g['BIRTHDAY'].idxmax()diff = g.loc[max_age]['SALARY']- g.loc[min_age]['SALARY']return diffemp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])salary_diff= emp.groupby('DEPT').apply(salary_diff)print(salary_diff)
There are still shortcomings in terms of computational completeness
Poor syntax consistency
Poor big data capabilities
No universal high-performance storage
SPL is especially suitable for complex process operations.
Designed specifically for structured data tables
QDBase, developed in JAVA, can run independently or be seamlessly integrated into applications.
Multiple data sources are directly used for mixed computations.
Private data storage format: bin files, composite tables
Support to store data by business classification in tree directory
Data routing enables low-cost high-performance full data analysis
Support low-risk, high-performance and strong real-time HTAP with open multi-source hybrid computing capability
QDBase is based on a brand-new computing model, no open source technology can be cited, and all independent innovation from theory to code.
SPL is based on innovation theory that can no longer use SQL to achieve high performance, and SQL can not describe most low complexity algorithms.
QDBase is implemented in pure Java.
QDBase can run smoothly under any OS equipped with JVM, including VM, cloud server and even container.
QDBase provides a standard JDBC driver for Java applications.
QDBase can be integrated in a Java application seamlessly.
QDBase can be invoked by a non-Java application via HTTP/RESTFul
As a Java product with good integration, it can seamlessly be integrated in various Java frameworks and application servers, and its logical status is equivalent to self written Java code
For computational frameworks (such as Spark), although QDBase can be seamlessly integrated, it has no practical significance; QDBase can replace Spark to compute
Specifically, QDBase has its own streaming computing abilities and does not need to be integrated in streaming computing frameworks (such as Flink), typically resulting in better functionality and performance
Yes, Of course!
However, QDBase can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm.
QDBase stores data in files of self-designed format to guarantee performance.
Any file system installed on OS is available, including NFS.
QDBase can easily implement separation between storage and computation to scale out
When embedded in applications, reliability is guaranteed by the application
When used independently, load balancing and fault tolerance mechanisms are provided, but a single task may fail, only suitable for small-scale clusters
Does not provide automatic recovery function after failure
The elastic computing mechanism of the cloud version avoids the current failed nodes when allocating VMs, achieving high availability to a certain extent
The provided interface can be used to invoke static functions written in Java to extend functionality
QDBase also opens an interface for custom functions, which can be used in SPL after registration
Comparing with RDB:
The metadata is immature in QDBase, most of computation will begin from accessing data source, it will be a little tedious for simple operations.
Comparing with Hadoop/MPP:
The cluster function of QDBase has not many chances to be well-trained.
QDBase has reduced many clusters into a single machine without sacrificing performance in history.
Comparing with Python:
SPL is developing its AI functions, but now is still not even close to Python.
SPL is not a computing engine of the SQL system, currently only supports simple SQL with small data volumes and does not guarantee performance; it can be considered that QDBase does not support SQL, and of course it is not compatible with any SQL stored procedures.
In the future, dual engines supporting SQL will be developed, but it is still difficult to ensure high performance and big data, just to make the existing SQL code easy to migrate.
Not yet.
The information in SQL statement is insufficient to optimize its performance. Frankly, we are not a veteran like RDB vendor for guessing goal of a SQL, so converting SQL to SPL directly will usually lead to slower speed.
SPL is dedicated to low code and high performance.
SPL syntax is easy, and those with Java/SQL knowledge can get on hand in just a few hours and become proficient in it within a few weeks.
“Difficult”, high-performance algorithms are a bit difficult and require learning more algorithm knowledge;
“Not difficult”, once learned, many high-performance tasks become “routine”.
The first 1-2 scenarios will be implemented by Scudata engineer in collaboration with users.
Most programmers are used to the way of thinking in SQL and are not familiar with high performance solutions of SQL. They need to be trained to understand in one or two scenarios.
Performance optimization routines will be experienced and learned. Algorithmic design and implementation are not so difficult.
Give a man a fish and you feed him for a day. Teach him how to fish and you feed him for a lifetime!
The processing speed of big data is 1 order of magnitude higher than that of traditional solutions
Procedural syntax, in line with natural thinking
Rich class libraries
Multi-source mixed computation
Can run independently, or embedded into applications
Single machine can match cluster, reducing hardware expense
Environment-friendly
Development, hardware, O&M costs reduced by X times