esProc SPL

Data analysis engine
Low code / High performance / Lightweight / Versatility

Table of Contents

  1. What is esProc?
  2. Case Brief
  3. Why esProc Works Better
  4. Technical Characteristics
  5. More Solutions
  6. FAQ
  7. Summary

01What is esProc?

What is esProc SPL?

  • Data computing and processing engine, run as an analysis database or middleware.
  • Computing and processing of structured and semi-structured data
  • Offline batch job, online query
  • neither SQL system nor NoSQL Technology
  • Self created SPL syntax, more concise and efficient
SPL: Structured Process Language

What pain points does esProc SPL solve?

For the data computing scenarios : Offline Batch Job , Online Query/Report

Slow
  • Slow batch jobs can not fit in certain time window, being strained especially on critical summary dates
  • Being forced to wait for minutes for a query/report, the business personnel becomes angry
  • More concurrencies, longer the query time span, the database crashes
Difficult
  • N-layer nested SQL or stored procedures of dozens of KBs, programmer himself is confused after three months
  • Dozens of datasources like RDB/NoSQL/File/json/Web/…, cross-source mixed computation is highly needed
  • Separate the hot data and cold data into different databases, it is hard to perform real-time queries on the whole data
Complex
  • Too much relied on the stored procedures, the application can not migrate, framework is hard to adjust
  • Too many intermediate tables in the database, exhausting the storage and resources, but dare not to delete them
  • Endless report demands in an enterprise, and how can the cost of personnel be relieved?

What are the counterpart technologies of esProc SPL?

Databases that use SQL syntax and are applied to OLAP scenarios

  • Common database: MySQL, PostgreSQL, Oracle, DB2, …
  • Data warehouse on Hadoop: Hive, Spark SQL, …
  • Distributed data warehouse/MPP: …
  • Cloud data warehouse: Snowflake, …
  • All-in-one database machine: ExaData, …

Other technologies for structured data analysis and statistics

  • Python, Scala, Java, Kotlin, …

esProc SPL

  • Low code
  • High performance
  • Lightweight
  • Versatility

What does esProc SPL bring beyond SQL?

SQL

SPL

The description ability is insufficient, and complex logic needs to be written in a circuitous way
Strong description ability, natural thinking to implement complex logic
Lengthy nested code, difficult to write and debug
Development cost
Simplified stepwise code, easy to write and debug
Huge computing loads consume resources
Hardware cost
Low complexity algorithms reduce resource consumption
Heavy and closed computing ability leads to bloated framework
Imperfect capabilities force complex technology stack
O&M cost
Integrated and open computing ability forms lightweight framework
Versatility enables implementing most tasks on its own
esProc SPL: Reduce the development, hardware, operation and maintenance costs by X times

What are the advantages of esProc SPL compared to Java?

Java

SPL

Lack of necessary computing class libraries, all need to be developed from scratch
Rich computing class libraries with numerous built-in high-performance algorithms
Complex computational code, difficult to develop
Development cost
Rich class libraries, agile development
Difficult and cumbersome to implement high-performance algorithms
Hardware cost
Low complexity algorithms reduce resource consumption
No hot swap, high coupling
Weak capabilities, often requires the use of databases
O&M cost
Hot swap, low coupling
Independently implements the vast majority of tasks
esProc SPL: Reduce the development, hardware, operation and maintenance costs by X times

What are the advantages of esProc SPL compared to Python?

Python

SPL

Rich class libraries but weak big data capabilities
Rich class libraries with built-in big data capabilities
Easy to implement general computations, difficult to implement complex computations
Development cost
Agile syntax; the more complex the computations, the more obvious the advantages
Weak big data ability, high hardware consumption
Hardware cost
Built-in high-performance algorithms, low hardware requirements
Chaotic versions, poor integration
Weak capabilities, often requires the use of databases
O&M cost
Consistent version, integrable, and lightweight framework
Independently implements the vast majority of tasks
esProc SPL: Reduce the development, hardware, operation and maintenance costs by X times

02Case Brief

CaseNational Astronomical Observatory – Star aggregation

TASK
  • 11 photos, 5000000 objects/photo
  • Celestial bodies with close astronomical distance (trigonometric function calculation) are regarded as the same
  • Complexity: 5000000 * 5000000 * 10 = 250 trillion (times comparison)
  • 500000 celestial bodies test
  • Python 200 lines, single thread 6.5 days
  • SQL 100CPU cluster 3.8 hours
  • 500000 celestial bodies test, 2.5 minutes
  • 5 million celestial bodies, 3 hours
  • Codes: 50 lines
Increase speed
2000 times

CaseBatch job of insurance policies of an auto insurance company

TASK
  • Insurance policy table: 35 million rows,details table: 123 million rows
  • There are various ways of association, which need to be handled separately
  • Informix
  • Historical policy matching 6672 seconds
  • Codes 1800 lines
  • Historical policy matching 1020 seconds
  • Codes 500 lines
Increase speed
6.5 times

CaseBatch job of loan agreements of a bank

TASK
  • SQL: 48 steps,3300 lines
  • Historical data: 110 million rows,daily increase: 1.37 million rows
  • Complex multi-table join
  • AIX+DB2
  • Calculation time: 1.5 hours
  • Calculation time: 10 minutes,
  • codes: 500 lines
Increase speed
8.5 times

CaseMobile banking: multi concurrent account query

TASK
  • Huge number of users, and large concurrent accesses
  • Branch information changes frequently and needs to be associated in time
  • Commercial data warehouses on Hadoop cannot meet the high concurrency requirements
  • Using 6 ElasticSearch cluster can cope with concurrency, but can not associate in real time. The data update time is long, and the service must be stopped during this period.
  • Single machine can cope with the same concurrent volume as ES cluster
  • Real time association, zero waiting time for branch information update
1 server vs 6 servers

CaseCalculation of the number of unique loan clients of a bank

TASK
  • Too many labels, and hundreds of labels can be arbitrarily combined to query
  • Association, filtering and aggregation calculation of a 20 million rows large table and even larger detailed tables
  • Each page involves the calculation of nearly 200 indexes, and 10 concurrency will cause the concurrent calculation of more than 2000 indexes
  • Oracle
  • Unable to calculate in real time; The query requirements have to be submitted in advance, and the calculation is carried out one day earlier.
  • 10 concurrency, 2000 indexes in total, less than 3 seconds
  • No need to prepare in advance, instantly select any label combination, and get query results in real time
Turn pre-calculation
into real-time calculation

CaseIntersection calculation of customer groups in bank user profiles

TASK
  • Huge amount of data, hundreds of millions of customers, thousands of customer groups (many to many relationship), dozens of dimensions
  • Arbitrarily select several customer groups to calculate the intersection: the intermediate result set is huge and cannot be pre calculated
  • More than 10 concurrent requests
  • Famous OLAP server on Hadoop,100CPU cluster
  • Single task: 2 minutes
  • 12CPU single task: 4 seconds
  • 10 concurrent tasks can be completed in 10 seconds
Increase speed
250 times

Front-end database in BI System of a bank

Center data warehouse undertakes all data task of whole bank, which is overburdened and can only assign 5 concurrencies to BI system
Only for a small amount of high-frequency data, DB2 is not capable for real-time query, and also unable to achieve data routing, users must select the data source

Concurrency
5 ➔ 100

esProc stores a small amount of high-frequency data, and large low-frequency data is still stored in the data warehouse to avoid repeated construction
esProc takes over the most high frequency computing tasks, and a few low frequency tasks are automatically routed to the center data warehouse

An insurance company - Outside database stored procedure

Vertica does not support stored procedures; To prepare data, complex nested SQL statements have to be written, and Java hardcoding is often required.
When mixed computing with MySQL, MySQL data has to be loaded into Vertica first, which is tedious, not real-time, and the database is bloated.

User comments

The best use for us is to pass parameters to the Vertica database.

Each cell becomes a data array that are easy to use, compare and manipulate. It is very logical and you have made it user friendly.

esProc can not only implement stored-procedure-like computations on Vertica, but also calculate different sources directly.

03Why esProc Works Better

Why SQL is difficult to write: What is the max days has a stock been rising?

SELECT MAX(ContinuousDays)
FROM (SELECT COUNT(*) ContinuousDays
    FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays
        FROM (SELECT TradeDate,
            	CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate)
                	THEN 0 ELSE 1 END UpDownTag
            FROM Stock ) )
    GROUP BY NoRisingDays )

SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used in a roundabout way.

Such statements are not only difficult to write, but also difficult to understand.

In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.

It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development and maintenance severely.

Why can‘t SQL run fast: Get the top 10 from 100 million rows of data

SELECT TOP 10 * FROM Orders ORDER BY Amount DESC

This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.

We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.

In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint

In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout approach.

In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.

SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn 
    FROM Orders ) 
WHERE rn<=10

The SPL solution

A
1=Stock.sort(TradeDate).group@i(Price< Price[-1]).max(~.len())

The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and concise.

A
1=file(“Orders.ctx”).open().cursor()
2=A1.groups(;top(10;-Amount))Top 10 orders
3=A1.groups(Area;top(10;-Amount))Top 10 orders of each area

SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups, and there is no need to use the roundabout approach.

Why is SPL more advanced?

Analogy Calculate1+2+3+…+100=?

Ordinary people will do like this

  • 1+2=3
  • 3+3=6
  • 6+4=10
  • 10+5=15
  • 15+6=21

Gauss does like this

  • 1+100=101
  • 2+99=101
  • 3+98=101
  • A total of fifty 101
  • 50*101= 5050
SQL is like an arithmetic system with only addition. The code is lengthy and the calculation is inefficient.
SPL is equivalent to the invention of multiplication! Simplify writing and improve performance.

The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still difficult to meet complex requirements.

SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful expression capabilities.

SPL High Performance Computing Concept

Hardware ?

Software can't make hardware run faster, no software can!

Algorithm

But we can design a high efficiency and low complexity algorithm, and it will be faster if the amount of calculation is less.

Develop

It's not enough to come up with a good algorithm, but to develop it.

Database

Traditional databases are limited by theoretical system, and it is impossible to implement a good algorithm.
Q: What can we do?
A: Look forward!
Q: Oh, so it is like this.
A: Yes, that's not magical.
Q: Then find a programmer to do it.
A: Not so easy!
Q: Isn't that all you can do is stare in despair?
A: Hey hey, that's how it works most of the time.
So High performance computing = Algorithm design + Algorithm ImplementationBecoming the bottleneck of high performance computing
SQL,NoSQL,NewSQL,Hadoop,all restrict the implementation of algorithms

Common scenarios to beat SQL

1. Complex orderly Computing:funnel analysis of user behavior transformation

2. Multi-step big data batch job

3. Multi index calculation on big data, repeated use and multiple associations

In real business, complex SQL (and stored procedures) are often hundreds/thousands of lines, and a large number of roundabout approaches have to be used to implement the calculation. The code becomes complex as well as the performance becomes low.

Funnel analysis of an E-commerce company

with e1 as (
    select uid,1 as step1,min(etime) as t1
    from event
    where etime>= to_date('2021-01-10') and etime < to_date('2021-01-25')
         and eventtype='eventtype1' and …
    group by 1),
e2 as (
    select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
    from event as e2
    inner join e1 on e2.uid = e1.uid
    where e2.etime>= to_date('2021-01-10') and e2.etime < to_date('2021-01-25') 
         and e2.etime > t1 and e2.etime < t1 + 7
         and eventtype='eventtype2' and …
    group by 1),
e3 as (
    select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
    from event as e3
    inner join e2 on e3.uid = e2.uid
    where e3.etime>= to_date('2021-01-10') and e3.etime < to_date('2021-01-25')
         and e3.etime > t2 and e3.etime < t1 + 7
         and eventtype='eventtype3' and …
    group by 1)
select
    sum(step1) as step1,
    sum(step2) as step2,
    sum(step3) as step3
from e1
    left join e2 on e1.uid = e2.uid
    left join e3 on e2.uid = e3.uid

SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation performance is very low.

Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added when adding more steps.


A
1=["etype1","etype2","etype3"]
2=file("event.ctx").open()
3=A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime < date("2021-01-25") && A1.contain(etype) && …)
4=A3.group(uid).(~.sort(etime))
5=A4.new(~.select@1(etype==A1(1)):first,~:all).select(first)
6=A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime < t1+7).etime, null))))
7=A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)

SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly according to natural thinking, which is simple and efficient.

This code can handle funnels with any number of steps, as long as the parameters are changed.

Part of High Performance Computing Mechanism Provided by SPL

Traversal technique
Delayed cursor
Aggregate Understanding
Ordered cursor
Multi-purpose traversal
Prefilter traversal
Highly efficient Joins
Foreign key as pointer
Numbering of foreign keys
Order-based merge
Attached table
Unilateral HASH Join
High performance storage
Orderly Compressed Storage
Free column storage
Hierarchical Numbering positioning
Index and Caching
Double increment segmentation
Cluster computing
Preemptive Load Balancing
Multi-zone composite table
Cluster dimension table
Memory spare tire fault tolerance
External storage redundancy fault tolerance
Many algorithms and storage schemes here are the original inventions of SPL!

Why Java can not work well?

Java is too native, lacking necessary data types and computing libraries, making it difficult or even impossible for application programmers to code.

Calendar cal = Calendar.getInstance();
Map < Object, DoubleSummaryStatistics> c = Orders.collect(Collectors.groupingBy(
                r -> {
                    cal.setTime(r.OrderDate);
                    return cal.get(Calendar.YEAR) + "_" + r.SellerId;
                },
                Collectors.summarizingDouble(r -> {
                    return r.Amount;
                })
        )
);
for(Object sellerid:c.keySet()){
    DoubleSummaryStatistics r =c.get(sellerid);
    String year_sellerid[]=((String)sellerid).split("_");
    System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum());
}

High performance algorithms are difficult to implement

  • Causing forced use of low performance algorithms that are easy to write, often fail to beat SQL

No universal high-performance storage

  • Can only use databases or text with low performance
  • If want to realize high-performance storage on its own, then facing difficulties in implementation

Why Python can not work well?

Python's DataFrame is not good at processing structured data computation in complex situations.

import pandas as pd
import datetime
import numpy as np
import math
def salary_diff(g):max_age = g['BIRTHDAY'].idxmin()min_age = 
	g['BIRTHDAY'].idxmax()diff = g.loc[max_age]['SALARY']-
	g.loc[min_age]['SALARY']return 
	diffemp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])salary_diff=
	emp.groupby('DEPT').apply(salary_diff)print(salary_diff)

There are still shortcomings in terms of computational completeness

  • Relatively cumbersome in calculations such as adjacent reference, ordered grouping, positioning calculation, non equivalence grouping

Poor syntax consistency

  • Some similar calculations may use different functions

Poor big data capabilities

  • No external storage cursor mechanism or parallelism ability

No universal high-performance storage

  • No efficient storage, just like Java

04Technical Characteristics

Operating environment

  • JVM of JDK1.8 or above version
  • Any operating system, including VM and Containers
  • The space occupation is less than 1G, and the core package excluding third-party packages is less than 15M, can run on Android
  • Resource consumption is less than that of databases; Increasing memory, enhancing CPU and hard drives can improve performance

esProc Technical Architecture


Simple and easy-to-use development environment

Specially designed syntax system

SPL is especially suitable for complex process operations.


Rich class libraries

Designed specifically for structured data tables

Excellent integration

esProc, developed in JAVA, can run independently or be seamlessly integrated into applications.


Support diverse data sources

Multiple data sources are directly used for mixed computations.

Direct computation without loading data into database, using the strengths of original data sources, real time computation

Data types that can be processed

  • Structured text: txt/csv
  • Plain text, string analysis
  • Data in Excel files
  • Multilayer structured text: JSON, XML
  • Structured data: relational database
  • Multilayer structured data: bson
  • KV type data: NoSQL
Special support for JSON/XML multi-layer data, skilled in collaborating with MongoDB and Kafka, and exchanging data with HTTP/RESTful
* Not good at handling audio and video, nor good at handling Excel formats

Flexible and efficient file storage

High performance

Private data storage format: bin files, composite tables

File system

Support to store data by business classification in tree directory

BIN FILE

  • Double increment segmentation supports arbitrary number parallelism
  • Self-owned efficient compression technique (reduced space; less CPU usage; secure)
  • Generic Storage, allowing set data

COMPOSITE TABLE

  • Mixed row and column storage
  • Ordered storage improves compression rate and positioning performance
  • Efficient intelligent index
  • Double increment segmentation supports arbitrary number parallelism
  • Integration of main and sub table to reduce storage and association
  • Numbering keys to achieve efficient positioning Join
Direct file storage without database is more efficient, more flexible, and cheaper.

Data storage and exchange

  • esProc does not own data, all data sources have the same logical status, and all data that can be accessed is esProc data
  • There is no concept of “inside the database” or “outside the database”, no action of “import into the database” or “export out of the database”, only conversion(can be implemented in SPL)

Data security and reliability

05More Solutions

Implementation of data-driven micro-service

Problems

  • Mainstream frameworks such as microservices require data processing at the application side.
  • The database is difficult to be embedded in front-end applications, thus hardcoding is the only choice.
  • Java/ORM lacks sufficient structured computing class library, which makes it difficult to develop data processing, and hot swap can not be achieved.

Solution

  • SPL replaces Java/ORM to implement data computing in (micro-service) applications.
  • Rich class library and agile syntax simplify the development.
  • The system is open and can process data of any source in real time.
  • SPL is interpreted executed, naturally supporting hot swap.
  • Efficient algorithms and parallel mechanism ensure computing performance.

Replace stored procedures

Problems

  • Stored procedures are hard to edit and debug, and lack migratability.
  • Compiling stored procedures requires high privilege, causing poor security.
  • The shared use of a stored procedure by multiple applications will cause tight coupling between applications.

Solution

  • SPL is intuitively suitable for complex multi-step data computation.
  • SPL scripts are naturally migratable.
  • The script only requires the read privilege of the database and will not cause database security problems.
  • Scripts of different applications are stored in different directories, which will not cause coupling between applications.

Eliminate intermediate tables from databases

Problems

  • For query efficiency or simplified development, a large number of intermediate tables are generated in the database.
  • The intermediate tables take up large space, causing the database to be excessively redundant and bloated.
  • The use of the same intermediate table by different applications will cause tight coupling, and it is difficult to manage the intermediate tables (hard to delete).

Solution

  • The aim for storing intermediate tables in the database is to employ the database’s computational ability for subsequent computations; SPL can implement the subsequent computations after using file storage.
  • External intermediate tables (files) are easier to manage, and using different directories for storage will not cause coupling problems between applications.
  • External intermediate tables can fully reduce the load on the database, even without the need to deploy the database.

Handle endless report development requirements

Problems

  • Reporting tools/BI tools can only solve the problems in the report presentation stage and can do nothing about data preparation.
  • Data preparation implemented in SQL/stored procedure/JAVA hardcoding is difficult to develop and maintain, and the cost is high.
  • The report development needs are objectively endless, and data preparation is the main factor leading to high development costs.

Solution

  • Add a computing layer between report presentation and data source to solve the data preparation problems.
  • SPL simplifies the data preparation of reports, makes up for the lack of computing ability of reporting tools, and comprehensively improves the efficiency of report development.
  • Both report presentation and data preparation can quickly respond to handle endless report development needs at low cost.

Programmable data routing to implement front-end calculation

Center data warehouse needs front-end calculation to share the heavy workload

  • Move only high frequency data to front, take over most of the calculation requests
  • The programmable data routing automatically selects the front-end data and data warehouse and mixes the calculation results, applications have transparent access to full data

Data routing enables low-cost high-performance full data analysis

Dilemmas of implementing front-end calculation through traditional database

  • Application may access all the data, and moving full data to front causes redundant construction and high cost
  • Due to lack of data routing, moving part of the data to front causes that application can not transparently access the full data and the poor experience

Mixed computation to implement real-time HTAP

Mixed calculation of hot and cold data to implement T+0 real-time analysis

  • Organized historical cold data
  • Real time reading of transaction hot data

The production system needs no modification

  • Make full use of the advantages of the original multiple data sources

Support low-risk, high-performance and strong real-time HTAP with open multi-source hybrid computing capability

HTAP databases are difficult to meet HTAP requirements

  • Requiring replacement of production system database, with high risk
  • Insufficient SQL computing power, insufficient historical data preparation, low performance
  • Closed computing capacity, complex ETL process required due to external diverse data sources, poor real-time ability

Perform computation on files to implement Lakehouse

  • Open format file data computation
    • txt/csv/xls/json/xml
  • High performance private format file storage and computation
  • Enter at will and sort out gradually;Lake to House
  • Rich data source interfaces, direct real-time computing
  • RDB can only House, not Lake
  • Strong constraints, non-compliant data cannot enter, and complex ETL processes are inefficient
  • Closed, external data cannot be calculated, let alone mixed real-time calculation

06FAQ

Is esProc based on open source or database technology?

esProc is based on a brand-new computing model, no open source technology can be cited, and all independent innovation from theory to code.

SPL is based on innovation theory that can no longer use SQL to achieve high performance, and SQL can not describe most low complexity algorithms.


Where can esProc be deployed

esProc is implemented in pure Java.

esProc can run smoothly under any OS equipped with JVM, including VM, cloud server and even container.


How applications invoke esProc?

esProc provides a standard JDBC driver for Java applications.

esProc can be integrated in a Java application seamlessly.

esProc can be invoked by a non-Java application via HTTP/RESTFul


Can esProc be integrated in other frameworks?

As a Java product with good integration, it can seamlessly be integrated in various Java frameworks and application servers, and its logical status is equivalent to self written Java code

For computational frameworks (such as Spark), although esProc can be seamlessly integrated, it has no practical significance; esProc can replace Spark to compute

Specifically, esProc has its own streaming computing abilities and does not need to be integrated in streaming computing frameworks (such as Flink), typically resulting in better functionality and performance


Can esProc run based on the existing database?

Yes, Of course!

However, esProc can not guarantee high performance in this situation due to the inefficient I/O of database, and database can hardly provide storage schema which is necessary for low complexity algorithm.


Where does esProc store data?

esProc stores data in files of self-designed format to guarantee performance.

Any file system installed on OS is available, including NFS.

esProc can easily implement separation between storage and computation to scale out


How to ensure high reliability of esProc?

When embedded in applications, reliability is guaranteed by the application

When used independently, load balancing and fault tolerance mechanisms are provided, but a single task may fail, only suitable for small-scale clusters

Does not provide automatic recovery function after failure

The elastic computing mechanism of the cloud version avoids the current failed nodes when allocating VMs, achieving high availability to a certain extent


How does esProc extend its functionality?

The provided interface can be used to invoke static functions written in Java to extend functionality

esProc also opens an interface for custom functions, which can be used in SPL after registration


What are the weaknesses of esProc?

Comparing with RDB:

The metadata is immature in esProc, most of computation will begin from accessing data source, it will be a little tedious for simple operations.

Comparing with Hadoop/MPP:

The cluster function of esProc has not many chances to be well-trained.

esProc has reduced many clusters into a single machine without sacrificing performance in history.

Comparing with Python:

SPL is developing its AI functions, but now is still not even close to Python.


How is SPL compatible with SQL?

SPL is not a computing engine of the SQL system, currently only supports simple SQL with small data volumes and does not guarantee performance; it can be considered that esProc does not support SQL, and of course it is not compatible with any SQL stored procedures.

In the future, dual engines supporting SQL will be developed, but it is still difficult to ensure high performance and big data, just to make the existing SQL code easy to migrate.


Is there a tool to convert SQL to SPL automatically?

Not yet.

The information in SQL statement is insufficient to optimize its performance. Frankly, we are not a veteran like RDB vendor for guessing goal of a SQL, so converting SQL to SPL directly will usually lead to slower speed.


How difficult is it to learn SPL?

SPL is dedicated to low code and high performance.

SPL syntax is easy, and those with Java/SQL knowledge can get on hand in just a few hours and become proficient in it within a few weeks.


“Difficult”, high-performance algorithms are a bit difficult and require learning more algorithm knowledge;

“Not difficult”, once learned, many high-performance tasks become “routine”.


How to launch a performance optimization process

The first 1-2 scenarios will be implemented by Scudata engineer in collaboration with users.

Most programmers are used to the way of thinking in SQL and are not familiar with high performance solutions of SQL. They need to be trained to understand in one or two scenarios.

Performance optimization routines will be experienced and learned. Algorithmic design and implementation are not so difficult.

Give a man a fish and you feed him for a day. Teach him how to fish and you feed him for a lifetime!

07Summary

Summary of advantages of esProc

5 advantages

High performance

The processing speed of big data is 1 order of magnitude higher than that of traditional solutions

Efficient development

Procedural syntax, in line with natural thinking
Rich class libraries

Flexible and open

Multi-source mixed computation
Can run independently, or embedded into applications

Save resources

Single machine can match cluster, reducing hardware expense
Environment-friendly

Sharp cost reduction

Development, hardware, O&M costs reduced by X times