Open-source SPL

The Professional Data Processing Language

01What is SPL?

SPL(Structured Process Language)A programming language specially for structured data computing
  • 1、Beyond the computing power of the database
  • 2、Open computing system
  • 3、Agile syntax

Universality of structured data computing

  • Structured data is everywhere.
  • Structured data computing has always been the mainstream of data processing.
Database
NoSQL
Daily data
Industrial data
...
Structured data is the top priority of data processing!

SPL Competitive Technologies

There are three main types of programming languages for processing structured data at present:

SQL the mainstream programming language of relational database
Java/C#/C++ more basic high-level language
Python emerging data processing and artificial intelligence language
  • SQL vs SPL
  • JAVA vs SPL
  • Python vs SPL
SQL is based on relational algebra theory 50 years ago SPL is based on creative discrete dataset theory
Lack of discreteness and incomplete set-orientation Full combination of discreteness and set-orientation
Barely support ordered calculation Supper ordered calculation
Do not advocate step-by-step calculation Advocate step-by-step calculation
➥The code for complex operations is lengthy and difficult to write ➥Good at complex operations
Difficult to implement high-performance algorithms High performance basic algorithms and storage mechanisms
➥Unable to take full advantage of hardware capabilities ➥Take full advantage of hardware capabilities
Closed, only the data in the database can be calculated Open, can calculate any data source

02SPL application scenarios

Data processing and computing are everywhere, but there is no good technology to solve these problems.
There are many SPL application scenarios, and they are mainly in the following six aspects:

Got SQL

SQL has certain computing power, but it is not available in many scenarios, so you will have to hard code in Java. SPL provides lightweight computing power independent of database and can process data in any scenario:

  • Structured text (txt/csv) calculation
  • Excel calculation
  • Perform SQL on files
  • Multi-layer json calculation
  • Multi-layer xml calculation
  • Java computing class library, surpass Stream/Kotlin/Scala
  • Replace ORM to implement business logic
  • SQL-like calculation on Mongodb, association calculation
  • Post calculation of WebService/Restful
  • Post calculation of Salesforce, Post calculation of SAP
  • Post calculation of various data sources: HBase,Cassandra,Redis,ElasticSearch,Kafka,…

Calculation on files

You can calculate files through SPL native syntax, and support SQL file query, which is simple and convenient.

  • Two file calculation methods are provided
  • If familiar with SQL, you can start at zero cost

MongoDB calculation

SPL can enhance mongoDB's computing power and simplify the computing process.

  • Make mongoDB equal to or stronger than the computing power of RDB
  • After enhancing the computing power, give full play to the original advantages of mongoDB

Diverse data source calculation

SPL provides the computing power independent of the database to complete the post calculation of various data sources.

Beyond SQL

SQL is difficult to deal with complex sets and ordered operations, and it is often read out and calculated in Java. SPL has complete set capability, especially supports ordered and step-by-step calculation, which can simplify these operations:

  • Ordered set
  • Position reference
  • Grouping subsets
  • Non-equivalence grouping
  • Multi-level association operation
  • Static and dynamic pivot
  • Recursion and iteration
  • Step-by-step and loop operation
  • Text and date time operation

SQL vs SPL

Set-orientation
SQL is incompletely set-oriented
  • Cannot describe a set of sets
  • Field value can no longer be a set

SPL provides richer
  • Set operation Library
  • Support lambda syntax
  • Support dynamic data structure
Support lambda syntax
SQL lacks discreteness
  • No natural sequence number, it is troublesome to refer to the specified set member
  • Only have a table with single record, do not have separate record

SPL provides richer
  • Set members can exist outside the set
  • Convenient for separate reference
Completely set-oriented
SPL set-orientation supported by discreteness
  • Complete set-orientation requires the support of discreteness
  • Allow separate members to form a new set
  • Ordered calculation is a combination of set-orientation and discreteness

SQL vs SPL: example

Calculation task:count the longest consecutively rising trading days for a stock.
select max(continuousDays)-1
from (select count(*) continuousDays
	from (select sum(changeSign) over(order by tradeDate) unRiseDays
		from (select tradeDate,
			case when closePrice>lag(closePrice) over(order by tradeDate)
			then 0 else 1 end changeSign
		from stock) )
	group by unRiseDays)
SQL solution

SQL completes the task by nesting three layers and using window functions;

Have you got it?


A
1 =stock.sort(tradeDate)
2 =0
3 =A1.max(A2=if(closePrice>closePrice[-1],A2+1,0))
SPL solution

In fact, this calculation is very simple. According to natural thinking: first sort by trading date (line 1), then compare the closing price of the day with the previous day, if higher, +1, otherwise clear it, and finally find a maximum value (line 3).

Example:aggregation that returns a set

Calculation task: list the last login interval for each user.
WITH TT AS
	(SELECT RANK() OVER(PARTITION BY uid ORDER BY logtime DESC) rk, T.*  FROM t_loginT)
SELECT uid,(SELECT TT.logtime FROM TT where TT.uid=TTT.uid and TT.rk=1)
		-(SELET TT.logtim FROM TT WHERE TT.uid=TTT.uid and TT.rk=2) interval
FROM t_loginTTTT GROUP BY uid

The return value of an aggregate function is not necessarily a single value, but can also be a set.

With complete set-orientation, it is easy to implement aggregation for grouped subsets that returns a set.

A
1 =t_login.groups(uid;top(2,-logtime)) The last two login records
2 =A1.new(uid,#2(1).logtime-#2(2).logtime:interval) Calculate interval

Example:cross-row reference

Calculation task:calculate the moving average of monthly sales (one month before and one month after).
WITH B AS
	(SELECT LAG(amount) OVER (ORDER BY 月份) f1, LEAD(amount) OVER (ORDER BY smonth) f2,   A.* FROM orders A)
SELECT smonth,amount,
	(NVL(f1,0)+NVL(f2,0)+amount)/(DECODE(f1,NULLl,0,1)+DECODE(f2,NULL,0,1)+1) moving_average
FROM B

The window function has only a simple cross-row reference, if a set is involved, members need to be used to spell.

An ordered set can provide cross-row set reference.

A
1 =orders.sort(smonth).derive(amount[-1,1].avg()):moving_average)

Example:ordered grouping

Calculation task:count the longest consecutively rising days for a stock.
SELECT max(consecutive_days)-1 FROM 
	(SELECT count(*) consecutive_days FROM 
		(SELECT SUM(updown_flag) OVER ( ORDER BY trade_date) nonrising_days FROM 
		( SELECT trade_date,
			CASE WHEN  close_price>LAG(close_price) OVER( ORDER BY trade_date THEN 0 ELSE 1 END updown_flag
		FROM stock ))
	GROUP BY nonrising_days )

Another order-related grouping, which generates a new group when the condition is true.

A
1 =stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())

Cooperate DB

The computing power of the database is closed and cannot process data outside the database. It is often necessary to perform ETL to import data into the same database before processing. SPL provides open and simple computing power, which can directly read multiple databases, realize mixed data calculation, and assist the database to do better calculation.

  • Fetch data in parallel to accelerate JDBC
  • SQL migration among different types of databases
  • Cross database operations
  • T+0 statistics and query
  • Replace stored procedure operation, improve code portability and reduce coupling
  • Avoid making ETL into ELT or even LET
  • Mixed calculation of multiple data sources
  • Reduce intermediate tables in the database
  • Report data source development, support hot switching, multiple data sources and improve development efficiency
  • Implement microservices, occupy less resources and support hot switching

Assist RDB

SPL assists RDB calculation and improves RDB capability.

  • SQL migration transforms standard SQL into various database "dialects“
  • Fetching data in parallel improves performance
  • Cross database / heterogeneous database calculation further enhances RDB capabilities

Replace stored procedure

Purpose of stored procedure
  • Data preparation
  • Presentation preparation
Problems of stored procedure
  • Cause intra application and inter application coupling
  • Low safety
  • Poor portability

Database decoupling

  • Put stored procedures and intermediate tables in the application
  • The database only undertakes storage and a small amount of (general) calculation
  • Application and database are decoupled, easy to maintain and expand

Cross database cluster

Multi database cluster computing is realized with the cross database and parallel ability of SPL.


T+0 query

T+0 problems
  • Transaction consistency requirements
  • If history data and current data are in the same database, the amount of data is too huge
  • If history data and current data are in different databases, cross database calculation is difficult
Parallel cross-database computing to implement T+0
  • Historical data can also be saved as files

ETL

Traditional ETL often starts with L and then ET, which is time-consuming and laborious; The real ETL process can be realized through SPL.

ETL complex calculation
  • Processing outside the database to reduce the burden on the database
  • Reduce IO and shorten time window
  • Simple and flexible implementation
  • Allow multi-source mixed processing

Assist Microservice

The data processing of microservices often depends on Java hard coding, and it is difficult to implement complex computing. The combination of SPL and microservice framework to implement data processing is more concise and efficient than other development languages.

Surpass DB

SQL is difficult to implement high-performance algorithms. The performance of big data operations can only rely on the optimization engine of the database, but it is often unreliable in complex situations.
SPL provides a large number of basic high-performance algorithms (many of which are pioneered in the industry) and efficient storage formats. Under the same hardware environment, it can obtain much better computing performance than the database, and can comprehensively replace the big data platform and data warehouse.

  • In-memory search:binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning
  • Dataset in external storage:parallel computing of text file, binary storage, double increment segmentation, columnar storage composite table, ordered storage and update
  • Search in external storage:binary search, hash index, sorting index, row-based storage and valued index, index preloading, batch search and set search, multi index merging, full-text searching
  • Traversing technique:post filter of cursor, multi-purpose traversal, parallel traversing and multi cursors, aggregation extension, ordered traversing, program cursor, partially ordered grouping and sorting, sequence number grouping and controllable segmentation
  • Association technique: foreign key addressing, foreign key serialization, index reuse, alignment sequence, large dimension table search, unilateral splitting, orderly merging, association positioning, schedule
  • Multidimensional analysis:pre summary and time period pre summary, alignment sequence, tag bit dimension
  • Distributed:free computing and data distribution, cluster multi-zone composite table, cluster dimension table, redundant fault tolerance, spare tire fault tolerance, Fork-Reduce, multi job load balancing

Decisive factors of computing performance

  • Computing efficiency depends on both hardware and software.
  • Software performance is algorithm efficiency.
  • Algorithm efficiency is determined by algorithm design and algorithm implementation.
  • It is futile to only think of a good algorithm but unable to implement it.
  • The lack of high-performance programming language will limit the implementation of good algorithms.

No gun can be made without gunpowder.


Structured data is the focus

  • At present, data calculation is still based on structured data generated by business system.
  • The industry mainly relies on large memory and large cluster to improve the performance of structured data computing.
  • The essence of large memory and large cluster is to improve the hardware capability vertically or horizontally.
  • The software core still uses SQL based relational algebra system.
  • SQL is too rough to implement many high-performance algorithms.

SQL is difficult to achieve high-performance structured data computing.

Why can't the database run fast?

  • The reason why SQL is difficult to implement efficient algorithms is its theoretical system (relational algebra).
  • The defects in theory are difficult to be made up by engineering.
【Example】 What will SQL do to get the top 10 from 100 million rows of data?
  • The principle of SQL statement is to sort all data, and then get the top 10, which is very inefficient.
  • There is a way to realize this operation without full sorting, but it can not be expressed in SQL.
  • You can only expect the database engine to optimize automatically, but the database does not know how to optimize in complex situations.

It is futile to only think of a good algorithm but unable to implement it.

The essence of high performance is the development efficiency. It should be easy to think of and code.

SPL high performance algorithms and storage schemes

In-memory search
  • Binary search
  • Sequence number positioning
  • Position index
  • Hash index
  • Multi-layer sequence number positioning
Dataset in external storage
  • Segmentation of text files
  • Bin file and double increment segmentation
  • Data types
  • Composite table and columnar storage
  • Order and supplementary file
  • Data update and multi-zone composite table
Search in external storage
  • Binary search
  • Hash index
  • Sorting index
  • Row-based storage and valued index
  • Index preloading
  • Batch search
  • Set search
  • Multi index merging
  • ull-text searching
Traversing technique
  • Post filter of cursor
  • Multi-purpose
  • Parallel traversing
  • Parallel loading of database
  • Multi cursors
  • Group aggregation
  • Understanding aggregation
  • Redundant grouping key
Ordered traversing
  • Ordered group aggregation
  • Ordered grouping subsets
  • Program cursor
  • Partially ordered grouping
  • Sequence number grouping and controllable segmentation
  • Index sorting
Foreign key association
  • Foreign key addressing
  • Temporary addressing
  • foreign key serialization
  • Inner join syntax
  • index reuse
  • alignment sequence
  • large dimension table search
  • unilateral splitting
Merge and join
  • Ordered merging
  • Merge by segment
  • Association positioning
  • schedule
Multidimensional analysis
  • Partial pre-summary
  • Time period pre-summary
  • Redundant sorting
  • alignment sequence
  • tag bit dimension
Cluster
  • free computing and data distribution
  • cluster multi-zone composite table
  • cluster dimension table
  • Dimension table segmentation
  • redundant fault tolerance
  • spare tire fault tolerance
  • multi job load balancing

Many algorithms here are original inventions of SPL!

Examples of high performance algorithms

Examples of high performance algorithms
  A  
1 =file("data.ctx").create().cursor()  
2 =A1.groups(;top(10,amount)) Top 10 orders
3 =A1.groups(area;top(10,amount)) Top 10 orders of each area

High complexity sorting is transformed into low complexity aggregation.

Traversal reuse
  A  
1 =file("order.ctx").create().cursor() Traversal preparation
2 =channel(A1).groups(product;count(1):N) Configure traversal reuse
3 =A1.groups(area;sum(amount):amount) Traversing,and get grouping result
4 =A2.result() Get result of traversal reuse

Multiple result sets can be returned in one traversal.

SPL performance

Hardware configuration:Intel3014 1.7G/12 core/64G memory , ARM/16 core/32G memory

In the same hardware environment, SPL performance is much better than Oracle; Even SPL using low-end chips can surpass Oracle with high-end chips.

【Application case】 Mobile Banking: optimization of large concurrent associated query of current details

Concurrent access is huge

Mobile banking/ online banking, hundreds of thousands/ millions of concurrent details query

ES does not support association

The association between details and organization dimension tables cannot be realized, so wide table redundancy is required

Updating dimension tables takes a long time

For dimension table data adjustment, the wide table data should be fully updated

Use the customer data characteristics to store orderly according to the customer ID; In–memory dimension table; Easy horizontal expansion and low cost.

【 Application case 】 User portrait: intersection of customer groups, acceleration of real-time calculation

Huge amount of data

Hundreds of millions of customers have many to many relationships with thousands of customer groups, with dozens of dimensions

Unable to pre calculate

Thousands of customer groups, too many permutations and combinations

Real time computing is difficult

The database calculates the intersection in real time and filters the dimensions, and cannot respond in seconds.

The database calculates the intersection in real time and filters the dimensions, and cannot respond in seconds.

For Excel

Three ways of analyzing and processing Excel:

BI Tool
  • Only regular calculation can be implemented
  • Unable to complete data preparation
Data Tool
  • Only modelized calculations can be processed
  • Unable to complete flexible and complex calculations
Programming
  • Can handle all situations

More than 80% of the data processing and calculation requirements cannot be completed with visual tools.

【Conclusion】Programming is the most effective way to deal with structured data.But programming languages such as Python are not easy to use.

SPL provides Excel-oriented set operations

Cell value and summary value calculation

  • Calculate using adjacent rows and intervals
  • Accumulation that may terminates early
  • Use adjacent rows to calculate when the same type of data is continuous
  • Use adjacent rows to calculate when the same type of data is discontinuous
  • Use the summary information of the same type of data
  • …….

Set operation and subordinate judgment

  • Intersection, union and difference of simple members
  • Intersection, union and difference of row-style data
  • Intersection, union and difference of uncertain number of sets
  • Set equality and subordinate judgment
  • Order-unrelated set equality and subordinate judgment
  • ……

Duplication judgment, count and deduplication

  • Judge whether there is duplicate data
  • Count the number of repetitions
  • Count the number of repetitions of uncertain number of columns
  • Deduplication of row-style data
  • Deduplication of simple data
  • …….

Sorting and ranking

  • Align in the specified order
  • Align in the specified order with duplicate values
  • Concatenate members with the same ranking
  • Sort within the same category
  • Ranking under category
  • ……

Special grouping and aggregate methods

  • Group every N members
  • Use adjacent data as grouping condition
  • Group when empty or non empty row is encountered
  • Group by interval of data values
  • Concatenate the data in the category into text
  • ……

Association and comparison

  • Associated table reference
  • Interval association
  • Two-dimension association table
  • Use interval range to query association table
  • Associate multiple rows of data
  • ……

Row-column transpose

  • Row to column of fixed column
  • Interchange between row-style table and crosstab
  • Row/column conversion of high level categories
  • Concatenate data in a category into a column horizontally
  • When data in a category is concatenated into a column, it needs to be classified or sorted again
  • ……

Expansion and supplement

  • Generate continuous interval
  • Expand one row to multiple rows based on values
  • Split text and expand to multiple rows
  • Fill in the missing part in the continuous values
  • Fill in several empty rows every N rows
  • ……

Examples of SPL assisting Excel

The combination of SPL and Excel can enhance the calculation ability of Excel and reduce the difficulty of calculation implementation.

Use clipboard

Through SPL's Excel plug-in, you can use SPL functions in Excel to calculate, and you can also call SPL scripts in VBA.

Load SPL plugin in Excel
Use SPL functions in Excel to calculate
Call SPL scripts in VBA to calculate

For Industry

There are a large number of time series data in industrial scenarios, and databases often only provide SQL. The ordered calculation capability of SQL is very weak, resulting in that it can only be used for data retrieval and cannot assist in calculation.
Many basic mathematical operations are often involved in industrial scenarios. SQL lacks these functions and the data can only be read out to process.
SPL can well support ordered calculation, and provides rich mathematical functions, such as matrix and fitting, and can more conveniently meet the calculation requirements of industrial scenes.

  • Time series cursor: aggregation by granularity, translation, adjacence reference, association and merging
  • Historical data compression and solidification, transparent reference
  • Vector and matrix operations
  • Various linear fitting: least squares, partial least squares, Lasso, ridge …

Industrial algorithms often need repeated experiments. SPL development efficiency is very high, and you can try
more within the same time period:

  • Instrument anomaly discovery algorithm
  • Abnormal measurement sample locating
  • Curve lifting and oscillation pattern recognition
  • Constrained linear fitting
  • Pipeline transmission scheduling algorithm

Application example: constrained linear fitting

Background: a refinery hopes to learn relatively accurate product yield from historical data and use it to formulate raw material processing and product output plans for the next day or in the future.

Objective: a set of coefficients (called yield) is fitted by using the historical production data in the way of linear fitting under constraints, so as to minimize the error between the calculated product yield and the actual yield.

Using SPL, the algorithm can be implemented quickly and accurately, so as to predict the output.

Application example: feature screening curves

The trend of time series reflects the actual production situation to a certain extent, which is called working condition. If the curve trend is stable, it indicates normal, and rapid rise or rapid decline indicates that abnormal conditions may occur. Corresponding the curve trend to the working condition is helpful to analyze the problems existing in production activities and improve production efficiency.

Using SPL, the algorithm can be realized quickly and accurately, so as to accurately identify various characteristic curves.

Application example: anomaly discovery

There are usually dozens of units in the refinery, and the number of instruments in each unit is hundreds or thousands. It is impossible to find abnormalities by manual observation, and corresponding algorithms are needed to identify them.

Using SPL, the algorithm can be implemented quickly and accurately, so as to accurately identify all kinds of outliers.

03SPL technology introduction

  • 1. Convenient development environment
  • 2. Specially designed syntax system
  • 3. Rich class library
  • 4. Diverse data sources
  • 5. External data interface
  • 6. Integration
  • 7. Hot switching
  • 8. Multithread parallel computing
  • 9. High performance storage
  • 10. Distributed computing

Convenient development environment


Specially designed syntax system

SPL is especially suitable for complex process operations.

Rich class library

Intended for structured data processing


Diverse data sources

Multiple data sources are directly used for mixed calculation, and there is no need to unify the data (ETL) before calculation.

External data interface

  • Commercial RDBMS:Oracle, MS SQL Server, DB2, Informix
  • Open Source RDBMS:MySQL, PostgreSQL
  • Open Source NOSQL:MongoD, Redis, Cassandra, ElasticSearch
  • Hadoop family:HDFS, HIVE, HBase
  • Application software:SAP ECC, BW
  • Files:Excel, Json, XML, TXT
  • Others:Http Restful, Web Services, OLAP4j, ...
Built in interface, ready to use

Integration

SPL is developed in Java and provides a standard application interface, and can be seamlessly integrated into applications.


Hot switching

SPL interpreted execution supports hot switching.


Multithread parallel computing

Easily implement multithreaded computing for a single task.

High performance storage

High performance data storage

Private data storage format : Bin file, composite table

File system storage

Support the storage of data by business classification in the form of tree directory

Bin file

  • Double increment segmentation supports any number of parallel computing
  • Privatehigh-efficiency compression coding (reducing space; less CPU consumption; security)
  • Generic storage, allow set data

Composite table

  • Row-based storage and columnar storage
  • Orderly storage improves compression rate and positioning performance
  • Efficient intelligent index
  • Double increment segmentation supports any number of parallel computing
  • Integration of primary and sub tables to reduce storage and association
  • Using serial byte to achieve efficient positioning and association

Distributed computing

Data fault tolerance and computing fault tolerance

Provide two data fault-tolerant mechanisms: external storage redundancy fault-tolerance and memory spare-tire fault-tolerance

Support computing fault tolerance. When a node fails, it automatically migrates the computing tasks of the node to other nodes to continue to complete.

Controllable data distribution

Users can flexibly customize data distribution and redundancy scheme according to the characteristics of data and computing task, so as to effectively reduce the amount of data transmission between nodes and obtain higher performance.

No central architecture to avoid single point of failure

The cluster does not have a permanent central master node. Programmers use code to control the nodes involved in computing.

Load balancing capability

Whether to allocate tasks is determined according to the idle degree (number of threads) of each node to achieve an effective balance between burden and resources.