01Query files directly in SQL

Want to make ad-hoc queries on txt/csv/xls/xlsx in SQL, but traditional practice involves complicated preparations, including installing database, creating table and loading data, and others.

By default support files of regular formats

  • Comma-separated csv files where the first row contains field names
  • Tab-separated txt files where the first row contains field names
  • xls/xlsx files where the first row contains column names
  • Support of files of irregular formats
Directly write SQL query on a file as if it is a database table.

SPL WIN supports SQL92 standards,including:

  • where, order by, distinct, group by…having…
  • sum, count, max, min, avg
  • inner join, Left join, full join
  • union, union all, intersect, minus
  • from clause, with clause, nested query, case when, NULLIF, COUNTIF, COALESCE
  • as, between, like, Top-N, limit n offset m
  • select …… into 'new file name'
  • A wealth of string functions, mathematical functions and date functions

02Mixed computations on files

To join a csv file and an xls file, and group and summarize the join result, the traditional practice is creating a table in the database for every file. The process is complicated.
  • Mixed computations also include set operations, subqueries, in functions and others
  • Mixed computations can be performed on any file formats
  • Directory name can be omitted if there is a working directory
Perform the join directly, the process is simple, convenient and efficient.

03Merge and split multiple files

There are xls files named after dates in a certain directory. Each file stores sale records of a date (below are two of the files). We want to find all xls files of a certain month, and group and summarize them. The traditional practice is to merge target files manually and load data to the database. This involves extremely heavy workload.
Write SQL directly – it supports filename containing wildcard characters – and automatically merge multiple files.
In a work zone, double click a file to open it (input), and the computing result can be directly exported as a single file (output).
Split and store data in Orders table into a new xls file by customer column, where each sheet stores orders of one customer. The sheet name is customer name.
SPL native syntax supports file splitting, while SQL does not support it.
T("Orders.xlsx").group(Client).(file("Ordersm.xlsx").xlsexport@ta(~;~.Client))

More Excel file/sheet merge/split scenarios:

  • Merge files under multilevel subdirectories;
  • Split a file into multiple sheets or files according to a fixed number of rows;
  • Combine a specified column with the other columns and write combinations to sheets or files respectively;
  • Summarize values of cells at same position in multiple card-style sheets or multiple card-style tables in one sheet, and generate a summary card;
  • Arrange multiple card-style tables in a sheet into a standard two-dimensional table;
  • Merge multiple card-style sheets into a standard two-dimensional table, where each row corresponds to one card;
  • Split a row-wise file into multiple card-style sheets, where each card corresponds to one row.

04Query files of irregular formats

When a txt file does not contain column names in the first row:
26	TAS		1	2142.4	2009-08-05
33	DSGC	1	613.2	2009-08-14
84	GC		1	88.5		2009-10-16
133	HU		1	1419.8	2010-12-12	
file("d:/noColName.txt").import() in {} is SPL native code that has outstanding parsing ability
Parse files of irregular formats using SPL native syntax, and access column names according to ordinal numbers.
With a txt file separated by double vertical lines, traditional practice is parsing it through programming because it does not support a multi-separator.
OrderID||Client||SellerId||Amount||OrderDate
26||TAS||1||2142.4||2009-08-05
33||DSGC||1||613.2||2009-08-14
84||GC||1||88.5||2009-10-16
import@t(;,"||") means the separator is double vertical lines
Support a multi-separator and can use its native syntax to parse files of irregular formats.
To skip the first two useless rows in a xls file, the traditional practice cannot do that without programming.
xlsimport@t(;,3) reads the file from the 3rd row when parsing it
Skip useless empty rows and use native syntax to parse files of irregular formats.
Use interactive computation to parse files of more complex formats step by step.

More scenarios

  • Fields are separated by a specified column width;
  • The file has Useless empty rows at random positions;
  • The valid data in file exists from row N to row M;
  • There are whitespace at both end of the string field in a file;
  • There are useless double or single quotation marks, or parentheses before and after a field in a csv file;
  • Handle escaping for a field in a csv file according to the Excel rule (as handling quotation marks);
  • The date format in a csv file is like "01-01-2012";
  • A txt file uses semicolon instead of carriage return to separate records;
  • There are certain invalid columns in an xls file;
  • Need password to open an xls file;
  • Read sheets in an xls file according to names or ordinal numbers.

05Interactive computation

In a txt file, every three rows consist of one record and we want to group and summarize it. The traditional practice needs programming to parse the file and the code is hard to write.
33
DSGC	2	613.2
2009-08-14
84
GC1	1	88.5
2009-10-16
133
HU	1	1419.8
2010-12-12
Click to name or rename the result set
Perform the interactive computation – organize data using SPL native syntax, name the computing result, and then write SQL query.

Continue to perform the interactive computation.
Continue to filter
Continue to join
Push forward to achieve multi-step, complex computation.
To find the first n customers whose sale amount takes up at least half of the total. It is hard to accomplish this through the traditional practice.
1. Sort customers by sale amount
2. Add the cumulation column Field,and enter formula ==cum(Amt) in its first cell
The interactive computation can considerably simplify the process.
We can also use SPL native code to generate the computed field. The code is ClientAmt.derive(cum(Amt):Field).
Continue the interactive computation to find the first n customers whose sale amount takes up at least half of the total.
3.Calculate half of the total
4.Get position of record containing half of the total - n
5.Get records before the position, which are our target
To find the first n customers whose sale amount takes up at least half of the total. The following steps are optional.
6.Get target records separately
7.Export them to an xls file
8.Save the xls file permanently or share it

More scenarios

  • Count the longest continuously rising days for a stock;
  • Find employees in each department whose age is below the average of their department;
  • Find US employees whose managers are Indians;
  • Parse log files;
  • When each cell in a zone of an Excel sheet contains value in the form of key-value;
  • Split each value of certain field into N rows according to the separator and generate a standard two-dimensional table;
  • Combine multiple rows into one according to the separator and generate a standard two-dimensional table;
  • Perform transposition and generate a standard two-dimensional table;
  • When certain rows do not have values in the second column and their other columns need to shift right one cell;
  • The number of rows consisting of one record varies and each record begins with same mark.