Apache Calcite official documentation Chinese version-Overview-2. Tutorials 07/06 Update SLTechnology News&Howtos

Apache Calcite official documentation Chinese version-Overview-2. Tutorials

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The first part is an overview 2. Tutorials

this chapter provides a step-by-step tutorial on connection establishment for Calcite, using a simple adapter to render a directory of CSV files in the form of tables containing Schema information, and provides a full SQL interface.

Calcite-example-CSV is a full-featured adapter in Calcite that reads text files in CSV format (separated by commas). To its credit, hundreds of lines of java code is enough to provide full SQL query capabilities.

The CSV adapter also builds the reference template as an adapter for other data formats. Although the amount of code is small, it covers some important concepts:

1) users customize schema by using SchemaFactory and Schema interfaces

2) declare schemas in JSON model file

3) declare the view views in a JSON model file

4) Custom table through Table interface

5) define the record type of table

6) use Scannable Table interface as a simple implementation of Table to enumerate all rows directly

7) Advanced implementation of FilterableTable to filter rows based on the simple predicate predicates

8) implement Table with Translatable Table advanced, and translate relational operators into execution plan rules.

2.1 × × ×

Version dependencies: Java (1.7 or higher; 1.8 preferred), git and maven (3.2.1 or later).

$git clone https://github.com/apache/calcite.git$ cd calcite$ mvn install-DskipTests-Dcheckstyle.skip=true$ cd example/csv2.2 query test

can connect to Calcite through the project's built-in sqlline script

$. / sqllinesqlline >! connect jdbc:calcite:model=target/test-classes/model.json admin admin

(if using the Windows operating system, the command is sqlline.bat)

Execute a metadata query

JDBC hint: the! tables command in sqlline is actually equivalent to executing DatabaseMetaData.getTables (), and there are other commands to query JDBC metadata, such as! columns and! describe.

as shown in the results, there are five table in the system: EMPS, DEPTS, HOBBIES under SALES schema and COLUMNS and TABLES under metadata schema. The system table is always displayed in Calcite, but the other tables are implemented by the specified schema. In this case, the EMPS and DEPTS tables are derived from the EMPS.csv and DEPTS.csv files under the target/test-classes path.

By executing some queries on these tables, can verify that Calcite provides a complete implementation of SQL functionality.

First, scan table:

Sqlline > SELECT * FROM emps +-+-- + | EMPNO | NAME | DEPTNO | GENDER | CITY | EMPID | AGE | S | +- -+-+ | 100 | Fred | 10 | 30 | 25 | t | 110 | Eric | 20 | M | San Francisco | 3 | 80 | n | 110 | John | 40 | | M | Vancouver | 2 | null | f | | 120 | Wilma | 20 | F | 1 | 5 | n | 130 | Alice | 40 | F | Vancouver | 2 | null | f | +-| -+

Support for both JOIN and GROUP BY

finally, the VALUES operator can aggregate into a single row of data, which is an easy way to test expressions and SQL embedded functions:

Sqlline > VALUES CHAR_LENGTH ('Hello,' | | 'worldview'); +-+ | EXPR$0 | +-+ | 13 | +-+

Calcite has many other SQL features. We don't have time to give examples here, and users can write more queries for verification.

2.3 Schema Discovery

now, let's explore how Calcite discovered these table. Remember, the core Calcite doesn't know anything about the CSV file. (like a "databse without a storage tier", Calcite doesn't know any file formats.) Calcite recognizes these table because we tell it to run the code under the calcite-example-csv project.

There are a series of steps in the runtime chain. First, we define a schema in the format of model file in a schema project class. The schema factory class then creates a schema,schema to create multiple table, all of which know how to get the data through the scan CSV file. Finally, when the Calcite parses the query and maps the query plan to these table, the Calcite triggers these table to read the data when the query is executed. Next, let's analyze the detailed steps in more depth.

in the JDBC connection string, we will give the path to the model defined in JSON format. Model is defined as follows

{version: '1.0, defaultSchema:' SALES', schemas: [{name: 'SALES', type:' custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: {directory:' target/test-classes/sales'}}]}

The model defines a schema called SALES. This schema is supported by a plugin class, org.apache.calcite.adapter.csv.CsvSchemaFactory, which is part of the calcite-example-csv project and implements the SchemaFactory interface in Calcite. Its create method instantiates a schema, passing the directory parameter in the model file.

Public Schema create (SchemaPlus parentSchema, String name, Map operand) {String directory = (String) operand.get ("directory"); String flavorName = (String) operand.get ("flavor"); CsvTable.Flavor flavor; if (flavorName = = null) {flavor = CsvTable.Flavor.SCANNABLE;} else {flavor = CsvTable.Flavor.valueOf (flavorName.toUpperCase ());} return new CsvSchema (new File (directory), flavor);}

according to the configuration of model, this schema project class instantiates a schema named SALES. This schema is an instance of org.apache.calcite.adapter.csv.CsvSchema that implements the Schema interface in Calcite.

A schema's job is to generate a series of tables (it can also enumerate sub-schema and table-function, but these advanced features are not supported in calcite-example-csv). These table implement the Table interface of Calcite. CsvSchema generates some tables, which are instances of CsvTable and subclasses of CsvTable.

Below is some code related to CsvSchema that overloads the getTableMap () method in the base class AbstractSchema.

Protected Map getTableMap () {/ / Look for files in the directory ending in ".csv", ".csv.gz", ".json", ".json.gz". File [] files = directoryFile.listFiles (new FilenameFilter () {public boolean accept (File dir, String name) {final String nameSansGz = trim (name, ".gz"); return nameSansGz.endsWith (".csv") | | nameSansGz.endsWith (".json");}}); if (files = = null) {System.out.println ("directory" + directoryFile + "not found"); files = new File [0] } / / Build a map from table name to table; each file becomes a table. Final ImmutableMap.Builder builder = ImmutableMap.builder (); for (File file: files) {String tableName = trim (file.getName (), ".gz"); final String tableNameSansJson = trimOrNull (tableName, ".json"); if (tableNameSansJson! = null) {JsonTable table = new JsonTable (file); builder.put (tableNameSansJson, table); continue;} tableName = trim (tableName, ".csv"); final Table table = createTable (file) Builder.put (tableName, table);} return builder.build ();} / * * Creates different sub-type of table based on the "flavor" attribute. * / private Table createTable (File file) {switch (flavor) {case TRANSLATABLE: return new CsvTranslatableTable (file, null); case SCANNABLE: return new CsvScannableTable (file, null); case FILTERABLE: return new CsvFilterableTable (file, null); default: throw new AssertionError ("Unknown flavor" + flavor);}}

schema scans the specified path to find all files that end with ".csv". In this example, the specified path is target/test-classes/sales, and the path contains the files EMPS.csv and DEPTS.csv, which are converted to EMPS and DEPTS tables.

{version: '1.0, defaultSchema:' SALES', schemas: [{name: 'SALES', type:' custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: {directory:' target/test-classes/sales'}, tables: [{name: 'FEMALE_EMPS' Type: 'view', sql:' SELECT * FROM emps WHERE gender =\'F\'}]}

The type: "view" line above the defines FEMALE_EMPS as a view, not a regular table or a custom table. Note that the escape field "\" is required to define single quotation marks in JSON. Using JSON to define long strings is not very easy to use, so Calcite supports an alternative syntax. If you have a long SQL statement in the view definition, you can use multiple lines to define a long string.

{name: 'FEMALE_EMPS', type:' view', sql: ['SELECT * FROM emps',' WHERE gender =\'F\']}

After defines a view, it can be used completely as a table when querying

The custom table is defined by user-defined code, and no additional custom schema is required.

For specific examples, please refer to model-with-custom-table.json

{version: '1.0, defaultSchema:' CUSTOM_TABLE', schemas: [{name: 'CUSTOM_TABLE', tables: [{name:' EMPS', type: 'custom', factory:' org.apache.calcite.adapter.csv.CsvTableFactory', operand: {file: 'target/test-classes/sales/EMPS.csv.gz' Flavor: "scannable"}}]}]}

We can query the custom table in a general way.

The schema above is a common format that contains a custom table driven by org.apache.calcite.adapter.csv.CsvTableFactory, a class that implements the TableFactory interface in Calcite. It creates a CsvScannableTable instance method that passes the file parameter in the model file.

Public CsvTable create (SchemaPlus schema, String name, Map map, RelDataType rowType) {String fileName = (String) map.get ("file"); final File file = new File (fileName); final RelProtoDataType protoRowType = rowType! = null? RelDataTypeImpl.proto (rowType): null; return new CsvScannableTable (file, protoRowType);}

implementing a custom table is usually an easier alternative to implementing a custom schema. Both methods eventually create similar implementations of Table interface classes, but custom tables do not need to implement metadata discovery. CsvTableFactory creates a CsvScannableTable, just like CsvSchema, but the table implementation does not need to scan the entire file system to find files of type .csv.

custom table requires developers to perform more operations on model (developers need to explicitly specify each table and its corresponding file in the model file), while also providing developers with more control options (for example, providing different parameters for each table).

2.6 Model comments

During the definition of model, you can use /... / or / / symbols to add comments

{version: '1.0, / * Multi-line comment. * / defaultSchema: 'CUSTOM_TABLE', / / Single-line comment. Schemas: [.. ]}

Comments is not a standard JSON format, but it has no impact.

2.7 optimize queries using execution plan rules

there is nothing wrong with the table implementation and query we have seen so far, because table does not contain a large amount of data. But if the custom table has a large amount of data, for example, 100 columns and 100w rows, you will want the user not to retrieve the full amount of data during each query. You will want Calcite to be measured by adapters and to find a more efficient way to access data.

The measurement process is a simple query optimization format. Calcite supports query optimization by adding planner rules. Planner rules take effect when matching corresponding rules in the query parse tree (for example, when matching a certain type of table in a project), and planner rules are extensible, such as schemas and tables. Therefore, if users want to access a dataset through SQL, they first need to define a custom table or schema, and then define some rules that make data access efficient.

to see the effect, we can use a planner rule to access some collection of child columns in an CSV file. We can execute the same query in two similar schema:

Sqlline >! connect jdbc:calcite:model=target/test-classes/model.json admin adminsqlline > explain plan for select name from emps +-- + | PLAN | +-- -+ | EnumerableCalcRel (expr#0..9= [{inputs}]) NAME= [$T1]) | | EnumerableTableScan (table= [[SALES, EMPS]]) | +-- + sqlline >! connect jdbc:calcite:model=target/test-classes/smart.json admin adminsqlline > explain plan for select name from emps +-- + | PLAN | +-- -+ | EnumerableCalcRel (expr#0..9= [{inputs}]) NAME= [$T1]) | | CsvTableScan (table= [[SALES, EMPS]]) | +-- +

The scan mode of the two queries is different, EnumerableTableScan and CsvTableScan.

What caused the difference in the execution plan of ? Let's trace the evidence. In the smart.json model file, there is an extra line:

Flavor: "translatable"

The configuration allows CsvSchema to be created with the falvor = TRANSLATABLE parameter, and its createTable method creates a CsvTranslatableTable instance instead of CsvScannableTable.

CsvTranslatableTable implements the TranslatableTable.toRel () method to create CsvTableScan. The Table scan operation is the leaf node in the query execution tree, and the default implementation is EnumerableTableScan, but we have constructed a different subtype to make the rule take effect.

Here is a complete rule for :

Public class CsvProjectTableScanRule extends RelOptRule {public static final CsvProjectTableScanRule INSTANCE = new CsvProjectTableScanRule (); private CsvProjectTableScanRule () {super (operand (Project.class, operand (CsvTableScan.class, none ()), "CsvProjectTableScanRule");} @ Override public void onMatch (RelOptRuleCall call) {final Project project = call.rel (0); final CsvTableScan scan = call.rel (1); int [] fields = getProjectFields (project.getProjects ()) If (fields = = null) {/ / Project contains expressions more complex than just field references. Return;} call.transformTo (new CsvTableScan (scan.getCluster (), scan.getTable (), scan.csvTable, fields);} private int [] getProjectFields (List exps) {final int [] fields = new int [exps.size ()]; for (int I = 0; I < exps.size (); iTunes +) {final RexNode exp = exps.get (I) If (exp instanceof RexInputRef) {fields [I] = (RexInputRef) exp) .getIndex ();} else {return null; / / not a simple projection}} return fields;}}

The constructor declares a relational expression matching pattern that enables the rule to take effect.

The onMatch method generates a new relational expression and calls RelOptRuleCall.transformTo () to indicate that the rule has been triggered successfully.

2.8 query optimization process

There are many ways for to talk about how smart Calcite's query plan is, but we won't talk about it here. The smartest part is the optimizer rule designer to lighten the burden on users.

first of all, Calcite does not execute the rules in the specified order. The query optimization process is a branch tree with many branches, which, like chess, checks for many possible sub-operations (movements). If both rules An and B satisfy a given subset of the query operation tree, Calcite can execute them at the same time.

second, Calcite uses cost-based optimization when executing the plan tree, but the cost-based model does not lead to rule execution, which seems to be more expensive in the short term.

Many optimization rules of have a linear optimization scheme. In the case of rules An and B mentioned above, such an optimizer needs to make an immediate decision. There may be a strategy, such as "execute Rule A first on the whole tree, and then execute Rule B on the entire tree," or execute a cost-based optimization strategy to execute rules that produce less costly results.

Calcite does not need such a compromise. ). This makes it easier to combine various rules. If you want to combine rules to identify various materialized views with rules to read data from CSV and JDBC source data systems, you need to give Calcite all the rules and tell it how to do it.

Calcite uses a cost-based optimization model that determines which execution plan is ultimately used. Sometimes the search tree is pruned to avoid explosive growth in search space, but it never forces users to choose between rule An and rule B. This is important because it avoids falling into local optimals that are not actually optimal in the search space.

similarly, the cost model is pluggable (extensible) and is based on statistics for table and query operations. This question will be discussed in detail later.

2.9 JDBC Adapter

The JDBC adapter maps the schema in the JDBC data source to Calcite's schema schema.

For example, the following schema is read from a MySQL "foodmart" database.

{version: '1.0, defaultSchema:' FOODMART', schemas: [{name: 'FOODMART', type:' custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: {jdbcDriver:' com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser:' foodmart' JdbcPassword: 'foodmart'}}]}

(The FoodMart database will be familiar to those of you who have used the Mondrian OLAP engine, because it is Mondrian's main test data set. To load the data set, follow Mondrian's installation instructions.)

Current limitations of : the JDBC adapter currently supports only push-down table scan operations; other operations (filtering,joins,aggregations, etc.) are done in Calcite. Our goal is to push down as many processing operations, syntax conversions, data types, and built-in functions as possible to the source data system. If an Calcite query comes from a table in a single JDBC database, in principle the entire query is pushed down to the source data system. If the table comes from multiple JDBC data sources, or a mixture of JDBC and non-JDBC sources, Calcite uses the most efficient distributed query method possible to complete this query.

2.10 Clone the JDBC adapter

The clone JDBC adapter creates a hybrid source data system. The data comes from the JDBC database but is read into the memory table the first time it is read. Calcite evaluates the query based on the memory table and effectively implements the database cache.

For example, the following model reads the "foodmart" table from the MySQL database:

{version: '1.0, defaultSchema:' FOODMART_CLONE', schemas: [{name: 'FOODMART_CLONE', type:' custom', factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: {jdbcDriver:' com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser:' foodmart' JdbcPassword: 'foodmart'}}]}

Another technique for is to build a clone schema from an existing schema. Refer to the schema previously defined in model through the source attribute, as follows:

{version: '1.0, defaultSchema:' FOODMART_CLONE', schemas: [{name: 'FOODMART', type:' custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: {jdbcDriver:' com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser:' foodmart' JdbcPassword: 'foodmart'}}, {name:' FOODMART_CLONE', type: 'custom', factory:' org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: {source: 'FOODMART'}}]}

can use this method to build clone schema of any type of schema, not limited to JDBC.

cloning adapter is not the most important. We plan to develop more complex caching strategies and more complex and efficient implementations of memory tables, but so far cloning JDBC adapter only reflects this possibility and allows us to start trying the initial implementation.

2.11 more topics

There are many other ways to extend Calcite not yet described in this tutorial. The adapter specification describes the APIs involved.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.