Databaseless Queries: Using Calcite as Research Testbed for Hybrid Cloud Data Integration
Companies around the world have to increasingly lift their data integration processes from being centered on single tool invocations towards more global, automatically cost-optimizing and elastically scaling data provisioning infrastructure. From a research perspective, the path to take towards that goal remains an open question. In this video presentation of a CLOUDSTARS secondment, we take a first step towards investigating the suitability of Apache Calcite to understand more about the potential compute cost of pipelines, operators and user-defined functions. Our specific contribution for research/education and software developer communities is a ‘databaseless’ Calcite Client that supports DDL, DML and DQL commands along with UDFs and query plan persistence for future analysis.
Why the focus on Calcite? Apache Calcite is an open source dynamic data management framework. This may sound a bit abstract. In more concrete terms, it is a framework to parse and optimise queries in all flavours of SQL, and to bind such queries to existing data sources. It is not a database management system in itself, but a construction set to build such systems and integrate existing systems. Following the ‘databaseless’ idea, by setting up transient in-memory tables (supported on the parser level) Calcite can be used as pseudo-DBMS for self-contained research and prototyping purposes. This is what we do in our work. Similar to an embedded and in-memory SQlite (using the default :memory: pseudo-connection), our client does not require any server operation or authentication or anything else that might get into the way of running experiments; yet it is able to represent tables, views and functions, and perform instructions directly on the Calcite API level or (pending the resolution of bug CALCITE-5791) through standard JDBC.
For those new to the framework, the sheer amount of classes and settings can be overwhelming. There is good intro-level documentation available on the website, but going deeper anywhere quickly leads to a mashup of random finds on the Internet, crawling through the API documentation, and eventually grepping through the source code. With the aim of establishing a baseline for our research needs, but also with sufficient documentation to be useful in education and training, we therefore provide the ‘databaseless’ Calcite Client application as open source tool that can be used to demonstrate the framework and as starting point for any possible extensions and investigations.
The following video provides a comparative Calcite Client session, with tracing and debugging enabled on the right side and the standard user perspective without tracing on the left side.” This is only a short 3 minutes video to give the idea. A slightly longer video could be produced, based on the final Calcite client codebase.