LingoDB is a cutting-edge data processing system that leverages compiler technology to achieve unprecedented flexibility and extensibility without sacrificing performance. It supports a wide range of data-processing workflows beyond relational SQL queries, thanks to declarative sub-operators. Furthermore, LingoDB can perform cross-domain optimization by interleaving optimization passes of different domains and its flexibility enables sustainable support for heterogeneous hardware.
LingoDB heavily builds on the MLIR compiler framework for compiling queries to efficient machine code without much latency.
LingoDB uses multiple layers of intermediate representation. This approach allows for high flexibility by exchanging layers.
LingoDB's custom dialects are designed for combination with any other MLIR dialects. Thus, LingoDB can be extended to other data processing domains through corresponding MLIR dialects.
LingoDB implements state-of-the-art query optimizations as compiler passes, which allows for composing custom optimization pipeline, e.g., for cross-domain optimization.
By using Apache Arrow as in-memory storage format, LingoDB can interact automatically with many different systems.
LingoDB can run complex analytical SQL queries and thus supports all queries of benchmarks like SSB, TPC-H, TPC-DS, and JOB.
Through its flexible design, LingoDB facilitates fundamental research regarding query engine architectures.
By using a layered design with sub-operators and building on MLIR, LingoDB is an ideal research tool for investigating heterogeneous hardware for data processing.
LingoDB's design allows for representing both SQL queries and other domains which simplifies resarch on cross-domain execution and optimization.
2023
VLDB 2023 | Michael Jungmair and Jana Giceva | August 28, 2023
Abstract
Data processing systems face the challenge of supporting increasingly diverse workloads efficiently. At the same time, they are already bloated with internal complexity, and it is not clear how new hardware can be supported sustainably. In this paper, we aim to resolve these issues by proposing a unified abstraction layer based on declarative sub-operators in addition to relational operators. By exposing this layer to users, they can express their non-relational workloads declaratively with sub-operators. Furthermore, the proposed sub-operators decouple the semantic implementation of operators from the efficient imperative implementation, reducing the implementation complexity for relational operators. Finally, through fine-grained automatic optimizations, the declarative sub-operators allow for automatic morsel-driven parallelism. We demonstrate the benefits not only by providing a specific set of sub-operators but also implementing them in a compiling query engine. With thorough evaluation and analysis, we show that we can support a richer set of workloads while retaining the development complexity low and being competitive in performance even with specialized systems.
2022
VLDB 2022 | Michael Jungmair, André Kohn, and Jana Giceva | September 5, 2022
Abstract
Since its invention, data-centric code generation has been adopted for query compilation by various database systems in academia and industry. These database systems are fast but maximize performance at the expense of developer friendliness, flexibility, and extensibility. Recent advances in the field of compiler construction identified similar issues for domain-specific compilers and introduced a solution with MLIR, a generic infrastructure for domain-specific dialects.We propose a layered query compilation stack based on MLIR with open intermediate representations that can be combined at each layer. We further propose moving query optimization into the query compiler to benefit from the existing optimization infrastructure and make cross-domain optimization viable. With LingoDB, we demonstrate that the used approach significantly decreases the implementation effort and is highly flexible and extensible. At the same time, LingoDB achieves high performance and low compilation latencies.
Student | Topic | Advisor(s) | Type |
---|---|---|---|
Robert Imschweiler | Transforming Data Frame Operations from Python to MLIR | Engelke, Jungmair | B.Sc. Thesis |
Florian Drescher | A template-based code generation backend for MLIR | Engelke | Guided Research |
Raoul Zebisch | Sub-Operator Placement on GPUs for accelerating analytical queries | Jungmair | M.Sc. Thesis |
Pascal Ginter | C-Backend, Index-Nested Loop Joins, Query Plan Visualization | Jungmair | Research Assistant |