




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The Query Intermediate Representation (QIR) used by BOLDR to optimize queries and translate them into database languages. It covers the syntax and semantics of QIR, optimization techniques, and the QIR type system. The QIR type system includes specific type systems for databases and a generic type system for QIR expressions targeting multiple databases.
Typology: Lecture notes
1 / 237
This page cannot be seen from the preview
Don't miss anything!
Thèse de doctorat de l'Université Paris-Saclay préparée à Université Paris-Sud
Ecole doctorale n◦580 Sciences et Technologies de l'Information et de la Communication (STIC) Spécialité de doctorat: Informatique
Thèse présentée et soutenue à Orsay, le 13/09/2019, par
Composition du Jury :
James Cheney Reader, Université d'Édimbourg Rapporteur Emmanuel Chailloux Professeur, Sorbonne Université (UMR 7606) Rapporteur Alan Schmitt Directeur de recherche, INRIA Rennes (UMR 6074) Examinateur Jérôme Siméon Chief Scientist, Clause Inc. Examinateur Sarah Cohen Boulakia Professeur, Université Paris-Sud (UMR 8623) Présidente du jury Véronique Benzaken Professeur, Université Paris-Sud (UMR 8623) Directeur de thèse Kim Nguyen Maitre de Conférences, Université Paris-Sud (UMR
Co-encadrant de thèse Giuseppe Castagna Directeur de Recherche, CNRS et Université de Paris (UMR 8243) Invité
ii
j'ai renforcé mon OCaml en enseignant notamment comment réaliser un jeu vidéo avec des match, et qui m'a appris le model checking. Merci à Guillaume Melquiond de m'avoir appris à faire du C++ correctement (sans *), et à Thibaut Balabonski grâce à qui j'ai dépoussiéré mes connaissances en compilation, tout cela me sera utile très prochainement. Merci à Jean-Christophe Filliâtre pour m'avoir partagé ses connaissances sur OCaml, et pour m'avoir prêté son exemplaire de Sixty Million Frenchmen Can't Be Wrong. Merci à Frédéric Voisin pour avoir trouvé le code d'accès à la salle de reprographie pour un thésard en panique et pour les blagues du vendredi (du lundi au vendredi) au coin café. Merci à Sylvie Boldo pour m'avoir fait conance pour la review d'un article au JFLA, pour son aide précieuse avec Véronique sur le recrutement académique, et sur le fonctionnement de la machine magique à faire des posters. Merci à Hai et Stefania pour m'avoir montré à l'avance dans quel état je serai en dernière année, à Robin pour avoir écouté mes tirades inintéressantes sur le speedrun de Mario 64, à Albin pour le vin à l'orange, à Bruno pour la promenade en vélo, à Alexandrina pour m'avoir fait rire quand j'en avais bien besoin et pour son amitié. Ah, et merci à Mattias pour les pauses, m'avoir presque donné envie de jouer à un MOBA, et pour m'avoir appris la quantité de crème pour des pâtes à la carbonara. Merci à mes vieux amis, Pierre-Alain, Pierre-Alain, Frédéric, Romain qui m'ont soutenu malgré la distance, et Bertrand, Solenne, Thibaut et Pierre-Jean pour nos beuveries et parties de Borderlands. Un grand merci à ma famille, tout particulièrement mes parents, ma grand- mère, mon frérot, et ma petite s÷ur. Sans votre amour et votre soutien, je ne serais jamais allé aussi loin. Enn, merci à toi Jenny. Merci pour ton soutien, tes corrections sur mon anglais beaucoup trop français, et pour ces merveilleuses années que nous avons passées ensemble.
iv
v
Abstract
Several classes of solutions allow programming languages to express queries: spe- cic APIs such as JDBC, Object-Relational Mappings (ORMs) such as Hibernate, and language-integrated query frameworks such as Microsoft's LINQ. However, most of these solutions do not allow for ecient cross-databases queries, and none allow the use of complex application logic from the programming language in queries.
This thesis studies the design of a new language-integrated query framework called BOLDR that allows the evaluation in databases of queries written in general-purpose programming languages containing application logic, and tar- geting several databases following dierent data models. In this framework, ap- plication queries are translated to an intermediate representation. Then, they are typed with a type system extensible by databases in order to detect which database language each subexpression should be translated to. This type system also allows us to detect a class of errors before execution. Next, they are rewritten in order to avoid query avalanches and make the most out of database optimiza- tions. Finally, queries are sent for evaluation to the corresponding databases and the results are converted back to the application. Our experiments show that the techniques we implemented are applicable to real-world database applica- tions, successfully handling a variety of language-integrated queries with good performances.
Résumé
Plusieurs classes de solutions permettent d'exprimer des requêtes dans des langages de programmation: les interfaces spéciques telles que JDBC, les map- pings objet-relationnel ou object-relational mapping en anglais (ORMs) comme Hibernate, et les frameworks de requêtes intégrées au langage comme le frame- work LINQ de Microsoft. Cependant, la plupart de ces solutions ne permettent
vii
pas d'écrire des requêtes visant plusieurs bases de données en même temps, et au- cune ne permet l'utilisation de logique d'application complexe dans des requêtes aux bases de données. Cette thèse présente un nouveau framework de requêtes intégrées au langage nommé BOLDR qui permet d'écrire des requêtes dans des langages de program- mation généralistes et qui contiennent de la logique d'application, et de les évaluer dans des bases de données hétérogènes. Dans ce framework, les requêtes d'une application sont traduites vers une représentation intermédiaire de requêtes. Puis, elles sont typées en utilisant un système de type extensible par les bases de don- nées pour détecter dans quel langage de données chaque sous-expression doit être traduite. Cette phase de typage permet également de détecter certaines erreurs avant l'exécution. Ensuite, les requêtes sont réécrites pour éviter le phénomène "d'avalanche de requêtes" et pour proter au maximum des capacités d'optimisa- tion des bases de données. Enn, les requêtes sont envoyées aux bases de données ciblées pour évaluation et les résultats obtenus sont convertis dans le langage de programmation de l'application. Nos expériences montrent que les techniques implémentées dans ce framework sont applicables pour de véritables applications centrées données, et permettent de gérer ecacement un vaste champ de requêtes intégrées à des langages de programmation généralistes.
Note : An d'en assurer une plus large diusion, et en accord avec l'école docto- rale STIC Paris-Saclay, cette thèse est rédigée en anglais. Pour un résumé étendu des travaux rédigé en français, voir l'Annexe A page 163.
viii
List of symbols
≡ Syntactical equivalence.................... 23 = Equality................................... 23 ∅ Empty set................................. 23 ⊆ Subset inclusion........................... 23 ∪, ∩, \ Union, intersection and dierence of sets... 23 dom(f ) Domain of a function f.................... 23 img(f ) Image of a function f...................... 23 i..j Set of integers {i, i + 1,... , j}.............. 23 _ Placeholder................................ 23 Γ Typing environment....................... 24 γ Evaluation environment.................... 24 σ Type substitution.......................... 24 σ 1 ◦ σ 2 Composition of type substitutions.......... 24 (EH, IH, VH, (^) →H) Host language............................. 24 (ED, VD, (^) →D) Database language......................... 25 (A, c) Inference rule.............................. 26 q QIR expression............................ 28 C(q) Children of a QIR expression.............. 32 T(q) Databases targeted by a QIR expression... 32 → Basic QIR semantics....................... 32 (H
VAL) Host language driver....................... 36 (
VAL) Database driver............................ 36 VQIR QIR values................................ 32 Extended QIR semantics................... 37 MEM MEM database language................... 40 ↪→ Normalization relation..................... 45 M (^) D(q) Measure of a QIR expression by a database 47 M (q) Generic measure of a QIR expression....... 49 Reds (q) Possible reductions of a QIR expression.... 51 B Basic QIR type............................ 56 T QIR type.................................. 56 dom(R) Domain of a QIR record type.............. 56
xiii
e-commerce application
Key-value database Document database RDBMS
User session Shopping cart
User accounts Logs
Inventory Financial data
Figure 1.1 Example of an application using dierent types of databases
of the application. Figure 1.1 shows an example of such application.
This thesis is a study aiming to create a solution that allows application developers to write safe and ecient queries targeting databases without forcing them to become experts in the data models and query languages of their target databases.
In this introduction, we rst give an overview of database query languages, programming languages used in the development of applications, and the existing solutions to interface these two worlds and the problems encountered in the pro- cess. Then we describe a new solution in the form of a new language-integrated query framework called BOLDR.
SQL (Structured Query Language) is the most popular query language. It is a domain-specic language based on relational algebra.
id name salary teamid 1 Lily Pond 5200 2 2 Daniel Rogers 4700 1 3 Olivia Sinclair 6000 1 (a) Table Employee
teamid teamname bonus 1 R&D 500 2 Sales 600 (b) Table Team
Figure 1.2 An example of data organized as tables
Relational algebra, rst designed by Edgar F. Codd [Cod70], denes operations on data represented as a set of n-tuples where every element of the tuple corresponds to an attribute denoted by a name. Relational databases call these constructions tables, composed of lines and columns. Figure 1.2 gives an example of tables.
Relational algebra is the basis of most database query languages [AHV95]. The most common operations of relational algebra are the projection, which re- stricts the tuples to a set of attributes; the selection (or lter), which keeps only the tuples that satisfy a condition; and the join, which returns the set of all com- binations of tuples from two tables that are equal on their common attributes. Figure 1.3 shows examples of applications of those operations on tables. Fig- ure 1.3a shows the projection of table Employee on attributes name and salary, Figure 1.3b shows the selection in the table Employee of the tuples for which the value of the attribute salary is greater than 5000, and Figure 1.3c shows the result of the join between Employee and Team.
name salary Lily Pond 5200 Daniel Rogers 4700 Olivia Sinclair 6000 (a) Projection on Employee
id name salary teamid 1 Lily Pond 5200 2 3 Olivia Sinclair 6000 1 (b) Selection on Employee
id name salary teamname bonus 1 Lily Pond 5200 Sales 600 2 Daniel Rogers 4700 Sales 600 3 Olivia Sinclair 6000 R&D 500 (c) Join between Employee and T eam
Figure 1.3 Example of applications of relational algebra
which is then bound to the name t using an alias, and this name is then used in the WHERE clause to refer to the table. The simple syntax of SQL is one of the reasons why it is so popular, and the most commonly used query language. Most databases support SQL, even those that do not have a data model directly suited for relational algebra. Therefore, SQL is an unavoidable database language to study for solutions aiming to allow programmers to send queries to databases.
The majority of data-driven applications are written in imperative programming languages. Python is used in particular for Web applications and for machine learning. It is a very popular programming language because of its simple syn- tax and numerous eld-specic libraries, such as for machine learning, general algorithms, and statistics. JavaScript is widely used for Web applications. R is a language natively designed for statistical applications and data analysis. Java is a widely used general-purpose programming language with numerous libraries for Web development, machine learning, text processing, and more. Contrary to declarative programming in languages like SQL, imperative pro- gramming involves describing step-by-step the control ow of a program, which requires programmers in these languages to describe how to get to the desired result. For instance, ltering a table in an imperative language would typically be written as such in Python:
filteredTable = [] for employee in employees: if (employee['salary'] > 5000): filteredTable.append(employee)
However, modern programming languages have made an eort to support some aspects of functional programming, making data-oriented applications less tech- nically detailed. For instance, we can write the example above in Python using list comprehensions [Kuh11]:
[employee for employee in employees if employee['salary'] > 5000] Application programs can use imperative and functional features, and usu- ally contain a mix of both. Even so, most application languages are originally imperative, and in particular are more ecient at evaluating imperative code. Additionally, most application languages (Python, R, Ruby, JavaScript,... ) are dynamically typed, meaning that the type-safety of a program is checked during its execution. For instance, a program such as
(function (x) { return x; })(2, 3)
which applies the identity function to two arguments would be recognized as an error during its execution (or should, but JavaScript just ignores the second argument in this case... ).
As stated earlier, most applications are written in general-purpose programming languages. These languages do not have native ways to query databases, and the vast majority of them are imperative languages, so their syntax is very dierent from those of query languages. Various solutions have been designed to enable programmers to send queries to databases from their programming languages. In this section, we take a look at some existing solutions, and discuss their pros and cons.
Java Database Connectivity (JDBC) [Cor16] is an application programming in- terface (API) which provides data access from the Java programming language to data sources, including databases. Example 1.1 is an example of use of JDBC in a Java program to retrieve data from a database.
Example 1.1. final Connection conn = ... final Statement stmt = conn.createStatement(); final String query = "SELECT id, name, salary FROM employee WHERE salary > 2500"; ResultSet rs = stmt.executeQuery(query); while (rs.next()) { System.out.println(rs.getInt("ID") + " "
In this example, the program queries the identier, name, and salary of em- ployees which salary is greater than 2500 from a table employee stored in a database. In JDBC, the user must rst create a Connection object using the correct credentials to access the targeted database, then create a Statement object from the connection object to send a query. The query itself is a string in the query language of the database (SQL in the example). The results are represented