GraphCoQL-TeX/introduction.tex at master · TDiazT/GraphCoQL-TeX · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
%!TEX root = ./main.tex
\section{Introduction}

\gql is an increasingly popular language to define interfaces and queries to data services. Originally developed internally by Facebook as an alternative to RESTful Web Services~\cite{restful}, \gql was made public in 2015, along with a reference implementation\footnote{https://github.com/graphql/graphql-js} and a specification---both of which have naturally evolved~\cite{gqlspec}. %\et{cite the version you used}.
Since early 2019, as a result of its successful adoption by major players in industry,
\gql is driven by an independent foundation\footnote{https://foundation.graphql.org/}. The key novelty compared to traditional REST-based services is that tailored queries can be formulated directly by clients, allowing a very precise selection of which data ought to be sent back as response. This supports a much more flexible and efficient interaction model for clients of services, who do not need to gather results of multiple queries on their own, possibly wasting bandwidth with unnecessary data.
% many REST requests can be replaced by a single \gql query; additionally, and
% that

% follow a ``what you ask is what you get'' spirit. This means that, in contrast with REST-based services, one can be very precise with the data requested and the response will look very similar to the query.

The official \gql specification, called \spec hereafter,
covers the definition of interfaces, the query language, and the validation processes, among other aspects. The specification undergoes regular revisions by an open working group, which discusses extensions and improvements, as well as addressing specification ambiguities. Indeed, as often happens, the \spec is an informal specification written in natural language.
% and does not include a rigorous formalization of \et{vague} its inner mechanics and limitations.
% related issues and improvements.
% These include extending the language to support new features or fix possible ambiguities present in the document. This is because the document is written in natural language, i.e. plain English,
% There is also a project to define CATs\footnote{Compatibility Acceptance Tests} for the different languages and frameworks that implement \gql\td{Not sure if this should go or where. It may serve as a link to "why of \gcoql"}.
Considering the actual vibrancy of the \gql community, sustained by several implementations in a variety of programming languages and underlying technologies, having a formal specification ought to bring some welcome clarity for all actors.

Recently, Hartig and Pérez~\cite{gqlph} proposed the first (and so far only) formalization of \gql, called \HP hereafter.
\HP is a formalization ``on paper'' that was used to prove complexity boundaries for \gql queries. Having a mechanized formalization would present many additional benefits, such as potentially providing a faithful reference implementation, and serving as a solid basis to prove formal results about the \gql semantics.

For instance, the complexity results of Hartig and Pérez rely on two techniques: {\em a)} transforming queries to equivalent queries in some  normal form and, {\em b)} interpreting queries in a simplified but equivalent definition of the semantics. However, Hartig and Pérez do not provide an algorithmic definition of query normalization, let alone proving it correct and semantics preserving;
% prove that the query transformation (called ) indeed produces queries in such a normal form, and that their semantics is preserved;
nor do they prove that the simplified semantics is equivalent to the original one when applied to normalized queries.

% The complexity results are based on two major premises.
% The first one is that ``\textit{for every query $\varphi$ that conforms to a schema $\mathcal{S}$, there exists a {\normalfont non-redundant} query $\varphi$' in {\normalfont ground-typed normal form} such that $\varphi \equiv \varphi$'}''. The second one is that for queries that are \textit{non-redundant} and in \textit{ground-typed normal form}, it is possible to define a simplified version of the semantics which is equivalent to the original.

% For the former, they propose a set of equivalence rules to transform queries but they do not actually prove that their application yield a query in this particular form or that they preserve the query semantics. The latter is also exploited, without providing any correctness proof. Since both are fundamental for their complexity results, we believe they must be rigorously addressed.

% \paragraph{\gcoql.}
This work presents {\bf the first mechanized formalization of \gql}, carried out in the \coq proof assistant~\cite{Coq}, called \gcoql (|græf$\cdot$co$\cdot$k{\pmschwa}l|).
\gcoql currently includes the definition of the Schema DSL, query definition, schema and query validation, and the semantics of queries, defined over a graph data model as in \HP (\S\ref{sec:form}).

As well as precisely capturing the semantics of \gql, \gcoql makes it possible to specify and prove the correctness of query transformations, as well as other extensions and optimizations made to the language and its algorithms. We illustrate this by studying \HP's notion of query normalization (\S\ref{sec:norm}). Specifically, we define an algorithmic presentation of normalization, which we then prove correct and semantics preserving. We also formalize and prove the equivalence between the original query evaluation semantics and the simplified semantics used by \HP for normalized queries. In the process, we address some imprecisions and minor issues that \coq led us to uncover.

%
We discuss the limitations and validation of \gcoql, including its trustworthiness, in \S\ref{sec:discussion}.
%
We hope that \gcoql can serve as a starting point for a formal specification of \gql from which reference implementations can be extracted. Although we have not yet experimented with extraction, \gcoql facilitates this vision by relying on boolean reflection as much as possible.

% ; this should eventually make it possible to extract reference implementations of the different components developed in \gcoql, such as the query evaluator or the normalization function.
 % and extracting it to be its official reference implementation.

%We will refer to it as \HP throughout the paper. They define the semantics of \gql by using a graph as the underlying data model over which queries are evaluated. \td{rewrite} define the normalization transformation, which results in \textit{non-redundant} queries in \textit{ground-typed normal form}.
% This normalization process is essential for proving complexity boundaries for \gql queries, because it allows simplifying the semantics and.

% On another note, we believe that \gql is still a very young and active technology which could greatly benefit by having its specification mechanically verified from its early stages. It has a very active and growing community, with many different implementations in different programming languages and technologies, and more importantly, with many open questions and issues. It currently has a reference implementation, written in Javascript, that could be improved by introducing a formally and mechanically verified one. %We refer to the reference implementation as \textit{\gqlJs} throughout the document.

 % Given the previous factors, we develop a Coq formalization of \gql, called
 % \gcoql (pronounced ``græf$\cdot$co$\cdot$k{\pmschwa}l'').
 % We believe that \gcoql can serve as a starting point towards fully formalizing \gql and extracting it to be its official reference implementation.


  % current status of \gcoql is less firmly
  % following the examples of JSCert~\cite{jscert} and CoqR~\cite{coqr}.
 % This provides a component of trustworthiness given by an ,
 % We also test our implementation with examples from the \spec but a more thorough comparison should be made against \textit{\gqlJs} and a bigger test suite.


% \et{there is some contradiction between the eyeball correspondence and the "mixed approach"}
% Like \HP, \gcoql  \et{what is the model in \spec?}
% With respect to the semantics of \gql, we follow a mixed approach between the \spec and \HP. The semantics are defined in a graph setting, as is in \HP, .
% One of the biggest difference between both approaches (besides the graph model) is that the
% \et{what is this processing about? and is the mixed approach you took here?}
% \spec performs a processing of queries during the evaluation, while \HP performs a post-processing of the responses generated. We took the mixed approach, which brings out some benefits as well as some limitations, which we discuss further in a following section.


%\td{Rewrite} When it comes to the underlying data model, we follow \HP and define our semantics in a graph setting. We are also interested in defining the properties and transformation rules defined by \HP. These definitions and their proofs of correctness are fundamental in the posterior results they obtain. This served as a particularly interesting first case study for our system, to establish that we can actually reason about \gql and that theirs results were based off correct assumptions. This allows us to, hopefully, anticipate that other transformations may also be defined and proven correct in \gcoql.


%We were first motivated to use it to define the data model and try to narrow our scope to finite types, as was used by (Veronique, Ev, Emilio, Dumbrava). In the end, we did not use any of it but the computational aspect of SSReflect was kept, as it facilitated developing the proofs. This same element is what makes us believe that extraction should not be hard.


We briefly introduce \gql in \S\ref{sec:bg}.
We discuss related work in \S\ref{sec:related} and conclude in \S\ref{sec:conclusion}. The fundamental challenge of the \coq formalization of \gql in comparison to similar formalization efforts (\eg~\cite{qcert, vesqlsemantics, certifdatalog}) resides in the non-structural nature of most definitions related to \gql queries. In particular, as we will see, both query semantics and query normalization are defined by well-founded induction on a notion of query size (\S~\ref{subsec:query}).


\gcoql and the results presented in this paper have been developed in \coq v.8.9.1 and are available online on GitHub\footnote{https://github.com/imfd/GraphCoQL}.
% We often refer directly to \coq files in the paper \et{for space, we might in fact *not* do this and just defer to a nice README file with the artifact}.
\gcoql extensively uses the \plstyle{Mathematical} \plstyle{Components}~\cite{mathcomp}, \ssreflect~\cite{ssreflect} and \equations~\cite{equations} libraries.
This work is based on the latest \gql specifications, dated June 2018~\cite{gqlspec}.
% \et{I think we can skip this paragraph here, not important enough for being in the intro} Finally, regarding the development itself, we use SSReflect \et{cite} intensively, relying on boolean reflection as much as possible. Also, the use of the  to define non-structural recursive functions is essential for our definitions. Other libraries, such as \textit{Function} and \textit{Program} did not provide sufficient tools to handle rewriting and inductive reasoning about our definitions, which \textit{Equations} incredibly facilitates. \coql{} is not currently extracted to other languages but we believe that it should not be a difficult task, given the design decisions considered.

% \et{might skip for space (is it really interesting anyway?}
% \td{Nah, I personally don't care at all how big/small the project is}
% \et{Put brief data about the code here}
% \td{This includes documentation, imports, etc.}
% The base definitions and proofs consist of approximately $3600$ and $1900$ lines of code, respectively.
% Files containing validation of \gcoql amount to approximately $3000$ lines of code.
% Finally, the normalization case study consists of approximately $350$ lines of definitions and $700$ lines of proofs.

% \td{include note on code as anonymous supplementary material}


% \subsubsection*{Structure of this paper}

% We first begin by gently and briefly introducing \gql in Section \ref{sec:bg}, which we do by means of an example. Then, in Section \ref{sec:form}, we describe the basic building blocks of our Coq formalization. This includes the definition of a \gql schema, the graph data model, queries and their semantics. Section \ref{sec:norm} describes the normalization process and proofs of its correctness and preservation of semantics. We finalize that section with the definition of the simplified semantics, as described in \HP, and a proof of equivalence between the semantics defined in Section \ref{sec:form} and the simplified one. In Section \ref{sec:valid}, we describe some of the work we did to validate our implementation and finally Section \ref{sec:related} and \ref{sec:future} we discuss related and future work.