For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.
Monday 8:00 AM – 10:00 AM  Introduction  
PODS Opening + Keynote  Session Chair: Dan Suciu, Yufei Tao  
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data  
Martin Grohe  Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view. Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.  https://dl.acm.org/doi/abs/10.1145/3375395.3387641 
Monday 10:30 AM – 11:15 AM  
PODS Test of Time + Gems  Session Chair: Yufei Tao  
Optimizing Linear Counting Queries under Differential Privacy  
Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor  
Coping with Incomplete Data: Recent Advances  
Leonid Libkin  Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on threevalued logic. Can we find a middle ground, and produce correct answers efficiently?
The paper surveys results of the last few years motivated by this question. We reexamine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of threevalued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers. 
https://dl.acm.org/doi/abs/10.1145/3375395.3387970 
Probabilistic Databases for All  
Dan Suciu  In probabilistic databases the data is uncertain and is modeled by a probability distribution. The central problem in probabilistic databases is query evaluation, which requires performing not only traditional data processing such as joins, projections, unions, but also probabilistic inference in order to compute the probability of each item in the answer. At their core, probabilistic databases are a proposal to integrate logic with probability theory. This paper accompanies a talk given as part of the Gems of PODS series, and describes several results in probabilistic databases, explaining their significance in the broader context of model counting, probabilistic inference, and Statistical Relational Models  https://dl.acm.org/doi/abs/10.1145/3375395.3389129 
Monday 11:15 AM – 12:00 PM  Streams – Robustness and Determinism  
PODS Session 1  Session Chair: Andrew McGregor  
The Adversarial Robustness of Sampling  
Omri BenEliezer and Eylon Yogev  Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe U to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample “very unrepresentative” of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Wellknown results in the static setting indicate that if the full stream is chosen in advance (nonadaptively), then a random sample of size Ω(d/ε2) is an εapproximation of the full data with good probability, where d is the VCdimension of the underlying set system (U, R). Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative : We demonstrate a set system where a constant sample size (corresponding to a VCdimension of 1) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easytoimplement attack. However, this attack is “theoretical only”, requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VCdimension term d in the sample size with the cardinality term log R. That is, the Bernoulli and reservoir sampling algorithms with sample size Ω(log R/ε2) output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.  https://dl.acm.org/doi/abs/10.1145/3375395.3387643 
A Framework for Adversarially Robust Streaming Algorithms  
Omri BenEliezer, Rajesh Jayaram, David P. Woodruff and Eylon Yogev  We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinearspace deterministic algorithms; on the other hand, classical spaceefficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. In this work, we show that the answer is positive for various important streaming problems in the insertiononly model, including distinct elements and more generally $F_p$estimation, Fpheavy hitters, entropy estimation, and others. For all of these problems, we develop adversarially robust (1+ε)approximation algorithms whose required space matches that of the best known nonrobust algorithms up to a poly(log n, 1/ε) multiplicative factor (and in some cases even up to a constant factor). Towards this end, we develop several generic tools allowing one to efficiently transform a nonrobust streaming algorithm into a robust one in various scenarios.  https://dl.acm.org/doi/abs/10.1145/3375395.3387658 
Tight Lower Bound for ComparisonBased Quantile Summaries  
Graham Cormode and Pavel Vesely  Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles of a stream of items, up to an error of at most ε. That is, an εapproximate quantile summary first processes a stream and then, given any quantile query 0łe φłe 1, returns an item from the stream, which is a φ’quantile for some φ’ = φ + ε. We focus on comparisonbased quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna [6], stores at most O(1/ε ⋅ log ε N) items, where N is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparisonbased quantile summary in space f(ε)⋅ o(log N), for any function f that does not depend on N. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relativeerror guarantee of (1+ε)⋅ φ, and for other related computational tasks.  https://dl.acm.org/doi/abs/10.1145/3375395.3387650 
Monday 1:30 PM – 2:15 PM  Containment and Rewritability  
PODS Session 2  Session Chair: Pablo Barcelo  
Bag Query Containment and Information Theory  
Mahmoud Abo Khamis, Phokion Kolaitis, Hung Ngo and Dan Suciu  The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a longstanding open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of a generalization of information inequalities is manyone equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree.  https://dl.acm.org/doi/abs/10.1145/3375395.3387645 
FirstOrder Rewritability in Consistent Query Answering with Respect to Multiple Keys  
Paraschos Koutris and Jef Wijsen  We study consistent query answering with respect to key dependencies. Given a (possibly inconsistent) database instance and a set of key dependencies, a repair is an inclusionmaximal subinstance that satisfies all key dependencies. Consistent query answering for a Boolean query is the following problem: given a database instance as input, is the query true in every repair? In [Koutris and Wijsen, ICDT 2019], it was shown that for every selfjoinfree Boolean conjunctive query and set of key dependencies containing exactly one key dependency per relation name (also called the primary key), this problem is in FO, Lcomplete, or coNPcomplete, and it is decidable which of the three cases applies. In this paper, we consider the more general case where a relation name can be associated with more than one key dependency. It is shown that in this more general setting, it remains decidable whether or not the above problem is in FO, for selfjoinfree Boolean conjunctive queries. Moreover, it is possible to effectively construct a firstorder query that solves the problem whenever such a query exists.  https://dl.acm.org/doi/abs/10.1145/3375395.3387654 
On Monotonic Determinacy and Rewritability for Recursive Queries and Views  
Michael Benedikt, Stanislav Kikot, Piotr OstropolskiNalewaja and Miguel Romero  A query Q is monotonically determined over a set of views if Q can be expressed as a monotonic function of the view image. In the case of relational algebra views and queries, monotonic determinacy coincides with rewritability as a union of conjunctive queries, and it is decidable in important special cases, such as for CQ views and queries. We investigate the situation for views and queries in the recursive query language Datalog. We give both positive and negative results about the ability to decide monotonic determinacy, and also about the coincidence of monotonic determinacy with Datalog rewritability.  https://dl.acm.org/doi/abs/10.1145/3375395.3387661 
Monday 2:15 PM – 3:00 PM  Probabilistic and Incomplete Databases  
PODS Session 3  Session Chair: Dan Suciu  
Solving a Special Case of the Intentional VS Extensional Conjecture in Probabilistic Databases  
Mikael Monet  We consider the problem of exact probabilistic inference for Union of Conjunctive Queries (UCQs) on tupleindependent databases. For this problem, two approaches currently coexist. In the extensional method, query evaluation is performed by exploiting the structure of the query, and relies heavily on the use of the inclusion–exclusion principle. In the intensional method, one first builds a representation of the lineage of the query in a tractable formalism of knowledge compilation. The chosen formalism should then ensure that the probability can be efficiently computed using simple disjointness and independence assumptions, without the need of performing inclusion–exclusion. The extensional approach has long been thought to be strictly more powerful than the intensional approach, the reason being that for some queries, the use of inclusion–exclusion seemed unavoidable. In this paper we introduce a new technique to construct lineage representations as deterministic decomposable circuits in polynomial time. We prove that this technique applies to a class of UCQs that had been conjectured to separate the complexity of the two approaches. In essence, we show that relying on the inclusion–exclusion formula can be avoided by using negation. This result brings back hope to prove that the intensional approach can handle all tractable UCQs.  https://dl.acm.org/doi/abs/10.1145/3375395.3387642 
Counting Problems over Incomplete Databases  
Marcelo Arenas, Pablo Barceló, and Mikael Monet  We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query q, we consider the following two problems: given as input an incomplete database D, (a) return the number of completions of D that satisfy q; or (b) return or the number of valuations of the nulls of D yielding a completion that satisfies q. We obtain dichotomies between #Phardness and polynomialtime computability for these problems when q is a selfjoinfree conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in D (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations (for instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption). Moreover, we find that both (1) and (2) reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial randomized approximation scheme, in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes.  https://dl.acm.org/doi/abs/10.1145/3375395.3387656 
Queries with arithmetic on incomplete databases  
Marco Console, Matthias Hofer and Leonid Libkin  The standard notion of query answering over incomplete database is that of certain answers, guaranteeing correctness regardless of how incomplete data is interpreted. In majority of reallife databases, relations have numerical columns and queries use arithmetic and comparisons. Even though the notion of certain answers still applies, we explain that it becomes much more problematic in situations when missing data occurs in numerical columns. We propose a new general framework that allows us to assign a measure of certainty to query answers. We test it in the agnostic scenario where we do not have prior information about values of numerical attributes, similarly to the predominant approach in handling incomplete data which assumes that each null can be interpreted as an arbitrary value of the domain. The key technical challenge is the lack of a uniform distribution over the entire domain of numerical attributes, such as real numbers. We overcome this by associating the measure of certainty with the asymptotic behavior of volumes of some subsets of the Euclidean space. We show that this measure is welldefined, and describe approaches to computing and approximating it. While it can be computationally hard, or result in an irrational number, even for simple constraints, we produce polynomialtime randomized approximation schemes with multiplicative guarantees for conjunctive queries, and with additive guarantees for arbitrary firstorder queries. We also describe a set of experimental results to confirm the feasibility of this approach.  https://dl.acm.org/doi/abs/10.1145/3375395.3387666 
Monday 3:30 – 5:00 PM  Data Structures  
PODS Session 4  Session Chair: Ke Yi  
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions  
Martin Aumuller, Rasmus Pagh and Francesco Silvestri  Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the rnear neighbor (rNN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the rNN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the lowdimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for rNN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a blackbox approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearlylinear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on realworld datasets and discusses the inherent unfairness introduced by solving other variants of the problem.  https://dl.acm.org/doi/abs/10.1145/3375395.3387648 
On the I/O Complexity of the kNearest Neighbors Problem  
Mayank Goswami, Riko Jacob and Rasmus Pagh  We consider static, external memory indexes for exact and approximate versions of the knearest neighbor (kNN) problem, and show new lower bounds under a standard indivisibility assumption: Polynomial space indexing schemes for highdimensional kNN in Hamming space cannot take advantage of block transfers: í(k) block reads are needed to to answer a query. For the l∞ metric the lower bound holds even if we allow cappoximate nearest neighbors to be returned, for c ∈ (1, 3). The restriction to c < 3 is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al. using space O(kn), where n is the number of points, that can retrieve k 3approximate nearest neighbors using optimal ⌈k/B⌉ I/Os, where B is the block size. For specific metrics, data structures with better approximation factors are possible. For kNN in Hamming space and every approximation factor c>1 there exists a polynomial space data structure that returns k capproximate nearest neighbors in ⌈k/B⌉ I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the λset workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in d ≥ n dimensions. To extend the lower bounds down to d = O(k log(n/k)) dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest.  https://dl.acm.org/doi/abs/10.1145/3375395.3387649 
Efficient Indexes for Diverse Topk Range Queries  
Pankaj Agarwal, Stavros Sintos and Alex Steiger  Let P be a set of n (nonnegatively) weighted points in Rd. We consider the problem of computing a subset of (at most) k diverse and highvalued points of P that lie inside a query range, a problem relevant to many areas such as search engines, recommendation systems, and online stores. The diversity and value of a set of points are measured as functions (say average or minimum) of their pairwise distances and weights, respectively. We study both bicriteria and constrained optimization problems. In the former, we wish to return a set of k points that maximize a weighted sum of their value and diversity measures, and in the latter, we wish to return a set of at most k points that maximize their value and satisfy a diversity constraint. We obtain three main types of results in this paper: Nearlinear time (0.5ε)approximation algorithms for the bicriteria optimization problem in the offline setting. Nearlinear size indexes for the bicriteria optimization problem that for a query rectangle return a (0.5ε)approximate solution in time O(k polylog(n)). The indexes can be constructed in O(n polylog(n)) time. Nearlinear size indexes for answering constrained optimization range queries. For a query rectangle, a 0.5O(d)approximate solution can be computed in O(k polylog(n)) time. If we allow some of the returned points to lie at most ε outside of the query rectangle then an (1ε)approximate solution can be computed in O(k polylog(n)) time. The indexes are constructed in O(n polylog(n)) and nO(1/εd) time, respectively.  https://dl.acm.org/doi/abs/10.1145/3375395.3387667 
Tuesday 10:30 AM – 12:00 PM  Three Modern Roles for Logic in AI  
Tutorial 1  Session Chair: Yufei Tao  
Three Modern Roles for Logic in AI  
Adnan Darwiche  We consider three modern roles for logic in artificial intelligence, which are based on the theory of tractable Boolean circuits: (1) logic as a basis for computation, (2) logic for learning from a combination of data and knowledge, and (3) logic for reasoning about the behavior of machine learning systems.  https://dl.acm.org/doi/abs/10.1145/3375395.3389131 
Tuesday 1:30 PM – 2:00 PM  Dependencies  
PODS Session 5  Session Chair: Michael Benedikt  
AllInstances Restricted Chase Termination  
Tomasz Gogacz, Jerzy Marcinkowski and Andreas Pieris  The chase procedure is a fundamental algorithmic tool in database theory with a variety of applications. A key problem concerning the chase procedure is allinstances termination: for a given set of tuplegenerating dependencies (TGDs), is it the case that the chase terminates for every input database? In view of the fact that this problem is undecidable, it is natural to ask whether known wellbehaved classes of TGDs ensure decidability. We consider here the main paradigms that led to robust TGDbased formalisms, that is, guardedness and stickiness. Although allinstances termination is wellunderstood for the oblivious chase, the more subtle case of the restricted (a.k.a. the standard) chase is rather unexplored. We show that allinstances restricted chase termination for guarded/sticky singlehead TGDs is decidable in elementary time.  https://dl.acm.org/doi/abs/10.1145/3375395.3387644 
The Limits of Efficiency for Open and ClosedWorld Query Evaluation Under Guarded TGDs  
Pablo Barcelo, Victor Dalmau, Cristina Feier, Carsten Lutz and Andreas Pieris  Ontologymediated querying and querying in the presence of constraints are two key database problems where tuplegenerating dependencies (TGDs) play a central role. In ontologymediated querying, TGDs can formalize the ontology and thus derive additional facts from the given data, while in querying in the presence of constraints, they restrict the set of admissible databases. In this work, we study the limits of efficient query evaluation in the context of the above two problems, focusing on guarded and frontierguarded TGDs and on UCQs as the actual queries. We show that a class of ontologymediated queries (OMQs) based on guarded TGDs can be evaluated in FPT iff the OMQs in the class are equivalent to OMQs in which the actual query has bounded treewidth, up to some reasonable assumptions. For querying in the presence of constraints, we consider classes of constraintquery specifications (CQSs) that bundle a set of constraints with an actual query. We show a dichotomy result for CQSs based on guarded TGDs that parallels the one for OMQs except that, additionally, FPT coincides with PTime combined complexity. The proof is based on a novel connection between OMQ and CQS evaluation. Using a direct proof, we also show a similar dichotomy result, again up to some reasonable assumptions, for CQSs based on frontierguarded TGDs with a bounded number of atoms in TGD heads. Our results on CQSs can be viewed as extensions of Grohe’s wellknown characterization of the tractable classes of CQs (without constraints). Like Grohe’s characterization, all the above results assume that the arity of relation symbols is bounded by a constant. We also study the associated meta problems, i.e., whether a given OMQ or CQS is equivalent to one in which the actual query has bounded treewidth.  https://dl.acm.org/doi/abs/10.1145/3375395.3387653 
Tuesday 2:00 PM – 2:30 PM  Resilience and Relevance  
PODS Session 6  Session Chair: Andreas Pieris  
New Results for the Complexity of Resilience for Binary Conjunctive Queries with SelfJoins  
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman and Alexandra Meliou  The resilience of a Boolean query on a database is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with sourceside effects. In this paper, we give several novel results on the hardness of the resilience problem for conjunctive queries with selfjoins, and, more specifically, we present a dichotomy result for the class of singleselfjoin binary queries with exactly two repeated relations occurring in the query. Unlike in the selfjoin free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NPhardness results. We also give novel involved reductions to network flow to show certain cases are in P. Although restricted, our results provide important insights into the problem of selfjoins that we hope can help solve the general case of all conjunctive queries with selfjoins in the future.  https://dl.acm.org/doi/abs/10.1145/3375395.3387647 
The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries  
Alon Reshef, Benny Kimelfeld and Ester Livshits  The Shapley value is a conventional and wellstudied function for determining the contribution of a player to the coalition in a cooperative game. Among its applications in a plethora of domains, it has recently been proposed to use the Shapley value for quantifying the contribution of a tuple to the result of a database query. In particular, we have a thorough understanding of the tractability frontier for the class of Conjunctive Queries (CQs) and aggregate functions over CQs. It has also been established that a tractable (randomized) multiplicative approximation exists for every union of CQs. Nevertheless, all of these results are based on the monotonicity of CQs. In this work, we investigate the implication of negation on the complexity of Shapley computation, in both the exact and approximate senses. We generalize a known dichotomy to account for negated atoms. We also show that negation fundamentally changes the complexity of approximation. We do so by drawing a connection to the problem of deciding whether a tuple is “relevant” to a query, and by analyzing its complexity.  https://dl.acm.org/doi/abs/10.1145/3375395.3387664 
Tuesday 2:30 PM – 3:00 PM  Databasedriven Workflows and Transactions  
PODS Session 7  Session Chair: Jan Van den Bussche  
Projection Views of Register Automata  
Luc Segoufin and Victor Vianu  Register automata have been used as a convenient model for specifying and verifying database driven systems. An important problem in such systems is to provide views that hide or restructure certain information about the data or process, extending classical notions of database views. In this paper we carry out a formal investigation of views of register automata by considering simple views that project away some of the registers. We show that classical register automata are not able to describe such projections and introduce more powerful register automata that are able to do so. We also show useful properties of these automata such as closure under projection and decidability of verifying temporal properties of their runs.  https://dl.acm.org/doi/abs/10.1145/3375395.3387651 
Deciding Robustness for Lower SQL Isolation Levels  
Bas Ketsman, Christoph Koch, Frank Neven and Brecht Vandevoort  While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this paper, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this paper is that we characterize robustness against both isolation levels in terms of the absence of counter example schedules of a specific form (split and multisplit schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this paper have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNPcomplete. The second main contribution of this paper is the coNPhardness proof. For READ UNCOMMITTED, we obtain LOGSPACEcompleteness.  https://dl.acm.org/doi/abs/10.1145/3375395.3387655 
Wednesday 10:30 AM – 12:00 PM  FineGrained Complexity Analysis of Queries: from Decision to Counting and Enumeration 

Tutorial 2  Session Chair: Yufei Tao  
FineGrained Complexity Analysis of Queries: from Decision to Counting and Enumeration  
Arnaud Durand  This paper is devoted to a complexity study of various tasks related to query answering such as deciding if a Boolean query is true or not, counting the size of the answer set or enumerating the results. It is a survey of some of the many tools from complexity measures trough algorithmic methods to conditional lower bounds that have been designed in the domain over the last years.  https://dl.acm.org/doi/abs/10.1145/3375395.3389130 
Wednesday 1:30 PM – 2:00 PM  Languages  
PODS Session 8  Session Chair: Benny Kimfield  
Generative Datalog with Continuous Distributions  
Martin Grohe, Benjamin Lucien Kaminski, JoostPieter Katoen and Peter Lindner  Arguing for the need to combine declarative and probabilistic programming, Bárány et al. (TODS 2017) recently introduced a probabilistic extension of Datalog as a “purely declarative probabilistic programming language.” We revisit this language and propose a more foundational approach towards defining its semantics. It is based on standard notions from probability theory known as stochastic kernels and Markov processes. This allows us to extend the semantics to continuous probability distributions, thereby settling an open problem posed by Bárány et al. We show that our semantics is fairly robust, allowing both parallel execution and arbitrary chase orders when evaluating a program. We cast our semantics in the framework of infinite probabilistic databases (Grohe and Lindner, ICDT 2020), and we show that the semantics remains meaningful even when the input of a probabilistic Datalog program is an arbitrary probabilistic database.  https://dl.acm.org/doi/abs/10.1145/3375395.3387659 
Conjunctive Regular Path Queries with String Variables  
Markus L. Schmid  We introduce the class CXRPQ of conjunctive xregex path queries, which are obtained from conjunctive regular path queries (CRPQs) by adding string variables (also called backreferences) as found in practical implementations of regular expressions. CXRPQs can be considered userfriendly, since they combine two concepts that are wellestablished in practice: patternbased graph queries and regular expressions with backreferences. Due to the string variables, CXRPQs can express interpath dependencies, which are not expressible by CRPQs. The evaluation complexity of CXRPQs, if not further restricted, is PSpacehard in data complexity. We identify three natural fragments with more acceptable evaluation complexity: their data complexity is in NL, while their combined complexity varies between ExpSpace, PSpace and NP. In terms of expressive power, we compare the CXRPQfragments with CRPQs and unions of CRPQs, and with extended conjunctive regular path queries (ECRPQs) and unions of ECRPQs.  https://dl.acm.org/doi/abs/10.1145/3375395.3387663 
Wednesday 2:00 PM – 3:00 PM  Query Enumeration and Aggregation  
PODS Session 9  Session Chair: Paris Koutris  
Tradeoffs in Static and Dynamic Evaluation of Hierarchical Queries  
Ahmet Kara, Milos Nikolic, Dan Olteanu and Haozhe Zhang  We investigate tradeoffs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the tradeoff is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under singletuple inserts or deletes to the database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for highdegree (heavy) and lowdegree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for onthefly enumeration. The main result of this work defines the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By conveniently choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries. For a restricted class of hierarchical queries, our approach can achieve worstcase optimal update time and enumeration delay conditioned on the Online MatrixVector Multiplication Conjecture.  https://dl.acm.org/doi/abs/10.1145/3375395.3387646 
Answering (Unions of) Conjunctive Queries using Random Access and RandomOrder Enumeration  
Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld and Nicole Schweikardt  As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a lineartime preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmicdelay enumeration. In this paper, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmicdelay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a “random access”: polylogarithmictime retrieval of an answer whose position is given. We establish that the freeconnex acyclic CQs are tractable in all three senses: enumeration, randomorder enumeration, and random access; and in the absence of selfjoins, it follows from past results that every other CQ is intractable by each of the three (under some finegrained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of freeconnex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a randomorder enumeration whose delay is logarithmic in expectation. We also identify a subclass of UCQs for which we can provide random access with polylogarithmic access time. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our randomorder enumeration approach over stateoftheart alternatives.  https://dl.acm.org/doi/abs/10.1145/3375395.3387662 
Parallel Algorithms for Sparse Matrix Multiplication and JoinAggregate Queries  
Xiao Hu and Ke Yi  In this paper, we design massively parallel algorithms for sparse matrix multiplication, as well as more general joinaggregate queries, where the join hypergraph is a tree with arbitrary output attributes. For each case, we obtain asymptotic improvement over existing algorithms. In particular, our matrix multiplication algorithm is shown to be optimal in the semiring model.  https://dl.acm.org/doi/abs/10.1145/3375395.3387657 
Aggregate Queries over Sparse Databases  
Szymon Torunczyk  We propose an algebraic framework for studying efficient algorithms for query evaluation, aggregation, enumeration, and maintenance under updates, on sparse databases. Our framework allows to treat those problems in a unified way, by considering various semirings, depending on the considered problem. As a concrete application, we propose a powerful query language extending firstorder logic by aggregation in multiple semirings. We obtain an optimal algorithm for computing the answers of such queries on sparse databases. More precisely, given a database from a fixed class with bounded expansion, the algorithm computes in linear timea data structure which allows to enumerate the set of answers to the query, with constant delay between two outputs.  https://dl.acm.org/doi/abs/10.1145/3375395.3387660 
Wednesday 3:00 PM – 3:30 PM  Streams – Cycle Counting  
PODS Session 10  Session Chair: Yufei Tao  
Triangle and Four Cycle Counting in the Data Stream Model  
Andrew McGregor and Sofya Vorotnikova  The problem of estimating the number of cycles in a graph is one of the most widely studied graph problems in the data stream model. Three relevant variants of the data stream model include: the arbitrary order model in which the stream consists of the edges of the graph in arbitrary order, the random order model in which the edges are randomly permuted, and the adjacency list order model in which all edges incident to the same vertex appear consecutively. In this paper, we focus on the problem of triangle and fourcycle counting in these models. We improve over the stateoftheart results as follows, where n is the number of vertices, m is the number of edges and T is the number of triangles/fourcycles in the graph (i.e., the quantity being estimated): Random Order Model: We present a singlepass algorithm that (1+ε)approximates the number of triangles using ~O(ε2 m/√T) space and prove that this is optimal in the range T ≤ √m. The best previous result, a (3+ε)approximation using ~O(ε4.5 m/√T) space, was presented by Cormode and Jowhari~(Theor. Comput. Sci. 2017). Adjacency List Model: We present an algorithm that returns a (1+ε)approximation of the number of 4cycles using two passes and ~O(ε4 m/√T) space. The best previous result, a constant approximation using ~O(m/T3/8) space, was presented by Kallaugher et al. (PODS~2019). We also show that (1+ε)approximation in a single pass is possible in a) polylog(n) space if T=Ω(n2) and b) ~O(n) space if T=Ω(n). Arbitrary Order Model: We present a threepass algorithm that (1+ε)approximates the number of 4cycles using ~O(ε2 m/T1/4) space and a onepass algorithm that uses ~O(ε2 n) space when T=Ω(n2). The best existing result, a (1+ε)approximation using ~O(ε2 m2/T) space, was presented by Bera and Chakrabarti (STACS~2017). We also show a multipass lower bound and another algorithm for distinguishing graphs with no four cycles and graphs with many 4cycles.  https://dl.acm.org/doi/abs/10.1145/3375395.3387652 
How the Degeneracy Helps for Triangle Counting in Graph Streams  
Suman K. Bera and C. Seshadhri  We revisit the wellstudied problem of triangle count estimation in graph streams. Given a graph represented as a stream of m edges, our aim is to compute a (1+ε)approximation to the triangle count T, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially Θ(min(m3/2 /T, m/√T)) (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for low degeneracy graphs. The degeneracy, K, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minorclosed families, and preferential attachment graphs). We design a streaming algorithm with space complexity ~O(mK/T). For constant degeneracy graphs, this bound is ~O(m/T), which is significantly smaller than both m3/2 /T and m/√T. We complement our algorithmic result with a nearly matching lower bound of Ω(mK/T).  https://dl.acm.org/doi/abs/10.1145/3375395.3387665 