For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

Monday 8:00 AM – 10:00 AM Introduction
PODS Opening + Keynote Session Chair: Dan Suciu, Yufei Tao
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data
Martin Grohe Vector representations of graphs and relational structures, whether hand-crafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view. Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research. https://dl.acm.org/doi/abs/10.1145/3375395.3387641
Monday 10:30 AM – 11:15 AM
PODS Test of Time + Gems Session Chair: Yufei Tao
Optimizing Linear Counting Queries under Differential Privacy
Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor  
Coping with Incomplete Data: Recent Advances
Leonid Libkin Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on three-valued logic. Can we find a middle ground, and produce correct answers efficiently?

The paper surveys results of the last few years motivated by this question. We re-examine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers.

Probabilistic Databases for All
Dan Suciu In probabilistic databases the data is uncertain and is modeled by a probability distribution. The central problem in probabilistic databases is query evaluation, which requires performing not only traditional data processing such as joins, projections, unions, but also probabilistic inference in order to compute the probability of each item in the answer. At their core, probabilistic databases are a proposal to integrate logic with probability theory. This paper accompanies a talk given as part of the Gems of PODS series, and describes several results in probabilistic databases, explaining their significance in the broader context of model counting, probabilistic inference, and Statistical Relational Models https://dl.acm.org/doi/abs/10.1145/3375395.3389129
Monday 11:15 AM – 12:00 PM Streams – Robustness and Determinism
PODS Session 1 Session Chair: Andrew McGregor
The Adversarial Robustness of Sampling
Omri Ben-Eliezer and Eylon Yogev Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe U to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample “very unrepresentative” of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size Ω(d/ε2) is an ε-approximation of the full data with good probability, where d is the VC-dimension of the underlying set system (U, R). Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative : We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of 1) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is “theoretical only”, requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term d in the sample size with the cardinality term log |R|. That is, the Bernoulli and reservoir sampling algorithms with sample size Ω(log |R|/ε2) output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack. https://dl.acm.org/doi/abs/10.1145/3375395.3387643
A Framework for Adversarially Robust Streaming Algorithms
Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff and Eylon Yogev We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. In this work, we show that the answer is positive for various important streaming problems in the insertion-only model, including distinct elements and more generally $F_p$-estimation, Fp-heavy hitters, entropy estimation, and others. For all of these problems, we develop adversarially robust (1+ε)-approximation algorithms whose required space matches that of the best known non-robust algorithms up to a poly(log n, 1/ε) multiplicative factor (and in some cases even up to a constant factor). Towards this end, we develop several generic tools allowing one to efficiently transform a non-robust streaming algorithm into a robust one in various scenarios. https://dl.acm.org/doi/abs/10.1145/3375395.3387658
Tight Lower Bound for Comparison-Based Quantile Summaries
Graham Cormode and Pavel Vesely Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles of a stream of items, up to an error of at most ε. That is, an ε-approximate quantile summary first processes a stream and then, given any quantile query 0łe φłe 1, returns an item from the stream, which is a φ’-quantile for some φ’ = φ +- ε. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna [6], stores at most O(1/ε ⋅ log ε N) items, where N is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)⋅ o(log N), for any function f that does not depend on N. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1+-ε)⋅ φ, and for other related computational tasks. https://dl.acm.org/doi/abs/10.1145/3375395.3387650
Monday 1:30 PM – 2:15 PM Containment and Rewritability
PODS Session 2 Session Chair: Pablo Barcelo
Bag Query Containment and Information Theory
Mahmoud Abo Khamis, Phokion Kolaitis, Hung Ngo and Dan Suciu The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of a generalization of information inequalities is many-one equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree. https://dl.acm.org/doi/abs/10.1145/3375395.3387645
First-Order Rewritability in Consistent Query Answering with Respect to Multiple Keys
Paraschos Koutris and Jef Wijsen We study consistent query answering with respect to key dependencies. Given a (possibly inconsistent) database instance and a set of key dependencies, a repair is an inclusion-maximal subinstance that satisfies all key dependencies. Consistent query answering for a Boolean query is the following problem: given a database instance as input, is the query true in every repair? In [Koutris and Wijsen, ICDT 2019], it was shown that for every self-join-free Boolean conjunctive query and set of key dependencies containing exactly one key dependency per relation name (also called the primary key), this problem is in FO, L-complete, or coNP-complete, and it is decidable which of the three cases applies. In this paper, we consider the more general case where a relation name can be associated with more than one key dependency. It is shown that in this more general setting, it remains decidable whether or not the above problem is in FO, for self-join-free Boolean conjunctive queries. Moreover, it is possible to effectively construct a first-order query that solves the problem whenever such a query exists. https://dl.acm.org/doi/abs/10.1145/3375395.3387654
On Monotonic Determinacy and Rewritability for Recursive Queries and Views
Michael Benedikt, Stanislav Kikot, Piotr Ostropolski-Nalewaja and Miguel Romero A query Q is monotonically determined over a set of views if Q can be expressed as a monotonic function of the view image. In the case of relational algebra views and queries, monotonic determinacy coincides with rewritability as a union of conjunctive queries, and it is decidable in important special cases, such as for CQ views and queries. We investigate the situation for views and queries in the recursive query language Datalog. We give both positive and negative results about the ability to decide monotonic determinacy, and also about the co-incidence of monotonic determinacy with Datalog rewritability. https://dl.acm.org/doi/abs/10.1145/3375395.3387661
Monday 2:15 PM – 3:00 PM Probabilistic and Incomplete Databases
PODS Session 3 Session Chair: Dan Suciu
Solving a Special Case of the Intentional VS Extensional Conjecture in Probabilistic Databases
Mikael Monet We consider the problem of exact probabilistic inference for Union of Conjunctive Queries (UCQs) on tuple-independent databases. For this problem, two approaches currently coexist. In the extensional method, query evaluation is performed by exploiting the structure of the query, and relies heavily on the use of the inclusion–exclusion principle. In the intensional method, one first builds a representation of the lineage of the query in a tractable formalism of knowledge compilation. The chosen formalism should then ensure that the probability can be efficiently computed using simple disjointness and independence assumptions, without the need of performing inclusion–exclusion. The extensional approach has long been thought to be strictly more powerful than the intensional approach, the reason being that for some queries, the use of inclusion–exclusion seemed unavoidable. In this paper we introduce a new technique to construct lineage representations as deterministic decomposable circuits in polynomial time. We prove that this technique applies to a class of UCQs that had been conjectured to separate the complexity of the two approaches. In essence, we show that relying on the inclusion–exclusion formula can be avoided by using negation. This result brings back hope to prove that the intensional approach can handle all tractable UCQs. https://dl.acm.org/doi/abs/10.1145/3375395.3387642
Counting Problems over Incomplete Databases
Marcelo Arenas, Pablo Barceló, and Mikael Monet We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query q, we consider the following two problems: given as input an incomplete database D, (a) return the number of completions of D that satisfy q; or (b) return or the number of valuations of the nulls of D yielding a completion that satisfies q. We obtain dichotomies between #P-hardness and polynomial-time computability for these problems when q is a self-join-free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in D (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations (for instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption). Moreover, we find that both (1) and (2) reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial randomized approximation scheme, in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes. https://dl.acm.org/doi/abs/10.1145/3375395.3387656
Queries with arithmetic on incomplete databases
Marco Console, Matthias Hofer and Leonid Libkin The standard notion of query answering over incomplete database is that of certain answers, guaranteeing correctness regardless of how incomplete data is interpreted. In majority of real-life databases, relations have numerical columns and queries use arithmetic and comparisons. Even though the notion of certain answers still applies, we explain that it becomes much more problematic in situations when missing data occurs in numerical columns. We propose a new general framework that allows us to assign a measure of certainty to query answers. We test it in the agnostic scenario where we do not have prior information about values of numerical attributes, similarly to the predominant approach in handling incomplete data which assumes that each null can be interpreted as an arbitrary value of the domain. The key technical challenge is the lack of a uniform distribution over the entire domain of numerical attributes, such as real numbers. We overcome this by associating the measure of certainty with the asymptotic behavior of volumes of some subsets of the Euclidean space. We show that this measure is well-defined, and describe approaches to computing and approximating it. While it can be computationally hard, or result in an irrational number, even for simple constraints, we produce polynomial-time randomized approximation schemes with multiplicative guarantees for conjunctive queries, and with additive guarantees for arbitrary first-order queries. We also describe a set of experimental results to confirm the feasibility of this approach. https://dl.acm.org/doi/abs/10.1145/3375395.3387666
Monday 3:30 – 5:00 PM Data Structures
PODS Session 4 Session Chair: Ke Yi
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions
Martin Aumuller, Rasmus Pagh and Francesco Silvestri Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem. https://dl.acm.org/doi/abs/10.1145/3375395.3387648
On the I/O Complexity of the k-Nearest Neighbors Problem
Mayank Goswami, Riko Jacob and Rasmus Pagh We consider static, external memory indexes for exact and approximate versions of the k-nearest neighbor (k-NN) problem, and show new lower bounds under a standard indivisibility assumption: Polynomial space indexing schemes for high-dimensional k-NN in Hamming space cannot take advantage of block transfers: í(k) block reads are needed to to answer a query. For the l∞ metric the lower bound holds even if we allow c-appoximate nearest neighbors to be returned, for c ∈ (1, 3). The restriction to c < 3 is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al. using space O(kn), where n is the number of points, that can retrieve k 3-approximate nearest neighbors using optimal ⌈k/B⌉ I/Os, where B is the block size. For specific metrics, data structures with better approximation factors are possible. For k-NN in Hamming space and every approximation factor c>1 there exists a polynomial space data structure that returns k c-approximate nearest neighbors in ⌈k/B⌉ I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the λ-set workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in d ≥ n dimensions. To extend the lower bounds down to d = O(k log(n/k)) dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest. https://dl.acm.org/doi/abs/10.1145/3375395.3387649
Efficient Indexes for Diverse Top-k Range Queries
Pankaj Agarwal, Stavros Sintos and Alex Steiger Let P be a set of n (non-negatively) weighted points in Rd. We consider the problem of computing a subset of (at most) k diverse and high-valued points of P that lie inside a query range, a problem relevant to many areas such as search engines, recommendation systems, and online stores. The diversity and value of a set of points are measured as functions (say average or minimum) of their pairwise distances and weights, respectively. We study both bicriteria and constrained optimization problems. In the former, we wish to return a set of k points that maximize a weighted sum of their value and diversity measures, and in the latter, we wish to return a set of at most k points that maximize their value and satisfy a diversity constraint. We obtain three main types of results in this paper: Near-linear time (0.5-ε)-approximation algorithms for the bicriteria optimization problem in the offline setting. Near-linear size indexes for the bicriteria optimization problem that for a query rectangle return a (0.5-ε)-approximate solution in time O(k polylog(n)). The indexes can be constructed in O(n polylog(n)) time. Near-linear size indexes for answering constrained optimization range queries. For a query rectangle, a 0.5O(d)-approximate solution can be computed in O(k polylog(n)) time. If we allow some of the returned points to lie at most ε outside of the query rectangle then an (1-ε)-approximate solution can be computed in O(k polylog(n)) time. The indexes are constructed in O(n polylog(n)) and nO(1/εd) time, respectively. https://dl.acm.org/doi/abs/10.1145/3375395.3387667
Tuesday 10:30 AM – 12:00 PM Three Modern Roles for Logic in AI
Tutorial 1 Session Chair: Yufei Tao
Three Modern Roles for Logic in AI
Adnan Darwiche We consider three modern roles for logic in artificial intelligence, which are based on the theory of tractable Boolean circuits: (1) logic as a basis for computation, (2) logic for learning from a combination of data and knowledge, and (3) logic for reasoning about the behavior of machine learning systems. https://dl.acm.org/doi/abs/10.1145/3375395.3389131
Tuesday 1:30 PM – 2:00 PM Dependencies
PODS Session 5 Session Chair: Michael Benedikt
All-Instances Restricted Chase Termination
Tomasz Gogacz, Jerzy Marcinkowski and Andreas Pieris The chase procedure is a fundamental algorithmic tool in database theory with a variety of applications. A key problem concerning the chase procedure is all-instances termination: for a given set of tuple-generating dependencies (TGDs), is it the case that the chase terminates for every input database? In view of the fact that this problem is undecidable, it is natural to ask whether known well-behaved classes of TGDs ensure decidability. We consider here the main paradigms that led to robust TGD-based formalisms, that is, guardedness and stickiness. Although all-instances termination is well-understood for the oblivious chase, the more subtle case of the restricted (a.k.a. the standard) chase is rather unexplored. We show that all-instances restricted chase termination for guarded/sticky single-head TGDs is decidable in elementary time. https://dl.acm.org/doi/abs/10.1145/3375395.3387644
The Limits of Efficiency for Open- and Closed-World Query Evaluation Under Guarded TGDs
Pablo Barcelo, Victor Dalmau, Cristina Feier, Carsten Lutz and Andreas Pieris Ontology-mediated querying and querying in the presence of constraints are two key database problems where tuple-generating dependencies (TGDs) play a central role. In ontology-mediated querying, TGDs can formalize the ontology and thus derive additional facts from the given data, while in querying in the presence of constraints, they restrict the set of admissible databases. In this work, we study the limits of efficient query evaluation in the context of the above two problems, focusing on guarded and frontier-guarded TGDs and on UCQs as the actual queries. We show that a class of ontology-mediated queries (OMQs) based on guarded TGDs can be evaluated in FPT iff the OMQs in the class are equivalent to OMQs in which the actual query has bounded treewidth, up to some reasonable assumptions. For querying in the presence of constraints, we consider classes of constraint-query specifications (CQSs) that bundle a set of constraints with an actual query. We show a dichotomy result for CQSs based on guarded TGDs that parallels the one for OMQs except that, additionally, FPT coincides with PTime combined complexity. The proof is based on a novel connection between OMQ and CQS evaluation. Using a direct proof, we also show a similar dichotomy result, again up to some reasonable assumptions, for CQSs based on frontier-guarded TGDs with a bounded number of atoms in TGD heads. Our results on CQSs can be viewed as extensions of Grohe’s well-known characterization of the tractable classes of CQs (without constraints). Like Grohe’s characterization, all the above results assume that the arity of relation symbols is bounded by a constant. We also study the associated meta problems, i.e., whether a given OMQ or CQS is equivalent to one in which the actual query has bounded treewidth. https://dl.acm.org/doi/abs/10.1145/3375395.3387653
Tuesday 2:00 PM – 2:30 PM Resilience and Relevance
PODS Session 6 Session Chair: Andreas Pieris
New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman and Alexandra Meliou The resilience of a Boolean query on a database is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects. In this paper, we give several novel results on the hardness of the resilience problem for conjunctive queries with self-joins, and, more specifically, we present a dichotomy result for the class of single-self-join binary queries with exactly two repeated relations occurring in the query. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NP-hardness results. We also give novel involved reductions to network flow to show certain cases are in P. Although restricted, our results provide important insights into the problem of self-joins that we hope can help solve the general case of all conjunctive queries with self-joins in the future. https://dl.acm.org/doi/abs/10.1145/3375395.3387647
The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries
Alon Reshef, Benny Kimelfeld and Ester Livshits The Shapley value is a conventional and well-studied function for determining the contribution of a player to the coalition in a cooperative game. Among its applications in a plethora of domains, it has recently been proposed to use the Shapley value for quantifying the contribution of a tuple to the result of a database query. In particular, we have a thorough understanding of the tractability frontier for the class of Conjunctive Queries (CQs) and aggregate functions over CQs. It has also been established that a tractable (randomized) multiplicative approximation exists for every union of CQs. Nevertheless, all of these results are based on the monotonicity of CQs. In this work, we investigate the implication of negation on the complexity of Shapley computation, in both the exact and approximate senses. We generalize a known dichotomy to account for negated atoms. We also show that negation fundamentally changes the complexity of approximation. We do so by drawing a connection to the problem of deciding whether a tuple is “relevant” to a query, and by analyzing its complexity. https://dl.acm.org/doi/abs/10.1145/3375395.3387664
Tuesday 2:30 PM – 3:00 PM Database-driven Workflows and Transactions
PODS Session 7 Session Chair: Jan Van den Bussche
Projection Views of Register Automata
Luc Segoufin and Victor Vianu Register automata have been used as a convenient model for specifying and verifying database driven systems. An important problem in such systems is to provide views that hide or restructure certain information about the data or process, extending classical notions of database views. In this paper we carry out a formal investigation of views of register automata by considering simple views that project away some of the registers. We show that classical register automata are not able to describe such projections and introduce more powerful register automata that are able to do so. We also show useful properties of these automata such as closure under projection and decidability of verifying temporal properties of their runs. https://dl.acm.org/doi/abs/10.1145/3375395.3387651
Deciding Robustness for Lower SQL Isolation Levels
Bas Ketsman, Christoph Koch, Frank Neven and Brecht Vandevoort While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this paper, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this paper is that we characterize robustness against both isolation levels in terms of the absence of counter example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this paper have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this paper is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness. https://dl.acm.org/doi/abs/10.1145/3375395.3387655
Wednesday 10:30 AM – 12:00 PM Fine-Grained Complexity Analysis of Queries:
from Decision to Counting and Enumeration
Tutorial 2 Session Chair: Yufei Tao
Fine-Grained Complexity Analysis of Queries: from Decision to Counting and Enumeration
Arnaud Durand This paper is devoted to a complexity study of various tasks related to query answering such as deciding if a Boolean query is true or not, counting the size of the answer set or enumerating the results. It is a survey of some of the many tools from complexity measures trough algorithmic methods to conditional lower bounds that have been designed in the domain over the last years. https://dl.acm.org/doi/abs/10.1145/3375395.3389130
Wednesday 1:30 PM – 2:00 PM Languages
PODS Session 8 Session Chair: Benny Kimfield
Generative Datalog with Continuous Distributions
Martin Grohe, Benjamin Lucien Kaminski, Joost-Pieter Katoen and Peter Lindner Arguing for the need to combine declarative and probabilistic programming, Bárány et al. (TODS 2017) recently introduced a probabilistic extension of Datalog as a “purely declarative probabilistic programming language.” We revisit this language and propose a more foundational approach towards defining its semantics. It is based on standard notions from probability theory known as stochastic kernels and Markov processes. This allows us to extend the semantics to continuous probability distributions, thereby settling an open problem posed by Bárány et al. We show that our semantics is fairly robust, allowing both parallel execution and arbitrary chase orders when evaluating a program. We cast our semantics in the framework of infinite probabilistic databases (Grohe and Lindner, ICDT 2020), and we show that the semantics remains meaningful even when the input of a probabilistic Datalog program is an arbitrary probabilistic database. https://dl.acm.org/doi/abs/10.1145/3375395.3387659
Conjunctive Regular Path Queries with String Variables
Markus L. Schmid We introduce the class CXRPQ of conjunctive xregex path queries, which are obtained from conjunctive regular path queries (CRPQs) by adding string variables (also called backreferences) as found in practical implementations of regular expressions. CXRPQs can be considered user-friendly, since they combine two concepts that are well-established in practice: pattern-based graph queries and regular expressions with backreferences. Due to the string variables, CXRPQs can express inter-path dependencies, which are not expressible by CRPQs. The evaluation complexity of CXRPQs, if not further restricted, is PSpace-hard in data complexity. We identify three natural fragments with more acceptable evaluation complexity: their data complexity is in NL, while their combined complexity varies between ExpSpace, PSpace and NP. In terms of expressive power, we compare the CXRPQ-fragments with CRPQs and unions of CRPQs, and with extended conjunctive regular path queries (ECRPQs) and unions of ECRPQs. https://dl.acm.org/doi/abs/10.1145/3375395.3387663
Wednesday 2:00 PM – 3:00 PM Query Enumeration and Aggregation
PODS Session 9 Session Chair: Paris Koutris
Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries
Ahmet Kara, Milos Nikolic, Dan Olteanu and Haozhe Zhang We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration. The main result of this work defines the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By conveniently choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries. For a restricted class of hierarchical queries, our approach can achieve worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture. https://dl.acm.org/doi/abs/10.1145/3375395.3387646
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration
Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld and Nicole Schweikardt As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this paper, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a “random access”: polylogarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a random-order enumeration whose delay is logarithmic in expectation. We also identify a subclass of UCQs for which we can provide random access with polylogarithmic access time. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives. https://dl.acm.org/doi/abs/10.1145/3375395.3387662
Parallel Algorithms for Sparse Matrix Multiplication and Join-Aggregate Queries
Xiao Hu and Ke Yi In this paper, we design massively parallel algorithms for sparse matrix multiplication, as well as more general join-aggregate queries, where the join hypergraph is a tree with arbitrary output attributes. For each case, we obtain asymptotic improvement over existing algorithms. In particular, our matrix multiplication algorithm is shown to be optimal in the semiring model. https://dl.acm.org/doi/abs/10.1145/3375395.3387657
Aggregate Queries over Sparse Databases
Szymon Torunczyk We propose an algebraic framework for studying efficient algorithms for query evaluation, aggregation, enumeration, and maintenance under updates, on sparse databases. Our framework allows to treat those problems in a unified way, by considering various semirings, depending on the considered problem. As a concrete application, we propose a powerful query language extending first-order logic by aggregation in multiple semirings. We obtain an optimal algorithm for computing the answers of such queries on sparse databases. More precisely, given a database from a fixed class with bounded expansion, the algorithm computes in linear timea data structure which allows to enumerate the set of answers to the query, with constant delay between two outputs. https://dl.acm.org/doi/abs/10.1145/3375395.3387660
Wednesday 3:00 PM – 3:30 PM Streams – Cycle Counting
PODS Session 10 Session Chair: Yufei Tao
Triangle and Four Cycle Counting in the Data Stream Model
Andrew McGregor and Sofya Vorotnikova The problem of estimating the number of cycles in a graph is one of the most widely studied graph problems in the data stream model. Three relevant variants of the data stream model include: the arbitrary order model in which the stream consists of the edges of the graph in arbitrary order, the random order model in which the edges are randomly permuted, and the adjacency list order model in which all edges incident to the same vertex appear consecutively. In this paper, we focus on the problem of triangle and four-cycle counting in these models. We improve over the state-of-the-art results as follows, where n is the number of vertices, m is the number of edges and T is the number of triangles/four-cycles in the graph (i.e., the quantity being estimated): Random Order Model: We present a single-pass algorithm that (1+ε)-approximates the number of triangles using ~O(ε-2 m/√T) space and prove that this is optimal in the range T ≤ √m. The best previous result, a (3+ε)-approximation using ~O(ε-4.5 m/√T) space, was presented by Cormode and Jowhari~(Theor. Comput. Sci. 2017). Adjacency List Model: We present an algorithm that returns a (1+ε)-approximation of the number of 4-cycles using two passes and ~O(ε-4 m/√T) space. The best previous result, a constant approximation using ~O(m/T3/8) space, was presented by Kallaugher et al. (PODS~2019). We also show that (1+ε)-approximation in a single pass is possible in a) polylog(n) space if T=Ω(n2) and b) ~O(n) space if T=Ω(n). Arbitrary Order Model: We present a three-pass algorithm that (1+ε)-approximates the number of 4-cycles using ~O(ε-2 m/T1/4) space and a one-pass algorithm that uses ~O(ε-2 n) space when T=Ω(n2). The best existing result, a (1+ε)-approximation using ~O(ε-2 m2/T) space, was presented by Bera and Chakrabarti (STACS~2017). We also show a multi-pass lower bound and another algorithm for distinguishing graphs with no four cycles and graphs with many 4-cycles. https://dl.acm.org/doi/abs/10.1145/3375395.3387652
How the Degeneracy Helps for Triangle Counting in Graph Streams
Suman K. Bera and C. Seshadhri We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of m edges, our aim is to compute a (1+-ε)-approximation to the triangle count T, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially Θ(min(m3/2 /T, m/√T)) (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for low degeneracy graphs. The degeneracy, K, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity ~O(mK/T). For constant degeneracy graphs, this bound is ~O(m/T), which is significantly smaller than both m3/2 /T and m/√T. We complement our algorithmic result with a nearly matching lower bound of Ω(mK/T). https://dl.acm.org/doi/abs/10.1145/3375395.3387665