Research

For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

SIGMOD Research Sessions for Tuesday
Research 5: Data Provenance • Tuesday 1:30 PM – 3:00 PM
mod0128 Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries
Pierre Bourhis (CNRS, UMR 9189 – CRIStAL); Daniel Deutch (Tel Aviv University); Yuval Moskovitch (Tel Aviv University) The algebraic approach for provenance tracking, originating in the semiring model of Green et. al, has proven useful as an abstract way of handling metadata. Commutative Semirings were shown to be the “correct” algebraic structure for Union of Conjunctive Queries, in the sense that its use allows provenance to be invariant under certain expected query equivalence axioms.</par><par>In this paper we present the first (to our knowledge) algebraic provenance model, for a fragment of update queries, that is invariant under set equivalence. The fragment that we focus on is that of hyperplane queries, previously studied in multiple lines of work. Our algebraic provenance structure and corresponding provenance-aware semantics are based on the sound and complete axiomatization of Karabeg and Vianu. We demonstrate that our construction can guide the design of concrete provenance model instances for different applications. We further study the efficient generation and storage of provenance for hyperplane update queries. We show that a naive algorithm can lead to an exponentially large provenance expression, but remedy this by presenting a normal form which we show may be efficiently computed alongside query evaluation. We experimentally study the performance of our solution and demonstrate its scalability and usefulness, and in particular the effectiveness of our normal form representation. https://doi.org/10.1145/3318464.3380578
mod0290 Causality-Guided Adaptive Interventional Debugging
Anna Fariha (University of Massachusetts Amherst); Suman Nath (Microsoft Research); Alexandra Meliou (University of Massachusetts Amherst) Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application’s intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them. https://doi.org/10.1145/3318464.3389694
mod0071 PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
Yinjun Wu (University of Pennsylvania); Val Tannen (University of Pennsylvania); Susan B. Davidson (University of Pennsylvania) The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models. https://doi.org/10.1145/3318464.3380571
mod0617 BugDoc: Algorithms to Debug Computational Processes
Raoni Lourenço (New York University); Juliana Freire (New York University); Dennis Shasha (New York University) Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement. https://doi.org/10.1145/3318464.3389763
mod0609 Computing Local Sensitivities of Counting Queries with Joins
Yuchao Tao (Duke University); Xi He (University of Waterloo); Ashwin Machanavajjhala (Duke University); Sudeepa Roy (Duke University) Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and differential privacy. However, it is NP-hard to find local sensitivity of a conjunctive query in terms of the size of the query, even for the class of acyclic queries. Although the complexity is polynomial when the query size is fixed, the naive algorithms are not efficient for large databases and queries involving multiple joins. In this paper, we present a novel approach to compute local sensitivity of counting queries involving join operations by tracking and summarizing tuple sensitivities. We give algorithms for the sensitivity problem for full acyclic join queries using join trees, that run in polynomial time in both the size of the database and query for an interesting sub-class of queries, which we call `doubly acyclic queries’ that include path queries, and in polynomial time in combined complexity when the maximum degree in the join tree is bounded. Our algorithms can be extended to certain non-acyclic queries using generalized hypertree decompositions. We evaluate our approach and show applications of our algorithms to obtain better results for differential privacy by orders of magnitude. https://doi.org/10.1145/3318464.3389762
Research 2: Serverless and Cloud Data Management • Tuesday 1:30 PM – 3:00 PM
mod0369 Transactional Causal Consistency for Serverless Computing
Chenggang Wu (University of California, Berkeley); Vikram Sreekanti (University of California, Berkeley); Joseph M. Hellerstein (University of California, Berkeley) We consider the setting of serverless Function-as-a-Service (FaaS) platforms, where storage services are disaggregated from the machines that support function execution. FaaS applications consist of compositions of functions, each of which may run on a separate machine and access remote storage. The challenge we address is improving I/O latency in this setting while also providing application-wide consistency. Previous work has explored providing causal consistency for individual I/Os by carefully managing the versions stored in a client-side data cache. In our setting, a single application may execute multiple functions across different nodes, and therefore issue interrelated I/Os to multiple distinct caches. This raises the challenge of Multisite Transactional Causal Consistency (MTCC): the ability to provide causal consistency for all I/Os within a given transaction even if it runs across multiple physical sites. We present protocols for MTCC implemented in a system called HYDROCACHE. Our evaluation demonstrates orders-of-magnitude performance improvements due to caching, while also protecting against consistency anomalies that otherwise arise frequently. https://doi.org/10.1145/3318464.3389710
mod0161 Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings
Tarique Siddiqui (Microsoft & University of Illinois at Urbana-Champaign); Alekh Jindal (Microsoft); Shi Qiao (Microsoft); Hiren Patel (Microsoft); Wangchao Le (Microsoft) Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage. https://doi.org/10.1145/3318464.3380584
mod0598 Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure
Ingo Müller (ETH Zürich); Renato Marroquín (ETH Zürich); Gustavo Alonso (ETH Zürich) Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions. https://doi.org/10.1145/3318464.3389758
mod0230s Starling: A Scalable Query Engine on Cloud Functions
Matthew Perron (Massachusetts Institute of Technology); Raul Castro Fernandez (University of Chicago); David DeWitt (Massachusetts Institute of Technology); Samuel Madden (Massachusetts Institute of Technology) Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when underutilized. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets. https://doi.org/10.1145/3318464.3380609
mod0339 Learning a Partitioning Advisor for Cloud Databases
Benjamin Hilprecht (TU Darmstadt); Carsten Binnig (TU Darmstadt); Uwe Röhm (The University of Sydney) Cloud vendors provide ready-to-use distributed DBMS solutions as a service. While the provisioning of a DBMS is usually fully automated, customers typically still have to make important design decisions which were traditionally made by the database administrator such as finding an optimal partitioning scheme for a given database schema and workload. In this paper, we introduce a new learned partitioning advisor based on Deep Reinforcement Learning (DRL) for OLAP-style workloads. The main idea is that a DRL agent learns the cost tradeoffs of different partitioning schemes and can thus automate the partitioning decision. In the evaluation, we show that our advisor is able to find non-trivial partitionings for a wide range of workloads and outperforms more classical approaches for automated partitioning design. https://doi.org/10.1145/3318464.3389704
Research 7: Security, Privacy, and Blockchain • Tuesday 1:30 PM – 3:00 PM
mod0565s Querying Shared Data with Security Heterogeneity
Yang Cao (University of Edinburgh); Wenfei Fan (University of Edinburgh, Shenzhen University, & Beihang University); Yanghao Wang (University of Edinburgh); Ke Yi (Hong Kong University of Scienc) There has been increasing need for secure data sharing. In practice a group of data owners often adopt a heterogeneous security scheme under which each pair of parties decide their own protocol to share data with diverse levels of trust. The scheme also keeps track of how the data is used. This paper studies distributed SQL query answering in the heterogeneous security setting. We define query plans by incorporating toll functions determined by data sharing agreements and reflected in the use of various security facilities. We formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. We show that this problem is PSPACE-hard for SQL and \Sigma_3^p-hard for SPC, and it is in NEXPTIME. Despite the hardness, we develop a set of approximate algorithms to generate distributed query plans that minimize data sharing toll and reduce parallel evaluation cost. Using real-life and synthetic data, we empirically verify the effectiveness, scalability and efficiency of our algorithms. https://doi.org/10.1145/3318464.3389784
mod0058 SAGMA: Secure Aggregation Grouped by Multiple Attributes
Timon Hackenjos (FZI Research Center for Information Technology); Florian Hahn (University of Twente); Florian Kerschbaum (University of Waterloo) Encryption can protect data in outsourced databases — in the cloud — while still enabling query processing over the encrypted data. However, the processing leaks information about the data specific to the type of query, e.g., aggregation queries. Aggregation over user-defined groups using SQL’s GROUP BY clause is extensively used in data analytics, e.g., to calculate the total number of visitors each month or the average salary in each department. The information leaked, e.g., the access pattern to a group, may reveal the group’s frequency enabling simple, yet detrimental leakage-abuse attacks. In this work we present SAGMA — an encryption scheme for performing secure aggregation grouped by multiple attributes. The querier can choose any combination of one or multiple attributes in the GROUP BY clause among the set of all grouping attributes. The encryption scheme only stores semantically secure ciphertexts at the cloud and query processing hides the access pattern, i.e., the frequency of each group. We implemented our scheme and our evaluation results underpin its practical feasibility. https://doi.org/10.1145/3318464.3380569
mod0234 CryptE: Crypto-Assisted Differential Privacy on Untrusted Servers
Amrita Roy Chowdhury (University of Wisconsin-Madison); Chenghong Wang (Duke University); Xi He (University of Waterloo); Ashwin Machanavajjhala (Duke University); Somesh Jha (University of Wisconsin-Madison) Differential privacy (DP) is currently the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the “central” or “local” model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at the cost of utility and algorithmic expressibility as compared to the central model. In this work, we propose, Crypt&#949;, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Crypt&#949; achieves the “best of both worlds” by employing two non-colluding untrusted servers that run DP programs on encrypted data from the data owners. In theory, straightforward implementations of DP programs using off-the-shelf secure multi-party computation tools can achieve the above goal. However, in practice, they are beset with many challenges like poor performance and tricky security proofs. To this end, Crypt&#949; allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP (for computationally bounded adversaries) for all Crypt&#949; programs. Crypt&#949; supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, we propose performance optimizations leveraging the fact that the output is noisy. We demonstrate Crypt&#949;’s practical feasibility with extensive empirical evaluations on real world datasets. https://doi.org/10.1145/3318464.3380596
mod0326 Estimating Numerical Distributions under Local Differential Privacy
Zitao Li (Purdue University); Tianhao Wang (Purdue University); Milan Lopuhaä-Zwakenberg (Eindhoven University of Technology); Ninghui Li (Purdue University); Boris &#352;kori&#263; (Eindhoven University of Technology) When collecting information, local differential privacy (LDP) relieves the concern of privacy leakage from users’ perspective, as user’s private information is randomized before sent to the aggregator. We study the problem of recovering the distribution over a numerical domain while satisfying LDP. While one can discretize a numerical domain and then apply the protocols developed for categorical domains, we show that taking advantage of the numerical nature of the domain results in better trade-off of privacy and utility. We introduce a new reporting mechanism, called the square wave (SW) mechanism, which exploits the numerical nature in reporting. We also develop an Expectation Maximization with Smoothing (EMS) algorithm, which is applied to aggregated histograms from the SW mechanism to estimate the original distributions. Extensive experiments demonstrate that our proposed approach, SW with EMS, consistently outperforms other methods in a variety of utility metrics. https://doi.org/10.1145/3318464.3389700
mod0227 FalconDB: Blockchain-based Collaborative Database
Yanqing Peng (University of Utah); Min Du (University of California, Berkeley); Feifei Li (University of Utah); Raymond Cheng (University of California, Berkeley); Dawn Song (University of California, Berkeley) Nowadays an emerging class of applications are based oncollaboration over a shared database among different entities. However, the existing solutions on shared database may require trust on others, have high hardware demand that is unaffordable for individual users, or have relatively low performance. In other words, there is a trilemma among security, compatibility and efficiency. In this paper, we present FalconDB, which enables different parties with limited hardware resources to efficiently and securely collaborate on a database. FalconDB adopts database servers with verification interfaces accessible to clients and stores the digests for query/update authentications on a blockchain. Using blockchain as a consensus platform and a distributed ledger, FalconDB is able to work without any trust on each other. Meanwhile, FalconDB requires only minimal storage cost on each client, and provides anywhere-available, real-time and concurrent access to the database. As a result, FalconDB over-comes the disadvantages of previous solutions, and enables individual users to participate in the collaboration with high efficiency, low storage cost and blockchain-level security guarantees. https://doi.org/10.1145/3318464.3380594
Research 8: Graph Query Processing • Tuesday 1:30 PM – 3:00 PM
mod0482s Exact Single-Source SimRank Computation on Large Graphs
Hanzhi Wang (Renmin University of China); Zhewei Wei (Renmin University of China); Ye Yuan (Beijing Institute of Technology); Xiaoyong Du (Renmin University of China); Ji-Rong Wen (Renmin University of China) {\it SimRank} is a popular measurement for evaluating the node-to-node similarities based on the graph topology. In recent years, single-source and top-$k$ SimRank queries have received increasing attention due to their applications in web mining, social network analysis, and spam detection. However, a fundamental obstacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than $10^6$ nodes. Consequently, no existing work has evaluated the actual trade-offs between query time and accuracy on large real-world graphs. In this paper, we present \exsim, the first algorithm that computes the exact single-source and top-$k$ SimRank results on large graphs. With high probability, this algorithm produces ground truths with a rigorous theoretical guarantee. We conduct extensive experiments on real-world datasets to demonstrate the efficiency of ExactSim. The results show that ExactSim provides the ground truth for any single-source SimRank query with a precision up to 7 decimal places within a reasonable query time. https://doi.org/10.1145/3318464.3389781
mod0495 Distributed Processing of k Shortest Path Queries over Dynamic Road Networks
Ziqiang Yu (Yantai University); Xiaohui Yu (York University); Nick Koudas (University of Toronto); Yang Liu (Wilfrid Laurier University); Yifan Li (York University &#38; Key Laboratory of Urban Land Resources Monitoring and Simulation, MNR); Yueting Chen (York University); Dingyu Yang (Alibaba Group) The problem of identifying the {\em k}-shortest paths (KSPs for short) in a dynamic road network is essential to many location-based services. Road networks are dynamic in the sense that the weights of the edges in the corresponding graph constantly change over time, representing evolving traffic conditions. Very often such services have to process numerous KSP queries over large road networks at the same time, thus there is a pressing need to identify distributed solutions for this problem. However, most existing approaches are designed to identify KSPs on a static graph in a sequential manner (i.e., the (i+1)-th shortest path is generated based on the i-th shortest path), restricting their scalability and applicability in a distributed setting. We therefore propose KSP-DG, a distributed algorithm for identifying k-shortest paths in a dynamic graph. It is based on partitioning the entire graph into smaller subgraphs, and reduces the problem of determining KSPs into the computation of partial KSPs in relevant subgraphs, which can execute in parallel on a cluster of servers. A distributed two-level index called DTLP is developed to facilitate the efficient identification of relevant subgraphs. A salient feature of DTLP is that it indexes a set of virtual paths that are insensitive to varying traffic conditions, leading to very low maintenance cost in dynamic road networks. This is the first treatment of the problem of processing KSP queries over dynamic road networks. Extensive experiments conducted on real road networks confirm the superiority of our proposal over baseline methods. https://doi.org/10.1145/3318464.3389735
mod0048 On the Optimization of Recursive Relational Queries: Application to Graph Queries
Louis Jachiet (LTCI, Télécom Paris); Pierre Genevès (Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG); Nils Gesbert (Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG); Nabil Layaida (Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG) Graph databases have received a lot of attention as they are particularly useful in many applications such as social networks, life sciences and the semantic web. Various languages have emerged to query graph databases, many of which embed forms of recursion which reveal essential for navigating in graphs. The relational model has benefited from a huge body of research in the last half century and that is why many graph databases rely on techniques of relational query engines. Since its introduction, the relational model has seen various attempts to extend it with recursion and it is now possible to use recursion in several SQL or Datalog based database systems. The optimization of recursive queries remains, however, a challenge. We propose mu-RA, a variation of the Relational Algebra equipped with a fixpoint operator for expressing recursive relational queries. mu-RA can notably express unions of conjunctive regular path queries. Leveraging the fact that this fixpoint operator makes recursive terms more amenable to algebraic transformations, we propose new rewrite rules. These rules makes it possible to generate new query execution plans, that cannot be obtained with previous approaches. We present the syntax and semantics of mu-RA, and the rewriting rules that we specifically devised to tackle the optimization of recursive queries. We report on practical experiments that show that the newly generated plans can provide significant performance improvements for evaluating recursive queries over graphs. https://doi.org/10.1145/3318464.3380567
mod0214 Pensieve: Skewness-Aware Version Switching for Efficient Graph Processing
Tangwei Ying (Huazhong University of Science and Technology); Hanhua Chen (Huazhong University of Science and Technology); Hai Jin (Huazhong University of Science and Technology) Multi-version graph processing has recently attracted much research efforts. Existing multi-version graph storage designs use either copy-based schemes or delta-based schemes. A copy-based scheme stores every version separately and may lead to expensive space cost due to high storage redundancy. On the contrary, a delta based scheme only stores incremental deltas between different versions and relies on delta computation for version switching. In this work, we observe: 1) high degree vertices incur much more significant storage overheads during graph version evolving compared to low degree vertices; 2) the skewed access frequency among graph versions greatly influences the system performance for version reproducing. Based on the observations, we propose Pensieve, a skewness-aware multi-version graph processing system. Two factors contribute to the efficiency of Pensieve. First, Pensieve leverages a differentiated graph storage strategy that stores low degree vertices using copy-based scheme while stores high degree ones using delta-based scheme. Such a design achieves a good trade-off between storage cost and version switching time for multi-version graph processing. Second, the Pensieve graph storage exploits the time locality of graph version access and designs a novel last-root version switching scheme, which significantly improves the access efficiency for recent versions. We implement Pensieve on top of Ligra, and conduct comprehensive experiments to evaluate the performance of this design using large-scale datasets collected from real world systems. The results show that Pensieve substantially outperforms state-of-the-art designs in terms of memory consumption and version switching time. https://doi.org/10.1145/3318464.3380590
mod0166 Extending Graph Patterns with Conditions
Grace Fan (Brown University); Wenfei Fan (University of Edinburgh, Beihang University & Shenzhen University); Yuanhao Li (University of Edinburgh & Shenzhen University); Ping Lu (Beihang University); Chao Tian (Alibaba Group); Jingren Zhou (Alibaba Group) We propose an extension of graph patterns, referred to as {\em conditional graph patterns} and denoted as \kw{CGPs}. In a \kw{CGP}, one can specify a simple condition on each edge such that the edge exists if and only if the condition is satisfied. We show that \kw{CGPs} allow us to catch missing links, increase the expressivity of graph functional dependencies, and provide a succinct representation of graph patterns. We settle the complexity of their consistency, matching, incremental matching and containment problems, in linear time, \kw{NP}-complete, \kw{NP}-complete and $\Pi_2^p$-complete, respectively. These tell us that despite the increased expressive power of \kw{CGPs}, the matching and incremental matching problems for \kw{CGPs} are no harder than their counterparts for conventional patterns. We develop algorithms for matching and incremental matching of \kw{CGPs}, and for (incremental) multi-\kw{CGP} matching and optimization. Using real-life and synthetic graphs, we empirically verify the efficiency and effectiveness of our algorithms. https://doi.org/10.1145/3318464.3380585
SIGMOD Research Sessions for Wednesday
Research 13: Data Matching • Wednesday 10:30 AM – 12:00 PM
mod0235 A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
Venkata Vamsikrishna Meduri (Arizona State University); Lucian Popa (IBM Research, Almaden); Prithviraj Sen (IBM Research, Almaden); Mohamed Sarwat (Arizona State University) Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, $\#$labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly $9\%$ in terms of F1-score and reduce example selection latencies by up to 10$\times$ without affecting the quality of the model. https://doi.org/10.1145/3318464.3380597
mod0530 ZeroER: Entity Resolution using Zero Labeled Examples
Renzhi Wu (Georgia Institute of Technology); Sanya Chaba (Georgia Institute of Technology); Saurabh Sawlani (Georgia Institute of Technology); Xu Chu (Georgia Institute of Technology); Saravanan Thirumuruganathan (QCRI, HBKU) Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation — the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches. https://doi.org/10.1145/3318464.3389743
mod0078 Towards Interpretable and Learnable Risk Analysis for Entity Resolution
Zhaoqiang Chen (Northwestern Polytechnical University); Qun Chen (Northwestern Polytechnical University); Boyi Hou (Northwestern Polytechnical University); Zhanhuai Li (Northwestern Polytechnical University); Guoliang Li (Tsinghua University) Machine-learning-based entity resolution has been widely studied. However, some entity pairs may be mislabeled by machine learning models and existing studies do not study the risk analysis problem — predicting and interpreting which entity pairs are mislabeled. In this paper, we propose an interpretable and learnable framework for risk analysis, which aims to rank the labeled pairs based on their risks of being mislabeled. We first describe how to automatically generate interpretable risk features, and then present a learnable risk model and its training technique. Finally, we empirically evaluate the performance of the proposed approach on real data. Our extensive experiments have shown that the learning risk model can identify the mislabeled pairs with considerably higher accuracy than the existing alternatives. https://doi.org/10.1145/3318464.3380572
mod0608 SLIM: Scalable Linkage of Mobility Data
Fuat Bas?k (Amazon Web Services); Hakan Ferhatosmano?lu (University of Warwick); Bu?ra Gedik (Bilkent University) We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning.
Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup.
https://doi.org/10.1145/3318464.3389761
mod0067 Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach
Yaoshu Wang (Shenzhen University); Chuan Xiao (Osaka University & Nagoya University); Jianbin Qin (Shenzhen University); Xin Cao (The University of New South Wales); Yifang Sun (The University of New South Wales); Wei Wang (The University of New South Wales); Makoto Onizuka (Osaka University) In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applications, especially for query optimization. Moreover, in some applications the estimated cardinality is supposed to be consistent and interpretable. Hence a monotonic estimation w.r.t. the query threshold is preferred. We propose a novel and generic method that can be applied to any data type and distance function. Our method consists of a feature extraction model and a regression model. The feature extraction model transforms original data and threshold to a Hamming space, in which a deep learning-based regression model is utilized to exploit the incremental property of cardinality w.r.t. the threshold for both accuracy and monotonicity. We develop a training strategy tailored to our model as well as techniques for fast estimation. We also discuss how to handle updates. We demonstrate the accuracy and the efficiency of our method through experiments, and show how it improves the performance of a query optimizer. https://doi.org/10.1145/3318464.3380570
Research 18: Main Memory Databases and Modern Hardware • Wednesday 10:30 AM – 12:00 PM
mod0159 Order-Preserving Key Compression for In-Memory Search Trees
Huanchen Zhang (Carnegie Mellon University); Xiaoxuan Liu (Carnegie Mellon University); David G. Andersen (Carnegie Mellon University); Michael Kaminsky (BrdgAI); Kimberly Keeton (Hewlett Packard Labs); Andrew Pavlo (Carnegie Mellon University) We present the High-speed Order-Preserving Encoder (HOPE) for in-memory search trees. HOPE is a fast dictionary-based compressor that encodes arbitrary keys while preserving their order. HOPE’s approach is to identify common key patterns at a fine granularity and exploit the entropy to achieve high compression rates with a small dictionary. we first develop a theoretical model to reason about order-preserving dictionary designs. We then select six representative compression schemes using this model and implement them in HOPE. These schemes make different trade-offs between compression rate and encoding speed. We evaluate HOPE on five data structures used in databases: SuRF, ART, HOT, B+tree, and Prefix B+tree. Our experiments show that using HOPE allows the search trees to achieve lower query latency (up to 40% lower) and better memory efficiency (up to 30% smaller) simultaneously for most string key workloads. https://doi.org/10.1145/3318464.3380583
mod0229 A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics
Anil Shanbhag (Massachusetts Institute of Technology); Samuel Madden (Massachusetts Institute of Technology); Xiangyao Yu (University of Wisconsin-Madison) There has been significant amount of excitement and recent work on GPU-based database systems. Previous work has claimed that these systems can perform orders of magnitude better than CPU-based database systems on analytical workloads such as those found in decision support and business intelligence applications. A hardware expert would view these claims with suspicion. Given the general notion that database operators are memory-bandwidth bound, one would expect the maximum gain to be roughly equal to the ratio of the memory bandwidth of GPU to that of CPU. In this paper, we adopt a model-based approach to understand when and why the performance gains of running queries on GPUs vs on CPUs vary from the bandwidth ratio (which is roughly 16$\times$ on modern hardware). We propose Crystal, a library of parallel routines that can be combined together to run full SQL queries on a GPU with minimal materialization overhead. We implement individual query operators to show that while the speedups for selection, projection, and sorts are near the bandwidth ratio, joins achieve less speedup due to differences in hardware capabilities. Interestingly, we show on a popular analytical workload that full query performance gain from running on GPU exceeds the bandwidth ratio despite individual operators having speedup less than bandwidth ratio, as a result of limitations of vectorizing chained operators on CPUs, resulting in a 25$\times$ speedup for GPUs over CPUs on the benchmark. https://doi.org/10.1145/3318464.3380595
mod0341 Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects
Clemens Lutz (DFKI GmbH); Sebastian Breß (TU Berlin); Steffen Zeuch (DFKI GmbH); Tilmann Rabl (HPI, University of Potsdam); Volker Markl (DFKI GmbH, TU Berlin) GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the on-board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU@. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing. https://doi.org/10.1145/3318464.3389705
mod0451 Robust Performance of Main Memory Data Structures by Configuration
Tiemo Bang (Technical University of Darmstadt &#38; SAP SE); Ismail Oukid (Snowflake Inc.); Norman May (SAP SE); Ilia Petrov (Reutlingen University); Carsten Binnig (Technical University of Darmstadt) In this paper, we present a new approach for achieving robust performance of data structures making it easier to reuse the same design for different hardware generations but also for different workloads. To achieve robust performance, the main idea is to strictly separate the data structure design from the actual strategies to execute access operations and adjust the actual execution strategies by means of so-called configurationsinstead of hard-wiring the execution strategy into the data structure.In our evaluation we demonstrate the benefits of this configuration approach for individual data structures as well as complex OLTP workloads. https://doi.org/10.1145/3318464.3389725
mod0215 Black or White: How to Develop an AutoTuner for Memory-based Analytics
Mayuresh Kunjir (Duke University); Shivnath Babu (Unravel Data Systems) There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven “black box” algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. We show that an empirically-driven “white-box” algorithm, called RelM, that we have developed provides a {\em close-to-optimal} tuning at a fraction of the overheads compared to state-of-the-art AI-driven “black box” algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM’s superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called Guided-BO (GBO), we use RelM’s analytical models to speed up BO.Through an evaluation based on Apache Spark, we showcase that the RelM’s recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower cost overhead compared to the state-of-the-art AI-driven policies. https://doi.org/10.1145/3318464.3380591
Research 15: Machine Learning for Cleaning, Integration, and Search • Wednesday 10:30 AM – 12:00 PM
mod0077s Learning to Validate the Predictions of Black Box Classifiers on Unseen Data
Sebastian Schelter (New York University); Tammo Rukat (Amazon Research); Felix Biessmann (Beuth University Berlin) Machine Learning (ML) models are difficult to maintain in production settings. In particular, deviations of the unseen serving data (for which we want to compute predictions) from the source data (on which the model was trained) pose a central challenge, especially when model training and prediction are outsourced via cloud services. Errors or shifts in the serving data can affect the predictive quality of a model, but are hard to detect for engineers operating ML deployments.We propose a simple approach to automate the validation of deployed ML models by estimating the model’s predictive performance on unseen, unlabeled serving data. In contrast to existing work, we do not require explicit distributional assumptions on the dataset shift between the source and serving data. Instead, we rely on a programmatic specification of typical cases of dataset shift and data errors. We use this information to learn a <i>performance predictor</i> for a pretrained black box model that automatically raises alarms when it detects performance drops on unseen serving data. We experimentally evaluate our approach on various datasets, models and error types. We find that it reliably predicts the performance of black box models in the majority of cases, and outperforms several baselines even in the presence of unspecified data errors. https://doi.org/10.1145/3318464.3380604
mod0358 Learning Over Dirty Data Without Cleaning
Jose Picado (Oregon State University); John Davis (Oregon State University); Arash Termehchy (Oregon State University); Ga Young Lee (Oregon State University) Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing data errors to create clean databases for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose Dirty Learn, DLearn, a novel learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean versions of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently. https://doi.org/10.1145/3318464.3389708
mod0308 Complaint-driven Training Data Debugging for Query 2.0
Weiyuan Wu (Simon Fraser University); Lampros Flokas (Columbia University); Eugene Wu (Columbia University); Jiannan Wang (Simon Fraser University) As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support “Query 2.0”, which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query’s intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps.We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets.Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively. https://doi.org/10.1145/3318464.3389696
mod0529 Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks
Riccardo Cappuzzo (EURECOM); Paolo Papotti (EURECOM); Saravanan Thirumuruganathan (QCRI, HBKU) Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise’s relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple.We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases. We make four major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Second, we propose how to derive sentences from such a graph that effectively “describe” the similarity across elements (tokens, attributes, rows) in the two datasets. The embeddings are learned based on such sentences. Third, we propose effective optimization to improve the quality of the learned embeddings and the performance of integration tasks. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform an extensive set of experiments validating them against multiple baseline methods. Our experiments show that our framework, EmbDI, produces meaningful results for data integration tasks such as schema matching and entity resolution both in supervised and unsupervised settings. https://doi.org/10.1145/3318464.3389742
mod0581 Minimization of Classifier Construction Cost for Search Queries
Shay Gershtein (Tel Aviv University); Tova Milo (Tel Aviv University); Gefen Morami (Tel Aviv University); Slava Novgorodov (eBay Research) Search over massive sets of items is the cornerstone of many modern applications. Users express a set of properties and expect the system to retrieve qualifying items.A common difficulty, however, is that the information on whether an item satisfies the search criteria is not explicitly recorded in the repository. Instead, it may be general knowledge or “hidden” in a picture/description, leading to incomplete search results. To overcome this problem, companies build dedicated classifiers that determine which items satisfy the given criteria. However, building classifiers requires volumes of high-quality labeled training data. Since the costs of training classifiers for different subsets of properties can vastly differ, the choice of which classifiers to train has great monetary significance. The goal of our research is to devise effective algorithms to choose which classifiers one should train to address a given query load while minimizing the cost.Previous work considered a simplified model with uniform classifier costs, and queries with two properties. We remove these restrictions in our model. We prove NP-hard inapproximability bounds and devise several algorithms with approximation guarantees. Moreover, we identify a common special case for which we provide an exact algorithm. Our experiments, performed over real-life datasets, demonstrate the effectiveness and efficiency of our algorithm https://doi.org/10.1145/3318464.3389755
Research 20: Graph Data Management and Analysis • Wednesday 10:30 AM – 12:00 PM
mod0536 Application Driven Graph Partitioning
Wenfei Fan (University of Edinburgh, Beihang University & Shenzhen University); Ruochun Jin (University of Edinburgh); Muyang Liu (University of Edinburgh); Ping Lu (BDBC, Beihang University); Xiaojian Luo (Alibaba Group); Ruiqi Xu (University of Edinburgh); Qiang Yin (Alibaba Group); Wenyuan Yu (Alibaba Group); Jingren Zhou (Alibaba Group) Graph partitioning is crucial to parallel computations on large graphs. The choice of partitioning strategies has strong impact on not only the performance of graph algorithms, but also the design of the algorithms. For an algorithm of our interest, what partitioning strategy fits it the best and improves its parallel execution? Is it possible to develop graph algorithms with partition transparency, such that the algorithms work under different partitions without changes? This paper aims to answer these questions. We propose an application-driven hybrid partitioning strategy that, given a graph algorithm A, learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a hybrid partition and reduce the parallel cost of A. Moreover, we identify a general condition under which graph-centric algorithms are partition transparent. We show that a number of graph algorithms can be made partition transparent. Using real-life and synthetic graphs, we experimentally verify that our partitioning strategy improves the performance of a variety of graph computations, up to 22.5 times. https://doi.org/10.1145/3318464.3389745
mod0537 Progressive Top-K Nearest Neighbors Search in Large Road Networks
Dian Ouyang (University of Sydney); Dong Wen (University of Technology Sydney); Lu Qin (University of Technology Sydney); Lijun Chang (University of Sydney); Ying Zhang (University of Technology Sydney); Xuemin Lin (University of New South Wales) Computing top-k nearest neighbors (kNN) is a fundamental problem in road networks. Existing solutions either need a complicated parameter configuration in index construction or incur high costs when scanning an unbounded number of vertices in query processing. In this paper, we propose a novel parameter-free index-based solution for the kNN query based on the concept of tree decomposition in large road networks. Based on our index structure, we propose an efficient and progressive algorithm that returns each result in a bounded delay. We also optimize the index structure, which improves the efficiency of both index construction and index maintenance in large road networks. We conduct extensive experiments to show the efficiency of our proposed algorithms and the effectiveness of our optimization techniques in real-world road networks from ten regions. https://doi.org/10.1145/3318464.3389746
mod0024 Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs
Yingxia Shao (Beijing Univeristy of Posts and Telecommunications); Shiyue Huang (Peking University); Xupeng Miao (Peking University); Bin Cui (Peking University); Lei Chen (Hong Kong University of Science and Technology) Second-order random walk is an important technique for graph analysis. Many applications %including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory overhead comes from the memory-unaware strategies for node sampling across the graph. In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost. Further, to guarantee the efficiency of the second-order random walk within arbitrary memory budgets, we propose a memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node in the graph within a memory budget while minimizing the time cost. Finally, we provide general programming interfaces for users to benefit from the memory-aware framework easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90\% of the memory cost. https://doi.org/10.1145/3318464.3380562
mod0502 Hub Labeling for Shortest Path Counting
Yikai Zhang (Chinese University of Hong Kong); Jeffrey Xu Yu (Chinese University of Hong Kong) The notion of shortest path is fundamental in graph analytics. While many works have devoted to devising efficient distance oracles to compute the shortest distance between any vertices $s$ and $t$, we study the problem of efficiently counting the number of shortest paths between $s$ and $t$ in light of its applications in tasks such as betweenness-related analysis. Specifically, we propose a hub labeling scheme based on hub pushing and discuss several graph reduction techniques to reduce the index size. Furthermore, we prove several theoretical results on the performance of the scheme for some special graph classes. Our empirical study verifies the efficiency and effectiveness of the algorithms. In particular, a query evaluation takes only hundreds of microseconds in average for graphs with up to hundreds of millions of edges. We report our findings in this paper. https://doi.org/10.1145/3318464.3389737
mod0472s CHASSIS: Conformity Meets Online Information Diffusion
Hui Li (Nanyang Technological University); Hui Li (Xidian University); Sourav S. Bhowmick (Nanyang Technological University) Online information diffusion generates huge volumes of social activities (eg. tweets, retweets posts, comments, likes) among individuals. Existing information diffusion modeling techniques are oblivious to <i>conformity</i> of individuals during the diffusion process, a fundamental human trait according to social psychology theories. Intuitively, conformity captures the extent to which an individual complies with social norms or expectations. In this paper, we present a novel framework called <sc>chassis</sc> to characterize online information diffusion by bridging classical information diffusion model with conformity from social psychology. To this end, we first <i>extend</i> “Hawkes Process”, a well-known statistical technique utilized to model information diffusion, to quantitatively capture two flavors of conformity, <i>informational conformity</i> and <i>normative conformity</i>, hidden in activity sequences. Next, we present a novel <i>semi-parametric inference approach</i> to learn the proposed model. Experimental study with real-world datasets demonstrates the superiority of <sc>chassis</sc> to state-of-the-art conformity-unaware information diffusion models. https://doi.org/10.1145/3318464.3389780
Research 17: Data Exploration and Preparation • Wednesday 1:30 PM – 3:00 PM
mod0408s Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning
Ori Bar El (Tel Aviv University); Tova Milo (Tel Aviv University); Amit Somech (Tel Aviv University) Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks — illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights. https://doi.org/10.1145/3318464.3389779
mod0506 Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
Cong Yan (University of Washington); Yeye He (Microsoft Research) Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to “auto-suggest” contextualized data preparation steps, by “learning” from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially “logging” how data scientists interact with diverse tables, and using the resulting logs as a proxy of “ground truth”, we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the “collective wisdom” of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy. https://doi.org/10.1145/3318464.3389738
mod0103 IDEBench: A Benchmark for Interactive Data Exploration
Philipp Eichmann (Brown University); Emanuel Zgraggen (MIT); Carsten Binnig (TU Darmstadt); Tim Kraska (MIT) In recent years, many query processing techniques have been developed to better support interactive data exploration (IDE) of large structured datasets. To evaluate and compare database engines in terms of how well they support such workloads, experimenters have mostly used self-designed evaluation procedures rather than established benchmarks. In this paper we argue that this is due to the fact that the workloads and metrics of popular analytical benchmarks such as TPC-H or TPC-DS were designed for traditional performance reporting scenarios, and do not capture distinctive IDE characteristics. Guided by the findings of several user studies we present a new benchmark called IDEBench, designed to evaluate database engines based on common IDE workflows and metrics that matter to the end-user. We demonstrate the applicability of IDEBench through a number of experiments with five different database engines, and present and discuss our findings. https://doi.org/10.1145/3318464.3380574
mod0475 Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data
Leilani Battle (University of Maryland); Philipp Eichmann (Brown University); Marco Angelini (University of Rome “La Sapienza”); Tiziana Catarci (University of Rome “La Sapienza”); Giuseppe Santucci (University of Rome “La Sapienza”); Yukun Zheng (University of Maryland); Carsten Binnig (Technical University of Darmstadt); Jean-Daniel Fekete (Inria, Univ. Paris-Saclay, CNRS); Dominik Moritz (University of Washington) In this paper, we present a new benchmark to validate the suitability of database systems for interactive visualization workloads. While there exist proposals for evaluating database systems on interactive data exploration workloads, none rely on real user traces for database benchmarking. To this end, our long term goal is to collect user traces that represent workloads with different exploration characteristics. In this paper, we present an initial benchmark that focuses on “crossfilter”-style applications, which are a popular interaction type for data exploration and a particularly demanding scenario for testing database system performance. We make our benchmark materials, including input datasets, interaction sequences, corresponding SQL queries, and analysis code, freely available as a community resource, to foster further research in this area: https://osf.io/9xerb/?view_only=81de1a3f99d04529b6b173a3bd5b4d23. https://doi.org/10.1145/3318464.3389732
mod0496s Benchmarking Spreadsheet Systems
Sajjadur Rahman (University of Illinois at Urbana-Champaign); Kelly Mack (University of Washington); Mangesh Bendre (VISA Research); Ruilin Zhang (University of Southern California); Karrie Karahalios (University of Illinois at Urbana-Champaign); Aditya Parameswaran (University of California, Berkeley) Spreadsheet systems are used for storing and analyzing data across domains by programmers and non-programmers alike.While spreadsheet systems have continued to support increasingly large datasets, they are prone to hanging and freezing while performing computations even on much smaller ones. We present a benchmarking study that evaluates and compares the performance of three popular systems, Microsoft Excel, LibreOffice Calc, and Google Sheets, on a range of canonical spreadsheet computation operations. We find that spreadsheet systems lack interactivity for several operations, on datasets well below their advertised scalability limits. We further evaluate whether spreadsheet systems adopt database optimization techniques such as indexing, intelligent data layout, and incremental and shared computation,to efficiently execute computation operations. We outline several ways future spreadsheet systems can be redesigned to offer interactive response times on large datasets. https://doi.org/10.1145/3318464.3389782
Research 14: Query Optimization and Execution • Wednesday 1:30 PM – 3:00 PM
mod0121s Fast Join Project Query Evaluation using Matrix Multiplication
Shaleen Deep (University of Wisconsin-Madison); Xiao Hu (Duke University); Paraschos Koutris (University of Wisconsin-Madison) In the last few years, much effort has been devoted to developing join algorithms to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in developing efficient algorithms that achieve worst-case optimal runtime for full join queries, i.e., joins without projections. However, not much is known about join evaluation with {\em projections} beyond some simple techniques of pushing down the projection operator in the query execution plan. Such queries have a large number of applications in entity matching, graph analytics and searching over compressed graphs. In this paper, we study how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication. Crucially, our algorithms are parameterized by the output size of the final result, allowing for choosing the best execution strategy. We implement our algorithms as a subroutine and compare the performance with state-of-the-art techniques to show they can be improved upon by as much as 50x. More importantly, our experiments indicate that matrix multiplication is a useful operation that can help speed up join processing owing to highly optimized open source libraries that are also highly parallelizable. https://doi.org/10.1145/3318464.3380607
mod0192 Maintaining Acyclic Foreign-Key Joins under Updates
Qichen Wang (Hong Kong University of Science and Technology); Ke Yi (Hong Kong University of Science and Technology) A large number of analytical queries (e.g., all the 22 queries in the TPC-H benchmark) are based on acyclic foreign-key joins. In this paper, we study the problem of incrementally maintaining the query results of these joins under updates, i.e., insertion and deletion of tuples to any of the relations. Prior work has shown that this problem is inherently hard, requiring at least $\Omega(|db|^{{1\over 2} -\epsilon})$ time per update, where $|db|$ is the size of the database, and $\epsilon > 0$ can be any small constant. However, this negative result holds only on adversarially constructed update sequences; on the other hand, most real-world update sequences are “nice”, nowhere near these worst-case scenarios. We introduce a measure $\lambda$, which we call the {\em enclosureness} of the update sequence, to more precisely characterize its intrinsic difficulty. We present an algorithm to maintain the query results of any acyclic foreign-key join in $O(\lambda)$ time amortized, on any update sequence whose enclosureness is $\lambda$. This is complemented with a lower bound of $\Omega(\lambda^{1-\epsilon})$, showing that our algorithm is essentially optimal with respect to $\lambda$. Next, using this algorithm as the core component, we show how all the 22 queries in the TPC-H benchmark can be supported in $\tilde{O}(\lambda)$ time. Finally, based on the algorithms developed, we built a continuous query processing system on top of Flink, and experimental results show that our system outperforms previous ones significantly. https://doi.org/10.1145/3318464.3380586
mod0591 Thrifty Query Execution via Incrementability
Dixin Tang (University of Chicago); Zechao Shang (University of Chicago); Aaron J. Elmore (University of Chicago); Sanjay Krishnan (University of Chicago); Michael J. Franklin (University of Chicago) Many applications schedule queries before all data is ready. To return fast query results, database systems can eagerly process existing data and incrementally incorporate new data into prior intermediate results, which often relies on incremental view maintenance (IVM) techniques. However, incrementally maintaining a query result can increase the total amount of work mainly as some early work is not useful for computing the final query result. In this paper, we propose a new metric incrementability to quantify the cost-effectiveness of IVM to decide how eagerly or lazily databases should incrementally execute a query. We further observe that different parts of a query have different levels of incrementability and the query execution should have a decomposed control flow based on the difference. Therefore, to address these needs, we propose a new query processing method Incrementability-aware Query Processing (InQP). We build a prototype InQP system based on Spark and show that InQP significantly reduces resource consumption with a similar latency compared with incrementability-oblivious approaches. https://doi.org/10.1145/3318464.3389756
mod0635 A Method for Optimizing Opaque Filter Queries
Wenjia He (University of Michigan, Ann Arbor); Michael R. Anderson (University of Michigan, Ann Arbor); Maxwell Strome (University of Michigan, Ann Arbor); Michael Cafarella (University of Michigan, Ann Arbor) An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate’s semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical “query-independent” index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature. https://doi.org/10.1145/3318464.3389766
mod0353 Functional-Style SQL UDFs With a Capital ‘F’
Christian Duta (University of Tübingen); Torsten Grust (University of Tübingen) We advocate to express complex in-database computation using a functional style in which SQL UDFs use plain self-invocation to recurse. The resulting UDFs are concise and readable, but their run time performance on contemporary RDBMSs is sobering. This paper describes how to compile such functional-style UDFs into SQL:1999 recursive common table expressions. We build on function call graphs to build the compiler’s core and to realize a series of optimizations (reference counting, memoization, exploitation of linear and tail recursion). The compiled UDFs evaluate efficiently, challenging the performance of manually tweaked (but often convoluted) SQL code. SQL UDFs can indeed be functional and fast. https://doi.org/10.1145/3318464.3389707
Research 3: Machine Learning for Databases I • Wednesday 1:30 PM – 3:00 PM
mod0104 DB4ML – An In-Memory Database Kernel with Machine Learning Support
Matthias Jasny (TU Darmstadt); Tobias Ziegler (TU Darmstadt); Tim Kraska (MIT); Uwe Roehm (The University of Sydney); Carsten Binnig (TU Darmstadt) In this paper, we revisit the question of how ML algorithms can be best integrated into existing DBMSs to not only avoid expensive data copies to external ML tools but also to comply with regulatory reasons. The key observation is that database transactions already provide an execution model that allows DBMSs to efficiently mimic the execution model of modern parallel ML algorithms. As a main contribution, this paper presents DB4ML, an in-memory database kernel that allows applications to implement user-defined ML algorithms and efficiently run them inside a DBMS. Thereby, the ML algorithms are implemented using a programming model based on the idea of so called iterative transactions. Our experimental evaluation shows that DB4ML can support user-defined ML algorithms inside a DBMS with the efficiency of modern specialized ML engines. In contrast to DB4ML, these engines not only need to transfer data out of the DBMS but also hardcode the ML algorithms and thus are not extensible. https://doi.org/10.1145/3318464.3380575
mod0658 Active Learning for ML Enhanced Database Systems
Lin Ma (Carnegie Mellon University); Bailu Ding (Microsoft Research); Sudipto Das (Amazon Web Services); Adith Swaminathan (Microsoft Research) Recent research has shown promising results by using machine learning (ML) techniques to improve the performance of database systems, e.g., in query optimization or index recommendation. However, in many production deployments, the ML models’ performance degrades significantly when the test data diverges from the data used to train these models.In this paper, we address this performance degradation by using B-instances to collect additional data during deployment. We propose an active data collection platform, ADCP,that employs active learning (AL) to gather relevant data cost-effectively. We develop a novel AL technique, Holistic Active Learner (HAL), that robustly combines multiple noisysignals for data gathering in the context of database applications. HAL applies to various ML tasks, budget sizes, cost types, and budgeting interfaces for database applications. We evaluate ADCP on both industry-standard benchmarks and real customer workloads. Our evaluation shows that, compared with other baselines, our technique improves ML models’ prediction performance by up to 2x with the same cost budget. In particular, on production workloads, our technique reduces the prediction error of ML models by 75% using about 100 additionally collected queries. https://doi.org/10.1145/3318464.3389768
mod0674 Qd-tree: Learning Data Layouts for Big Data Analytics
Zongheng Yang (University of California, Berkeley); Badrish Chandramouli (Microsoft Research); Chi Wang (Microsoft Research); Johannes Gehrke (Microsoft); Yinan Li (Microsoft Research); Umar Farooq Minhas (Microsoft Research); Per-Åke Larson (Microsoft Research); Donald Kossmann (Microsoft Research); Rajeev Acharya (Microsoft) Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today’s systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks. https://doi.org/10.1145/3318464.3389770
mod0266 Facilitating SQL Query Composition and Analysis
Zainab Zolaktaf (University of British Columbia); Mostafa Milani (University of British Columbia); Rachel Pottinger (University of British Columbia) Formulating efficient SQL queries requires several cycles of tuning and execution. We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We achieve this by predicting properties such as the query answer size, its run-time, and error class. Unlike existing approaches, our approach does not rely on any statistics from the database instance or query execution plans. Our approach is based on using data-driven machine learning techniques that rely on large query workloads to model SQL queries and their properties. Empirical results show that the neural network models are more accurate in predicting several query properties. https://doi.org/10.1145/3318464.3380602
mod0461 MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured Predicates
Sourav Sikdar (Rice University); Chris Jermaine (Rice University) User-defined functions (UDFs) in modern SQL database systems and Big Data processing systems such as Spark—that offer API bindings in high-level languages such as Python or Scala—make automatic optimization challenging. The foundation of modern database query optimization is the collection of statistics describing the data to be processed, but when a database or Big Data computation is partially obscured by UDFs,good statistics are often unavailable. In this paper, we describe a query optimizer called the Monsoon optimizer. In the presence of UDFs, the Monsoon optimizer may choose to collect statistics on the UDFs, and then run the computation. Or, it may optimize and execute part of the plan, collecting statistics on the result of the partial plan, followed by a re optimization step, with the process repeated as needed. Monsoon decides how to interleave execution and statistics collection in a principled fashion by formalizing the problem as a Markov decision process. https://doi.org/10.1145/3318464.3389728
Research 4: Uncertain, Probabilistic, and Approximate Data • Wednesday 1:30 PM – 3:00 PM
mod0603 Causal Relational Learning
Babak Salimi (University of Washington); Harsh Parikh (Duke University); Moe Kayali (University of Washington); Lise Getoor (University of California, Santa Cruz); Sudeepa Roy (Duke University); Dan Suciu (University of Washington) Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing {\em randomized controlled trials}; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from {\em observational data} have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of {\em homogeneous elements} that can be represented in a single flat table, where each row is referred to as a {\em unit}. In contrast, in many real-world settings, the study domain naturally consists of {\em heterogeneous elements} with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CARL for capturing causal background knowledge and assumptions, and specifying causal queries using simple Datalog-like rules. CARL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CARL in social sciences and healthcare. https://doi.org/10.1145/3318464.3389759
mod0088s Sample Debiasing in the Themis Open World Database System
Laurel Orr (University of Washington); Magda Balazinska (University of Washington); Dan Suciu (University of Washington) Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic modeling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining interactive query response times. We also show that Themis is robust to differences in the support between the sample and population, a key use case when using social media samples. https://doi.org/10.1145/3318464.3380606
mod0632 Stochastic Package Queries in Probabilistic Databases
Matteo Brucato (University of Massachusetts Amherst); Nishant Yadav (University of Massachusetts Amherst); Azza Abouzied (New York University Abu Dhabi); Peter J. Haas (University of Massachusetts Amherst); Alexandra Meliou (University of Massachusetts Amherst) We provide methods for in-database support of decision making under uncertainty. Many important decision problems correspond to selecting a “package” (bag of tuples in a relational database) that jointly satisfy a set of constraints while minimizing some overall “cost” function; in most real-world problems, the data is uncertain. We provide methods for specifying—via a \sql extension—and processing \emph{stochastic package queries (\spq{s})}, in order to solve optimization problems over uncertain data, right where the data resides. Prior work in stochastic programming uses Monte Carlo methods where the original stochastic optimization problem is approximated by a large deterministic optimization problem that incorporates many “scenarios”, i.e., sample realizations of the uncertain data values. For large database tables, however, a huge number of scenarios is required, leading to poor performance and, often, failure of the solver software. We therefore provide a novel \sss algorithm that, instead of trying to solve a large deterministic problem, seamlessly approximates it via a sequence of smaller problems defined over carefully crafted “summaries” of the scenarios that accelerate convergence to a feasible and near-optimal solution. Experimental results on our prototype system show that \sss can be orders of magnitude faster than prior methods at finding feasible and high-quality packages. https://doi.org/10.1145/3318464.3389765
mod0587s Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints
Xi Liang (University of Chicago); Zechao Shang (University of Chicago); Sanjay Krishnan (University of Chicago); Aaron J. Elmore (University of Chicago); Michael J. Franklin (University of Chicago) Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines. https://doi.org/10.1145/3318464.3389785
mod0095 Mining Approximate Acyclic Schemes from Relations
Batya Kenig (University of Washington); Pranay Mundra (University of Washington); Guna Prasad (University of Washington); Babak Salimi (University of Washington); Dan Suciu (University of Washington) Acyclic schemes have numerous applications in databases and in machine learning, such as improved design, more efficient storage, and increased performance for queries and machine learning algorithms. Multivalued dependencies (MVDs) are the building blocks of acyclic schemes. The discovery from data of both MVDs and acyclic schemes is more challenging than other forms of data dependencies, such as Functional Dependencies, because these dependencies do not hold on subsets of data, and because they are very sensitive to noise in the data; for example a single wrong or missing tuple may invalidate the schema. In this paper we present Maimon, a system for discovering approximate acyclic schemes and MVDs from data. We give a principled definition of approximation, by using notions from information theory, then describe the two components of Maimon: mining for approximate MVDs, then reconstructing acyclic schemes from approximate MVDs. We conduct an experimental evaluation of Maimon on 20 real-world datasets, and show that it can scale up to 1M rows, and up to 30 columns. https://doi.org/10.1145/3318464.3380573
Research 9: Data Cleaning • Wednesday 4:30 PM – 6:00 PM
mod0179s Cleaning Denial Constraint Violations through Relaxation
Stella Giannakopoulou (EPFL); Manos Karpathiotakis (Facebook); Anastasia Ailamaki (EPFL) Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads. https://doi.org/10.1145/3318464.3389775
mod0444 On Multiple Semantics for Declarative Database Repairs
Amir Gilad (Tel Aviv University); Daniel Deutch (Tel Aviv University); Sudeepa Roy (Duke University) We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allowing to capture Database Triggers of interest. We show that there are no one-size-fits-all semantics for repairs in this inclusive setting, and we consequently introduce multiple alternative semantics, presenting the case for using each of them. We then study the relationships between the semantics in terms of their output and the complexity of computation. Our results formally establish the tradeoff between the permissiveness of the semantics and its computational complexity. We demonstrate the usefulness of the framework in capturing multiple data repair scenarios for an academic search database and the TPC-H databases, showing how using different semantics affects the repair in terms of size and runtime, and examining the relationships between the repairs. We also compare our approach with SQL triggers and a state-of-the-art data repair system. https://doi.org/10.1145/3318464.3389721
mod0654s Discovery Algorithms for Embedded Functional Dependencies
Ziheng Wei (The University of Auckland); Sven Hartmann (Clausthal University of Technology); Sebastian Link (The University of Auckland) Embedded functional dependencies (eFDs) advance data management applications by data completeness and integrity requirements. We show that the discovery problem of eFDs is NP-complete, W[2]-complete in the output, and has a minimum solution space that is larger than the maximum solution space for functional dependencies. Nevertheless, we use novel data structures and search strategies to develop row-efficient, column-efficient, and hybrid algorithms for eFD discovery. Our experiments demonstrate that the algorithms scale well in terms of their design targets, and that ranking the eFDs by the number of redundant data values they cause can provide useful guidance in identifying meaningful eFDs for applications. Finally, we demonstrate the benefits of introducing completeness requirements and ranking by the number of redundant data values for approximate and genuine functional dependencies. https://doi.org/10.1145/3318464.3389786
mod0053 SCODED: Statistical Constraint Oriented Data Error Detection
Jing Nathan Yan (Cornell University); Oliver Schulte (Simon Fraser University); MoHan Zhang (Simon Fraser University); Jiannan Wang (Simon Fraser University); Reynold Cheng (The University of Hong Kong) Statistical Constraints (\SCs) play an important role in statistical modeling and analysis. This paper brings the concept to data cleaning and studies how to leverage \SCs for error detection. \SCs provide a novel approach that has various application scenarios and works harmoniously with downstream statistical modeling. Entailment relationships between \SCs and integrity constraints provide analytical insight into \SCs. We develop \system, an S\underline{C}-\underline{O}riented \underline{D}ata \underline{E}rror \underline{D}etection system, comprising two key components: (1) \emph{SC Violation Detection}: checks whether an \SC is violated on a given dataset, and (2) \emph{Error Drill Down}: identifies the top-$k$ records that contribute most to the violation of an \SC. Experiments on synthetic and real-world data show that SCs are effective in detecting data errors that violate them, compared to state-of-the-art approaches. https://doi.org/10.1145/3318464.3380568
mod0552 A Statistical Perspective on Discovering Functional Dependencies in Noisy Data
Yunjia Zhang (University of Wisconsin-Madison); Zhihan Guo (University of Wisconsin-Madison); Theodoros Rekatsinas (University of Wisconsin-Madison) We study the problem of discovering functional dependencies (FD) from a noisy data set. We adopt a statistical perspective and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy data set is equivalent to learning the structure of a model over binary random variables, where each random variable corresponds to a functional of the data set attributes. We build upon this observation to introduce FDX a conceptually simple framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that FDX can recover true functional dependencies across a diverse array of real-world and synthetic data sets, even in the presence of noisy or missing data. We find that FDX scales to large data instances with millions of tuples and hundreds of attributes while it yields an average F1 improvement of 2x against state-of-the-art FD discovery methods. https://doi.org/10.1145/3318464.3389749
Research 6: Transaction Processing and Query Optimization • Wednesday 4:30 PM – 6:00 PM
mod0376 Long-lived Transactions Made Less Harmful
Jongbin Kim (Hanyang University); Hyunsoo Cho (Hanyang University); Kihwang Kim (Hanyang University); Jaeseon Yu (Hanyang University); Sooyong Kang (Hanyang University); Hyungsoo Jung (Hanyang University) Many systems use snapshot isolation, or something similar, as defaults, and multi-version concurrency control (MVCC) remains essential to offering such point-in-time consistency. One major issue in MVCC is the timely removal of unnecessary versions of data items, especially in the presence of long-lived transactions (LLTs). We have observed that the latest versions of MySQL and PostgreSQL are still vulnerable to LLTs. Our analysis of existing proposals suggests that new solutions to this matter must provide rigorous rules for completely identifying unnecessary versions, and elaborate designs for version cleaning lest old versions required for LLTs should suspend garbage collection. In this paper, we formalize such rules into our version pruning theorem and version classification, of which all form theoretical foundations for our new version management system, vDriver, that bases its record versioning on a new principle: Single In-row Remaining Off-row (SIRO) versioning. We implemented a prototype of vDriver and integrated it with MySQL-8.0 and PostgreSQL-12.0. The experimental evaluation demonstrated that the engines with Driver continue to perform the reclamation of dead versions in the face of LLTs while retaining transaction throughput with reduced space consumption. https://doi.org/10.1145/3318464.3389714
mod0449 Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks
Erfan Zamanian (Brown University); Julian Shun (Massachusetts Institute of Technology); Carsten Binnig (TU Darmstadt); Tim Kraska (Massachusetts Institute of Technology) Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact, recent work has shown that distributed databases can scale even when the majority of transactions are cross-partition. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions. Finally, we evaluate Chiller using various workloads, and show that our partitioning and execution strategy outperforms traditional partitioning techniques which try to avoid distributed transactions, by up to a factor of 2. https://doi.org/10.1145/3318464.3389724
mod0618 Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning
Guna Prasaad (University of Washington); Alvin Cheung (University of California, Berkeley); Dan Suciu (University of Washington) Research on transaction processing has made significant progress towards improving performance of main memory multicore OLTP systems under low contention. However, these systems struggle on workloads with lots of conflicts. Partitioned databases (and variants) perform well on high contention workloads that are statically partitionable, but time-varying workloads often make them impractical. Towards addressing this, we propose Strife—a novel transaction processing scheme that clusters transactions together dynamically and executes most of them without any concurrency control. Strife executes transactions in batches, where each batch is partitioned into disjoint clusters without any cross-cluster conflicts and a small set of residuals. The clusters are then executed in parallel with no concurrency control, followed by residuals separately executed with concurrency control. Strife uses a fast dynamic clustering algorithm that exploits a combination of random sampling and concurrent union-find data structure to partition the batch online, before executing it. Strife outperforms lock-based and optimistic protocols by up to 2x on high contention workloads. While Strife incurs about 50% overhead relative to partitioned systems in the statically partitionable case, it performs 2x better when such static partitioning is not possible and adapts to dynamically varying workloads. https://doi.org/10.1145/3318464.3389764
mod0286 A Transactional Perspective on Execute-order-validate Blockchains
Pingcheng Ruan (National University of Singapore); Dumitrel Loghin (National University of Singapore); Quang-Trung Ta (National University of Singapore); Meihui Zhang (Beijing Institute of Technology); Gang Chen (Zhejiang University); Beng Chin Ooi (National University of Singapore) Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms to general transactional systems. A new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions. However, this architecture might render many invalid transactions when serializing them. This problem is further exaggerated as the block formation rate is inherently limited due to other factors beside data processing, such as cryptography and consensus. Inspired by optimistic concurrency control in modern databases, we propose a novel method to enhance the execute-order-validate architecture, by reordering transactions to reduce the abort rate. In contrast to existing blockchains that adopt database’s preventive approaches which might over-abort serializable transactions, our method is theoretically more fine-grained: unserializable transactions are aborted before reordering and the rest are guaranteed to be serializable. We implement our method in two blockchains respectively, FabricSharp on top of Hyperledger Fabric, and FastFabricSharp on top of FastFabric. We compare the performance of FabricSharp with vanilla Fabric and three related systems, two of which are respectively implemented with one standard and one state-of-the-art concurrency control techniques from databases. The results demonstrate that FabricSharp achieves 25% higher throughput compared to the other systems in nearly all experimental scenarios. Moreover, the FastFabricSharp’s improvement on FastFabric is up to 66%. https://doi.org/10.1145/3318464.3389693
mod0497 Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates
Surabhi Gupta (Microsoft Research India); Sanket Purandare (Harvard University); Karthik Ramachandra (Microsoft Research India) Loops that iterate over SQL query results are quite common, both in application programs that run outside the DBMS, as well as User Defined Functions (UDFs) and stored procedures that run within the DBMS. It can be argued that set-oriented operations are more efficient and should be preferred over iteration; but from real world use cases, it is clear that loops over query results are inevitable in many situations, and are preferred by many users. Such loops, known as cursor loops, come with huge trade-offs and overheads w.r.t. performance, resource consumption and concurrency. We present Aggify, a technique for optimizing loops over query results that overcomes these overheads. It achieves this by automatically generating custom aggregates that are equivalent in semantics to the loop. Thereby, Aggify completely eliminates the loop by rewriting the query to use this generated aggregate. This technique has several advantages such as: (i) pipelining of entire cursor loop operations instead of materialization, (ii) pushing down loop computation from the application layer into the DBMS, closer to the data, (iii) leveraging existing work on optimization of aggregate functions, resulting in efficient query plans. We describe the technique underlying Aggify, and present our experimental evaluation over benchmarks as well as real workloads that demonstrate the significant benefits of this technique. https://doi.org/10.1145/3318464.3389736
Research 19: Machine Learning Systems and Applications • Wednesday 4:30 PM – 6:00 PM
mod0368 Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale
Supun Nakandala (University of California, San Diego); Arun Kumar (University of California, San Diego) Scalable systems for machine learning (ML) are largely siloed into dataflow systems for structured data and deep learning systems for unstructured data. This gap has left workloads that jointly analyze both forms of data with poor systems support, leading to both low system efficiency and grunt work for users. We bridge this gap for an important class of such workloads: feature transfer from deep convolutional neural networks (CNNs) for analyzing images along with structured data. Executing feature transfer on scalable dataflow and deep learning systems today faces two key systems issues: inefficiency due to redundant computations and crash-proneness due to mismanaged memory. We present Vista, a new data system that resolves these issues by elevating this workload to a declarative level on top of dataflow and deep learning systems. Vista automatically optimizes the configuration and execution of this workload to reduce both computational redundancy and the potential for workload crashes. Experiments on real datasets show that apart from making feature transfer easier, Vista avoids workload crashes and reduces runtimes by 58% to 92% compared to baselines. https://doi.org/10.1145/3318464.3389709
mod0379 Optimizing Machine Learning Workloads in Collaborative Environments
Behrouz Derakhshan (DFKI GmbH); Alireza Rezaei Mahdiraji (DFKI GmbH); Ziawasch Abedjan (TU Berlin); Tilmann Rabl (Hasso Plattner Institute & University of Potsdam); Volker Markl (DFKI GmbH & TU Berlin) Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50\% for the execution of modified workloads in collaborative environments. https://doi.org/10.1145/3318464.3389715
mod0225 GOGGLES: Automatic Image Labeling with Affinity Coding
Nilaksh Das (Georgia Institute of Technology); Sanya Chaba (Georgia Institute of Technology); Renzhi Wu (Georgia Institute of Technology); Sakshi Gandhi (Georgia Institute of Technology); Duen Horng Chau (Georgia Institute of Technology); Xu Chu (Georgia Institute of Technology) Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound. https://doi.org/10.1145/3318464.3380592
mod0487 DeepSqueeze: Deep Semantic Compression for Tabular Data
Amir Ilkhechi (Brown University); Andrew Crotty (Brown University); Alex Galakatos (Brown University); Yicong Mao (Brown University); Grace Fan (Brown University); Xiran Shi (Brown University); Ugur Cetintemel (Brown University) With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time. We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives. https://doi.org/10.1145/3318464.3389734
mod0437 TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications
Kaiping Zheng (National University of Singapore); Shaofeng Cai (National University of Singapore); Horng Ruey Chua (National University Health System); Wei Wang (National University of Singapore); Kee Yuan Ngiam (National University Health System); Beng Chin Ooi (National University of Singapore) In high stakes applications such as healthcare and finance analytics, the interpretability of predictive models is required and necessary for domain practitioners to trust the predictions. Traditional machine learning models, e.g., logistic regression (LR), are easy to interpret in nature. However, many of these models aggregate time-series data without considering the temporal correlations and variations. Therefore, their performance cannot match up to recurrent neural network (RNN) based models, which are nonetheless difficult to interpret. In this paper, we propose a general framework TRACER to facilitate accurate and interpretable predictions, with a novel model TITV devised for healthcare analytics and other high stakes applications such as financial investment and risk management. Different from LR and other existing RNN-based models, TITV is designed to capture both the time-invariant and the time-variant feature importance using a feature-wise transformation subnetwork and a self-attention subnetwork, for the feature influence shared over the entire time series and the time-related importance respectively. Healthcare analytics is adopted as a driving use case, and we note that the proposed TRACER is also applicable to other domains, e.g., fintech. We evaluate the accuracy of TRACER extensively in two real-world hospital datasets, and our doctors/clinicians further validate the interpretability of TRACER in both the patient level and the feature level. Besides, TRACER is also validated in a critical financial application. The experimental results confirm that TRACER facilitates both accurate and interpretable analytics for high stakes applications. https://doi.org/10.1145/3318464.3389720
Research 25: Social Network Analysis • Wednesday 4:30 PM – 6:00 PM
mod0026 The Solution Distribution of Influence Maximization: A High-level Experimental Study on Three Algorithmic Approaches
Naoto Ohsaka (NEC Corporation) Influence maximization is among the most fundamental algorithmic problems in social influence analysis. Over the last decade, a great effort has been devoted to developing efficient algorithms for influence maximization, so that identifying the “best” algorithm has become a demanding task. In SIGMOD’17, Arora, Galhotra, and Ranu reported benchmark results on eleven existing algorithms and demonstrated that there is no single state-of-the-art offering the best trade-off between computational efficiency and solution quality. In this paper, we report a high-level experimental study on three well-established algorithmic approaches for influence maximization, referred to as Oneshot, Snapshot, and Reverse Influence Sampling (RIS). Different from Arora et al., our experimental methodology is so designed that we examine the <i>distribution</i> of random solutions, characterize the relation between the <i>sample number</i> and the actual solution quality, and avoid <i>implementation dependencies</i>. Our main findings are as follows: 1. For a sufficiently large sample number, we obtain a unique solution regardless of algorithms. 2. The average solution quality of Oneshot, Snapshot, and RIS improves at the same rate up to scaling of sample number. 3. Oneshot requires more samples than Snapshot, and Snapshot requires fewer but larger samples than RIS. We discuss the time efficiency when <i>conditioning</i> Oneshot, Snapshot, and RIS to be of identical accuracy. Our conclusion is that Oneshot is suitable only if the size of available memory is limited, and RIS is more efficient than Snapshot for large networks; Snapshot is preferable for small, low-probability networks. https://doi.org/10.1145/3318464.3380564
mod0522 Influence Maximization Revisited: Efficient Reverse Reachable Set Generation with Bound Tightened
Qintian Guo (The Chinese University of Hong Kong); Sibo Wang (The Chinese University of Hong Kong); Zhewei Wei (Renmin University of China); Ming Chen (Renmin University of China) Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the <i>influence maximization (IM)</i> problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log(n)/&#949;<sup>2</sup>) expected time while returning a (1-1/e -&#949;) approximate solution with at least 1-1/n probability. A key phase of these IM algorithms is the random <i>reverse reachable (RR)</i> set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this paper, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k&#183; n log(n)/&#949;<sup>2</sup>), when for any node v, the total weight of its incoming edges is no larger than a constant. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment. https://doi.org/10.1145/3318464.3389740
mod0194 Truss-based Community Search over Large Directed Graphs
Qing Liu (Hong Kong Baptist University); Minjun Zhao (Zhejiang University); Xin Huang (Hong Kong Baptist University); Jianliang Xu (Hong Kong Baptist University); Yunjun Gao (Zhejiang University) Community search enables personalized community discovery and has wide applications in large real-world graphs. While community search has been extensively studied for undirected graphs, the problem for directed graphs has received attention only recently. However, existing studies suffer from several drawbacks, e.g., the vertices with varied in-degrees and out-degrees cannot be included in a community at the same time. To address the limitations, in this paper, we systematically study the problem of community search over large directed graphs. We start by presenting a novel community model, called D-truss, based on two distinct types of directed triangles, i.e., flow triangle and cycle triangle. The D-truss model brings nice structural and computational properties and has many advantages in comparison with the existing models. With this new model, we then formulate the D-truss community search problem, which is proved to be NP-hard. In view of its hardness, we propose two efficient 2-approximation algorithms, named \emph{Global} and \emph{Local}, that run in polynomial time yet with quality guarantee. To further improve the efficiency of the algorithms, we devise an indexing method based on D-truss decomposition. Consequently, the D-truss community search can be solved upon the D-truss index without time-consuming accesses to the original graph. Experimental studies on real-world graphs with ground-truth communities validate the quality of the solutions we obtain and the efficiency of the proposed algorithms. https://doi.org/10.1145/3318464.3380587
mod0021s Densely Connected User Community and Location Cluster Search in Location-Based Social Networks
Junghoon Kim (Nanyang Technological University); Tao Guo (Google); Kaiyu Feng (Nanyang Technological University); Gao Cong (Nanyang Technological University); Arijit Khan (Nanyang Technological University); Farhana Choudhury (University of Melbourne) Searching for a community based on query nodes in a graph is a fundamental problem and has been extensively investigated. Most of the existing approaches focus on finding a community in a social network, and very few studies consider location-based social networks where users can check in locations. In this paper we propose the GeoSocial Community Search problem (GCS) which aims to find a social community and a cluster of spatial locations that are densely connected in a location-based social network simultaneously. The GCS can be useful for marketing and user/location recommendation. To the best of our knowledge, this is the first work to find a social community and a cluster of spatial locations that are densely connected from location-based social networks. We prove that the problem is NP-hard, and is not in APX, unless P = NP. To solve this problem, we propose three algorithms: core-based basic algorithm, top-down greedy removing algorithm, and an expansion algorithm.Finally, we report extensive experimental studies that offer insights into the efficiency and effectiveness of the proposed solutions. https://doi.org/10.1145/3318464.3380603
mod0531 Global Reinforcement of Social Networks: The Anchored Coreness Problem
Qingyuan Linghu (University of New South Wales); Fan Zhang (Guangzhou University); Xuemin Lin (University of New South Wales); Wenjie Zhang (University of New South Wales); Ying Zhang (University of Technology Sydney) The stability of a social network has been widely studied as an important indicator for both the network holders and the participants. Existing works on reinforcing networks focus on a local view, e.g., the anchored k-core problem aims to enlarge the size of the k-core with a fixed input k. Nevertheless, it is more promising to reinforce a social network in a global manner: considering the engagement of every user (vertex) in the network. Since the coreness of a user has been validated as the “best practice” for capturing user engagement, we propose and study the anchored coreness problem in this paper: anchoring a small number of vertices to maximize the coreness gain (the total increment of coreness) of all the vertices in the network. We prove the problem is NP-hard and show it is more challenging than the existing local-view problems. An efficient heuristic algorithm is proposed with novel techniques on pruning search space and reusing the intermediate results. Extensive experiments on real-life data demonstrate that our model is effective for reinforcing social networks and our algorithm is efficient. https://doi.org/10.1145/3318464.3389744
SIGMOD Research Sessions for Thursday
Research 22: Data Lakes, Web, and Knowledge Graph • Thursday 10:30 AM – 12:00 PM
mod0079s Organizing Data Lakes for Navigation
Fatemeh Nargesian (University of Rochester); Ken Q. Pu (University of Ontario Institute of Technology); Erkang Zhu (Microsoft Research); Bahar Ghadiri Bashardoost (University of Toronto); Renée J. Miller (Northeastern University) We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We propose the data lake organization problem as the problem of finding an organization that allows a user to most effectively navigate a data lake. We present a new probabilistic model of how users interact with an organization and propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both a real data lake containing data from open data portals and on a benchmark that contains rich metadata emulating the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users find relevant tables that cannot be found by keyword search. https://doi.org/10.1145/3318464.3380605
mod0453 Finding Related Tables in Data Lakes for Interactive Data Science
Yi Zhang (University of Pennsylvania); Zachary G. Ives (University of Pennsylvania) Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts. https://doi.org/10.1145/3318464.3389726
mod0140s Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference
Mohammad Raza (Microsoft Corporation); Sumit Gulwani (Microsoft Corporation) Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various robustness and usability challenges. In this work we present a novel program synthesis approach which combines the benefits of deductive and enumerative synthesis strategies, yielding a semi-supervised technique with which concise programs expressible in standard languages can be synthesized from very few examples. We demonstrate improvement over existing techniques in terms of overall accuracy, number of examples required, and program complexity. Our method has been deployed as a web extraction feature in the mass market Microsoft Power BI product. https://doi.org/10.1145/3318464.3380608
mod0306 SPARQL Rewriting: Towards Desired Results
Xun Jian (The Hong Kong University of Science and Technology); Yue Wang (Shenzhen Institute of Computing Sciences, Shenzhen University); Xiayu Lei (The Hong Kong University of Science and Technology); Libin Zheng (The Hong Kong University of Science and Technology); Lei Chen (The Hong Kong University of Science and Technology) Recent years witnessed the emergence of various applications on knowledge graphs, which are often represented as RDF graphs. However, due to the lack of data schema and the complexity of SPARQL language, there is usually a gap between the user’s real desire and the actual meaning of a SPARQL query, especially when the query itself is complicated. In this paper, we try to narrow this gap by modifying a given query with a set of modifiers, so that its result approaches a user-provided example set. Specifically, we model this problem as two individual sub-problems, <i>query-restricting</i>, and <i>query-relaxing</i>, both of which are shown to be NP-hard. We further prove that unless P=NP, <i>query-restricting</i> has no <i>polynomial-time approximation scheme</i> (PTAS), and <i>query-relaxing</i> has no polynomial-time constant-factor approximation algorithm. Despite their hardness, we propose a (1-1/&#949;)-approximation method for <i>query-restricting</i> and 2 heuristics for <i>query-relaxing</i>. Extensive experiments have been conducted on real-world knowledge graphs to evaluate the effectiveness and efficiency of our proposed solutions. https://doi.org/10.1145/3318464.3389695
mod0238 Realistic Re-evaluation of Knowledge Graph Completion Methods: An Experimental Study
Farahnaz Akrami (University of Texas at Arlington); Mohammed Samiul Saeef (University of Texas at Arlington); Qingheng Zhang (Nanjing University); Wei Hu (Nanjing University); Chengkai Li (University of Texas at Arlington) In the active research area of employing embedding models for knowledge graph completion, particularly for the task of link prediction, most prior studies used two benchmark datasets FB15k and WN18 in evaluating such models. Most triples in these and other datasets in such studies belong to reverse and duplicate relations which exhibit high data redundancy due to semantic duplication, correlation or data incompleteness. This is a case of excessive data leakage—a model is trained using features that otherwise would not be available when the model needs to be applied for real prediction. There are also Cartesian product relations for which every triple formed by the Cartesian product of applicable subjects and objects is a true fact. Link prediction on the aforementioned relations is easy and can be achieved with even better accuracy using straightforward rules instead of sophisticated embedding models. A more fundamental defect of these models is that the link prediction scenario, given such data, is non-existent in the real-world. This paper is the first systematic study with the main objective of assessing the true effectiveness of embedding models when the unrealistic triples are removed. Our experiment results show these models are much less accurate than what we used to perceive. Their poor accuracy renders link prediction a task without truly effective automated solution. Hence, we call for re-investigation of possible effective approaches. https://doi.org/10.1145/3318464.3380599
Research 10: Storage and Indexing • Thursday 10:30 AM – 12:00 PM
mod0407 Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage Engines
Michael Haubenschild (Tableau Software); Caetano Sauer (Tableau Software); Thomas Neumann (Technische Universität München); Viktor Leis (Friedrich-Schiller-Universität Jena) For decades, ARIES has been the standard for logging and recovery in database systems. ARIES offers important features like support for arbitrary workloads, fuzzy checkpoints, and transparent index recovery. Nevertheless, many modern in-memory database systems use more lightweight approaches that have less overhead and better multi-core scalability but only work well for the in-memory setting. Recently, a new class of high-performance storage engines has emerged, which exploit fast SSDs to achieve performance close to pure in-memory systems but also allow out-of-memory workloads. For these systems, ARIES is too slow whereas in-memory logging proposals are not applicable. In this work, we propose a new logging and recovery design that supports incremental and fuzzy checkpointing, index recovery, out-of-memory workloads, and low-latency transaction commits. Our continuous checkpointing algorithm guarantees bounded recovery time. Using per-thread logging with minimal synchronization, our implementation achieves near-linear scalability on multi-core CPUs. We implemented and evaluated these techniques in our LeanStore storage engine. For working sets that fit in main memory, we achieve performance close to that of an in-memory approach, even with logging, checkpointing, and dirty page writing enabled. For the out-of-memory scenario, we outperform a state-of-the-art ARIES implementation by a factor of two. https://doi.org/10.1145/3318464.3389716
mod0594 Lethe: A Tunable Delete-Aware LSM Engine
Subhadeep Sarkar (Boston University); Tarikul Islam Papon (Boston University); Dimitris Staratzis (Boston University); Manos Athanassoulis (Boston University) Data-intensive applications fueled the evolution of log structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource.To address these challenges, in this paper, we build a new key-value storage engine, Lethe, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17-1.4x) and lower space amplification (2.1-9.8x), with a modest increase in write amplification (between 4% and 25%). In addition, Lethe supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance nor employing a costly full tree merge. https://doi.org/10.1145/3318464.3389757
mod0025 BinDex: A Two-Layered Index for Fast and Robust Scans
Linwei Li (Fudan University); Kai Zhang (Fudan University); Jiading Guo (Fudan University); Wen He (Fudan University); Zhenying He (Fudan University); Yinan Jing (Fudan University); Weili Han (Fudan University); X. Sean Wang (Fudan University) In modern analytical database systems, the performance of the data scan operation is of key importance to the performance of query execution. Existing approaches may be categorized into index scan and sequential scan. However, both approaches have inherent inefficiencies. Indeed, sequential scan may need to access a large amount of unneeded data, especially for queries with low selectivity. Instead, index scan may involve a large number of expensive random memory accesses when the query selectivity is high. Moreover, with the growing complexities in database query workloads, it has become hard to predict which approach is better for a particular query. In order to obtain fast and robust scans under all selectivities, this paper proposes BinDex, a two-layered index structure based on binned bitmaps that can be used to significantly accelerate the scan operations for in-memory column stores. The first layer of BinDex consists of a set of binned bitmaps which filter out most unneeded values in a column. The second layer provides some auxiliary information to correct the bits that have incorrect values. By varying the number of bit vectors in the first layer, BinDex can make a tradeoff between memory space and performance. Experimental results show that BinDex outperforms the state-of-the-art approaches with less memory than a B+-tree would use. And by enlarging the memory space, BinDex can achieve up to 2.9 times higher performance, eliminating the need for making a choice between sequential or index scans. https://doi.org/10.1145/3318464.3380563
mod0038s Analysis of Indexing Structures for Immutable Data
Cong Yue (National University of Singapore); Zhongle Xie (National University of Singapore); Meihui Zhang (Beijing Institute of Technology); Gang Chen (Zhejiang University); Beng Chin Ooi (National University of Singapore); Sheng Wang (Alibaba Group); Xiaokui Xiao (National University of Singapore) In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. To provide efficient support for lookup and merge operations, three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree(MBT), and Pattern-Oriented-Split Tree (POS-Tree), have been proposed. Although these structures have been adopted in real applications, there is no systematic evaluation of their pros and cons in the literature, making it difficult for practitioners to choose the right index structure for their applications.
To alleviate the above problem, we present a comprehensive analysis of the existing index structures for immutable data, and evaluate both their asymptotic and empirical performance. Specifically, we show that MPT, MBT, and POS-Tree are all instances of a recently proposed framework, dubbedStructurally Invariant and Reusable Indexes (SIRI). We propose to evaluate the SIRI instances on their index performance and deduplication capability. We establish the worst-case guarantees of each index, and experimentally evaluate all indexes in a wide variety of settings. Based on our theoretical and empirical analysis, we conclude that POS-Tree is a favorable choice for indexing immutable data.
https://doi.org/10.1145/3318464.3389773
mod0198 Tree-Encoded Bitmaps
Harald Lang (Technical University of Munich); Alexander Beischl (Technical University of Munich); Viktor Leis (Friedrich Schiller University Jena); Peter Boncz (Centrum Wiskunde & Informatica); Thomas Neumann (Technical University of Munich); Alfons Kemper (Technical University of Munich) We propose a novel method to represent compressed bitmaps. Similarly to existing bitmap compression schemes, we exploit the compression potential of bitmaps populated with consecutive identical bits, i.e., 0-runs and 1-runs. But in contrast to prior work, our approach employs a binary tree structure to represent runs of various lengths. Leaf nodes in the upper tree levels thereby represent longer runs, and vice versa. The tree-based representation results in high compression ratios and enables efficient random access, which in turn allows for the fast intersection of bitmaps. Our experimental analysis with randomly generated bitmaps shows that our approach significantly improves over state-of-the-art compression techniques when bitmaps are dense and/or only barely clustered. Further, we evaluate our approach with real-world data sets, showing that our tree-encoded bitmaps can save up to one third of the space over existing techniques. https://doi.org/10.1145/3318464.3380588
Research 28: Stream Processing • Thursday 10:30 AM – 12:00 PM
mod0375 Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems
Ahmed S. Abdelhamid (Purdue University); Ahmed R. Mahmood (Purdue University); Anas Daghistani (Purdue University); Walid G. Aref (Purdue University) Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application’s service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency. https://doi.org/10.1145/3318464.3389713
mod0447 Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines
Bonaventura Del Monte (Technische Universität Berlin & DFKI GmbH); Steffen Zeuch (Technische Universität Berlin & DFKI GmbH); Tilmann Rabl (Hasso Plattner Institute, University of Potsdam); Volker Markl (Technische Universität Berlin & DFKI GmbH) Scale-out stream processing engines (SPEs) are powering large big data applications on high velocity data streams. Industrial setups require SPEs to sustain outages, varying data rates, and low-latency processing. SPEs need to transparently reconfigure stateful queries during runtime. However, state-of-the-art SPEs are not ready yet to handle on-the-fly reconfigurations of queries with terabytes of state due to three problems. These are network overhead for state migration, consistency, and overhead on data processing. In this paper, we propose Rhino, a library for efficient reconfigurations of running queries in the presence of very large distributed state. Rhino provides a handover protocol and a state migration protocol to consistently and efficiently migrate stream processing among servers. Overall, our evaluation shows that Rhino scales with state sizes of up to TBs, reconfigures a running query 15 times faster than the state-of-the-art, and reduces latency by three orders of magnitude upon a reconfiguration. https://doi.org/10.1145/3318464.3389723
mod0513 Grizzly: Efficient Stream Processing Through Adaptive Query Compilation
Philipp M. Grulich (Technische Universität Berlin); Breß Sebastian (Technische Universität Berlin); Steffen Zeuch (Technische Universitat Berlin & DFKI GmbH); Jonas Traub (Technische Universität Berlin); Janis von Bleichert (Technische Universität Berlin); Zongxiong Chen (DFKI GmbH); Tilmann Rabl (HPI, University of Potsdam); Volker Markl (Technische Universitat Berlin & DFKI GmbH) Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. They follow an interpretation-based processing model and do not perform runtime optimizations. This limits the utilization of modern hardware and neglects changing data characteristics at runtime. In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution. We extend query compilation and task-based parallelization for the unique requirements of stream processing and apply adaptive compilation to enable runtime re-optimizations. The combination of light-weight statistic gathering with just-in-time compilation enables Grizzly to adjust to changing data-characteristics dynamically at runtime. Our experiments show that Grizzly outperforms state-of-the-art SPEs by up to an order of magnitude in throughput. https://doi.org/10.1145/3318464.3389739
mod0574 LightSaber: Efficient Window Aggregation on Multi-core Processors
Georgios Theodorakis (Imprerial College London); Alexandros Koliousis (Graphcore Research); Peter Pietzuch (Imprerial College London); Holger Pirk (Imprerial College London) Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introducebLightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems—on a 16-core server, it processes 470 million records/s with 132 ?s average latency. https://doi.org/10.1145/3318464.3389753
mod0107 Parallel Index-based Stream Join on a Multicore CPU
Amirhesam Shahvarani (Technische Universität München); Hans-Arno Jacobsen (Technische Universität München) Indexing sliding window content to enhance the performance of streaming queries can be greatly improved by utilizing the computational capabilities of a multicore processor. Conventional indexing data structures optimized for frequent search queries on a prestored dataset do not meet the demands of indexing highly dynamic data as in streaming environments. In this paper, we introduce an index data structure, called the partitioned in-memory merge tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming settings. Utilizing the specific pattern of streaming data and the distribution of queries, we propose a low-cost and effective concurrency control mechanism to meet the demands of high-rate update queries. To complement the index, we design an algorithm to realize a parallel index-based stream join that exploits the computational power of multicore processors. Our experiments using an octa-core processor show that our parallel stream join achieves up to 5.5 times higher throughput than a single-threaded approach. https://doi.org/10.1145/3318464.3380576
Research 11: Machine Learning for Databases II • Thursday 10:30 AM – 12:00 PM
mod0370 ALEX: An Updatable Adaptive Learned Index
Jialin Ding (Massachusetts Institute of Technology); Umar Farooq Minhas (Microsoft Research); Jia Yu (Arizona State University & Microsoft Research); Chi Wang (Microsoft Research); Jaeyoung Do (Microsoft Research); Yinan Li (Microsoft Research); Hantian Zhang (Georgia Institute of Technology & Microsoft Research); Badrish Chandramouli (Microsoft Research); Johannes Gehrke (Microsoft); Donald Kossmann (Microsoft Research); David Lomet (Microsoft Research); Tim Kraska (Massachusetts Institute of Technology) Recent work on “learned indexes” has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as “models” that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+ tree by a factor of up to three in search time and by an order of magnitude in memory footprint. However, it is limited to static, read-only workloads. In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. ALEX effectively combines the core insights from learned indexes with proven storage and indexing techniques to achieve high performance and low memory footprint. On read-only workloads, ALEX beats the learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size. Across the spectrum of read-write workloads, ALEX beats B+ trees by up to 4.1X while never performing worse, with up to 2000X smaller index size. We believe ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates. https://doi.org/10.1145/3318464.3389711
mod0134 Learning Multi-Dimensional Indexes
Vikram Nathan (Massachusetts Institute of Technology); Jialin Ding (Massachusetts Institute of Technology); Mohammad Alizadeh (Massachusetts Institute of Technology); Tim Kraska (Massachusetts Institute of Technology) Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system. https://doi.org/10.1145/3318464.3380579
mod0573 The Case for a Learned Sorting Algorithm
Ani Kristo (Brown University); Kapil Vaidya (Massachusetts Institute of Technology); U&#287;ur Çetintemel (Brown University); Sanchit Misra (Intel Labs); Tim Kraska (Massachusetts Institute of Technology) Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python. https://doi.org/10.1145/3318464.3389752
mod0459 QuickSel: Quick Selectivity Learning with Mixture Models
Yongjoo Park (University of Illinois at Urbana-Champaign); Shucheng Zhong (University of Michigan – Ann Arbor); Barzan Mozafari (University of Michigan – Ann Arbor) Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today’s databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, <i>query-driven histograms</i> have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice—i.e., require an exponential number of buckets—or quickly lose their advantage as they observe more queries. In this paper, we propose a <i>selectivity learning</i> framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal <i>model</i> of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from <i>each query</i> and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x–179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%–91.8% more accurate than periodically-updated histograms and samples, respectively. https://doi.org/10.1145/3318464.3389727
mod0527 Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
Shohedul Hasan (University of Texas at Arlington); Saravanan Thirumuruganathan (QCRI, HBKU); Jees Augustine (University of Texas at Arlington); Nick Koudas (University of Toronto); Gautam Das (University of Texas at Arlington) Selectivity estimation – the problem of estimating the result size of queries – is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. Recently, deep learning has been applied to this problem with promising results. However, many of the proposed approaches often struggle to provide accurate results for multi attribute queries involving large number of predicates and with low selectivity.In this paper, we propose two complementary approaches that are effective for this scenario. Our first approach models selectivity estimation as a density estimation problem where one seeks to estimate the joint probability distribution from a finite number of samples. We leverage techniques from neural density estimation to build an accurate selectivity estimator. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We describe how to extend our algorithms for range queries. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating query workload information in a deep learning framework and the dynamic scenario where both data and workload queries could be updated. Our extensive experiments with a special emphasis on queries with a large number of predicates and/or small result sizes demonstrates that our proposed techniques provide fast and accurate selective estimates with minimal space overhead. https://doi.org/10.1145/3318464.3389741
Research 26: Usability and Natural Language User Interfaces • Thursday 1:30 PM – 3:00 PM
mod0637 QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster
Aristotelis Leventidis (Northeastern University); Jiahui Zhang (Northeastern University); Cody Dunne (Northeastern University); Wolfgang Gatterbauer (Northeastern University); H.V. Jagadish (University of Michigan); Mirek Riedewald (Northeastern University) Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries
in <i>automatically-generated visual diagrams</i> that can help users understand the meaning of queries faster and more accurately than SQL text alone. We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture the meaning of deeply nested queries. Our diagrams build upon a rich history of diagrammatic reasoning systems in logic and were designed using a large body of human-computer interaction best practices: they are <i>minimal</i> in that no visual element is superfluous; they are <i>unambiguous</i> in that no two queries with different semantics map to the same visualization; and they <i>extend</i> previously existing visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation involving 42 users on Amazon Mechanical Turk shows that with only a 2–3 minute static tutorial, participants could interpret queries meaningfully faster with our diagrams than when reading SQL alone. Moreover, we have evidence that our visual diagrams result in participants making fewer errors than with SQL. We believe that more regular exposure to diagrammatic representations of SQL can give rise to a <i>pattern-based</i> and thus more intuitive use and re-use of SQL. A full version of this paper with all appendices and supplemental material for the experimental study (stimuli, raw data, and analysis code) are available at https://osf.io/btszh.
https://doi.org/10.1145/3318464.3389767
mod0295s Duoquest: A Dual-Specification System for Expressive SQL Queries
Christopher Baik (University of Michigan – Ann Arbor); Zhongjun Jin (University of Michigan – Ann Arbor); Michael Cafarella (University of Michigan – Ann Arbor); H. V. Jagadish (University of Michigan – Ann Arbor) Querying a relational database is difficult because it requires users to be familiar with both the SQL language and the schema. However, many users possess enough domain expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of these alternatives face certain pitfalls: natural language queries (NLQs) are often ambiguous, even for human interpreters, while current PBE approaches limit functionality to be tractable. Consequently, we propose dual-specification query synthesis, which consumes both a NLQ and an optional PBE-like table sketch query that enables users to express varied levels of domain knowledge. We introduce the novel dual-specification Duoquest system, which leverages guided partial query enumeration to efficiently explore the space of possible queries. We present results from user studies in which Duoquest demonstrates a 62.5% absolute increase in query construction accuracy over a state-of-the-art NLI and comparable accuracy to a PBE system on a limited workload supported by the PBE system. In a simulation study on the Spider benchmark, Duoquest demonstrates a >2x increase in top-1 accuracy over both NLI and PBE. https://doi.org/10.1145/3318464.3389776
mod0580 SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns
Prashanth Dintyala (Georgia Institute of Technology); Arpit Narechania (Georgia Institute of Technology); Joy Arulraj (Georgia Institute of Technology) The emergence of database-as-a-service platforms has made deploying database applications easier than before. Now, developers can quickly create scalable applications. However, designing performant, maintainable, and accurate applications is challenging. Developers may unknowingly introduce anti-patterns in the application’s SQL statements. These anti-patterns are design decisions that are intended to solve a problem but often lead to other problems by violating fundamental design principles. In this paper, we present SQLCheck, a holistic toolchain for automatically finding and fixing anti-patterns in database applications. We introduce techniques for automatically (1) detecting anti-patterns with high precision and recall, (2) ranking the anti-patterns based on their impact on performance, maintainability, and accuracy of applications, and (3) suggesting alternative queries and changes to the database design to fix these anti-patterns. We demonstrate the prevalence of these anti-patterns in a large collection of queries and databases collected from open-source repositories. We introduce an anti-pattern detection algorithm that augments query analysis with data analysis. We present a ranking model for characterizing the impact of frequently occurring anti-patterns. We discuss how SQLCheck suggests fixes for high-impact anti-patterns using rule-based query refactoring techniques. Our experiments demonstrate that SQLCheck enables developers to create more performant, maintainable, and accurate applications. https://doi.org/10.1145/3318464.3389754
mod0211 DBPal: A Fully Pluggable NL2SQL Training Pipeline
Nathaniel Weir (Johns Hopkins University); Prasetya Utama (Technische Universität Darmstadt); Alex Galakatos (Brown University); Andrew Crotty (Brown University); Amir Ilkhechi (Brown University); Shekar Ramaswamy (Brown University); Rohin Bhushan (Brown University); Nadja Geisler (Technische Universität Darmstadt); Benjamin Hättasch (Technische Universität Darmstadt); Steffen Eger (Technische Universität Darmstadt); Ugur Cetintemel (Brown University); Carsten Binnig (Technische Universität Darmstadt) Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa).Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models. https://doi.org/10.1145/3318464.3380589
mod0357s SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data
Vraj Shah (University of California, San Diego); Side Li (University of California, San Diego); Arun Kumar (University of California, San Diego); Lawrence Saul (University of California, San Diego) Speech-driven querying is becoming popular in new device environments such as smartphones, tablets, and even conversational assistants. However, such querying is largely restricted to natural language. Typed SQL remains the gold standard for sophisticated structured querying although it is painful in many environments, which restricts when and how users consume their data. In this work, we propose to bridge this gap by designing a speech-driven querying system and interface for structured data we call SpeakQL. We support a practically useful subset of regular SQL and allow users to query in any domain with novel touch/speech based human-in-the-loop correction mechanisms. Automatic speech recognition (ASR) introduces myriad forms of errors in transcriptions, presenting us with a technical challenge. We exploit our observations of SQL’s properties, its grammar, and the queried database to build a modular architecture. We present the first dataset of spoken SQL queries and a generic approach to generate them for any arbitrary schema. Our experiments show that SpeakQL can automatically correct a large fraction of errors in ASR transcriptions. User studies show that SpeakQL can help users specify SQL queries significantly faster with a speedup of average 2.7x and up to 6.7x compared to typing on a tablet device. SpeakQL also reduces the user effort in specifying queries by a factor of average 10x and up to 60x compared to raw typing effort. https://doi.org/10.1145/3318464.3389777
Research 23: OLAP, Data Warehouses, and Key-Value Stores • Thursday 1:30 PM – 3:00 PM
mod0663 Bitvector-aware Query Optimization for Decision Support Queries
Bailu Ding (Microsoft Research); Surajit Chaudhuri (Microsoft Research); Vivek Narasayya (Microsoft Research) Bitvector filtering is an important query processing technique that can significantly reduce the cost of execution, especially for complex decision support queries with multiple joins. Despite its wide application, however, its implication to query optimization is not well understood. In this work, we study how bitvector filters impact query optimization. We show that incorporating bitvector filters into query optimization straightforwardly can increase the plan space complexity by an exponential factor in the number of relations in the query. We analyze the plans with bitvector filters for star and snowflake queries in the plan space of right deep trees without cross products. Surprisingly, with some simplifying assumptions, we prove that, the plan of the minimal cost with bitvector filters can be found from a linear number of plans in the number of relations in the query. This greatly reduces the plan space complexity for such queries from exponential to linear. Motivated by our analysis, we propose an algorithm that accounts for the impact of bitvector filters in query optimization. Our algorithm optimizes the join order for an arbitrary decision support query by choosing from a linear number of candidate plans in the number of relations in the query. We implement our algorithm in a commercial database DBMS-X as a transformation rule. Our evaluation on both industry standard benchmarks and customer workload shows that, compared with DBMS-X, our technique reduces the total CPU execution time by 22%-64% for the workloads, with up to two orders of magnitude reduction in CPU execution time for individual queries. https://doi.org/10.1145/3318464.3389769
mod0414 Efficient Join Synopsis Maintenance for Data Warehouse
Zhuoyue Zhao (University of Utah); Feifei Li (University of Utah); Yuxi Liu (University of Utah) Various sources such as daily business operations and sensors from different IoT applications constantly generate a lot of data. They are often loaded into a data warehouse system to perform complex analysis over. It, however, can be extremely costly if the query involves joins, especially many-to-many joins over multiple large tables. A join synopsis, i.e., a small uniform random sample over the join result, often suffices as a representative alternative to the full join result for many applications such as histogram construction, model training and etc. Towards that end, we propose a novel algorithm SJoin that can maintain a join synopsis over a pre-specified general $\theta$-join query in a dynamic database with continuous inflows of updates. Central to SJoin is maintaining a weighted join graph index, which assists to efficiently replace join results in the synopsis upon update. We conduct extensive experiments using TPC-DS and a simulated road sensor data over several complex join queries and they demonstrate the clear advantage of SJoin over the best available baseline. https://doi.org/10.1145/3318464.3389717
mod0557s Adaptive HTAP through Elastic Resource Scheduling
Aunn Raza (Ecole Polytechnique Fédérale de Lausanne); Periklis Chrysogelos (Ecole Polytechnique Fédérale de Lausanne); Angelos Christos Anadiotis (Ecole Polytechnique); Anastasia Ailamaki (Ecole Polytechnique Fédérale de Lausanne) Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost for either OLTP or OLAP. The data freshness and the performance requirements of both engines, however, may vary with the workload.We approach HTAP as a scheduling problem, addressed at runtime through elastic resource management. We model an HTAP system as a set of three individual engines: an OLTP, an OLAP and a Resource and Data Exchange (RDE) engine. We devise a scheduling algorithm which traverses the HTAP design spectrum through elastic resource management, to meet the workload data freshness requirements. We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach. Our evaluation shows that the performance benefit of our system for OLAP queries increases over time, reaching up to 50% compared to static schedules for 100 query sequences, while maintaining a small, and controlled, drop in the OLTP throughput. https://doi.org/10.1145/3318464.3389783
mod0032 SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries
Yoon-Min Nam (Daegu Gyeongbuk Institute of Science and Technology); Donghyoung Han (Daegu Gyeongbuk Institute of Science and Technology); Min-Soo Kim (Korea Advanced Institute of Science and Technology) The concept of OLAP query processing is now being widely adopted in various applications. The number of complex queries containing the joins between non-unique keys (called FK-FK joins) increases in those applications. However, the existing in-memory OLAP systems tend not to handle such complex queries efficiently since they generate a large amount of intermediate results or incur a huge amount of probe cost. In this paper, we propose an effective query planning method for complex OLAP queries. It generates a query plan containing n-ary join operators based on a cost model. The plan does not generate intermediate results for processing FK-FK joins and significantly reduces the probe cost. We also propose an efficient processing method for n-ary join operators. We implement the prototype system SPRINTER by integrating our proposed methods into an open-source in-memory OLAP system. Through experiments using the TPC-DS benchmark, we have shown that SPRINTER outperforms the state-of-the-art OLAP systems for complex queries. https://doi.org/10.1145/3318464.3380565
mod0471 Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores
Siqiang Luo (Harvard University); Subarna Chatterjee (Harvard University); Rafael Ketsetsidis (Harvard University); Niv Dayan (Harvard University); Wilson Qin (Harvard University); Stratos Idreos (Harvard University) We introduce Rosetta, a probabilistic range filter designed specifically for LSM-tree based key-value stores. The core intuition is that we can sacrifice filter probe time because it is not visible in end-to-end key-value store performance, which in turn allows us to significantly reduce the filter false positive rate for every level of the tree. Rosetta indexes all binary prefixes of a key using a hierarchically arranged set of Bloom filters. It then converts each range query into multiple probes, one for each non-overlapping binary prefix. Rosetta has the ability to track workload patterns and adopt a beneficial tuning for each individual LSM-tree run by adjusting the number of Bloom filters it uses and how memory is spread among them to optimize the FPR/CPU cost balance. We show how to integrate Rosetta in a full system, RocksDB, and we demonstrate that it brings as much as a 40x improvement compared to default RocksDB and 2-5x improvement compared to state-of-the-art range filters in a variety of workloads and across different levels of the memory hierarchy (memory, SSD, hard disk). We also show that, unlike state-of-the-art filters, Rosetta brings a net benefit in RocksDB’s overall performance, i.e., it improves range queries without losing any performance for point queries. https://doi.org/10.1145/3318464.3389731
Research 16: Graph and Stream Processing • Thursday 1:30 PM – 3:00 PM
mod0546 Scaling Up Distance Labeling on Graphs with Core-Periphery Properties
Wentao Li (CAI, FEIT, University of Technology Sydney); Miao Qiao (University of Auckland); Lu Qin (CAI, FEIT, University of Technology Sydney); Ying Zhang (CAI, FEIT, University of Technology Sydney); Lijun Chang (University of Sydney); Xuemin Lin (University of New South Wales) In indexing a graph for distance queries, distance labeling is a common practice; in particular, 2-hop labeling which guarantees the exactness of the query results is widely adopted. When it comes to a massive real graph with a relatively large treewidth such as social networks and web graphs, however, 2-hop labeling can hardly be constructed due to the oversized index. This paper discloses the theoretical relationships between the graph treewidth and 2-hop labeling’s index size and query time. To scale up distance labeling, this paper proposes Core-Tree (CT) Index to facilitate a critical and effective trade-off between the index size and query time. The reduced index size enables CT-Index to handle massive graphs that no existing approaches can process while the cost in the query time is negligible: the query time is below 0.4 milliseconds on all tested graphs including one graph with 5.5 billion edges. https://doi.org/10.1145/3318464.3389748
mod0346 Reliable Data Distillation on Graph Convolutional Network
Wentao Zhang (Peking University &#38; National Engineering Laboratory for Big Data Analysis and Applications); Xupeng Miao (Peking University); Yingxia Shao (Beijing University of Posts and Telecommunications, BUPT); Jiawei Jiang (ETH Zurich); Lei Chen (Hong Kong University of Science and Technology); Olivier Ruas (Peking University); Bin Cui (Peking University &#38; National Engineering Laboratory for Big Data Analysis and Applications) Graph Convolutional Network (GCN) is a widely used method for learning from graph-based data. However, it fails to use the unlabeled data to its full potential, thereby hindering its ability. Given some pseudo labels of the unlabeled data, the GCN can benefit from this extra supervision. Based on Knowledge Distillation and Ensemble Learning, lots of methods use a teacher-student architecture to make better use of the unlabeled data and then make a better prediction. However, these methods introduce unnecessary training costs and a high bias of student model if the teacher’s predictions are unreliable. Besides, the final ensemble gains are limited due to limited diversity in the combined models. Therefore, we propose Reliable Data Distillation, a reliable data driven semi-supervised GCN training method. By defining the node reliability and edge reliability in a graph, we can make better use of high quality data and improve the graph representation learning. Furthermore, considering the data reliability and data importance, we propose a new ensemble learning method for GCN and a novel Self-Boosting SSL Framework to combine the above optimizations. Finally, our extensive evaluation of Reliable Data Distillation on real-world datasets shows that our approach outperforms the state-of-the-art methods on semi-supervised node classification tasks. https://doi.org/10.1145/3318464.3389706
mod0481 Regular Path Query Evaluation on Streaming Graphs
Anil Pacaci (University of Waterloo); Angela Bonifati (Lyon 1 University); M. Tamer Özsu (University of Waterloo) We study persistent query evaluation over streaming graphs, which is becoming increasingly important. We focus on navigational queries that determine if there exists a path between two entities that satisfies a user-specified constraint.
We adopt the Regular Path Query (RPQ) model that specifies navigational patterns with labeled constraints. We propose deterministic algorithms to efficiently evaluate persistent RPQs under both arbitrary and simple path semantics in a uniform manner.Experimental analysis on real and synthetic streaming graphs shows that the proposed algorithms can process up to tens of thousands of edges per second and efficiently answer RPQs that are commonly used in real-world workloads.
https://doi.org/10.1145/3318464.3389733
mod0237 Timely Reporting of Heavy Hitters using External Memory
Prashant Pandey (Carnegie Mellon University); Shikha Singh (Wellesley College); Michael A. Bender (Stony Brook University); Jonathan W. Berry (Sandia National Laboratories); Martín Farach-Colton (Rutgers University); Rob Johnson (VMware Research); Thomas M. Kroeger (Sandia National Laboratories); Cynthia A. Phillips (Sandia National Laboratories) Given an input stream of size $N$, a \defn{$\phi$-heavy hitter} is an item that occurs at least $\phi N$ times in $S$. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its $T = \phi N$-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space ($\Omega(N)$ words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., $\approx$ 100K observations per second. https://doi.org/10.1145/3318464.3380598
mod0125 Factorized Graph Representations for Semi-Supervised Learning from Sparse Data
Krishna Kumar P. (IIT Madras); Paul Langton (Northeastern University); Wolfgang Gatterbauer (Northeastern University) Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that iteratively pass messages along edges, starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix, which must thus be provided by either domain experts or heuristics. We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph. This method, which we call distant compatibility estimation, works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the underlying idea of leveraging algebraic properties of an algorithm’s update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics. https://doi.org/10.1145/3318464.3380577
Research 24: Spatial, Temporal, and Multimedia Data I • Thursday 1:30 PM – 3:00 PM
mod0135 RID: Deduplicating Snapshot Computations
Nikos Tsikoudis (Brandeis University); Liuba Shrira (Brandeis University) One can audit SQL applications by running SQL programs over sequences of persistent snapshots, but care is needed to avoid wasteful duplicate computation.
This paper describes the design, implementation, and performance of RID, the first language-independent optimization framework that eliminates duplicate computations in SQL programs running over low-level snapshots by exploiting snapshot metadata efficiently.
https://doi.org/10.1145/3318464.3380580
mod0328 Architecting a Query Compiler for Spatial Workloads
Ruby Y. Tahboub (Purdue University); Tiark Rompf (Purdue University) Modern location-based applications rely extensively on the efficient processing of spatial data and queries. Spatial query engines are commonly engineered as an extension to a relational database or a cluster-computing framework. Large parts of the spatial processing runtime is spent on evaluating spatial predicates and traversing spatial indexing structures. Typical high-level implementations of these spatial structures incur significant interpretive overhead, which increases latency and lowers throughput. A promising idea to improve the performance of spatial workloads is to leverage native code generation techniques that have become popular in relational query engines. However, architecting a spatial query compiler is challenging since spatial processing has fundamentally different execution characteristics from relational workloads in terms of data dimensionality, indexing structures, and predicate evaluation. In this paper, we discuss the underlying reasons why standard query compilation techniques are not fully effective when applied to spatial workloads, and we demonstrate how a particular style of query compilation based on techniques borrowed from partial evaluation and generative programming manages to avoid most of these difficulties by extending the scope of custom code generation into the data structures layer. We extend the LB2 main-memory query compiler, a relational engine developed in this style, with spatial data types, predicates, indexing structures, and operators. We show that the spatial extension matches the performance of specialized library code and outperforms relational and map-reduce extensions. https://doi.org/10.1145/3318464.3389701
mod0338 LISA: A Learned Index Structure for Spatial Data
Pengfei Li (Zhejiang University); Hua Lu (Roskilde University); Qian Zheng (Nanyang Technological University); Long Yang (Zhejiang University); Gang Pan (Zhejiang University) In spatial query processing, the popular index R-tree may incur large storage consumption and high IO cost. Inspired by the recent <i>learned index</i> [17] that replaces B-tree with machine learning models, we study an analogy problem for spatial data. We propose a novel Learned Index structure for Spatial dAta (LISA for short). Its core idea is to use machine learning models, through several steps, to generate searchable data layout in disk pages for an arbitrary spatial dataset. In particular, LISA consists of a mapping function that maps spatial keys (points) into 1-dimensional mapped values, a learned shard prediction function that partitions the mapped space into shards, and a series of local models that organize shards into pages. Based on~LISA, a range query algorithm is designed, followed by a lattice regression model that enables us to convert a KNN query to range queries. Algorithms are also designed for~LISA~to handle data updates. Extensive experiments demonstrate that~LISA~clearly outperforms R-tree and other alternatives in terms of storage consumption and IO cost for queries. Moreover,~LISA~can handle data insertions and deletions efficiently. https://doi.org/10.1145/3318464.3389703
mod0689 Effective Travel Time Estimation: When Historical Trajectories over Road Networks Matter
Haitao Yuan (Tsinghua University); Guoliang Li (Tsinghua University); Zhifeng Bao (RMIT University); Ling Feng (Tsinghua University) In this paper, we study the problem of origin-destination (OD) travel time estimation where the OD input consists of an OD pair and a departure time. We propose a novel neural network based prediction model that fully exploits an important fact neglected by the literature — for a past OD trip its travel time is usually affiliated with the trajectory it travels along, whereas it does not exist during prediction. At the training phase, our goal is to design novel representations for the OD input and its affiliated trajectory, such that they are close to each other in the latent space. First, we match the OD pairs and their affiliated (historical) trajectories to road networks, and utilize road segment embeddings to represent their spatial properties. Later, we match the timestamps associated with trajectories to time slots and utilize time slot embeddings to represent the temporal properties. Next, we build a temporal graph to capture the weekly and daily periodicity of time slot embeddings. Last, we design an effective encoding to represent the spatial and temporal properties of trajectories. To bind each OD input to its affiliated trajectory, we also encode the OD input into a hidden representation, and make the hidden representation close to the spatio-temporal representation of the trajectory. At the prediction phase, we only use the OD input, get the hidden representation of the OD input, and use it to generate the travel time. Extensive experiments on real datasets show that our method achieves high effectiveness and outperforms existing methods. https://doi.org/10.1145/3318464.3389771
Research 1: Crowdsourcing and Visualization • Thursday 3:30 PM – 5:00 PM
mod0436 Recommending Deployment Strategies for Collaborative Tasks
Dong Wei (New Jersey Institute of Technology); Senjuti Basu Roy (New Jersey Institute of Technology); Sihem Amer-Yahia (CNRS, Univ. Grenoble Alpes) Our work contributes to aiding requesters in deploying collaborative tasks in crowdsourcing. We initiate the study of recommending deployment strategies for collaborative tasks to requesters that are consistent with deployment parameters they desire: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by paying workers. A deployment strategy is a choice of value for three dimensions: Structure (whether to solicit the workforce sequentially or simultaneously), Organization (to organize it collaboratively or independently), and Style (to rely solely on the crowd or to combine it with machine algorithms). We propose StratRec, an optimization-driven middle layer that recommends deployment strategies and alternative deployment parameters to requesters by accounting for worker availability. Our solutions are grounded in discrete optimization and computational geometry techniques that produce results with theoretical guarantees. We present extensive experiments on Amazon Mechanical Turk, and conduct synthetic experiments to validate the qualitative and scalability aspects of StratRec. https://doi.org/10.1145/3318464.3389719
mod0692 Human-in-the-loop Outlier Detection
Chengliang Chai (Tsinghua University); Lei Cao (CSAIL, MIT); Guoliang Li (Tsinghua University); Jian Li (Tsinghua University); Yuyu Luo (Tsinghua University); Samuel Madden (CSAIL, MIT) Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach \hod that effectively leverages human intelligence to discover the true outliers. There are two main challenges in \hod. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. \hod then leverages this set of inliers (called \contexts) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that \hod significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers. https://doi.org/10.1145/3318464.3389772
mod0020 QUAD: Quadratic-Bound-based Kernel Density Visualization
Tsz Nam Chan (The University of Hong Kong); Reynold Cheng (The University of Hong Kong); Man Lung Yiu (The Hong Kong Polytechnic University) Kernel density visualization, or KDV, is used to view and understand data points in various domains, including traffic or crime hotspot detection, ecological modeling, chemical geology, and physical modeling. Existing solutions, which are based on computing kernel density (KDE) functions, are computationally expensive. Our goal is to improve the performance of KDV, in order to support large datasets (e.g., one million points) and high screen resolutions (e.g., $1280\times 960$ pixels). We examine two widely-used variants of KDV, namely approximate kernel density visualization (\EKDV{}) and thresholded kernel density visualization (\TKDV{}). For these two operations, we develop fast solution, called QUAD, by deriving quadratic bounds of KDE functions for different types of kernel functions, including Gaussian, triangular etc. We further adopt a progressive visualization framework for KDV, in order to stream partial visualization results to users continuously. Extensive experiment results show that our new KDV techniques can provide at least one-order-of-magnitude speedup over existing methods, without degrading visualization quality. We further show that QUAD can produce the reasonable visualization results in real-time (0.5 sec) by combining the progressive visualization framework in single machine setting without using GPU and parallel computation. %We implement our solutions and provide a software framework, called QUAD, to enable fast and effective KDV. https://doi.org/10.1145/3318464.3380561
mod0446 ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines
Tarique Siddiqui (University of Illinois, Urbana Champaign (UIUC)); Paul Luh (University of Illinois, Urbana Champaign (UIUC)); Zesheng Wang (University of Illinois, Urbana Champaign (UIUC)); Karrie Karahalios (University of Illinois, Urbana Champaign (UIUC)); Aditya Parameswaran (UC Berkeley) Identifying trendline visualizations with desired patterns is a common task during data exploration. Existing visual analytics tools offer limited flexibility, expressiveness, and scalability for such tasks, especially when the pattern of interest is under-specified and approximate. We propose ShapeSearch, an efficient and flexible pattern-searching tool, that enables the search for desired patterns via multiple mechanisms: sketch, natural-language, and visual regular expressions. We develop a novel shape querying algebra, with a minimal set of primitives and operators that can express a wide variety of shape search queries, and design a natural- language and regex-based parser to translate user queries to the algebraic representation. To execute these queries within interactive response times, ShapeSearch uses a fast shape algebra execution engine with query-aware optimizations, and perceptually-aware scoring methodologies. We present a thorough evaluation of the system, including a user study, a case study involving genomics data analysis, as well as performance experiments, comparing against state-of-the-art trendline shape matching approaches—that together demonstrate the usability and scalability of ShapeSearch. https://doi.org/10.1145/3318464.3389722
mod0469 Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization
Liming Dong (Tsinghua University); Qiushi Bai (University of California Irvine); Taewoo Kim (University of California Irvine); Taiji Chen (University of California Irvine); Weidong Liu (Tsinghua University); Chen Li (University of California Irvine) We study the problem of efficient spatial visualization on a large data set stored in a database using SQL queries with ad-hoc range conditions on numerical attributes, for example, a spatial scatterplot of taxi pickup events in New York between 1/1/2015 and 3/10/2015. We present a novel middleware-based technique called Marviq. It divides the selection-attribute domain into intervals, and precomputes and stores a visualization for each interval. These results are called MVS and stored as tables in the database. We can compute an exact visualization for a request by accessing MVS and retrieving additional records from the base table. To further reduce the latter time, we present algorithms for using MVS to compute an approximate visualization that satisfies a user-specified similarity threshold. We show a family of functions with certain properties that can use this technique. We present an improvement by dividing the MVS intervals into smaller intervals and materializing low-resolution visualization for these intervals. We report the results of an extensive evaluation of Marviq, including a user study, and show its high performance in both space and time. https://doi.org/10.1145/3318464.3389730
Research 27: Distributed and Parallel Processing • Thursday 3:30 PM – 5:00 PM
mod0569 Near-Optimal Distributed Band-Joins through Recursive Partitioning
Rundong Li (Google); Wolfgang Gatterbauer (Northeastern University); Mirek Riedewald (Northeastern University) We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work. https://doi.org/10.1145/3318464.3389750
mod0226 ChronoCache: Predictive and Adaptive Mid-Tier Query Result Caching
Bradley Glasbergen (University of Waterloo); Kyle Langendoen (University of Waterloo); Michael Abebe (University of Waterloo); Khuzaima Daudjee (University of Waterloo) The performance of data-driven, web-scale client applications is sensitive to access latency. To address this concern, enterprises strive to cache data on edge nodes that are closer to users, thereby avoiding expensive round-trips to remote data centers. However, these geo-distributed approaches are limited to caching static data. In this paper we present ChronoCache, a mid-tier caching system that exploits the presence of geo-distributed edge nodes to cache database query results closer to users. ChronoCache transparently learns and leverages client application access patterns to predictively combine query requests and cache their results ahead of time, thereby reducing costly round-trips to the remote database. We show that ChronoCache reduces query response times by up to 2/3 over prior approaches on multiple representative benchmark workloads. https://doi.org/10.1145/3318464.3380593
mod0315 Cheetah: Accelerating Database Queries with Switch Pruning
Muhammad Tirmazi (Harvard University); Ran Ben Basat (Harvard University); Jiaqi Gao (Harvard University); Minlan Yu (Harvard University) Modern database systems are growing increasingly distributed and struggle to reduce query completion time with a large volume of data. In this paper, we leverage programmable switches in the network to partially offload query computation to the switch. While switches provide high performance, they have resource and programming constraints that make implementing diverse queries difficult. To fit in these constraints, we introduce the concept of data <i>pruning</i> — filtering out entries that are guaranteed not to affect output. The database system then runs the same query but on the pruned data, which significantly reduces processing time. We propose pruning algorithms for a variety of queries. We implement our system, Cheetah, on a Barefoot Tofino switch and Spark. Our evaluation on multiple workloads shows $40 – 200% improvement in the query completion time compared to Spark. https://doi.org/10.1145/3318464.3389698
mod0462 External Merge Sort for Top-K Queries
Yannis Chronis (University of Wisconsin-Madison); Thanh Do (Google Inc); Goetz Graefe (Google Inc); Keith Peters (Google Inc) Business intelligence and web log analysis workloads often use queries with top-\textit{k} clauses to produce the most relevant results. Values of \textit{k} range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-\textit{k} algorithms are efficient, as they can eliminate almost all but the top \textit{k} results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-\textit{k} queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-\textit{k} algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-\textit{k} queries with outputs larger than the available memory. We evaluate our algorithm against existing top-\textit{k} algorithms and show that it reduces I/O traffic and can be up to 11 times faster. https://doi.org/10.1145/3318464.3389729
mod0374 Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing
Qiange Wang (Northeastern University); Yanfeng Zhang (Northeastern University); Hao Wang (Ohio State University); Liang Geng (Northeastern University); Rubao Lee (Ohio State University); Xiaodong Zhang (Ohio State University); Ge Yu (Northeastern University) In database and large-scale data analytics, recursive aggregate processing plays an important role, which is generally implemented under a framework of incremental computing and executed synchronously and/or asynchronously. We identify three barriers in existing recursive aggregate data processing. First, the processing scope is largely limited to monotonic programs. Second, checking on conditions for monotonicity and correctness for async processing is sophisticated and manually done. Third, execution engines may be suboptimal due to separation of sync and async execution. In this paper, we lay an analytical foundation for conditions to check if a recursive aggregate program that is monotonic or even non-monotonic can be executed incrementally and asynchronously with its correct result. We design and implement a condition verification tool that can automatically check if a given program satisfies the conditions. We further propose a unified sync-async engine to execute these programs for high performance. To integrate all these effective methods together, we have developed a distributed Datalog system, called PowerLog. Our evaluation shows that PowerLog can outperform three representative Datalog systems on both monotonic and non-monotonic recursive programs. https://doi.org/10.1145/3318464.3389712
Research 12: Graph Matching and Discovery • Thursday 3:30 PM – 5:00 PM
mod0314 Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs
Chenhao Ma (The University of Hong Kong); Yixiang Fang (University of New South Wales); Reynold Cheng (The University of Hong Kong); Laks V.S. Lakshmanan (The University of British Columbia); Wenjie Zhang (University of New South Wales); Xuemin Lin (University of New South Wales) Given a directed graph $G$, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from $G$, whose density is the highest among all the subgraphs of $G$. The DDS problem is fundamental to a wide range of applications, such as fraud detection, community mining, and graph compression. However, existing DDS solutions suffer from efficiency and scalability problems: on a three-thousand-edge graph, it takes three days for one of the best exact algorithms to complete. In this paper, we develop an efficient and scalable DDS solution. We introduce the notion of [$x$, $y$]-core, which is a dense subgraph for $G$, and show that the densest subgraph can be accurately located through the [$x$, $y$]-core with theoretical guarantees. Based on the [$x$, $y$]-core, we develop exact and approximation algorithms. We have performed an extensive evaluation of our approaches on eight real large datasets. The results show that our proposed solutions are up to six orders of magnitude faster than the state-of-the-art. https://doi.org/10.1145/3318464.3389697
mod0325 GPU-Accelerated Subgraph Enumeration on Partitioned Graphs
Wentian Guo (National University of Singapore); Yuchen Li (Singapore Management University); Mo Sha (National University of Singapore); Bingsheng He (National University of Singapore); Xiaokui Xiao (National University of Singapore); Kian-Lee Tan (National University of Singapore) Subgraph enumeration is important for many applications such as network motif discovery and community detection. Recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration, but they can only handle graphs that fit into the GPU memory. In this paper, we propose a new approach for GPU-accelerated subgraph enumeration that can efficiently scale to large graphs beyond the GPU memory. Our approach divides the graph into partitions, each of which fits into the GPU memory. The GPU processes one partition at a time and searches the matched subgraphs of a given pattern (i.e., instances) within the partition as in the small graph. The key challenge is on enumerating the instances across different partitions, because this search would enumerate considerably redundant subgraphs and cause the expensive data transfer cost via the PCI-e bus. Therefore, we propose a novel shared execution approach to eliminate the redundant subgraph searches and correctly generate all the instances across different partitions. The experimental evaluation shows that our approach can scale to large graphs and achieve significantly better performance than the existing single-machine solutions. https://doi.org/10.1145/3318464.3389699
mod0151 In-Memory Subgraph Matching: An In-depth Study
Shixuan Sun (Hong Kong University of Science and Technology); Qiong Luo (Hong Kong University of Science and Technology) We study the performance of eight representative in-memory subgraph matching algorithms.
Specifically, we put QuickSI, GraphQL, CFL, CECI, DP-iso, RI and VF2++ in a common framework to compare them on the following four aspects: (1) method of filtering candidate vertices in the data graph; (2) method of ordering query vertices; (3) method of enumerating partial results; and (4) other optimization techniques. Then, we compare the overall performance of these algorithms with Glasgow, an algorithm based on the constraint programming. Through experiments, we find that (1) the filtering method of GraphQL is competitive to that of the latest algorithms CFL, CECI and DP-iso in terms of pruning power; (2) the ordering methods in GraphQL and RI are usually the most effective; (3) the set intersection based local candidate computation in CECI and DP-iso performs the best in the enumeration; and (4) the failing sets pruning in DP-iso can significantly improve the performance when queries become large. Our source code is publicly available at https://github.com/RapidsAtHKUST/SubgraphMatching.
https://doi.org/10.1145/3318464.3380581
mod0332 G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching
Yeonsu Park (POSTECH); Seongyun Ko (POSTECH); Sourav S. Bhowmick (NTU); Kyoungmin Kim (POSTECH); Kijae Hong (POSTECH); Wook-Shin Han (POSTECH) Despite the crucial role of cardinality estimation in query optimization, there has been no systematic and in-depth study of the existing cardinality estimation techniques for subgraph matching queries. In this paper, for the first time, we present a comprehensive study of the existing cardinality estimation techniques for subgraph matching queries, scaling far beyond the original experiments. We first introduce a novel framework called <sc>g-care</sc> that enables us to realize all existing techniques on top of it and that provides insights on their performance. By using <sc>g-care</sc>, we then reimplement representative cardinality estimation techniques for graph databases as well as relational databases. We next evaluate these techniques w.r.t accuracy on <sc>rdf</sc> and non-<sc>rdf</sc> graphs from different domains with subgraph matching queries of various topologies so far considered. Surprisingly, our results reveal that all existing techniques have serious problems in accuracy for various scenarios and datasets. Intriguingly, a simple sampling method based on an <i>online aggregation</i> technique designed for <i>relational</i> data, consistently outperforms all existing techniques. https://doi.org/10.1145/3318464.3389702
mod0033 Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
Tashin Reza (University of British Columbia); Matei Ripeanu (University of British Columbia); Geoffrey Sanders (Lawrence Livermore National Laboratory); Roger Pearce (Lawrence Livermore National Laboratory) There are multiple situations where supporting approximation in graph pattern matching tasks is highly desirable: (i) the data acquisition process can be noisy; (ii) a user may only have an imprecise idea of the search query; and (iii) approximation can be used for high volume vertex labeling when extracting machine learning features from graph data. We present a new algorithmic pipeline for approximate matching that combines edit-distance based matching with systematic graph pruning. We formalize the problem as identifying all exact matches for up to k edit-distance subgraphs of a user-supplied template. We design a solution which exploits unique optimization opportunities within the design space, not explored previously. Our solution is (i) highly scalable, (ii) supports arbitrary patterns and edit-distance, (iii) offers 100% precision and 100% recall guarantees, and (vi) supports a set of popular data analysis scenarios. We demonstrate its advantages through an implementation that offers good strong and weak scaling on massive real-world (257 billion edges) and synthetic (1.1 trillion edges) labeled graphs, respectively, and when operating on a massive cluster (256 nodes/9,216 cores), orders of magnitude larger than previously used for similar problems. Empirical comparison with the state-of-the-art highlights the advantages of our solution when handling massive graphs and complex patterns. https://doi.org/10.1145/3318464.3380566
Research 29: Data Mining and Similarity Search • Thursday 3:30 PM – 5:00 PM
mod0246 Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination
Conglong Li (Carnegie Mellon University); Minjia Zhang (Microsoft AI and Research); David G. Andersen (Carnegie Mellon University); Yuxiong He (Microsoft AI and Research) In applications ranging from image search to recommendation systems, the problem of identifying a set of “similar” real-valued vectors to a query vector plays a critical role. However, retrieving these vectors and computing the corresponding similarity scores from a large database is computationally challenging. Approximate nearest neighbor (ANN) search relaxes the guarantee of exactness for efficiency by vector compression and/or by only searching a subset of database vectors for each query. Searching a larger subset increases both accuracy and latency. State-of-the-art ANN approaches use fixed configurations that apply the same termination condition (the size of subset to search) for all queries, which leads to undesirably high latency when trying to achieve the last few percents of accuracy. We find that due to the index structures and the vector distributions, the number of database vectors that must be searched to find the ground-truth nearest neighbor varies widely among queries. Critically, we further identify that the intermediate search result after a certain amount of search is an important runtime feature that indicates how much more search should be performed.To achieve a better tradeoff between latency and accuracy, we propose a novel approach that adaptively determines search termination conditions for individual queries. To do so, we build and train gradient boosting decision tree models to learn and predict when to stop searching for a certain query. These models enable us to achieve the same accuracy with less total amount of search compared to the fixed configurations. We apply the learned adaptive early termination to state-of-the-art ANN approaches, and evaluate the end-to-end performance on three million to billion-scale datasets. Compared with fixed configurations, our approach consistently improves the average end-to-end latency by up to 7.1 times faster under the same high accuracy targets. Our approach is open source at github.com/efficient/faiss-learned-termination. https://doi.org/10.1145/3318464.3380600
mod0153 Theoretically-Efficient and Practical Parallel DBSCAN
Yiqiu Wang (Massachusetts Institute of Technology); Yan Gu (University of California, Riverside); Julian Shun (Massachusetts Institute of Technology) The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with two-way hyper-threading show that our implementations outperform existing parallel implementations by up to several orders of magnitude, and achieve speedups of up to 33x over the best sequential algorithms. https://doi.org/10.1145/3318464.3380582
mod0539 A Relational Matrix Algebra and its Implementation in a Column Store
Oksana Dolmatova (University of Zürich); Nikolaus Augsten (University of Salzburg); Michael H. Böhlen (University of Zürich) Analytical queries often require a mixture of relational andlinear algebra operations applied to the same data. This posesa challenge to analytic systems that must bridge the gap betweenrelations and matrices. Previous work has mainlystrived to fix the problem at the implementation level. Thispaper proposes a principled solution at the logical level. Weintroduce the relational matrix algebra (RMA), which seamlesslyintegrates linear algebra operations into the relationalmodel and eliminates the dichotomy between matrices andrelations. RMA is closed: All our relational matrix operationsare performed on relations and result in relations; no additionaldata structure is required. Our implementation inMonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performswell for mixed workloads. https://doi.org/10.1145/3318464.3389747
mod0367s Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring
Yifan Lei (National University of Singapore); Qiang Huang (National University of Singapore); Mohan Kankanhalli (National University of Singapore); Anthony K. H. Tung (National University of Singapore) Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is \emph{LSH-family-independent}, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes. https://doi.org/10.1145/3318464.3389778
mod0251 Continuously Adaptive Similarity Search
Huayi Zhang (Worcester Polytechnic Institute); Lei Cao (Massachusetts Institute of Technology); Yizhou Yan (Worcester Polytechnic Institute); Samuel Madden (Massachusetts Institute of Technology); Elke A. Rundensteiner (Worcester Polytechnic Institute) Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques. https://doi.org/10.1145/3318464.3380601
Research 21: Spatial, Temporal, and Multimedia Data II • Thursday 3:30 PM – 5:00 PM
mod0431 Architecture-Intact Oracle for Fastest Path and Time Queries on Dynamic Spatial Networks
Victor Junqiu Wei (Noah’s Ark Lab, Huawei Technologies); Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology); Cheng Long (Nanyang Technological University) Given two vertices of interest (POIs) $s$ and $t$ on a spatial network, a distance (path) query returns the shortest network distance (shortest path) from $s$ to $t$. This query has a variety of applications in practice and is a fundamental operation for many database and data mining algorithms. In this paper, we propose an efficient distance and path oracle on dynamic road networks using the randomization technique. Our oracle has a good performance in practice and remarkably, and at the same time, it has a favorable theoretical bound. Specifically, it has $O(n\log^2 n)$ (resp. $O(n\log^2{n})$) preprocessing time (resp. space) and $O(\log^4{n}\log\log n )$ (resp. $O(\log^4 {n}\log\log n +l)$) distance query time (resp. shortest path query time) as well as $O(\log^3{n})$ update time with high probability (w.h.p.), where $n$ is the number of vertices in the spatial network and $l$ is the number of edges on the shortest path. Our experiments show that the existing oracles suffer from a huge updating time that renders them impractical and our oracle enjoys a negligible updating time and meanwhile has comparable query time and indexing cost with the best existing oracle. https://doi.org/10.1145/3318464.3389718
mod0572 Data Series Progressive Similarity Search with Probabilistic Quality Guarantees
Anna Gogolou (Université Paris-Saclay, Inria, CNRS, LRI & LIPADE, University of Paris); Theophanis Tsandilas (Université Paris-Saclay, Inria, CNRS, LRI); Karima Echihabi (IRDA, Rabat IT Center, ENSIAS, Mohammed V University); Anastasia Bezerianos (Université Paris-Saclay, CNRS, Inria, LRI); Themis Palpanas (LIPADE, University of Paris & French University Institute (IUF)) Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Experiments with synthetic and diverse real datasets demonstrate that our prediction methods constitute the first practical solution to the problem, significantly outperforming competing approaches. https://doi.org/10.1145/3318464.3389751
mod0162s A GPU-friendly Geometric Data Model and Algebra for Spatial Queries
Harish Doraiswamy (New York University); Juliana Freire (New York University) The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. Unfortunately, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. While Graphics Processing Units~(GPUs) are increasingly being used to speed up spatial queries, existing solutions have two important drawbacks: they are often tightly coupled to the specific query types they target, making it hard to adapt them for other queries; and since their design is based on CPU-based approaches, it can be difficult to effectively utilize all the benefits provided by the GPU.As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects.We demonstrate the expressiveness of the proposed algebra and present a proof-of-concept prototype that supports a subset of the operators, which shows that it is orders of magnitude faster than a CPU-based implementation and outperforms custom GPU-based approaches. https://doi.org/10.1145/3318464.3389774
mod0605 Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures
John Paparrizos (University of Chicago); Chunwei Liu (University of Chicago); Aaron J. Elmore (University of Chicago); Michael J. Franklin (University of Chicago) Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, the most detailed experimental study in this area is outdated (over a decade old) and, naturally, does not reflect recent progress. Importantly, this study (i) omitted multiple distance measures, including a classic measure in the time-series literature; (ii) considered only a single time-series normalization method; and (iii) reported only raw classification error rates without statistically validating the findings, resulting in or fueling four misconceptions in the time-series literature. Motivated by the aforementioned drawbacks and our curiosity to shed some light on these misconceptions, we comprehensively evaluate 71 time-series distance measures. Specifically, our study includes (i) 8 normalization methods; (ii) 52 lock-step measures; (iii) 4 sliding measures; (iv) 7 elastic measures; (v) 4 kernel functions; and (vi) 4 embedding measures. We extensively evaluate these measures across 128 time-series datasets using rigorous statistical analysis. Our findings debunk four long-standing misconceptions that significantly alter the landscape of what is known about existing distance measures. With the new foundations in place, we discuss open challenges and promising directions. https://doi.org/10.1145/3318464.3389760
mod0277 MIRIS: Fast Object Track Queries in Video
Favyen Bastani (Massachusetts Institute of Technology); Songtao He (Massachusetts Institute of Technology); Arjun Balasingam (Massachusetts Institute of Technology); Karthik Gopalakrishnan (Massachusetts Institute of Technology); Mohammad Alizadeh (Massachusetts Institute of Technology); Hari Balakrishnan (Massachusetts Institute of Technology); Michael Cafarella (Massachusetts Institute of Technology); Tim Kraska (Massachusetts Institute of Technology); Sam Madden (Massachusetts Institute of Technology) Video databases that enable queries with object-track predicates are useful in many applications. Such queries include selecting objects that move from one region of the camera frame to another (e.g., finding cars that turn right through a junction) and selecting objects with certain speeds (e.g., finding animals that stop to drink water from a lake). Processing such predicates efficiently is challenging because they involve the movement of an object over several video frames. We propose a novel query-driven tracking approach that integrates query processing with object tracking to efficiently process object track queries and address the computational complexity of object detection methods. By processing video at low framerates when possible, but increasing the framerate when needed to ensure high-accuracy on a query, our approach substantially speeds up query execution. We have implemented query-driven tracking in MIRIS, a video query processor, and compare MIRIS against four baselines on a diverse dataset consisting of five sources of video and nine distinct queries. We find that, at the same accuracy, MIRIS accelerates video query processing by 9x on average over the IOU tracker, an overlap-based tracking-by-detection method used in existing video database systems. https://doi.org/10.1145/3318464.3389692