Tutorials

For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

Sunday 8:00 AM – 10:30 AM (30 minute break) Continues from 11:00 AM – 11:30 PM
Tutorial Session 1 Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing
Alexey Drutsa (Yandex); Dmitry Ustalov (Yandex); Evfrosiniya Zerminova (Yandex); Valentina Fedorova (Yandex); Olga Megorskaya (Yandex); Daria Baidakova (Yandex) In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. While the crowd performers are annotating the project set up by the attendees, we will present the major theoretical results in efficient aggregation, incremental relabeling, and dynamic pricing. We will also discuss their strengths and weaknesses as well as applicability to real-world tasks, summarizing our five year-long research and industrial expertise in crowdsourcing. Finally, participants will receive a feedback about their projects and practical advice on how to make them more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high quality labeled data and do it efficiently. https://doi.org/10.1145/3318464.3383127
Sunday 1:30 PM – 3:00 PM
Tutorial Session 2 Optimal Join Algorithms Meet Top-k
Nikolaos Tziavelis (Northeastern University); Wolfgang Gatterbauer (Northeastern University); Mirek Riedewald (Northeastern University) Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries. https://doi.org/10.1145/3318464.3383132
Tuesday 1:30 PM – 3:00 PM
Tutorial Session 3 State of the Art and Open Challenges in Natural Language Interfaces to Data
Fatma Ozcan (IBM Research – Almaden); Abdul Quamar (IBM Research – Almaden); Jaydeep Sen (IBM Research – India); Chuan Lei (IBM Research – Almaden); Vasilis Efthymiou (IBM Research – Almaden) Recent advances in natural language understanding and processing resulted in renewed interest in natural language based interfaces to data, which provide an easy mechanism for non-technical users to access and query the data. While early systems only allowed simple selection queries over a single table, some recent work supports complex BI queries, with many joins and aggregation, and even nested queries. There are various approaches in the literature for interpreting user’s natural language query. Rule-based systems try to identify the entities in the query, and understand the intended relationships between those entities. Recent years have seen the emergence and popularity of neural network based approaches which try to interpret the query holistically, by learning the patterns. In this tutorial, we will review these natural language interface solutions in terms of their interpretation approach, as well as the complexity of the queries they can generate. We will also discuss open research challenges. https://doi.org/10.1145/3318464.3383128
Wednesday 10:30 AM – 3:30 PM break from 12:30 PM – 1:30 PM for Business Lunch
Tutorial Session 4 Beyond Analytics: The Evolution of Stream Processing Systems
Paris Carbone (RISE – Research Institutes of Sweden); Marios Fragkoulis (Delft University of Technology); Vasiliki Kalavri (Boston University); Asterios Katsifodimos (Delft University of Technology) Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early (’00-’10) and modern (’11-’18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins. Instead, modern streaming systems are being increasingly used to deploy general event-driven applications in a scalable fashion, challenging the design decisions, architecture and intended use of existing stream processing systems. https://doi.org/10.1145/3318464.3383131
Wednesday 4:30 PM – 6:00 PM
Tutorial Session 5 Key-Value Storage Engines
Stratos Idreos (Harvard University); Mark Callaghan (MongoDB) Key-value stores are everywhere. They power a diverse set of data-driven applications across both industry and science. Key-value stores are used as stand-alone NoSQL systems but they are also used as a part of more complex pipelines and systems such as machine learning and relational systems. In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed. We focus on several critical components of the engine, starting with the core data structures to lay out data across the memory hierarchy. We also discuss design issues related to caching, timestamps, concurrency control, updates, shifting workloads, as well as mixed workloads with both analytical and transactional characteristics. We cover designs that are read-optimized, write-optimized as well as hybrids. We draw examples from several state-of-the-art systems but we also put everything together in a general framework which allows us to model storage engine designs under a single unified model and reason about the expected behavior of diverse designs. In addition, we show that given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and we present a first version of such a language. We then use that framework to present several open challenges in the field, especially in terms of supporting increasingly more diverse and dynamic applications in the era of data science and AI, including neural networks, graphs, and data versioning. https://doi.org/10.1145/3318464.3383133
Thursday 10:30 AM – 12:00 PM
Tutorial Session 6 Automating Exploratory Data Analysis via Machine Learning: An Overview
Tova Milo (Tel Aviv University); Amit Somech (Tel Aviv University) Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA.</par><par>In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users’ interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models.</par><par>We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA. https://doi.org/10.1145/3318464.3383126
Thursday 1:30 PM – 3:00 PM
Tutorial Session 7 Le Taureau: Deconstructing the Serverless Landscape & A Look Forward
Anurag Khandelwal (Yale University); Arun Kejariwal (Facebook Inc.); Karthikeyan Ramasamy (Splunk Inc.C Akin to the natural evolution of programming in assembly language to high-level languages, serverless computing represents the next frontier in the evolution of cloud computing: <i>bare metal -> virtual machines -> containers -> serverless</i>. The genesis of serverless computing can be traced back to the fundamental need of enabling a programmer to singularly focus on writing application code in a high-level language and isolating all facets of system management (for example, but not limited to, instance selection, scaling, deployment, logging, monitoring, fault tolerance and so on). This is particularly critical in light of today’s, increasingly tightening, time-to-market constraints. Currently, serverless computing is supported by leading public cloud vendors, such as AWS Lambda, Google Cloud Functions, Azure Cloud Functions and others. While this is an important step in the right direction, there are many challenges going forward. For instance, but not limited to, how to enable support for dynamic optimization, how to extend support for stateful computation, how to efficiently bin-pack applications, how to support hardware heterogeneity (this will be key especially in light of the emergence of hardware accelerators for deep learning workloads). Inspired by Picasso’s Le Taureau, in the tutorial proposed herein, we shall deconstruct evolution of serverless — the overarching intent being to facilitate better understanding of the serverless landscape. This, we hope, would help push the innovation frontier on both fronts, the paradigm itself and the applications built atop of it. https://doi.org/10.1145/3318464.3383130
Thursday 1:30 PM – 3 PM (30 minute break) Continues 3:30 PM – 5:00 PM
Tutorial Session 8 SIGMOD 2020 Tutorial on Fairness and Bias in Peer Review and Other Sociotechnical Intelligent Systems
Nihar B. Shah (Carnegie Mellon University); Zachary Lipton (Carnegie Mellon University) Questions of fairness and bias abound in all socially-consequential decisions pertaining to collection and management of data. Whether designing protocols for peer review of research papers, setting hiring policies, or framing research question in genetics, any data-management decision with the potential to allocate benefits or confer harms raises concerns about who gains or loses that may fail to surface in naively-chosen performance measures. Data science interacts with these questions in two fundamentally different ways: (i) as the technology driving the very systems responsible for certain social impacts, posing new questions about what it means for such systems to accord with ethical norms and the law; and (ii) as a set of powerful tools for analyzing existing data management systems, e.g., for auditing existing systems for various biases. This tutorial will tackle both angles on the interaction between technology and society vis-a-vis concerns over fairness and bias, particularly focusing on the collection and management of data. Our presentation will cover a wide range of disciplinary perspectives with the first part focusing on the social impacts of technology and the formulations of fairness and bias defined via protected characteristics and the second part taking a deep into peer review and distributed human evaluations, to explore other forms of bias, such as that due to subjectivity, miscalibration, and dishonest behavior. https://doi.org/10.1145/3318464.3383129