For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

SIGMOD Demo Session A.1
Tuesday 1:30 PM – 3:00 PM and Wednesday 4:30 PM – 6:00 PM
modde01 RASQL: A Powerful Language and its System for Big Data Applications
Jin Wang (University of California, Los Angeles);
Guorui Xiao (University of California, Los Angeles);
Jiaqi Gu (University of California, Los Angeles);
Jiacheng Wu (Tsinghua University);
Carlo Zaniolo (University of California, Los Angeles)
There is a growing interest in supporting advanced Big Data applications on distributed data processing platforms. Most of these systems support SQL or its dialect as the query interface due to its portability and declarative nature. However, current SQL standard cannot effectively express advanced analytical queries due to its limitation in supporting recursive queries. In this demonstration, we show that this problem can be resolved via a simple SQL extension that delivers greater expressive power by allowing aggregates in recursion. To this end, we propose the Recursive-aggregate-SQL (RASQL) language and its system on top of Apache Spark to express and execute complex queries and declarative algorithms in many applications, such as graph search and machine learning. With a variety of examples, we will (i) show how complicated analytic queries can be expressed with RASQL; (ii) illustrate formal semantics of the powerful new constructs; and (iii) present a user-friendly interface to interact with the RASQL system and monitor the query results.
modde13 MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness
Zhongjun Jin (University of Michigan – Ann Arbor); Mengjing Xu (University of Michigan – Ann Arbor); Chenkai Sun (University of Michigan – Ann Arbor); Abolfazl Asudeh (University of Illinois at Chicago);
H. V. Jagadish (University of Michigan – Ann Arbor)
Data-driven technologies are only as good as the data they work with. On the other hand, data scientists have often limited control on how the data is collected. Failing to contain adequate number of instances from minority (sub)groups, known as population bias, is a major reason for model unfairness and disparate performance across different groups. We demonstrate MithraCoverage, a system for investigating population bias over the intersection of multiple attributes. We use the concept of coverage for identifying intersectional subgroups with inadequate representation in the dataset. MithraCoverage is a web application with an interactive visual interface that allows data scientists to explore the dataset and identify subgroups with poor coverage.
modde15 Sentinel: Understanding Data Systems
Bradley Glasbergen (University of Waterloo);
Michael Abebe (University of Waterloo);
Khuzaima Daudjee (University of Waterloo);
Daniel Vogel (University of Waterloo);
Jian Zhao (University of Waterloo)
The complexity of modern data systems and applications greatly increases the challenge in understanding system behaviour and diagnosing performance problems. When these problems arise, system administrators are left with the difficult task of remedying them by relying on large debug log files, vast numbers of metrics, and system-specific tooling. We demonstrate the Sentinel system, which enables administrators to analyze systems and applications by building models of system execution and comparing them to derive key differences in behaviour. The resulting analyses are then presented as system reports to administrators and developers in an intuitive fashion. Users of Sentinel can locate, identify and take steps to resolve the reported performance issues. As Sentinel’s models are constructed online by intercepting debug logging library calls, Sentinel’s functionality incurs little overhead and works with all systems that use standard debug logging libraries.
modde26 F-IVM: Learning over Fast-Evolving Relational Data
Milos Nikolic (University of Edinburgh);
Haozhe Zhang (University of Oxford);
Ahmet Kara (University of Oxford);
Dan Olteanu (University of Oxford)
F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
modde29 Big Data Series Analytics Using TARDIS and its Exploitation in Geospatial Applications
Liang Zhang (Worcester Polytechnic Institute);
Noura Alghamdi (Worcester Polytechnic Institute); Mohamed Y. Eltabakh (Worcester Polytechnic Institute); Elke A. Rundensteiner (Worcester Polytechnic Institute)
The massive amounts of data series data continuously generated and collected by applications require new indices to speed up data series similarity queries on which various data mining techniques rely. However, the state-of-the-art iSAX-based indexing techniques do not scale well due to the binary fanout that leads to a highly deep index tree and suffer from accuracy degradation due to the character-level cardinalitythat leads to poor maintenance of the proximity. To address this problem, we recently proposed TARDIS to supports indexing and querying billion-scale data series datasets. It introduces a new iSAX-T signatures to reduce the cardinality conversion cost and corresponding sigTreeto construct a compact index structure topreserve better similarity. The framework consists of one centralized index and local distributed indices to efficiently re-partition and index dimensional datasets. Besides, effective query strategies based on sigTreestructure are proposed to greatly improve the accuracy. In this demonstration, we present GENET, a new interactive exploration demonstration that allows users to support Big Data Series Approximate Retrieval and Recursive Interactive Clustering in large-scale geospatial datasets using TARDIS index techniques.
modde25 SVQ++: Querying for Object Interactions in Video Streams
Daren Chao (University of Toronto);
Nick Koudas (University of Toronto);
Ioannis Xarchakos (University of Toronto)
Deep neural nets enabled sophisticated information extraction out of images, including video frames. Recently, there has been interest in techniques and algorithms to enable interactive declarative query processing of objects appearing on video frames and their associated interactions on the video feed. SVQ++ is a system for declarative querying on real-time video streams involving objects and their interactions. The system utilizes a sequence of inexpensive and less accurate models (filters), called Progressive Filters (PF), to detect the presence of the query specified objects on frames, and a filtering approach, called Interaction Sheave (IS), to effectively prune frames that are not likely to contain interactions. We demonstrate that this system can efficiently identify frames in a streaming video in which an object is interacting with another in a specific way, increasing the frame processing rate dramatically and speed up query processing by at least two orders of magnitude depending on the query.
SIGMOD Demo Session A.2
Tuesday 1:30 PM – 3:00 PM and Wednesday 4:30 PM – 6:00 PM
modde28 Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift
Nemanja Borić (Amazon Web Services);
Hinnerk Gildhoff (Amazon Web Services);
Menelaos Karavelas (Amazon Web Services);
Ippokratis Pandis (Amazon Web Services);
Ioanna Tsalouchidou (Amazon Web Services)
Enterprise companies use spatial data for decision optimization and gain new insights regarding the locality of their business and services. Industries rely on efficiently combining spatial and business data from different sources, such as data warehouses, geospatial information systems, transactional systems, and data lakes, where spatial data can be found in structured or unstructured form. In this demonstration we present the spatial functionality of Amazon Redshift and its integration with other Amazon services, such as Amazon Aurora PostgreSQL and Amazon S3. We focus on the design and functionality of the feature, including the extensions in Redshift’s state-of-the-art optimizer to push spatial processing close to where the data is stored.
modde30 CDFShop: Exploring and Optimizing Learned Index Structures
Ryan Marcus (Massachusetts Institute of Technology); Emily Zhang (Massachusetts Institute of Technology); Tim Kraska (Massachusetts Institute of Technology) Indexes are a critical component of data management applications. While tree-like structures (e.g., B-Trees) have been employed to great success, recent work suggests that index structures powered by machine learning models (learned index structures) can achieve low lookup times with a reduced memory footprint. This demonstration showcases CDFShop, a tool to explore and optimize recursive model indexes (RMIs), a type of learned index structure. This demonstration allows audience members to (1) gain an intuition about various tuning parameters of RMIs and why learned index structures can greatly accelerate search, and (2) understand how automatic optimization techniques can be used to explore space/time tradeoffs within the space of RMIs.
modde31 TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines
Emily Caveness (Google Inc.);
Paul Suganthan G. C. (Google Inc.);
Zhuo Peng (Google Inc.);
Neoklis Polyzotis (Google Inc.);
Sudip Roy (Google Inc.);
Martin Zinkevich (Google Inc.)
Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding, validating, and monitoring the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen, on par with algorithms and infrastructure which are the typical building blocks of ML pipelines. In this demonstration we showcase TensorFlow Data Validation (TFDV), a scalable data analysis and validation system for ML that we have developed at Google and recently open-sourced. This system is deployed in production as an integral part of TFX – an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well.
modde33 Demonstration of BitGourmet: Data Analysis via Deterministic Approximation
Saehan Jo (Cornell University);
Immanuel Trummer (Cornell University)
We demonstrate BitGourmet, a novel data analysis system that supports deterministic approximate query processing (DAQ). The system executes aggregation queries and produces deterministic bounds that are guaranteed to contain the true value. The system allows users to set a precision constraint on query results. Given a user-defined target precision, we operate on a carefully selected data subset to satisfy the precision constraint. More precisely, we divide each column vertically, bit-by-bit. Our specialized query processing engine evaluates queries on subsets of these bit vectors. This involves a scenario-specific query optimizer which relies on quality and cost models to decide the optimal bit selection and execution plan. In our demonstration, we show that DAQ realizes an interesting trade-off between result quality and execution time, making data analysis more interactive. We also offer manual control over the query plan, i.e., the bit selection and the execution plan, so that users can gain more insights into our system and DAQ in general.
modde35 Physical Visualization Design
Lana Ramjit (University of California, Los Angeles);
Zhaoning Kong (University of California, Los Angeles);
Ravi Netravali (University of California, Los Angeles);
Eugene Wu (Columbia University)
We demonstrate PVD, a system that visualization designers can use to co-design the interface and system architecture of scalable and expressive visualization.
modde36 Demonstration of Chestnut: An In-memory Data Layout Designer for Database Applications
Mingwei Samuel (University of California, Berkeley);
Cong Yan (University of Washington);
Alvin Cheung (University of California, Berkeley)
This demonstration showcases Chestnut, a data layout generator for in-memory object-oriented database applications. Given an application and a memory budget, Chestnut generates a customized in-memory data layout and the corresponding query plans that are specialized for the application queries. Our demo will let users design and improve simple web applications using Chestnut. Users can view the Chestnut-generated data layouts using a custom visualization system, which will allow users to see how the application parameters affect Chestnut’s design. Finally, users will be able to run queries generated by the application via the customized query plans generated by Chestnut or traditional relational query engines, and can compare the results and observe the speedup achieved by the Chestnut-generated query plans.
SIGMOD Demo Session B.1
Wednesday 10:30 AM – 12:00 PM and Thursday 10:30 AM – 12:00 PM
modde04 BOOMER: A Tool for Blending Visual P-Homomorphic Queries on Large Networks
Yinglong Song (Nanyang Technological University & Fudan University);
Huey Eng Chua (Nanyang Technological University);
Sourav S. Bhowmick (Nanyang Technological University);
Byron Choi (Hong Kong Baptist University);
Shuigeng Zhou (Fudan University)
The paradigm of <i>interleaving</i> (i.e. <i>blending</i>) visual subgraph query formulation and processing by exploiting the latency offered by the GUI brings in several potential benefits such as superior <i>system response time</i> (SRT) and opportunities to enhance usability of graph databases. Recent efforts at implementing this paradigm are focused on subgraph isomorphism-based queries, which are often restrictive in many real-world graph applications. In this demonstration, we present a novel system called BOOMER to realize this paradigm on more generic but complex <i>bounded 1-1</i> <i>p-homomorphic</i>(BPH) queries on large networks. Intuitively, a BPH query maps an <i>edge</i> of the query to <i>bounded paths</i> in the data graph. We demonstrate various innovative features of BOOMER, its flexibility, and its promising performance.
modde05 AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases
Sourav S. Bhowmick (Nanyang Technological University);
Kai Huang (Nanyang Technological University & Fudan University);
Huey Eng Chua (Nanyang Technological University);
Zifeng Yuan (Nanyang Technological University & Fudan University);
Byron Choi (Hong Kong Baptist University);
Shuigeng Zhou (Fudan University)
Several commercial and academic frameworks for querying a large collection of small- or medium-sized data graphs (eg. chemical compounds) provide visual graph query interfaces (a.k.a GUI) to facilitate non-programmers to query these sources. However, construction of these visual interfaces is not <i>data-driven</i>. That is, it does not exploit the underlying data graphs to <i>automatically</i> generate the contents of various panels in a GUI. Such data-driven construction has several benefits such as facilitating efficient subgraph query formulation and portability of the interface across different application domains and sources. In this demonstration, we present a novel data-driven visual subgraph query interface construction engine called AURORA. Specifically, given a graph repository D containing a collection of small- or medium-sized data graphs, it automatically generates the GUI for D by populating various components of the interface. We demonstrate various innovative features of AURORA.
modde09 High Performance Distributed OLAP on Property Graphs with Grasper
Hongzhi Chen (The Chinese University of Hong Kong);
Bowen Wu (The Chinese University of Hong Kong);
Shiyuan Deng (The Chinese University of Hong Kong);
Chenghuan Huang (The Chinese University of Hong Kong); Changji Li (The Chinese University of Hong Kong);
Yichao Li (The Chinese University of Hong Kong);
James Cheng (The Chinese University of Hong Kong)
Achieving high performance OLAP over large graphs is a challenging problem and has received great attention recently because of its broad spectrum of applications. Existing systems have various performance bottlenecks due to limitations such as low parallelism and high network overheads. This Demo presents Grasper, an RDMA-enabled distributed graph OLAP system, which adopts a series of new system designs to overcome the challenges of OLAP on graphs. The take-aways for Demo attendees are: (1)~a good understanding of the challenges of processing graph OLAP queries; (2)~useful insights about where Grasper’s good performance comes from; (3)~inspirations about how to design an efficient graph OLAP system by comparing Grasper with existing systems.
modde19 Interactively Discovering and Ranking Desired Tuples without Writing SQL Queries
Xuedi Qin (Tsinghua University);
Chengliang Chai (Tsinghua University);
Yuyu Luo (Tsinghua University);
Nan Tang (QCRI);
Guoliang Li (Tsinghua University)
The very first step of many data analytics is to find and (possibly) rank desired tuples, typically through writing SQL queries – this is feasible only for data experts who can write SQL queries and know the data very well. Unfortunately, in practice, the queries might be complicated (for example, “find and rank good off-road cars based on a combination of Price, Make, Model, Age, Mileage, and so on” is complicated because it contains many if-then-else, and, or and not logic) such that even data experts cannot precisely specify SQL queries; and the data might be unknown, which is common in data discovery that one tries to discover desired data from a data lake. Naturally, a system that can help users to discover and rank desired tuples without writing SQL queries is needed. We propose to demonstrate such as a system, namely DExPlorer. To use DExPlorer for data exploration, the user only needs to interactively perform two simple operations over a set of system provided tuples: (1) annotate which tuples are desired (i.e., true labels) or not (i.e., false labels), and (2) annotate whether a tuple is more preferred than another one (i.e., partial orders or ranked lists). We will show that DExPlorer can find user’s desired tuples and rank them in a few interactions, even for complicated queries.
modde20 Synner: Generating Realistic Synthetic Data
Miro Mannino (New York University Abu Dhabi);
Azza Abouzied (New York University Abu Dhabi)
Synner allows users to generate realistic-looking data. With Synner users can visually and declaratively specify properties of the dataset they wish to generate. Such properties include the domain, and statistical distribution of each field, and relationships between fields. User can also sketch custom distributions and relationships. Synner provides instant feedback on every user interaction by visualizing a preview of the generated data. It also suggests generation specifications from a few user-provided examples of data to generate, column labels and other user interactions. In this demonstration, we showcase Synner and summarize results from our evaluation of Synner’s effectiveness at generating realistic-looking data.
modde23 STAR: A Distributed Stream Warehouse System for Spatial Data
Zhida Chen (Nanyang Technological University);
Gao Cong (Nanyang Technological University);
Walid G. Aref (Purdue University)
The proliferation of mobile phones and location-based services gives rise to an explosive growth of spatial data. This spatial data contains valuable information, and calls for data stream warehouse systems that can provide real-time analytical results with the latest integrated spatial data. In this demonstration, we present the STAR (Spatial Data Stream Warehouse) system. STAR is a distributed in-memory spatial data stream warehouse system that provides low-latency and up-to-date analytical results over a fast spatial data stream. STAR supports a rich set of aggregate queries for spatial data analytics, e.g., contrasting the frequencies of spatial objects that appear in different spatial regions, or showing the most frequently mentioned topics being tweeted in different cities. STAR processes aggregate queries by maintaining distributed materialized views. Additionally, STAR supports dynamic load adjustment that makes STAR scalable and adaptive. We demonstrate STAR on top of Amazon EC2 clusters using real data sets.
SIGMOD Demo Session B.2
Wednesday 10:30 AM – 12:00 PM and Thursday 10:30 AM – 12:00 PM
modde06 vChain: A Blockchain System Ensuring Query Integrity
Haixin Wang (Hong Kong Baptist University);
Cheng Xu (Hong Kong Baptist University);
Ce Zhang (Hong Kong Baptist University);
Jianliang Xu (Hong Kong Baptist University)
This demonstration presents vChain, a blockchain system that ensures query integrity. With the proliferation of blockchain applications and services, there has been an increasing demand for querying the data stored in a blockchain database. However, existing solutions either are at the risk of losing query integrity, or require users to maintain a full copy of the blockchain database. In comparison, by employing a novel verifiable query processing framework, vChain enables a lightweight user to authenticate the query results returned from a potentially untrusted service provider. We demonstrate its verifiable query operations, usability, and performance with visualization for better insights. We also showcase how users can detect falsified results in the case that the service provider is compromised.
modde07 AUDITOR: A System Designed for Automatic Discovery of Complex Integrity Constraints in Relational Databases
Wentao Hu (Zhejiang University);
Dongxiang Zhang (Zhejiang University);
Dawei Jiang (Zhejiang University);
Sai Wu (Zhejiang University);
Ke Chen (Zhejiang University);
Kian-Lee Tan (School of Computing National University of Singapore);
Gang Chen (Zhejiang University)
In this demonstration, we present a new definition of integrity constraint that is more powerful for anomalous data discovery. In our definition, a constraint is functioned on both categorical and numerical attributes in relational tables, as well as their derivative attributes, leading to a huge search space. Furthermore, we are the first to take into account attribute value distribution as part of a constraint. Based on the proposed integrity constraint, we build AUDITOR on top of relational tables from the industry of healthcare auditing and demonstrate its effectiveness and ease-of-use for domain experts to discover anomalous data.
modde10 ProcAnalyzer: Effective Code Analyzer for Tuning Imperative Programs in SAP HANA
Kisung Park (SAP Labs Korea);
Taeyoung Jeong (SAP Labs Korea);
Chanho Jeong (SAP Labs Korea);
Jaeha Lee (SAP Labs Korea);
Dong-Hun Lee (SAP Labs Korea);
Young-Koo Lee (Kyung Hee University)
Troubleshooting imperative programs at runtime is very challenging because the final optimized plan is quite different from the original design time model. In this demonstration, we present ProcAnalyzer, an expressive and intuitive tool for troubleshooting issues related to performance, code quality, and security. We propose end-to-end graph (E2EGraph) that provides a holistic view of design time, compile time, and runtime behavior so that end users and engine developers easily find the correlations between design time and runtime. ProcAnalyzer provides suggestions and visualization to find problematic statements through the E2EGraph.
modde11 LATTE: Visual Construction of Smart Contracts
Sean Tan (Nanyang Technological University);
Sourav S Bhowmick (Nanyang Technological University); Huey Eng Chua (Nanyang Technological University);
Xiaokui Xiao (National University of Singapore)
Smart contracts enable developers to run instructions on blockchains (eg. Ethereum) and have broad range of real-world applications. Solidity is the most popular high-level smart contract programming language on Ethereum. Coding in such language, however, demands a user to be proficient in contract programming and debugging to construct smart contracts correctly. In practice, such expectation makes it harder for non-programmers to take advantage of smart contracts. In this demonstration, we present a novel <i>visual smart contract construction system</i> on Ethereum called <sc>latte</sc> to make smart contract development accessible to non-programmers. Specifically, it allows a user to construct a contract without writing Solidity code by manipulating visual objects in a <i>direct manipulation-based</i> interface. Furthermore, <sc>latte</sc> interactively guides users and makes them aware of the cost (in units of <i>Gas</i>) of visual actions undertaken by them during contract construction.
modde27 CoMing: A Real-time Co-Movement Mining System for Streaming Trajectories
Ziquan Fang (Zhejiang University);
Yunjun Gao (Zhejiang University);
Lu Pan (Zhejiang University);
Lu Chen (Aalborg University);
Xiaoye Miao (Zhejiang University);
Christian S. Jensen (Aalborg University)
The aim of real-time co-movement pattern mining for streaming trajectories is to discover co-moving objects that satisfy specific spatio-temporal constraints in real time. This functionality serves a range of real-world applications, such as traffic monitoring and management. However, little work targets the visualization and interaction with such co-movement detection on streaming trajectories. To this end, we develop <b>CoMing</b>, a real-time co-movement pattern mining system, to handle streaming trajectories. CoMing leverages <b>ICPE</b>, a real-time distributed co-movement pattern detection framework, and thus, it has its capacity of good performance. This demonstration offers hands-on experience with CoMing’s visual and user-friendly interface. Moreover, several applications in the traffic domain, including object monitoring and traffic statistics visualization, are also provided to users.
modde32 Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing
Zuozhi Wang (University of California, Irvine); Kai Zeng (Alibaba Group); Botong Huang (Alibaba Group); Wei Chen (Alibaba Group); Xiaozong Cui (Alibaba Group); Bo Wang (Alibaba Group); Ji Liu (Alibaba Group); Liya Fan (Alibaba Group); Dachuan Qu (Alibaba Group); Zhenyu Hou (Alibaba Group); Tao Guan (Alibaba Group); Chen Li (University of California, Irvine); Jingren Zhou (Alibaba Group) As the primary approach to deriving decision-support insights, automated recurring routine analytic jobs account for a major part of cluster resource usages in modern enterprise data warehouses. These recurring routine jobs usually have stringent schedule and deadline determined by external business logic, and thus cause dreadful resource skew and severe resource over-provision in the cluster. In this paper, we present Grosbeak, a novel data warehouse that supports resource-aware incremental computing to process recurring routine jobs, smooths the resource skew, and optimizes the resource usage. Unlike batch processing in traditional data warehouses, Grosbeak leverages the fact that data is continuously ingested. It breaks an analysis job into small batches that incrementally process the progressively available data, and schedules these small-batch jobs intelligently when the cluster has free resources. In this demonstration, we showcase Grosbeak using real-world analysis pipelines. Users can interact with the data warehouse by registering recurring queries and observing the incremental scheduling behavior and smoothed resource usage pattern.
SIGMOD Demo Session C.1
Wednesday 1:30 PM – 3:30 PM and Thursday 1:30 PM – 3:00 PM
modde02 PL/SQL Without the PL
Denis Hirn (University of Tübingen);
Torsten Grust (University of Tübingen)
We demonstrate a source-to-source compilation technique that can translate user-defined PL/SQL functions into plain SQL queries. These PL/SQL functions may feature arbitrarily complex control flow—iteration, in particular. From this imperative-style input code we derive equivalent recursive common table expressions, ready for execution on any relational SQL:1999 back-end. Principally due to the absence of PL/SQL?SQL context switches, the output plain SQL code comes with substantial runtime savings. The demonstration embeds the compiler into an interactive setup that admits function editing while live re-compilation and visualization of intermediate code forms happens in the background.
modde03 Analysis of Database Search Systems with THOR
Theofilos Belmpas (Athena Research Center);
Orest Gkini (Athena Research Center);
Georgia Koutrika (Athena Research Center)
Numerous search systems have been implemented that allow users to pose unstructured queries over databases without the need to use a query language, such as SQL. Unfortunately, the landscape of efforts is fragmented with no clear sight of which system is best, and what open challenges we should pursue in our research. To help towards this direction, we present THOR that makes 4 important contributions: a query benchmark, a framework for comparing different systems, several search system implementations, and a highly interactive tool for comparing different search systems.
modde08 SHARQL: Shape Analysis of Recursive SPARQL Queries
Angela Bonifati (Lyon 1 University);
Wim Martens (University of Bayreuth); Thomas Timm (University of Bayreuth)
We showcase SHARQL, a system that allows to navigate SPARQL query logs, can inspect complex queries by visualizing their shape, and can serve as a back-end to flexibly produce statistics about the logs. Even though SPARQL query logs are increasingly available and have become public recently, their navigation and analysis is hampered by the lack of appropriate tools. SPARQL queries are sometimes hard to understand and their inherent properties, such as their shape, their hypertree properties, and their property paths are even more difficult to be identified and properly rendered. In SHARQL, we show how the analysis and exploration of several hundred million queries is possible. We offer edge rendering which works with complex hyperedges, regular edges, and property paths of SPARQL queries. The underlying database stores more than one hundred attributes per query and is therefore extremely flexible for exploring the query logs and as a back-end to compute and display analytical properties of the entire logs or parts thereof.
modde12 PROUD: PaRallel OUtlier Detection for Streams
Theodoros Toliopoulos (Aristotle University of Thessaloniki); Christos Bellas (Aristotle University of Thessaloniki); Anastasios Gounaris (Aristotle University of Thessaloniki); Apostolos Papadopoulos (Aristotle University of Thessaloniki) We introduce PROUD, standing for PaRallel OUtlier Detection for streams, which is an extensible engine for continuous multi-parameter parallel distance-based outlier (or anomaly) detection tailored to big data streams. PROUD is built on top of Flink. It defines a simple API for data ingestion. It supports a variety of parallel techniques, including novel ones, for continuous outlier detection that can be easily configured. In addition, it graphically reports metrics of interest and stores main results into a permanent store to enable future analysis. It can be easily extended to support additional techniques. Finally, it is publicly provided in open-source.
modde16 BugDoc: A System for Debugging Computational Pipelines
Raoni Lourenço (New York University);
Juliana Freire (New York University);
Dennis Shasha (New York University)
Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We recently proposed a new approach that makes provenance to automatically and iteratively infer root causes and derive succinct explanations of failures; such an approach was implemented in our prototype, BugDoc. In this demonstration, we will illustrate BugDoc’s capabilities to debug pipelines using few configuration instances.
modde18 ExTuNe: Explaining Tuple Non-conformance
Anna Fariha (University of Massachusetts Amherst);
Ashish Tiwari (Microsoft);
Arjun Radhakrishna (Microsoft);
Sumit Gulwani (Microsoft)
In data-driven systems, we often encounter tuples on which the predictions of a machine-learned model are untrustworthy. A key cause of such untrustworthiness is non-conformance of a new tuple with respect to the training dataset. To check conformance, we introduce a novel concept of data invariant, which captures a set of implicit constraints that all tuples of a dataset satisfy: a test tuple is non-conforming if it violates the data invariants. Data invariants model complex relationships among multiple attributes; but do not provide interpretable explanations of non-conformance. We present ExTuNe, a system for Explaining causes of Tuple Non-conformance. Based on the principles of causality, ExTuNe assigns responsibility to the attributes for causing non-conformance. The key idea is to observe change in invariant violation under intervention on attribute-values. Through a simple interface, ExTuNe produces a ranked list of the test tuples based on their degree of non-conformance and visualizes tuple-level attribute responsibility for non-conformance through heat maps. ExTuNe further visualizes attribute responsibility, aggregated over the test tuples. We demonstrate how ExTuNe can detect and explain tuple non-conformance and assist the users to make careful decisions towards achieving trusted machine learning.
SIGMOD Demo Session C.2
Wednesday 1:30 PM – 3:30 PM and Thursday 1:30 PM – 3:00 PM
modde14 MC3: A System for Minimization of Classifier Construction Cost
Shay Gershtein (Tel Aviv University);
Tova Milo (Tel Aviv University);
Gefen Morami (Tel Aviv University);
Slava Novgorodov (eBay Research)
Search mechanisms over massive sets of items are the cornerstone of many modern applications, particularly in e-commerce websites. Consumers express in search queries a set of properties, and expect the system to retrieve qualifying items. A common difficulty, however, is that the information on whether or not an item satisfies the search criteria is sometimes not explicitly recorded in the repository. Instead, it may be considered as general knowledge or “hidden” in a picture/description, thereby leading to incomplete search results. To overcome these problems companies invest in building dedicated classifiers that determine whether an item satisfies the given search criteria. However, building classifiers typically incurs non-trivial costs due to the required volumes of high-quality labeled training data. In this demo, we introduce MC3, a real-time system that helps data analysts decide which classifiers to construct to minimize the costs of answering a set of search queries. MC3 is interactive and facilitates real-time analysis, by providing detailed classifiers impact information. We demonstrate the effectiveness of MC3 on real-world data and scenarios taken from a large e-commerce system, by interacting with the SIGMOD’20 audience members who act as analysts.
modde17 TQVS: Temporal Queries over Video Streams in Action
Yueting Chen (York University);
Xiaohui Yu (York University);
Nick Koudas (University of Toronto)
We present TQVS, a system capable of conducting efficient evaluation of declarative temporal queries over real-time video streams. Users may issue queries to identify video clips in which the same two cars and the same three persons appear jointly in the frames for say 30 seconds. In real-world videos, some of the objects may disappear in frames due to reasons such as occlusion, which introduces challenges to query evaluation. Our system, aiming to address such challenges, consists of two main components: the Object Detection and Tracking (ODT) module and the Query Evaluation module. The ODT module utilizes state-of-art Object Detection and Tracking algorithms to produce a list of identified objects for each frame. Based on these results, we maintain select object combinations through the current window during query evaluation. Those object combinations contain sufficient information to evaluate queries correctly. Since the number of possible combinations could be very large, we introduce a novel technique to structure the possible combinations and facilitate query evaluation. We demonstrate that our approach offers significant performance benefits compared to alternate approaches and constitutes a fundamental building block of the TQVS system.
modde21 InCognitoMatch: Cognitive-aware Matching via Crowdsourcing
Roee Shraga (Technion – Israel Institute of Technology); Coral Scharf (Technion – Israel Institute of Technology); Rakefet Ackerman (Technion – Israel Institute of Technology);
Avigdor Gal (Technion – Israel Institute of Technology)
We present InCognitoMatch, the first cognitive-aware crowdsourcing application for matching tasks. InCognitoMatch provides a handy tool to validate, annotate, and correct correspondences using the crowd whilst accounting for human matching biases. In addition, InCognitoMatch enables system administrators to control context information visible for workers and analyze their performance accordingly. For crowd workers, InCognitoMatch is an easy-to-use application that may be accessed from multiple crowdsourcing platforms. In addition, workers completing a task are offered suggestions for followup sessions according to their performance in the current session. For this demo, the audience will be able to experience InCognitoMatch thorough three use-cases, interacting with system as workers and as administrators.
modde22 CoClean: Collaborative Data Cleaning
Mashaal Musleh (University of Minnesota);
Mourad Ouzzani (QCRI, HBKU); Nan Tang (QCRI, HBKU); AnHai Doan (University of Wisconsin)
High quality data is crucial for many applications but real-life data is often dirty. Unfortunately, automated solutions are often not trustable and are thus seldom employed in practice. In real-world scenarios, it is often necessary to resort to manual cleaning for obtaining pristine data. Existing human-in-the-loop solutions, such as Trifacta and OpenRefine, typically involve a single user. This is often error-prone, limited to a single-person expertise, and cannot scale with the ever growing volume, variety and veracity of data. We propose a crowd-in-the-loop cleaning system, called CoClean, built on top of Python Pandas dataframe, a widely used library for data scientists. The core ofCoCleanis a new Python library called Collaborative dataframe (CDF) that allows one to share data represented as a dataframe with other users. CDF is responsible for synchronizing and aggregating annotations obtained from different users. The attendees will have the opportunity to experience the following features:(1)Data Assignment: Given a dataframe, the owner can assign it (or a subset of it) to different users. (2)Supporting both lay and power users: lay users can use a GUI for direct manual cleaning of the data, while power users can work on the assigned data through a Jupyter Notebook where they can write scripts to do batch cleaning. (3)Combining machines and humans: Possible errors and repairs generated by machine algorithms can be highlighted as annotations, which can make the life of users easier for manual cleaning. (4)Collaboration Modes: CoClean supports two modes: blind-on(no user can see the annotations from others) and blind-off.
modde24 T-REx: Table Repair Explanations
Daniel Deutch (Tel Aviv University);
Nave Frost (Tel Aviv University);
Amir Gilad (Tel Aviv University);
Oren Sheffer (Tel Aviv University)
Data repair is a common and crucial step in many frameworks today, as applications may use data from different sources and of different levels of credibility. Thus, this step has been the focus of many works, proposing diverse approaches. To assist users in understanding the output of such data repair algorithms, we propose T-REx, a system for providing data repair explanations through Shapley values. The system is generic and not specific to a given repair algorithm or approach: it treats the algorithm as a black box. Given a specific table cell selected by the user, T-REx employs Shapley values to explain the significance of each constraint and each table cell in the repair of the cell of interest. T-REx then ranks the constraints and table cells according to their importance in the repair of this cell. This explanation allows users to understand the repair process, as well as to act based on this knowledge, to modify the most influencing constraints or the original database.
modde34 Bring Your Own Data to X-PLAIN
Eliana Pastor (Politecnico di Torino); Elena Baralis (Politecnico di Torino) Exploring and understanding the motivations behind black-box model predictions is becoming essential in many different applications. X-PLAIN is an interactive tool that allows human-in-the-loop inspection of the reasons behind model predictions. Its support for the local analysis of individual predictions enables users to inspect the local behavior of different classifiers and compare the knowledge different classifiers are exploiting for their prediction. The interactive exploration of prediction explanation provides actionable insights for both trusting and validating model predictions and, in case of unexpected behaviors, for debugging and improving the model itself.