Industry

For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

Industry 1: Machine Learning and Analytics
Tuesday 1:30 PM – 3:00 PM
modin711s Elastic Machine Learning Algorithms in Amazon SageMaker
Edo Liberty (Amazon AI); Zohar Karnin (Amazon AI);
Bing Xiang (Amazon AI); Laurence Rouesnel (Amazon AI);
Baris Coskun (Amazon AI); Ramesh Nallapati (Amazon AI);
Julio Delgado (Amazon AI); Amir Sadoughi (Amazon AI);
Yury Astashonok (Amazon AI); Piali Das (Amazon AI);
Can Balioglu (Amazon AI); Saswata Chakravarty (Amazon AI); Madhav Jha (Amazon AI); Philip Gautier (Amazon AI); David Arpin (Amazon AI); Tim Januschowski (Amazon AI);
Valentin Flunkert (Amazon AI); Yuyang Wang (Amazon AI);
Jan Gasthaus (Amazon AI); Lorenzo Stella (Amazon AI);
Syama Rangapuram (Amazon AI); David Salinas (Amazon AI); Sebastian Schelter (Amazon AI); Alex Smola (Amazon AI)
There is a large body of research on scalable machine learning (ML). Nevertheless, training ML models on large, continuously evolving datasets is still a difficult and costly undertaking for many companies and institutions. We discuss such challenges and derive requirements for an industrial-scale ML platform. Next, we describe the computational model behind Amazon SageMaker, which is designed to meet such challenges. SageMaker is an ML platform provided as part of Amazon Web Services (AWS), and supports incremental training, resumable and elastic learning as well as automatic hyperparameter optimization. We detail how to adapt several popular ML algorithms to its computational model. Finally, we present an experimental evaluation on large datasets, comparing SageMaker to several scalable, JVM-based implementations of ML algorithms, which we significantly outperform with regard to computation time and cost. https://doi.org/10.1145/3318464.3386126
modin741 Timon: A Timestamped Event Database for Efficient Telemetry Data Processing and Analytics
Wei Cao (Zhejiang University & Alibaba Group); Yusong Gao (Alibaba Group); Feifei Li (Alibaba Group); Sheng Wang (Alibaba Group); Bingchen Lin (Alibaba Group); Ke Xu (Alibaba Group); Xiaojie Feng (Alibaba Group); Yucong Wang (Alibaba Group); Zhenjun Liu (Alibaba Group); Gejin Zhang (Alibaba Group) With the increasing demand for real-time system monitoring and tracking in various contexts, the amount of time-stamped event data grows at an astonishing rate. Analytics on time-stamped events must be real time and the aggregated results need to be accurate even when data arrives out of order. Unfortunately, frequent occurrences of out-of-order data will significantly slow down the processing, and cause a large delay in the query response. Timon is a timestamped event database that aims to support aggregations and handle late arrivals both correctly (i.e., upholding the exactly-once semantics) and efficiently. Our insight is that a broad range of applications can be implemented with data structures and corresponding operators that satisfy associative and commutative properties. Records arriving after the low watermark are appended to Timon directly, allowing aggregations to be performed lazily. To improve query efficiency, Timon maintains a TS-LSM-Tree, which keeps the most recent data in memory and contains a time-partitioning tree on disk for high-volume data accumulated over long time span. Besides, Timon supports materialized aggregation views and correlation analysis across multiple streams. Timon has been successfully deployed at Alibaba Cloud and is a critical building block for Alibaba cloud’s continuous monitoring and anomaly analysis infrastructure. https://doi.org/10.1145/3318464.3386136
modin744 Vertica-ML: Distributed Machine Learning in Vertica Database
Arash Fard (Vertica); Anh Le (Vertica); George Larionov (Vertica); Waqas Dhillon (Vertica); Chuck Bear (Vertica) A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it. https://doi.org/10.1145/3318464.3386137
modin750 Database Workload Capacity Planning using Time Series Analysis and Machine Learning
Antony S. Higginson (Oracle Advanced Customer Services); Mihaela Dediu (Oracle Advanced Customer Services); Octavian Arsene (Oracle Advanced Customer Services); Norman W. Paton (University of Manchester ); Suzanne M. Embury (University of Manchester ) When procuring or administering any I.T. system or a component of an I.T. system, it is crucial to understand the computational resources required to run the critical business functions that are governed by any Service Level Agreements. Predicting the resources needed for future consumption is like looking into the proverbial crystal ball. In this paper we look at the forecasting techniques in use today and evaluate if those techniques are applicable to the deeper layers of the technological stack such as clustered database instances, applications and groups of transactions that make up the database workload. The approach has been implemented to use supervised machine learning to identify traits such as reoccurring patterns, shocks and trends that the workloads exhibit and account for those traits in the forecast. An experimental evaluation shows that the approach we propose reduces the complexity of performing a forecast, and accurate predictions have been produced for complex workloads. https://doi.org/10.1145/3318464.3386140
modin868 The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development
Micah J. Smith (Massachusetts Institute of Technology); Carles Sala (Massachusetts Institute of Technology);
James Max Kanter (Feature Labs); Kalyan Veeramachaneni (Massachusetts Institute of Technology)
As machine learning is applied more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of “pipeline jungles” – brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new framework for developing machine learning and automated machine learning software systems. First, we introduce ML primitives, a unified API and specification for data processing and ML components from different software libraries. Next, we compose primitives into usable ML pipelines, abstracting away glue code, data flow, and data storage. We further pair these pipelines with a hierarchy of AutoML strategies – Bayesian optimization and bandit learning. We use these components to create a general-purpose, multi-task, end-to-end AutoML system that provides solutions to a variety of data modalities (image, text, graph, tabular, relational, etc.) and problem types (classification, regression, anomaly detection, graph matching, etc.). We demonstrate 5 real-world use cases and 2 case studies of our approach. Finally, we present an evaluation suite of 456 real-world ML tasks and describe the characteristics of 2.5 million pipelines searched over this task suite. https://doi.org/10.1145/3318464.3386146
Industry 2: Cloud and Distributed Databases
Wednesday 1:30 PM – 3:00 PM
modin718 A Framework for Emulating Database Operations in Cloud Data Warehouses
Mohamed A Soliman (Datometry, Inc.); Lyublena Antova (Datometry, Inc.); Marc Sugiyama (Datometry, Inc.); Michael Duller (Datometry, Inc.); Amirhossein Aleyasen (Datometry, Inc.); Gourab Mitra (Datometry, Inc.); Ehab Abdelhamid (Datometry, Inc.); Mark Morcos (Datometry, Inc.); Michele Gage (Datometry, Inc.); Dmitri Korablev (Datometry, Inc.); Florian M Waas (Datometry, Inc.) In recent years, increased interest in cloud-based data warehousing technologies has emerged with many enterprises moving away from on-premise data warehousing solutions. The incentives for adopting cloud data warehousing technologies are many: cost-cutting, on-demand pricing, offloading data centers, unlimited hardware resources, built-in disaster recovery, to name a few. There is inherent difference in the language surface and feature sets of on-premise and cloud data warehousing solutions. This could range from subtle syntactic and semantic differences, with potentially big impact on result correctness, to complete features that exist in one system but are missing in other systems. While there have been some efforts to help automate the migration of on-premise applications to new cloud environments, a major challenge that slows down the migration pace is the handling of features not yet supported, or partially supported, by the cloud technologies. In this paper we build on our earlier work in adaptive data virtualization and present novel techniques that allow running applications utilizing sophisticated database features within foreign query engines lacking the native support of such features. In particular, we introduce a framework to manage discrepancy of metadata across heterogeneous query engines, and various mechanisms to emulate database applications code in cloud environments without any need to rewrite or change the application code. https://doi.org/10.1145/3318464.3386128
modin719 Taurus Database: How to be Fast, Available, and Frugal in the Cloud
Alex Depoutovitch (Huawei Research); Chong Chen (Huawei Research); Jin Chen (Huawei); Paul Larson (Huawei Research); Shu Lin (Huawei Research); Jack Ng (Huawei Research); Wenlin Cui (Huawei Research); Qiang Liu (Huawei Research); Wei Huang (Huawei Research); Yong Xiao (Huawei Research); Yongjun He (Huawei Research) Using cloud Database as a Service (DBaaS) offerings instead of on-premise deployments is increasingly common. Key advantages include improved availability and scalability at a lower cost than on-premise alternatives. In this paper, we describe the design of Taurus, a new multi-tenant cloud database system. Taurus separates the compute and storage layers in a similar manner to Amazon Aurora and Microsoft Socrates and provides similar benefits, such as read replica support, low network utilization, hardware sharing and scalability. However, the Taurus architecture has several unique advantages. Taurus offers novel replication and recovery algorithms providing better availability than existing approaches using the same or fewer replicas. Also, Taurus is highly optimized for performance, using no more than one network hop on critical paths and exclusively using append-only storage, delivering faster writes, reduced device wear, and constant-time snapshots. This paper describes Taurus and provides a detailed description and analysis of the storage node architecture, which has not been previously available from the published literature. https://doi.org/10.1145/3318464.3386129
modin721 Reliability Analytics for Cloud Based Distributed Databases
Mathieu B. Demarne (Microsoft Corporation); Jim Gramling (Microsoft Corporation); Tomer Verona (Microsoft Corporation); Miso Cilimdzic (Microsoft Corporation) We present RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud. RADD can perform root cause analysis (RCA) to provide a minute-by-minute summary of the availability of a database close to real-time. On top of this data, RADD can raise alerts, analyze the stability of new versions during their deployment, and provide Key Performance Indicators (KPIs) that allow us to understand the stability of our system across all deployed databases. RADD implements an event correlation framework that puts the emphasis on data compliance and uses information entropy to measure causality and reduce noisy signals. It also uses statistical modelling to analyze new versions of the product and detect potential regressions early in our software development lifecycle. We demonstrate the application of RADD on top of Azure Synapse Analytics, where the system has helped us identify top-hitting and new issues and support on-call teams regarding every aspect of database health. https://doi.org/10.1145/3318464.3386130
modin730 CockroachDB: The Resilient Geo-Distributed SQL Database
Rebecca Taft (Cockroach Labs); Irfan Sharif (Cockroach Labs); Andrei Matei (Cockroach Labs); Nathan VanBenschoten (Cockroach Labs); Jordan Lewis (Cockroach Labs); Tobias Grieger (Cockroach Labs); Kai Niemi (Cockroach Labs); Andy Woods (Cockroach Labs); Anne Birzin (Cockroach Labs); Raphael Poss (Cockroach Labs); Paul Bardea (Cockroach Labs); Amruta Ranade (Cockroach Labs); Ben Darnell (Cockroach Labs); Bram Gruneir (Cockroach Labs); Justin Jaffray (Cockroach Labs); Lucy Zhang (Cockroach Labs); Peter Mattis (Cockroach Labs) We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years. https://doi.org/10.1145/3318464.3386134
modin751 Azure SQL Database Always Encrypted
Panagiotis Antonopoulos (Microsoft Azure and Microsoft Research); Arvind Arasu (Microsoft Azure and Microsoft Research); Kunal D. Singh (Microsoft Azure and Microsoft Research); Ken Eguro (Microsoft Azure and Microsoft Research); Nitish Gupta (Microsoft Azure and Microsoft Research); Rajat Jain (Microsoft Azure and Microsoft Research); Raghav Kaushik (Microsoft Azure and Microsoft Research); Hanuma Kodavalla (Microsoft Azure and Microsoft Research); Donald Kossmann (Microsoft Azure and Microsoft Research); Nikolas Ogg (Microsoft Azure and Microsoft Research); Ravi Ramamurthy (Microsoft Azure and Microsoft Research); Jakub Szymaszek (Microsoft Azure and Microsoft Research); Jeffrey Trimmer (Microsoft Azure and Microsoft Research); Kapil Vaswani (Microsoft Azure and Microsoft Research); Ramarathnam Venkatesan (Microsoft Azure and Microsoft Research); Mike Zwilling (Microsoft Azure and Microsoft Research) This paper presents Always Encrypted, a recently released feature of Microsoft SQL Server that uses column granularity encryption to provide cryptographic data protection guarantees. Always Encrypted can be used to outsource database administration while keeping the data confidential from an administrator, including cloud operators. The first version of Always Encrypted was released in Azure SQL Database and as part of SQL Server 2016, and supported equality operations over deterministically encrypted columns. The second version, released as part of SQL Server 2019, uses an enclave running within a trusted execution environment to provide richer functionality that includes comparison and string pattern matching for an IND-CPA-secure (randomized) encryption scheme. We present the security, functionality, and design of Always Encrypted, and provide a performance evaluation using the TPC-C benchmark. https://doi.org/10.1145/3318464.3386141
Industry 3: Graph Databases and Knowledge Bases
Wednesday 4:30 PM -6:00 PM
modin727 AliCoCo: Alibaba E-commerce Cognitive Concept Net
Xusheng Luo (Alibaba Group); Luxin Liu (Alibaba Group); Yonghua Yang (Alibaba Group); Le Bo (Alibaba Group); Yuanpeng Cao (Alibaba Group); Jinghang Wu (Alibaba Group); Qiang Li (Alibaba Group); Keping Yang (Alibaba Group); Kenny Q. Zhu (Shanghai Jiao Tong University) One of the ultimate goals of e-commerce platforms is to satisfy various shopping needs for their customers. Much efforts are devoted to creating taxonomies or ontologies in e-commerce towards this goal. However, user needs in e-commerce are still not well defined, and none of the existing ontologies has the enough depth and breadth for universal user needs understanding. The semantic gap in-between prevents shopping experience from being more intelligent. In this paper, we propose to construct a large-scale E-commerce Cognitive Concept Net named “AliCoCo”, which is practiced in Alibaba, the largest Chinese e-commerce platform in the world. We formally define user needs in e-commerce, then conceptualize them as nodes in the net. We present details on how AliCoCo is constructed semi-automatically and its successful, ongoing and potential applications in e-commerce. https://doi.org/10.1145/3318464.3386132
modin731 A1: A Distributed In-Memory Graph Database
Chiranjeeb Buragohain (Oracle & Microsoft); Knut Magne Risvik (Microsoft); Paul Brett (Microsoft); Miguel Castro (Microsoft Research); Wonhee Cho (Microsoft); Joshua Cowhig (Microsoft); Nikolas Gloy (Microsoft); Karthik Kalyanaraman (Microsoft); Richendra Khanna (Oracle & Microsoft); John Pao (Microsoft); Matthew Renzelmann (Microsoft); Alex Shamis (Microsoft Research); Timothy Tan (Amazon); Shuheng Zheng (Microsoft & Amazon) A1 is an in-memory distributed database used by the Bing search engine to support complex queries over structured data. The key enablers for A1 are availability of cheap DRAM and high speed RDMA (Remote Direct Memory Access) networking in commodity hardware. A1 uses FaRM \cite{farm-nsdi, farm-sosp} as its underlying storage layer and builds the graph abstraction and query engine on top. The combination of in-memory storage and RDMA access requires rethinking how data is allocated, organized and queried in a large distributed system. A single A1 cluster can store tens of billions of vertices and edges and support a throughput of 350+ million of vertex reads per second with end to end query latency in single digit milliseconds. In this paper we describe the A1 data model, RDMA optimized data structures and query execution. https://doi.org/10.1145/3318464.3386135
modin745 IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2
Yuanyuan Tian (IBM Research); En Liang Xu (IBM Research); Wei Zhao (IBM Research); Mir Hamid Pirahesh (IBM Research); Sui Jun Tong (IBM Research); Wen Sun (IBM Research); Thomas Kolanko (IBM Cloud and Cognitive Software); Md. Shahidul Haque Apu (IBM Cloud and Cognitive Software); Huijuan Peng (IBM Cloud and Cognitive Software) To meet the challenge of analyzing rapidly growing graph and network data created by modern applications, a large number of graph databases have emerged, such as Neo4j and JanusGraph. They mainly target low-latency graph queries, such as finding the neighbors of a vertex with certain properties, and retrieving the shortest path between two vertices. Although many of the graph databases handle the graph-only queries very well, they fall short for real life applications involving graph analysis. This is because graph queries are not all that one does in an analytics workload of a real life application. They are often only a part of an integrated heterogeneous analytics pipeline, which may include SQL, machine learning, graph, and other analytics. This means graph queries need to be synergistic with other analytics. Unfortunately, most existing graph databases are standalone and cannot easily integrate with other analytics systems. In addition, many graph data (data about relationships between objects or people) are already prevalent in existing non-graph databases, although they are not explicitly stored as graphs. None of existing graph databases can retrofit graph queries onto these existing data without transferring or transforming data. In this paper, we propose an in-DBMS graph query approach, IBM Db2 Graph, to support synergistic and retrofittable graph queries inside the IBM Db2 relational database. It is implemented as a layer inside Db2, and thus can support integrated graph and SQL analytics efficiently. Db2 Graph employs a novel graph overlay approach to expose a graph view of the relational data. This approach flexibly retrofits graph queries to existing graph data stored in relational tables, without expensive data transfer or transformation. In addition, it enables efficient execution of graph queries with the help of Db2 relational engine, through sophisticated compile-time and runtime optimization strategies. Our experimental study, as well as our experience with real customers using Db2 Graph, showed that Db2 Graph can provide very competitive and sometimes even better performance on graph-only queries, compared to existing graph databases. Moreover, it optimizes the overall performance of complex analytics workloads. https://doi.org/10.1145/3318464.3386138
modin749 An Ontology-Based Conversation System for Knowledge Bases
Abdul Quamar (IBM Research – Almaden); Chuan Lei (IBM Research – Almaden); Dorian Miller (IBM Watson Health); Fatma Ozcan (IBM Research – Almaden); Jeffrey Kreulen (IBM Watson Health); Robert J. Moore (IBM Research – Almaden); Vasilis Efthymiou (IBM Research – Almaden) Domain-specific knowledge bases (KB), carefully curated from various data sources, provide an invaluable reference for professionals. Conversation systems make these KBs easily accessible to professionals and are gaining popularity due to recent advances in natural language understanding and AI. Despite the increasing use of various conversation systems in open-domain applications, the requirements of a domain-specific conversation system are quite different and challenging. In this paper, we propose an ontology-based conversation system for domain-specific KBs. In particular, we exploit the domain knowledge inherent in the domain ontology to identify user intents, and the corresponding entities to bootstrap the conversation space. We incorporate the feedback from domain experts to further refine these patterns, and use them to generate training samples for the conversation model, lifting the heavy burden from the conversation designers. We have incorporated our innovations into a conversation agent focused on healthcare as a feature of the IBM Micromedex product. https://doi.org/10.1145/3318464.3386139
modin759 Aggregation Support for Modern Graph Analytics in TigerGraph
Alin Deutsch (University of California, San Diego & TigerGraph); Yu Xu (TigerGraph); Mingxi Wu (TigerGraph); Victor E. Lee (TigerGraph) We describe how GSQL, TigerGraph’s graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power and the evaluation complexity of the specified aggregation. We detail our design showing how our ideas transcend GSQL and are eminently portable to the upcoming graph query language standards as well as the existing pattern-based declarative query languages. https://doi.org/10.1145/3318464.3386144
modin867 GIANT: Scalable Creation of a Web-scale Ontology
Bang Liu (University of Alberta); Weidong Guo (Platform and Content Group, Tencent); Di Niu (University of Alberta); Jinwen Luo (Tencent); Chaoyue Wang (Tencent); Zhen Wen (Tencent); Yu Xu (Tencent) Understanding what online users may pay attention to on the web is key to content recommendation and search services. These services will benefit from a highly structured and web-scale ontology of entities, concepts, events, topics and categories. While existing knowledge bases and taxonomies embody a large volume of entities and categories, we argue that they fail to discover properly grained concepts, events and topics in the language style of online users. Neither is a logically structured ontology maintained among these notions. In this paper, we present GIANT, a mechanism to construct a user-centered, web-scale, structured ontology, containing a large number of natural language phrases conforming to user attentions at various granularities, mined from the vast volume of web documents and search click logs. Various types of edges are also constructed to maintain a hierarchy in the ontology. We present our detailed techniques used in GIANT, and evaluate the proposed models and methods as compared to a variety of baselines, as well as deploy the resulted Attention Ontology in real-world applications, involving over a billion users, to observe its effect on content recommendation. The online performance of the ontology built by GIANT proves that it can significantly improve the click-through rate in news feeds recommendation. https://doi.org/10.1145/3318464.3386145
Industry 4: Advanced Functionality
Thursday 10:30 AM – 12:00 PM
modin714 Confidentiality Support over Financial Grade Consortium Blockchain
Ying Yan (Ant Financial Services Group); Changzheng Wei (Ant Financial Services Group); Xuepeng Guo (Ant Financial Services Group); Xuming Lu (Ant Financial Services Group); Xiaofu Zheng (Ant Financial Services Group); Qi Liu (Ant Financial Services Group); Chenhui Zhou (Ant Financial Services Group); Xuyang Song (Ant Financial Services Group); Boran Zhao (Ant Financial Services Group); Hui Zhang (Ant Financial Services Group); Guofei Jiang (Ant Financial Services Group) Confidentiality is an indispensable requirement in financial applications of blockchain technology, and supporting it along with high performance and friendly programmability is technically challenging. In this paper, we present a system design called CONFIDE to support on-chain confidentiality by leveraging Trust Execution Environment (TEE). CONFIDE’s secure data transmission protocol and data encryption protocol, together with a highly efficient virtual machine run in TEE, guarantee the confidentiality in the life cycle of a transaction from end to end. CONFIDE proposes a secure data model along with an application-driven secure protocol to guarantee data confidentiality and integrity. Its smart contract language extension offers users the flexibility to define complex confidentiality models. CONFIDE is implemented as a plugin module to Antfin Blockchain’s proprietary platform, and can be plugged into other blockchain platforms as well with its universal interface design. Nowadays, CONFIDE is supporting millions of commercial transactions daily on consortium blockchain running financial applications including supply chain finance, ABS, commodity provenance, and cold-chain logistics. https://doi.org/10.1145/3318464.3386127
modin724 PASE: PostgreSQL Ultra-High Dimensional Approximate Nearest Neighbor Search Extension
Wen Yang (Ant Financial Services Group); Tao Li (Ant Financial Services Group); Gai Fang (Ant Financial Services Group); Hong Wei (Ant Financial Services Group) Similarity search has been widely used in various fields, particularly in the Alibaba ecosystem. The open-source solutions to a similarity search of vectors can only support a query with a single vector, whereas real-life scenarios generally require a processing of compound queries. Moreover, existing open-source implementations only provide runtime libraries, which have difficulty meeting the requirements of industrial applications. To address these issues, we designed a novel scheme for extending the index-type of PostgreSQL (PG), which enables a similar vector search and achieves a high-performance level and strong reliability of PG. Two representative types of nearest neighbor search (NNS) algorithms are presented herein. These algorithms achieve a high performance, and afford advantages such as the support of composite queries and seamless integration of existing business data. The other NNS algorithms can be easily implemented under the proposed framework. Experiments were conducted on large datasets to illustrate the efficiency of the proposed retrieval mechanism. https://doi.org/10.1145/3318464.3386131
modin729 Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs
Fabio Maschi (ETH Zurich); Muhsen Owaida (ETH Zurich); Gustavo Alonso (ETH Zurich); Matteo Casalino (Amadeus); Anthony Hock-Koon (Amadeus) Business Rule Management Systems (BRMSs) are widely used in industry for a variety of tasks. Their main advantage is to codify in a succinct and queryable manner vast amounts of constantly evolving logic. In BRMSs, rules are typically captured as facts (tuples) over a collection of criteria, and checking them involves querying the collection of rules to find the best match. In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights. The MCT module is part of the flight search engine, and captures the ever changing constraints at each airport that determine the time to allocate between an arriving and a departing flight for a connection to be feasible. We explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed. A key aspect of the solution is the transformation of a collection of rules into a Non-deterministic Finite state Automaton efficiently implemented on FPGA. Experiments performed on-premises and in the cloud show several orders of magnitude improvement over the existing solution, and the potential to reduce by 40% the number of machines needed for the flight search engine. https://doi.org/10.1145/3318464.3386133
modin754 Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines
Ke Wang (Carnegie Mellon University); Avrilia Floratou (Microsoft); Ashvin Agrawal (Microsoft); Daniel Musgrave (Netflix) Bing’s monetization pipeline is one of the largest and most critical streaming workloads deployed in Microsoft’s internal data lake. The pipeline runs 24/7 at a scale of 3500 YARN containers and is required to meet a Service Level Objective (SLO) of low tail latency. In this paper, we highlight some of the unique challenges imposed by this large scale of operation: other concurrent workloads sharing the cluster may cause random performance deterioration; unavailability of external dependencies may cause temporary stalls in the pipeline; scarcity in the underlying resource manager may cause arbitrarily long delays or rejection of container allocation requests. Weathering these challenges requires specially tailored dynamic control policies that react to these issues as and when they arise. We focus on the problem of reducing the latency in the tail, i.e., 99th percentile (p99), by detecting and mitigating slow instances through speculative replication. We show that widely used approaches do not satisfactorily solve this issue at our scale. A conservative approach is hesitant to acquire additional resources, reacts too slowly to the changes in the environment and therefore achieves little improvement in p99 latency. On the other hand, an aggressive approach overwhelms the underlying resource manager with unnecessary resource requests and paradoxically worsens the p99 latency. Our proposed approach, Spur, is designed for this challenging environment. It combines aggressive detection of slow instances with smart pruning of false positives to achieve a far better trade-off between these conflicting objectives. Using only 0.5% additional resources (similar to the conservative approach), we demonstrate a 10% -38% improvement in the tail latency compared to both conservative and aggressive approaches. https://doi.org/10.1145/3318464.3386142
modin757 Entity Matching in the Wild: a Consistent and Versatile Framework to Unify Data in Industrial Applications
Yan Yan (Amperity, Inc.); Stephen Meyles (Amperity, Inc.); Aria Haghighi (Amperity, Inc.); Dan Suciu (University of Washington) Entity matching — the task of clustering duplicated database records to underlying entities — has become an increasingly critical component in modern data integration management. Amperity provides a platform for businesses to manage customer data that utilizes a machine-learning approach to entity matching, resolving billions of customer records on a daily basis. We face several challenges in deploying entity matching to industrial applications at scale, and they are less prominent in the literature. These challenges include: (1) Providing not just a single entity clustering, but supporting clusterings at multiple confidence levels to enable downstream applications with varying precision/recall trade-off needs. (2) Many customer record attributes may be systematically missing from different sources of data, creating many pairs of records in a cluster that appear to not match due to incomplete, rather than conflicting information. Allowing these records to connect transitively without introducing conflicts is invaluable to businesses because they can acquire a more comprehensive profile of their customers without incorrect entity merges. (3) How to cluster records over time and assign persistent cluster IDs that can be used for downstream use cases such as A/B tests or predictive model training; this is made more challenging by the fact that we receive new customer data every day and clusters naturally evolving over time still require persistent IDs that refer to the same entity. In this work, we describe Amperity’s entity matching framework, Fusion, and how its design provides solutions to these challenges. In particular, we describe our pairwise matching model based on ordinal regression that permits a well-defined way to produce entity clusterings at different confidence levels, a novel clustering algorithm that separates conflicting record pairs in clusters while allowing for pairs that may appear dissimilar due to missing data, and a persistent ID generation algorithm which balances stability of the identifier with ever-evolving entities. https://doi.org/10.1145/3318464.3386143