Keynotes

For meeting with your colleagues, chatting, discussions, SIGMOD provides a virtual conference site in Gather.

SIGMOD Keynote 1
Tuesday 8:00 AM – 10:00 AM
Systems and ML: When the Sum is Greater than Its Parts
Ion Stoica (UC-Berkeley)  
SIGMOD Keynote 2: Grand Challenges in Exploring, Understanding, and Searching a Billion Datasets
Wednesday 8:00 AM – 10:00 AM
When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web
Natasha Noy (Google Research) There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. In this talk, I will discuss our work on Dataset Search, which provides search capabilities over potentially all dataset repositories on the Web. I will talk about the open ecosystem for describing and citing datasets that we hope to encourage and the technical details on how we went about building Dataset Search. Finally, I will highlight research challenges in building a vibrant, heterogeneous, and open ecosystem where data becomes a first-class citizen.
The Challenge of Building Effective, Enterprise-scale Data Lakes
Awez Syed (Databricks) There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn’t meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don’t scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.