Okebiz Video Search



Title:The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)
Duration:40:45
Viewed:0
Published:21-10-2019
Source:Youtube

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is 'many small files', and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: databricks.com/product/unified-data-analytics-plat… Connect with us: Website: databricks.com/ Facebook: www.facebook.com/databricksinc Twitter: twitter.com/databricks LinkedIn: www.linkedin.com/company/databricks Instagram: www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-named-leader-by-gartner

SHARE TO YOUR FRIENDS


Download Server 1


DOWNLOAD MP4

Download Server 2


DOWNLOAD MP4

Alternative Download :



SPONSORED
Loading...
RELATED VIDEOS
What is Apache Parquet file? What is Apache Parquet file?
08:02 | 38,565
Discover the Data Lakehouse Discover the Data Lakehouse
24:47 | 5,495
Delta Lake on Databricks Demo Delta Lake on Databricks Demo
08:59 | 56,620
Making Apache Spark™ Better with Delta Lake Making Apache Spark™ Better with Delta Lake
58:10 | 127,998
This INCREDIBLE trick will speed up your data processes. This INCREDIBLE trick will speed up your data p...
12:54 | 44,774
Parquet file, Avro file, RC, ORC file formats in Hadoop | Different file formats in Hadoop Parquet file, Avro file, RC, ORC file formats i...
08:44 | 46,289
The columnar roadmap: Apache Parquet and Apache Arrow The columnar roadmap: Apache Parquet and Apache...
41:39 | 26,011
Tuning Apache Spark for Large Scale Workloads - Sital Kedia & Gaoxiang Liu Tuning Apache Spark for Large Scale Workloads -...
32:41 | 41,101

shopee ads

coinpayu