Learning PySpark
In this tutorial, you will learn about different techniques for collecting data. You will distinguish between and understand techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples showing how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames. Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.
Difficulty Level
Intermediate
Completion Time
2h 28m
Language
English
About Course
Learning PySpark
- About Course
- Course Content
Course content
lessons • 2h 28m total length
Related Courses
Access Ready-to-Use Courses for Free!
Get instant access to a library of pre-built courses—free trial, no credit card required. Start training your team in minutes!