Essential PySpark for Scalable Data Analytics
Essential PySpark for Scalable Data Analytics is an introduction for anyone new to the distributed computing model. You'll learn to unlock the analytics world by building end-to-end data processing pipelines, starting with data ingestion, cleansing, and integration, through to data visualization and building and operationalizing predictive models.
Offered by
Difficulty Level
Intermediate
Completion Time
10h44m
Language
English
About Book
Who Is This Book For?
This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.
Essential PySpark for Scalable Data Analytics
- About Book
- Who Is This Book For?
- Book Content
Book content
chapters • 10h44m total length
Distributed Computing Primer
Data Ingestion
Data Cleansing and Integration
Real-time Data Analytics
Scalable Machine Learning with PySpark
Feature Engineering – Extraction, Transformation, and Selection
Supervised Machine Learning
Unsupervised Machine Learning
Machine Learning Life Cycle Management
Scaling Out Single-Node Machine Learning Using PySpark
Data Visualization with PySpark
Spark SQL Primer
Integrating External Tools with Spark SQL
The Data Lakehouse
Related Resources
Access Ready-to-Use Books for Free!
Get instant access to a library of pre-built books—free trial, no credit card required. Start training your team in minutes!