Book

Essential PySpark for Scalable Data Analytics

Essential PySpark for Scalable Data Analytics is an introduction for anyone new to the distributed computing model. You'll learn to unlock the analytics world by building end-to-end data processing pipelines, starting with data ingestion, cleansing, and integration, through to data visualization and building and operationalizing predictive models.

Offered byPackt Logo

Difficulty Level

Intermediate

Completion Time

10h44m

Language

English

About Book

Who Is This Book For?

This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.

Book content

chapters 10h44m total length

Distributed Computing Primer

Data Ingestion

Data Cleansing and Integration

Real-time Data Analytics

Scalable Machine Learning with PySpark

Feature Engineering – Extraction, Transformation, and Selection

Supervised Machine Learning

Unsupervised Machine Learning

Machine Learning Life Cycle Management

Scaling Out Single-Node Machine Learning Using PySpark

Data Visualization with PySpark

Spark SQL Primer

Integrating External Tools with Spark SQL

The Data Lakehouse

Related Resources

Access Ready-to-Use Books for Free!

Get instant access to a library of pre-built books—free trial, no credit card required. Start training your team in minutes!

No credit card required