PySpark — A year long journey into it ! Beginners targeted.

Harshita Singh
4 min readFeb 14, 2021

--

During my engineering, I always had this vision that like most of the fresh graduates, I will also be going to be a Software Engineer(or SDE) i.e. working on either frontend/backend technologies and hustling each day to learn more n more.
And to my surprise, here I am in a new & challenging side of engineering, something which I was not aware about much, until late 2019.
But trust me, I am totally loving it :)

Finally, after hustling for a year to understand the concepts deeply, I thought of writing it down to share my knowledge with the beginners specially.

So this article is for all the beginners who are hustling like me, Lets Begin !!

This article will help you understand following things —

  1. How growing data led to the discovery of GFS?
  2. How Hadoop came in existence ?
  3. How Spark came in existence?
  4. Spark’s take over Hadoop.

Let’s begin !

So in late 90s, technology was hardly developed the way it is today. Very few people across the globe were aware about the internet(WWW) which leads to the fact that data was sparse.

In 2003 , Google released the GFS (Google File System), a large scale distributed file system. This was designed to handle the growing demand for the Google’s data processing needs (specially, the search engine Chrome). It is based on the concept of Master-Slave architecture.

Later in 2008, Hadoop, a framework for large scale distributed processing of datasets, was invented, it was developed on top of Google File System following its distributed architecture. It has the following major modules —

  1. Hadoop Common
  2. HDFS — Hadoop Distributed File System
  3. YARN — Yet Another Resource Negotiator
  4. Map-Reduce (most important)

So, how Spark came in picture ??

Apache Spark — Unified analytics streaming engine for large scale data-processing. It supports high level API’s in 3 different languages — Python, Scala & JAVA.

Lets Go!!

So it has two main nodes (commodity hardware) -

  1. Master Node (Driver) — It is the heart of the application maintaining all the essential information during the lifecycle of an application. Mainly responsible for following activities :

a. Responding to a user’s program.

b. Analyzing, scheduling and distributing work across the workers/executors.

2. Slave Node (Worker/Executor) — Workers are responsible for running the task assigned to it by the Master node and reporting its state back to the master. Activities involved :

a. Performs all the data processing.

b. Reads from and Writes data to external sources(HDFS,BLOB,S3,etc).

c. Executor stores the computation results data in-memory, cache or on hard disk drives.

Reasons for Spark taking over the world of Hadoop

  1. In-memory computation — Spark processes the data in-memory i.e. it stores(or caches) the intermediate data in RAM instead of storing on disks and thus increasing the speed efficiently (10X).
  2. Lazy Evaluation — Spark follows lazy evaluation for the processing of data. Only after an Action has been called on the data, Transformations will be done. Read more about transformation & actions here.
  3. RDDs(Resilient Distributed Datasets) — It can be said as an immutable and distributed data structure in Spark. It is fault tolerant by maintaining the lineage (dependencies graph in between existing RDD and new RDD), so the name ‘resilient’.
  4. ML Libraries — Spark has the support for ML and streaming libraries too.
  5. Realtime Processing — Unlike Hadoop, Spark supports both batch & real-time processing.
  6. Support for Languages — PYTHON,JAVA,SCALA

Directed Acyclic Graph — DAG

So coming to the end of this article with one of the important topic — Directed Acyclic Graph. Like mentioned in the lazy evaluation —

  1. DAG consists of the different stages an execution cycle goes through.

2. Any transformations are done once the action has been called on the dataset.

3. As soon as the action is called, Spark creates a DAG and submits it to the DAG Scheduler internally.

4. DAG Scheduler then divides the graph into different stages which further is drilled into different tasks.

DAG for a simple count action on the dataset

The DAG image is taken from the Databricks(ETL) platform when a count action has been called on the dataset.

Note : DAG will never have bi-directional flow or else it will keep the execution running infinitely.

This article is just an overview about Spark thus not making it lengthy for the beginners. For details, you can refer to the resources provided inside.

Please give a clap if the article helped you in any way. :)

Happy Reading !! :)

--

--

Harshita Singh

Understanding Data Everyday || Data Engineer || Graduate Student in CS