Categories
pyspark

Transformation and action in pyspark

In this post, let us learn about transformation and action in pyspark.

Transformation

Transformation is one of the operations available in pyspark.

This helps in creating a new RDD from the existing RDD.

Types of transformation

Narrow transformation :

map,filter,flatmap,distinct,sample,union,intersection,join,coalesce,repartition,pipe,cartesian

Wide transformation :

groupByKey,reduceByKey,aggregateByKey,sortByKey

What is action ?

On applying the transformation, DAG(Directed Acyclic Graph)  is usually created. And this develops on further application of some other operations.

But the operations will execute only if action is called upon.

Types of action

reduce,collect,take,head,count,first,saveAsObjectFile,countByKey,foreach,saveAsSequenceFile,saveAsTextFile,takeOrdered,takeSample

Sample program

The following program helps us to filter elements based on some conditions.

But the steps execute only at the collect function. 

from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
rdd1=sc.parallelize([1,2,3,4])
rdd1_first=rdd1.filter(lambda x : x<3)
rdd1_first.collect()
[1, 2]

https://beginnersbug.com/rank-and-dense-rank-in-pyspark-dataframe/

https://beginnersbug.com/window-function-in-pyspark-with-example/