Categories
pyspark

greatest() and least() in pyspark

In this post, we will learn the functions greatest() and least() in pyspark.

greatest() in pyspark

Both the functions greatest() and least() helps in identifying the greater and smaller value among few of the columns.

Creating dataframe

With the below sample program, a dataframe can be created which could be used in the further part of the program.

To understand the creation of dataframe better, please refer to the earlier post

Sample program
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row
df=sc.parallelize([Row(name='Gokul',Class=10,level1=480,level2=380,level3=280,level4=520,grade='A'),Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Rajesh',Class=12,level1=180,level2=560,level3=660,level4=850,grade='B')]).toDF()
print("Printing the dataframe df below")
Output
Printing the dataframe df below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    A|   670|   720|   870|   920|  Usha|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+
greatest() in pyspark

In order to compare the multiple columns row-wise, the greatest and least function can be used.

In the below program, the four columns level1,level2,level3,level4 are getting compared to find the larger value.

Sample program
from pyspark.sql.functions import greatest,col
df1=df.withColumn("large",greatest(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df1 below")
df1.show()

The column large is populated with the greater value among the four levels for each row .

Output
Printing the dataframe df1 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4|  name|large|
+-----+-----+------+------+------+------+------+-----+
|   10|    A|   480|   380|   280|   520| Gokul|  520|
|   12|    A|   670|   720|   870|   920|  Usha|  920|
|   12|    B|   180|   560|   660|   850|Rajesh|  850|
+-----+-----+------+------+------+------+------+-----+
least() in pyspark

The least function helps us to get a smaller value among the four levels for each row. 

Sample program
from pyspark.sql.functions import least,col
df2=df.withColumn("Small",least(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df2 below")
df2.show()
Output
Printing the dataframe df2 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4|  name|Small|
+-----+-----+------+------+------+------+------+-----+
|   10|    A|   480|   380|   280|   520| Gokul|  280|
|   12|    A|   670|   720|   870|   920|  Usha|  670|
|   12|    B|   180|   560|   660|   850|Rajesh|  180|
+-----+-----+------+------+------+------+------+-----+
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.least

Leave a Reply

Your email address will not be published. Required fields are marked *