In this post, we will learn the functions greatest() and least() in pyspark.
greatest() in pyspark
Both the functions greatest() and least() helps in identifying the greater and smaller value among few of the columns.
Creating dataframe
With the below sample program, a dataframe can be created which could be used in the further part of the program.
To understand the creation of dataframe better, please refer to the earlier post
Sample program
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row
df=sc.parallelize([Row(name='Gokul',Class=10,level1=480,level2=380,level3=280,level4=520,grade='A'),Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Rajesh',Class=12,level1=180,level2=560,level3=660,level4=850,grade='B')]).toDF()
print("Printing the dataframe df below")
Output
Printing the dataframe df below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4| name|
+-----+-----+------+------+------+------+------+
| 10| A| 480| 380| 280| 520| Gokul|
| 12| A| 670| 720| 870| 920| Usha|
| 12| B| 180| 560| 660| 850|Rajesh|
+-----+-----+------+------+------+------+------+
greatest() in pyspark
In order to compare the multiple columns row-wise, the greatest and least function can be used.
In the below program, the four columns level1,level2,level3,level4 are getting compared to find the larger value.
Sample program
from pyspark.sql.functions import greatest,col
df1=df.withColumn("large",greatest(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df1 below")
df1.show()
The column large is populated with the greater value among the four levels for each row .
Output
Printing the dataframe df1 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4| name|large|
+-----+-----+------+------+------+------+------+-----+
| 10| A| 480| 380| 280| 520| Gokul| 520|
| 12| A| 670| 720| 870| 920| Usha| 920|
| 12| B| 180| 560| 660| 850|Rajesh| 850|
+-----+-----+------+------+------+------+------+-----+
least() in pyspark
The least function helps us to get a smaller value among the four levels for each row.
Sample program
from pyspark.sql.functions import least,col
df2=df.withColumn("Small",least(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df2 below")
df2.show()
Output
Printing the dataframe df2 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4| name|Small|
+-----+-----+------+------+------+------+------+-----+
| 10| A| 480| 380| 280| 520| Gokul| 280|
| 12| A| 670| 720| 870| 920| Usha| 670|
| 12| B| 180| 560| 660| 850|Rajesh| 180|
+-----+-----+------+------+------+------+------+-----+
Reference
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.least