In this post, we will understand the usage of where condition in pyspark with example.
Where condition in pyspark
This topic where condition in pyspark with example works in a similar manner as the where clause in SQL operation.
We cannot use the filter condition to filter null or non-null values. In that case, where condition helps us to deal with the null values also.
Sample program in pyspark
In the below sample program, the dictionary data1 created with key and value pairs and the dataframe df1 created with rows and columns.
Using the createDataFrame method, the dictionary data1 converted to a dataframe df1.
Here , We can use isNull() or isNotNull() to filter the Null values or Non-Null values.
spark = SparkSession.builder \
.appName("Filtering Null records") \
.getOrCreate()
# Creating dictionary
data1=[{"Name" : 'Usha', "Class" : 7, "Marks" : 250 }, \
{"Name" : 'Rajesh' , "Class" : 5, "Marks" : None }]
# Converting dictionary to dataframe
df1=spark.createDataFrame(data1)
df1.show()
# Filtering Null records
df2=df1.where(df1["Marks"].isNull())
df2.show()
#Filtering Non-Null records
df3=df1.where(df1["Marks"].isNotNull())
df3.show()
Output
The dataframe df1 is created from the dictionary with one null record and one non-null record using the above sample program.
The dataframe df2 filters only the null records whereas the dataframe df3 filters the non-null records.
Other than filtering null and non-null values, we can even use the where() to filter based on any particular values.
Printing dataframe df1
+-----+-----+------+
|Class|Marks| Name|
+-----+-----+------+
| 7| 250| Usha|
| 5| null|Rajesh|
+-----+-----+------+
Printing dataframe df2
+-----+-----+------+
|Class|Marks| Name|
+-----+-----+------+
| 5| null|Rajesh|
+-----+-----+------+
Printing dataframe df3
+-----+-----+----+
|Class|Marks|Name|
+-----+-----+----+
| 7| 250|Usha|
+-----+-----+----+
Reference
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
Related Articles
Please refer to the below link for understanding the filter condition in pyspark with example.
https://beginnersbug.com/how-to-use-filter-condition-in-pyspark/