How to use filter condition in pyspark

In this post, we will learn how to use filter condition in pyspark with example.

Sample program using filter condition

We will create a dataframe using the following sample program.

Then we filter the dataframe based on marks and store the result in another dataframe.

The following classes imported at the beginning of the code.

import findspark 
findspark.init() 
from pyspark import SparkContext,SparkConf 
from pyspark.sql import Row 
from pyspark.sql.functions import *
sc=SparkContext.getOrCreate()

#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF() 
print("Printing df dataframe below ")
df.show() 
#Filtering based on the marks
df1=df.filter(col("marks")==480)
print("Printing df1 dataframe below")
df1.show()

Output

The following dataframes created as the result of the above sample program.

Here this filter condition helps us to filter the records having marks as 480 from the dataframe.

Printing df dataframe below 
+-----+-----+-----+------+ 
|Class|grade|marks| name| 
+-----+-----+-----+------+ 
| 10| A| 480| Gokul| 
| 12| A| 450| Usha| 
| 12| B| 430|Rajesh| 
+-----+-----+-----+------+
 Printing df1 dataframe below
 +-----+-----+-----+-----+ 
|Class|grade|marks| name| 
+-----+-----+-----+-----+ 
| 10| A| 480|Gokul| 
+-----+-----+-----+-----+

Must use double equal to inside the filter condition.

I hope that everyone got an idea about how to use filter condition in pyspark now.

Reference

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=filter#pyspark.sql.DataFrame

https://beginnersbug.com/where-condition-in-pyspark-with-example/

Sample program using filter condition

Output

Reference

Related Articles

Leave a Reply Cancel reply