In this post, We will learn about Left-anti and Left-semi join in pyspark dataframe with examples.
Sample program for creating dataframes
Let us start with the creation of two dataframes . After that we will move into the concept of Left-anti and Left-semi join in pyspark dataframe.
# Creating two dictionaries with Employee and Department details
dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" },{"Emp_id" : 234 , "Emp_name" : "Sindu"},{"Emp_id" : 456 , "Emp_name" : "Ravi"}]
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name" :"Economy"} , {"Emp_id" : 789 , "Dep_name" : "History"}]
# Creating RDDs from the above dictionaries using parallelize method
rdd=sc.parallelize(dict)
rdd1=sc.parallelize(dict1)
# Converting RDDs to dataframes
df=rdd.toDF()
df1=rdd1.toDF()
print("Printing the first dataframe")
df.show()
print("Printing the second dataframe")
df1.show()
Printing the first dataframe
+------+--------+
|Emp_id|Emp_name|
+------+--------+
| 123| Raja|
| 234| Sindu|
| 456| Ravi|
+------+--------+
Printing the second dataframe
+--------+------+
|Dep_name|Emp_id|
+--------+------+
|Computer| 123|
| Economy| 456|
| History| 789|
+--------+------+
What is Left-anti join ?
In order to return only the records available in the left dataframe . For those does not have the matching records in the right dataframe, We can use this join.
We could even see in the below sample program . Only the columns from the left dataframe will be available in Left-anti and Left-semi . And not all the columns from both the dataframes as in other types of joins.
Sample program – Left-anti join
Emp_id: 234 is only available in the left dataframe and not in the right dataframe.
# Left-anti join between the two dataframes df and df1 based on the column Emp_id
df2=df.join(df1,['Emp_id'], how = 'left_anti')
print("Printing the result of left-anti below")
df2.show()
Printing the result of left-anti below
+------+--------+
|Emp_id|Emp_name|
+------+--------+
| 234| Sindu|
+------+--------+
What is Left-semi join?
The common factors between the two dataframes is listed down in this join.
In the below sample program, two Emp_ids -123,456 are available in both the dataframes and so they picked up here.
Sample program – Left-semi join
# Left-semi join between two dataframes df and df1
df3=df.join(df1,['Emp_id'], how = 'left_semi')
print("Printing the result of left-semi below")
df3.show()
Printing the result of left-semi below
+------+--------+
|Emp_id|Emp_name|
+------+--------+
| 123| Raja|
| 456| Ravi|
+------+--------+
Other types of join are outer join and inner join in pyspark