Categories
pyspark

Left-anti and Left-semi join in pyspark

In this post, We will learn about Left-anti and Left-semi join in pyspark dataframe with examples.

Sample program for creating dataframes

Let us start with the creation of two dataframes . After that we will move into the concept of Left-anti and Left-semi join in pyspark dataframe.

# Creating two dictionaries with Employee and Department details
dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" },{"Emp_id" : 234 , "Emp_name" : "Sindu"},{"Emp_id" : 456 , "Emp_name" : "Ravi"}]
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name"  :"Economy"} , {"Emp_id" : 789 , "Dep_name" : "History"}]
# Creating RDDs from the above dictionaries using parallelize method
rdd=sc.parallelize(dict)
rdd1=sc.parallelize(dict1)
# Converting RDDs to dataframes 
df=rdd.toDF()
df1=rdd1.toDF()
print("Printing the first dataframe")
df.show()
print("Printing the second dataframe")
df1.show()
Printing the first dataframe
+------+--------+
|Emp_id|Emp_name|
+------+--------+
|   123|    Raja|
|   234|   Sindu|
|   456|    Ravi|
+------+--------+
Printing the second dataframe
+--------+------+
|Dep_name|Emp_id|
+--------+------+
|Computer|   123|
| Economy|   456|
| History|   789|
+--------+------+
What is Left-anti join ?

In order to return only the records available in the left dataframe . For those does not have the matching records in the right dataframe, We can use this join.

We could even see in the below sample program . Only the columns from the left dataframe will be available in Left-anti and Left-semi . And not all  the columns from both the dataframes as in other types of joins.

Sample program – Left-anti join

Emp_id: 234 is only available in the left dataframe and not in the right dataframe.

# Left-anti join between the two dataframes df and df1 based on the column Emp_id
df2=df.join(df1,['Emp_id'], how = 'left_anti')
print("Printing the result of left-anti below")
df2.show()
Printing the result of left-anti below
+------+--------+
|Emp_id|Emp_name|
+------+--------+
|   234|   Sindu|
+------+--------+
What is Left-semi join?

The common factors between the two dataframes is listed down in this join.

In the below sample program, two Emp_ids -123,456 are available in both the dataframes and so they picked up here.

Sample program – Left-semi join
# Left-semi join between two dataframes df and df1
df3=df.join(df1,['Emp_id'], how = 'left_semi')
print("Printing the result of left-semi below")
df3.show()
Printing the result of left-semi below
+------+--------+
|Emp_id|Emp_name|
+------+--------+
|   123|    Raja|
|   456|    Ravi|
+------+--------+

Other types of join are outer join  and inner join in pyspark 

Reference

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.join