Categories
pyspark

Outer join in pyspark dataframe with example

In this post , we will learn about outer join in pyspark dataframe with example .

If you want to learn Inner join refer below URL

There are other types of joins like inner join , left-anti join and left semi join

What you will learn

At the end of this tutorial, you will learn Outer join in pyspark dataframe with example

Types of outer join

Types of outer join in pyspark dataframe are as follows :

  • Right outer join / Right join 
  • Left outer join / Left join
  • Full outer join /Outer join / Full join 
Sample program for creating two dataframes

We will start with the creation of two dataframes before moving into the topic of outer join in pyspark dataframe .

#Creating dictionaries
dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" },{"Emp_id" : 234 , "Emp_name" : "Sindu"},{"Emp_id" : 456 , "Emp_name" : "Ravi"}]
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name"  :"Economy"} , {"Emp_id" : 789 , "Dep_name" : "History"}]
# Creating RDDs from the above dictionaries using parallelize method
rdd=sc.parallelize(dict)
rdd1=sc.parallelize(dict1)
# Converting RDDs to dataframes 
df=rdd.toDF()
df1=rdd1.toDF()
print("Printing the first dataframe")
df.show()
print("Printing the second dataframe")
df1.show()
Printing the first dataframe
+------+--------+
|Emp_id|Emp_name|
+------+--------+
|   123|    Raja|
|   234|   Sindu|
|   456|    Ravi|
+------+--------+
Printing the second dataframe
+--------+------+
|Dep_name|Emp_id|
+--------+------+
|Computer|   123|
| Economy|   456|
| History|   789|
+--------+------+
What is Right outer join ?

The Right outer join helps us to get the entire records from the right dataframe along with the matching records from the left dataframe .

And will be populated with null for the remaining unmatched columns of the left dataframe.

Sample program – Right outer join / Right join

Within the join syntax , the type of join to be performed will be mentioned as right_outer or right .

As Emp_name for Emp_id : 789 is not available in the left dataframe , it is populated with null in the following result .

# Right outer join / Right join 
df2=df.join(df1,['Emp_id'], how = 'right_outer')
print("Printing the result of right outer / right join")
df2.show()
# Printing the result of right outer / right join 
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   789|    null| History|
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+
What is Left outer join ?

This join is used to retrieve all the records from the left dataframe with its matching records from right dataframe .

The type of join is mentioned in either way as Left outer join or left join .

Sample program – Left outer join / Left join

In the below example , For the Emp_id : 234 , Dep_name is populated with null as there is no record for this Emp_id in the right dataframe .

# Left outer join / Left join <br />df3=df.join(df1,['Emp_id'], how = 'left_outer')
Print("Printing the result of Left outer join / Left join") 
 df3.show()
Printing the result of Left outer join / Left join
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   234|   Sindu|    null|
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+
What is Full outer join ?

Full outer join generate the result with all the records from both the dataframes . Null will populate in the columns for the unmatched records  .

Sample program – Full outer join / Full join / Outer join

All the Emp_ids from both the dataframes combined in this case with null population for unavailable values .

# Full outer join / Full join / Outer join
df4=df.join(df1,['Emp_id'], how = 'Full_outer')
print(Printing the result of Full outer join")
df4.show()
Printing the result of Full outer join
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   789|    null| History|
|   234|   Sindu|    null|
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+
Reference

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html?highlight=outer%20join