In this post , we will learn about outer join in pyspark dataframe with example .
If you want to learn Inner join refer below URL
There are other types of joins like inner join , left-anti join and left semi join
What you will learn
At the end of this tutorial, you will learn Outer join in pyspark dataframe with example
Types of outer join
Types of outer join in pyspark dataframe are as follows :
- Right outer join / Right join
- Left outer join / Left join
- Full outer join /Outer join / Full join
Sample program for creating two dataframes
We will start with the creation of two dataframes before moving into the topic of outer join in pyspark dataframe .
#Creating dictionaries
dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" },{"Emp_id" : 234 , "Emp_name" : "Sindu"},{"Emp_id" : 456 , "Emp_name" : "Ravi"}]
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name" :"Economy"} , {"Emp_id" : 789 , "Dep_name" : "History"}]
# Creating RDDs from the above dictionaries using parallelize method
rdd=sc.parallelize(dict)
rdd1=sc.parallelize(dict1)
# Converting RDDs to dataframes
df=rdd.toDF()
df1=rdd1.toDF()
print("Printing the first dataframe")
df.show()
print("Printing the second dataframe")
df1.show()
Printing the first dataframe
+------+--------+
|Emp_id|Emp_name|
+------+--------+
| 123| Raja|
| 234| Sindu|
| 456| Ravi|
+------+--------+
Printing the second dataframe
+--------+------+
|Dep_name|Emp_id|
+--------+------+
|Computer| 123|
| Economy| 456|
| History| 789|
+--------+------+
What is Right outer join ?
The Right outer join helps us to get the entire records from the right dataframe along with the matching records from the left dataframe .
And will be populated with null for the remaining unmatched columns of the left dataframe.
Sample program – Right outer join / Right join
Within the join syntax , the type of join to be performed will be mentioned as right_outer or right .
As Emp_name for Emp_id : 789 is not available in the left dataframe , it is populated with null in the following result .
# Right outer join / Right join
df2=df.join(df1,['Emp_id'], how = 'right_outer')
print("Printing the result of right outer / right join")
df2.show()
# Printing the result of right outer / right join
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
| 789| null| History|
| 123| Raja|Computer|
| 456| Ravi| Economy|
+------+--------+--------+
What is Left outer join ?
This join is used to retrieve all the records from the left dataframe with its matching records from right dataframe .
The type of join is mentioned in either way as Left outer join or left join .
Sample program – Left outer join / Left join
In the below example , For the Emp_id : 234 , Dep_name is populated with null as there is no record for this Emp_id in the right dataframe .
# Left outer join / Left join <br />df3=df.join(df1,['Emp_id'], how = 'left_outer')
Print("Printing the result of Left outer join / Left join")
df3.show()
Printing the result of Left outer join / Left join
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
| 234| Sindu| null|
| 123| Raja|Computer|
| 456| Ravi| Economy|
+------+--------+--------+
What is Full outer join ?
Full outer join generate the result with all the records from both the dataframes . Null will populate in the columns for the unmatched records .
Sample program – Full outer join / Full join / Outer join
All the Emp_ids from both the dataframes combined in this case with null population for unavailable values .
# Full outer join / Full join / Outer join
df4=df.join(df1,['Emp_id'], how = 'Full_outer')
print(Printing the result of Full outer join")
df4.show()
Printing the result of Full outer join
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
| 789| null| History|
| 234| Sindu| null|
| 123| Raja|Computer|
| 456| Ravi| Economy|
+------+--------+--------+