inner join Archives - BeginnersBug

In this post, We will learn about Inner join in pyspark dataframe with example.

Types of join in pyspark dataframe

Before proceeding with the post, we will get familiar with the types of join available in pyspark dataframe.

Types of join: inner join, cross join, outer join, full join, full_outer join, left join, left_outer join, right join, right_outer join, left_semi join, and left_anti join

What is Inner join ?

As similar to SQL , Inner join helps us to get the matching records between two datasets . To understand it better , we will create two dataframes with the following piece of code .

Sample program for creating two dataframes

spark = SparkSession.builder.appName("Inner Join").getOrCreate()
from pyspark.sql import Row
# Creating dictionary with columns Emp_id and Emp_name
dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" }, {"Emp_id" : 456 , "Emp_name" : "Ravi"}]
# Creating RDD from the above dictionary using parallelize method
rdd=sc.parallelize(dict)
# Converting RDD to dataframe 
df=rdd.toDF()
print("Printing the first dataframe df")
df.show()

Printing the first dataframe df 
+------+--------+
|Emp_id|Emp_name|
+------+--------+
|   123|    Raja|
|   456|    Ravi|
+------+--------+

# Creating dictionary with columns Emp_id and Dep_name 
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name"  :"Economy"} , {"Emp_id" : 789 , "Dep_name" : "History"}]
# Creating RDD from the above dictionary using parallelize method
rdd1=sc.parallelize(dict1)
# Converting RDD to dataframe  
df1=rdd1.toDF()
print("Printing the second dataframe df1")
df1.show()

Printing the second dataframe df1
+--------+------+
|Dep_name|Emp_id|
+--------+------+
|Computer|   123|
| Economy|   456|
| History|   789|
+--------+------+

How to do inner join ?

The syntax of join requires three parameters to be passed –

1) The dataframe to be joined with

2) Column to be checked for

3) Type of join to be do .

By default , Inner join will be taken for the third parameter if no input is passed .

First method

Let us see the first method in understanding Inner join in pyspark dataframe with example.

# Inner joining the two dataframes df and df1 based on the column Emp_id 
df2=df.join(df1,['Emp_id'], how = 'inner')
print("Printing the dataframe df2")
df2.show()

Printing the dataframe df2
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+

Second method

# Inner joining the two dataframes df and df1 based on the column Emp_id with default join i.e inner join
df3=df.join(df1,['Emp_id'])
print("Printing the dataframe df3")
df3.show()

Printing the dataframe df3
+------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+

Reference

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.join