Categories
pyspark

Subtracting dataframes in pyspark

In this post , let us learn about Subtracting dataframes in pyspark.

Creating dataframes in pyspark

We can create two dataframes using the below program for further use.

Sample program
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row<br># Creating the first dataframe df
df=sc.parallelize([Row(name='Gokul',Class=10,level1=480,level2=380,level3=280,level4=520,grade='A'),Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Rajesh',Class=12,level1=180,level2=560,level3=660,level4=850,grade='B')]).toDF()
print("Printing the dataframe df below")
df.show()<br># Creating the second dataframe df1
df1=sc.parallelize([Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Kumar',Class=9,level1=320,level2=650,level3=760,level4=580,grade='C')]).toDF()
print("Printing the dataframe df1 below")
df1.show()
Output
Printing the dataframe df below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    A|   670|   720|   870|   920|  Usha|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+
Printing the dataframe df1 below
+-----+-----+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4| name|
+-----+-----+------+------+------+------+-----+
|   12|    A|   670|   720|   870|   920| Usha|
|    9|    C|   320|   650|   760|   580|Kumar|
+-----+-----+------+------+------+------+-----+
Subtracting dataframes

The keyword subtract helps us in subtracting dataframes in pyspark.

In the below program, the first dataframe is subtracted with the second dataframe.

#Subtracting dataframes in pyspark
df2=df.subtract(df1)
print("Printing the dataframe df2 below")
df2.show()
Printing the dataframe df2 below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+

We can subtract the dataframes based on few columns also.

#Subtracting dataframes based on few columns
df3=df.select('Class','grade','level1').subtract(df1.select('Class','grade','level1'))
print("Printing the dataframe df3 below ")
df3.show()
Printing the dataframe df3 below
+-----+-----+------+
|Class|grade|level1|
+-----+-----+------+
|   10|    A|   480|
|   12|    B|   180|
+-----+-----+------+
Reference

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract