Categories
pyspark

renaming dataframe column in pyspark

In this post, we can learn about renaming dataframe column in pyspark.

Sample program

withColumn() used for creating a new column in a dataframe.

Whereas withColumnRenamed() can be used while renaming the columns .

Note : Underlined characters must need to be in Capital letter.

import findspark 
findspark.init() 
from pyspark import SparkContext,SparkConf 
from pyspark.sql import Row 
from pyspark.sql.functions import * 
sc=SparkContext.getOrCreate() 
#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
print("Printing df dataframe ")
df.show()
# Creating new column as Remarks
df1=df.withColumn("Remarks",lit('Good'))
print("Printing df1 dataframe")
df1.show()
#Renaming the column Remarks as Feedback
df2=df1.withColumnRenamed('Remarks','Feedback')
print("Printing df2 dataframe")
df2.show()
Output
Printing df dataframe 
+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
+-----+-----+-----+------+

Printing df1 dataframe
+-----+-----+-----+------+-------+
|Class|grade|marks|  name|Remarks|
+-----+-----+-----+------+-------+
|   10|    A|  480| Gokul|   Good|
|   12|    A|  450|  Usha|   Good|
|   12|    B|  430|Rajesh|   Good|
+-----+-----+-----+------+-------+

Printing df2 dataframe
+-----+-----+-----+------+--------+
|Class|grade|marks|  name|Feedback|
+-----+-----+-----+------+--------+
|   10|    A|  480| Gokul|    Good|
|   12|    A|  450|  Usha|    Good|
|   12|    B|  430|Rajesh|    Good|
+-----+-----+-----+------+--------+
printSchema()

This function printSchema() help us to view the schema of each dataframe.

df2.printSchema()
root
 |-- Class: long (nullable = true)
 |-- grade: string (nullable = true)
 |-- marks: long (nullable = true)
 |-- name: string (nullable = true)
 |-- Feedback: string (nullable = false)
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumnRenamed

Creating dataframes in pyspark using parallelize

Categories
pyspark

When otherwise in pyspark with examples

In this post , We will learn about When otherwise in pyspark with examples

when otherwise used as a condition statements like if else statement 

In below examples we will learn with single,multiple & logic conditions

Sample program – Single condition check

In Below example, df is a dataframe with three records .

df1 is a new dataframe created from df by adding one more column named as First_Level .

import findspark 
findspark.init() 
from pyspark import SparkContext,SparkConf 
from pyspark.sql import Row 
from pyspark.sql.functions import * 

sc=SparkContext.getOrCreate() 
#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF() 
print("Printing df dataframe below ")
df.show() 
df1=df.withColumn("First_Level",when(col("grade") =='A',"Good").otherwise("Average")) 
print("Printing df1 dataframe below ")
df1.show()
Output
print("printing df")
+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
+-----+-----+-----+------+
print("printing df1")
+-----+-----+-----+------+-----------+
|Class|grade|marks|  name|First_Level|
+-----+-----+-----+------+-----------+
|   10|    A|  480| Gokul|       Good|
|   12|    A|  450|  Usha|       Good|
|   12|    B|  430|Rajesh|    Average|
+-----+-----+-----+------+-----------+
Sample program – Multiple checks

We can check multiple conditions using when otherwise as like below 

import findspark 
findspark.init() 
from pyspark import SparkContext,SparkConf 
from pyspark.sql import Row 
from pyspark.sql.functions import * 

sc=SparkContext.getOrCreate() 
#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF() 
print("Printing df dataframe below")
df.show()
#In below line we are using multiple condition
df2=df.withColumn("Second_Level",when(col("grade") == 'A','Excellent').when(col("grade") == 'B','Good').otherwise("Average"))
print("Printing df2 dataframe below")
df2.show() 
Output

The column Second_Level is created from the above program

Printing df dataframe below
+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
+-----+-----+-----+------+

Printing df2 dataframe below
+-----+-----+-----+------+------------+
|Class|grade|marks|  name|Second_Level|
+-----+-----+-----+------+------------+
|   10|    A|  480| Gokul|   Excellent|
|   12|    A|  450|  Usha|   Excellent|
|   12|    B|  430|Rajesh|        Good|
+-----+-----+-----+------+------------+
Sample program with logical operators & and |

Logical operators & (AND) , |(OR) is used in when otherwise as like below .

import findspark 
findspark.init() 
from pyspark import SparkContext,SparkConf 
from pyspark.sql import Row 
from pyspark.sql.functions import * 
sc=SparkContext.getOrCreate() 
#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B'),Row(name='Mahi',Class=5,marks=350,grade='C')]).toDF() 
print("Printing df dataframe")
df.show()
# In below line we are using logical operators
df3=df.withColumn("Third_Level",when((col("grade") =='A') | (col("Marks") > 450) ,"Excellent").when((col("grade") =='B') | ((col("Marks") > 400) & (col("Marks") < 450)),"Good").otherwise("Average") )
print("Printing  df3 dataframe ")
df3.show()
Output
Printing df dataframe
+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
|    5|    C|  350|  Mahi|
+-----+-----+-----+------+

Printing  df3 dataframe 
+-----+-----+-----+------+-----------+
|Class|grade|marks|  name|Third_Level|
+-----+-----+-----+------+-----------+
|   10|    A|  480| Gokul|  Excellent|
|   12|    A|  450|  Usha|  Excellent|
|   12|    B|  430|Rajesh|       Good|
|    5|    C|  350|  Mahi|    Average|
+-----+-----+-----+------+-----------+
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.when

case when statement in pyspark with example