In this post, we can learn about renaming dataframe column in pyspark.
Sample program
withColumn() used for creating a new column in a dataframe.
Whereas withColumnRenamed() can be used while renaming the columns .
Note : Underlined characters must need to be in Capital letter.
import findspark
findspark.init()
from pyspark import SparkContext,SparkConf
from pyspark.sql import Row
from pyspark.sql.functions import *
sc=SparkContext.getOrCreate()
#creating dataframe with three records
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
print("Printing df dataframe ")
df.show()
# Creating new column as Remarks
df1=df.withColumn("Remarks",lit('Good'))
print("Printing df1 dataframe")
df1.show()
#Renaming the column Remarks as Feedback
df2=df1.withColumnRenamed('Remarks','Feedback')
print("Printing df2 dataframe")
df2.show()
Output
Printing df dataframe
+-----+-----+-----+------+
|Class|grade|marks| name|
+-----+-----+-----+------+
| 10| A| 480| Gokul|
| 12| A| 450| Usha|
| 12| B| 430|Rajesh|
+-----+-----+-----+------+
Printing df1 dataframe
+-----+-----+-----+------+-------+
|Class|grade|marks| name|Remarks|
+-----+-----+-----+------+-------+
| 10| A| 480| Gokul| Good|
| 12| A| 450| Usha| Good|
| 12| B| 430|Rajesh| Good|
+-----+-----+-----+------+-------+
Printing df2 dataframe
+-----+-----+-----+------+--------+
|Class|grade|marks| name|Feedback|
+-----+-----+-----+------+--------+
| 10| A| 480| Gokul| Good|
| 12| A| 450| Usha| Good|
| 12| B| 430|Rajesh| Good|
+-----+-----+-----+------+--------+
printSchema()
This function printSchema() help us to view the schema of each dataframe.
df2.printSchema()
root
|-- Class: long (nullable = true)
|-- grade: string (nullable = true)
|-- marks: long (nullable = true)
|-- name: string (nullable = true)
|-- Feedback: string (nullable = false)
Reference
Related Articles
Creating dataframes in pyspark using parallelize