Categories
pyspark

row_number in pyspark dataframe

In this post, we will learn to use row_number in pyspark dataframe with examples.

What is row_number ?

This row_number in pyspark dataframe will assign consecutive numbering over a set of rows.
The window function in pyspark dataframe helps us to achieve it.
To get to know more about window function, Please refer to the below link.

Creating dataframe 

Before moving into the concept, Let us create a dataframe using the below program.

from pyspark.sql import Row
# Creating dictionary with employee and their salary details 
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer"  , "Salary" : 2500 } , {"Emp_id" : 456 ,"Dep_name"  :"Economy" , "Salary" : 4500} , {"Emp_id" : 789 , "Dep_name" : "Economy" , "Salary" : 7200 } , {"Emp_id" : 564 , "Dep_name" : "Computer" , "Salary" : 1400 } , {"Emp_id" : 987 , "Dep_name" : "History" , "Salary" : 3450 }, {"Emp_id" :678 , "Dep_name" :"Economy" ,"Salary": 6700},{"Emp_id" : 943 , "Dep_name" : "Computer" , "Salary" : 3200 }]
# Creating RDD from the dictionary created above
rdd1=sc.parallelize(dict1)
# Converting RDD to dataframe
df1=rdd1.toDF()
print("Printing the dataframe df1")
df1.show()

Thus we created the below dataframe with the salary details of some employees from various departments.

Printing the dataframe df1
+--------+------+------+
|Dep_name|Emp_id|Salary|
+--------+------+------+
|Computer|   123|  2500|
| Economy|   456|  4500|
| Economy|   789|  7200|
|Computer|   564|  1400|
| History|   987|  3450|
| Economy|   678|  6700|
|Computer|   943|  3200|
+--------+------+------+
Sample program – row_number

With the below segment of the code, we can populate the row number based on the Salary for each department separately.

We need to import the following libraries before using the window and row_number in the code.

orderBy clause is used for sorting the values before generating the row number.

from pyspark.sql import Window
from pyspark.sql.functions import row_number
df2=df1.withColumn("row_num",row_number().over(Window.partitionBy("Dep_name").orderBy("Salary")))
print("Printing the dataframe df2")
df2.show()
Printing the dataframe df2
+--------+------+------+-------+
|Dep_name|Emp_id|Salary|row_num|
+--------+------+------+-------+
|Computer|   564|  1400|      1|
|Computer|   123|  2500|      2|
|Computer|   943|  3200|      3|
| History|   987|  3450|      1|
| Economy|   456|  4500|      1|
| Economy|   678|  6700|      2|
| Economy|   789|  7200|      3|
+--------+------+------+-------+
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.row_number

Leave a Reply

Your email address will not be published. Required fields are marked *