Categories
pyspark

Creating dataframes in pyspark using parallelize

Creating dataframes in pyspark using parallelize

In this Post, We will learn about Creating dataframes in pyspark using parallelize method .

Dataframes are nothing but the tabular structure with rows and columns as similar to the relational database .

Libraries required

Following classes need to be called before executing the program , .

import findspark
findspark.init()
from pyspark import SparkContext,SparkConf
sc=SparkContext.getOrCreate()
from pyspark.sql import Row

Sample program – Creating dataframes using parallelize

Row() – used for creating records

parallelize – used for creating the collection of elements .

toDF() – used for converting the parallelized collection in to a dataframe  as seen below .

show() helps us to view the dataframes with the default of 20 rows . We can increase it by specifying the numbers needed like show(40) .

df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
df.show()

Output

We created the dataframe in our sample program with the name df as seen below.

The dataframes will have column names in the first row and the actual data in all other rows .

+-----+-----+-----+------+ 
|Class|grade|marks| name| 
+-----+-----+-----+------+ 
| 10| A| 480| Gokul| 
| 12| A| 450| Usha|
| 12| B| 430|Rajesh| 
+-----+-----+-----+------+

Reference

https://spark.apache.org/docs/2.1.1/programming-guide.html#parallelized-collections

https://beginnersbug.com/case-when-statement-in-pyspark-with-example/

Leave a Reply

Your email address will not be published. Required fields are marked *