In this Post, We will learn about Creating dataframes in pyspark using parallelize method .
Dataframes are nothing but the tabular structure with rows and columns as similar to the relational database .
Libraries required
Following classes need to be called before executing the program , .
import findspark
findspark.init()
from pyspark import SparkContext,SparkConf
sc=SparkContext.getOrCreate()
from pyspark.sql import Row
Sample program – Creating dataframes using parallelize
Row() – used for creating records
parallelize – used for creating the collection of elements .
toDF() – used for converting the parallelized collection in to a dataframe as seen below .
show() helps us to view the dataframes with the default of 20 rows . We can increase it by specifying the numbers needed like show(40) .
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
df.show()
Output
We created the dataframe in our sample program with the name df as seen below.
The dataframes will have column names in the first row and the actual data in all other rows .
+-----+-----+-----+------+
|Class|grade|marks| name|
+-----+-----+-----+------+
| 10| A| 480| Gokul|
| 12| A| 450| Usha|
| 12| B| 430|Rajesh|
+-----+-----+-----+------+
Reference
https://spark.apache.org/docs/2.1.1/programming-guide.html#parallelized-collections
Related Articles
https://beginnersbug.com/case-when-statement-in-pyspark-with-example/