Categories
pyspark

spark SQL operation in pyspark

In this post, let us look into the spark SQL operation in pyspark with example.

What is spark SQL in pyspark ?

Spark SQL helps us to execute SQL queries. We can store a dataframe as table using the function createOrReplaceTempView.

Sample program

In the following sample program, we are creating an RDD using parallelize method and later converting it into dataframe.

To understand the process of creating dataframes better, Please refer to the below link.

createOrReplaceTempView helps us to register the dataframe created as temporary table.

We can execute all the SQL queries with the help of spark SQL operation in pyspark.

#Libraries required
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row
#creating rdd and converting to dataframe
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
#Registering temporary table with create dataframedf.createOrReplaceTempView("df_view")
#Executing SQl queries using spark SQl operation
spark.sql("select * from df_view").show()
Output

We can even manipulate the data by filtering based on some conditions using where clause.

But Below is the entire data of the dataframe without any filteration and modification.

+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
+-----+-----+-----+------+
Reference

https://stackoverflow.com/questions/32788387/pipelinedrdd-object-has-no-attribute-todf-in-pyspark