Categories
spark

RDD , Dataframe and Dataset in spark

In this post , let us learn about RDD , Dataframe and Dataset in spark

Resilient Distributed Datasets – RDD

RDD is the fundamental data structure in spark .

  • fault-tolerant
  • in memory
  • immutable distributed collection of elements
  • partitioned across nodes in the cluster and operate in parallel
  • low-level API that offers transformation and action

Two ways to create RDDs:

  1. Parallelizing an existing collection
  2. Referencing external storage system

Parallelizing an existing collection:

val a = Array(1, 2, 3, 4)
val b = sc.parallelize(a)

Referencing external storage system:

Referenced external storage system such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat .
Text file RDDs created using SparkContext’s textFile method which takes a URI for the file (either a local path or a hdfs://, s3a://, etc) and reads it as a collection of lines. 

val c = sc.textFile("data.txt")

By default, Spark creates one partition for each block of the file (default block size in HDFS – 128MB).

We can even increase the partitions by passing a larger value as second argument to textFile method but cannot have fewer partitions than blocks.

Dataframe

A DataFrame is a Dataset organized into named columns , conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations.

Dataframes created from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

Dataset

A Dataset is a distributed collection of data and a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

RDDs and Datasets are type safe means that compiler know the Columns and it’s data type of the Column whether it is Long, String, etc…. But, In Dataframe, every time when you call an action, collect() for instance,then it will return the result as an Array of Rows not as Long, String data type.