Categories
python

difference between list and tuple

In this post , let us learn about the difference between list and tuple.

What are Lists?

Lists are compound data types storing multiple independent values .

It can accommodate different datatypes like integer, string, float, etc., within a square bracket.

In the below example, we have different types of data getting created with the list.

# Creating lists with multiple datatypes 
a = [1, 2, 3, 'a', 'b', 'c', 'apple', 'orange', 10.89]
What are Tuples ?

Tuples are also one of the sequence datatypes as like Lists.

Tuples can be created by storing the data within round bracket or without any brackets.

#Creating tuple with round bracket
b=(1,2,3,'a','b','c','apple','orange',10.89,(6,7,8))
#Creating tuple without any brackets
c=1,2,3,'a','b','c','apple','orange',10.89,(6,7,8)
What is the difference between list and tuple?

Lists are mutable whereas tuples are immutable.

We can change the lists but not the tuples.

program
# Creating  list a
a=[1, 2, 3, 'a', 'b', 'c', 'apple', 'orange', 10.89]
# Displaying the data
a
# Displaying the datatype
type(a)
Result
[1, 2, 3, 'a', 'b', 'c', 'apple', 'orange', 10.89]
list

We can append the list as shown below, But we cannot change the tuple.

This is the difference between the tuple and the list.

program
# Trying to append data to it
a.append('d')
a
RESULT
[1, 2, 3, 'a', 'b', 'c', 'apple', 'orange', 10.89, 'd']
Categories
hive

How to create external table in hive

In this post let us learn how to create external table in hive

Types of table in hive

The following are the two types of tables in the hive.

  1. Internal table or Managed table
  2. External table

For getting familiar with internal tables, Please refer to the below link.

https://beginnersbug.com/how-to-create-a-table-in-hive/  

What is External table ?

For creating an external table, we need to use one keyword called external.

The remaining syntax will remain the same for both types.

How to create ?

Following is the syntax for creating an external table.

Fields terminated by  ‘,’ – need to specify the delimiter of the file to be loaded into the table.

hive> create external table temp_details (year string,temp int,place string) 
> row format delimited 
> fields terminated by ','; 
OK Time taken: 0.093 seconds

To get familiar with loading the table, Please refer to the following link.

https://beginnersbug.com/how-to-load-data-into-a-hive-table/

Categories
hive

How to load data into a hive table

In this post , we will learn how to load data into a hive table .

Types of table in hive

The following are the two types of tables in the hive .

  1. Internal table or Managed table
  2. External table

In this post, let us discuss the internal tables and their loading ways.

What are the ways to load data into a table ?

There are three ways to load data into a hive table.

  • Loading data from a file 
  • Inserting values into a table 
  • Loading data from some other table 
1 . Loading data from a file

We can refer to the path of the file as below.  More importantly, we need to specify the details of the file like delimiter while creating the table itself.

Please refer to the below link to understand it clearly.

https://beginnersbug.com/how-to-create-a-table-in-hive/

hive> load data local inpath '/tmp/temp.csv' into table truck;
Loading data to table vehicle_details.truck
OK
Time taken: 0.846 seconds
hive> select * from truck limit 5;
OK
1901    50      bangalore
1903    48      kolkata
1900    26      pune
1902    12      darjeling
1903    23      delhi
Time taken: 0.954 seconds, Fetched: 5 row(s)
2 . Inserting values into a table 

We shall insert values into the table manually like below.

We can even add multiple records into the table in a similar way.

hive> insert into table truck1 values ('2020',65,'Chennai'); 

Let us verify whether the inserted data looks good.

hive> select * from truck1 where year='2020';
OK
2020    65      Chennai
Time taken: 0.161 seconds, Fetched: 1 row(s)
3 . Loading data from some other table 

We can create one more table with the following command.

hive> create table if not exists truck1
    > (year string , temp int , place string)
    > row format delimited
    > fields terminated by ','
    > ;
OK
Time taken: 0.099 seconds

As below, we can select the data from the existing table and load it into the new table.

hive> insert into table truck1 select * from truck where temp < 50;

After loading, we could find the new table loaded with the selected data.

hive> select * from truck1 limit 5;
OK
1903    48      kolkata
1900    26      pune
1902    12      darjeling
1903    23      delhi
1902    25      delhi
Time taken: 0.145 seconds, Fetched: 5 row(s)
Categories
hive

How to create a table in hive

In this post, we will learn how to create a table in hive .

Creating a database in hive

In Hive, we either have to use an existing database or to create a new database before creating a table.

The following show command lists the number of available databases in the hive.

hive> show databases; 
OK 
default Time taken: 0.009 seconds, Fetched: 1 row(s)

The create database command helps us to create a new database as shown below.

hive> create database vehicle_details;
OK
Time taken: 0.208 seconds 
hive> show databases;
OK
default
vehicle_details 
Time taken: 0.011 seconds, Fetched: 2 row(s)
Creating table

The create syntax contains the following options.

if not exists(optional) – If any table exists already, it will ignore the statement.

other options – The specifications of the file to be loaded with the table. 

tblproperties – can skip the header in the file with this property.

hive> create table if not exists truck 
    > (year string , temp int , place string )
    > row format delimited
    > fields terminated by ','
    > stored as textfile
    > tblproperties ("skip.header.line.count"="1");
OK
Time taken: 0.278 seconds
hive> show tables; 
OK 
deliveries 
matches 
truck 
Time taken: 0.012 seconds, Fetched: 3 row(s)
How the file looks like
cat /tmp/temp.csv |head -6
year|temp|place
1900,45,chennai 
1901,50,bangalore 
1903,48,kolkata 
1900,26,pune 
1902,12,darjeling 
Reference

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Creating,Showing,Altering,andDroppingTables

Categories
pyspark

Subtracting dataframes in pyspark

In this post , let us learn about Subtracting dataframes in pyspark.

Creating dataframes in pyspark

We can create two dataframes using the below program for further use.

Sample program
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row<br># Creating the first dataframe df
df=sc.parallelize([Row(name='Gokul',Class=10,level1=480,level2=380,level3=280,level4=520,grade='A'),Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Rajesh',Class=12,level1=180,level2=560,level3=660,level4=850,grade='B')]).toDF()
print("Printing the dataframe df below")
df.show()<br># Creating the second dataframe df1
df1=sc.parallelize([Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Kumar',Class=9,level1=320,level2=650,level3=760,level4=580,grade='C')]).toDF()
print("Printing the dataframe df1 below")
df1.show()
Output
Printing the dataframe df below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    A|   670|   720|   870|   920|  Usha|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+
Printing the dataframe df1 below
+-----+-----+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4| name|
+-----+-----+------+------+------+------+-----+
|   12|    A|   670|   720|   870|   920| Usha|
|    9|    C|   320|   650|   760|   580|Kumar|
+-----+-----+------+------+------+------+-----+
Subtracting dataframes

The keyword subtract helps us in subtracting dataframes in pyspark.

In the below program, the first dataframe is subtracted with the second dataframe.

#Subtracting dataframes in pyspark
df2=df.subtract(df1)
print("Printing the dataframe df2 below")
df2.show()
Printing the dataframe df2 below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+

We can subtract the dataframes based on few columns also.

#Subtracting dataframes based on few columns
df3=df.select('Class','grade','level1').subtract(df1.select('Class','grade','level1'))
print("Printing the dataframe df3 below ")
df3.show()
Printing the dataframe df3 below
+-----+-----+------+
|Class|grade|level1|
+-----+-----+------+
|   10|    A|   480|
|   12|    B|   180|
+-----+-----+------+
Reference

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract

Categories
pyspark

greatest() and least() in pyspark

In this post, we will learn the functions greatest() and least() in pyspark.

greatest() in pyspark

Both the functions greatest() and least() helps in identifying the greater and smaller value among few of the columns.

Creating dataframe

With the below sample program, a dataframe can be created which could be used in the further part of the program.

To understand the creation of dataframe better, please refer to the earlier post

Sample program
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row
df=sc.parallelize([Row(name='Gokul',Class=10,level1=480,level2=380,level3=280,level4=520,grade='A'),Row(name='Usha',Class=12,level1=670,level2=720,level3=870,level4=920,grade='A'),Row(name='Rajesh',Class=12,level1=180,level2=560,level3=660,level4=850,grade='B')]).toDF()
print("Printing the dataframe df below")
Output
Printing the dataframe df below
+-----+-----+------+------+------+------+------+
|Class|grade|level1|level2|level3|level4|  name|
+-----+-----+------+------+------+------+------+
|   10|    A|   480|   380|   280|   520| Gokul|
|   12|    A|   670|   720|   870|   920|  Usha|
|   12|    B|   180|   560|   660|   850|Rajesh|
+-----+-----+------+------+------+------+------+
greatest() in pyspark

In order to compare the multiple columns row-wise, the greatest and least function can be used.

In the below program, the four columns level1,level2,level3,level4 are getting compared to find the larger value.

Sample program
from pyspark.sql.functions import greatest,col
df1=df.withColumn("large",greatest(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df1 below")
df1.show()

The column large is populated with the greater value among the four levels for each row .

Output
Printing the dataframe df1 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4|  name|large|
+-----+-----+------+------+------+------+------+-----+
|   10|    A|   480|   380|   280|   520| Gokul|  520|
|   12|    A|   670|   720|   870|   920|  Usha|  920|
|   12|    B|   180|   560|   660|   850|Rajesh|  850|
+-----+-----+------+------+------+------+------+-----+
least() in pyspark

The least function helps us to get a smaller value among the four levels for each row. 

Sample program
from pyspark.sql.functions import least,col
df2=df.withColumn("Small",least(col("level1"),col("level2"),col("level3"),col("level4")))
print("Printing the dataframe df2 below")
df2.show()
Output
Printing the dataframe df2 below
+-----+-----+------+------+------+------+------+-----+
|Class|grade|level1|level2|level3|level4|  name|Small|
+-----+-----+------+------+------+------+------+-----+
|   10|    A|   480|   380|   280|   520| Gokul|  280|
|   12|    A|   670|   720|   870|   920|  Usha|  670|
|   12|    B|   180|   560|   660|   850|Rajesh|  180|
+-----+-----+------+------+------+------+------+-----+
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.least

Categories
pyspark

spark SQL operation in pyspark

In this post, let us look into the spark SQL operation in pyspark with example.

What is spark SQL in pyspark ?

Spark SQL helps us to execute SQL queries. We can store a dataframe as table using the function createOrReplaceTempView.

Sample program

In the following sample program, we are creating an RDD using parallelize method and later converting it into dataframe.

To understand the process of creating dataframes better, Please refer to the below link.

createOrReplaceTempView helps us to register the dataframe created as temporary table.

We can execute all the SQL queries with the help of spark SQL operation in pyspark.

#Libraries required
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
from pyspark.sql import Row
#creating rdd and converting to dataframe
df=sc.parallelize([Row(name='Gokul',Class=10,marks=480,grade='A'),Row(name='Usha',Class=12,marks=450,grade='A'),Row(name='Rajesh',Class=12,marks=430,grade='B')]).toDF()
#Registering temporary table with create dataframedf.createOrReplaceTempView("df_view")
#Executing SQl queries using spark SQl operation
spark.sql("select * from df_view").show()
Output

We can even manipulate the data by filtering based on some conditions using where clause.

But Below is the entire data of the dataframe without any filteration and modification.

+-----+-----+-----+------+
|Class|grade|marks|  name|
+-----+-----+-----+------+
|   10|    A|  480| Gokul|
|   12|    A|  450|  Usha|
|   12|    B|  430|Rajesh|
+-----+-----+-----+------+
Reference

https://stackoverflow.com/questions/32788387/pipelinedrdd-object-has-no-attribute-todf-in-pyspark

Categories
mysql

third highest salary for each department in a table using MySQL

In this post, let us learn to get the third highest salary for each department in a table using MySQL.

MySQL table creation

In order to find the third highest salary for each department in a table using MySQL, we will create a table like below.

The following table contains the salary details of a few employees across different departments.

Create Table Query
create table employee (Dep_name varchar(20),Emp_id int,Salary int)
Insert Query
insert into Employee values('Computer',564,1400);
insert into Employee values('Computer',123,2500);
insert into Employee values('Computer',943,3200);
insert into Employee values('History',987,3450);
insert into Employee values('Economy',456,4500);
insert into Employee values('Economy',678,6700);
insert into Employee values('Economy',789,7200);
Table Data
mysql> select * from Employee;
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    564 |   1400 |
| Computer |    123 |   2500 |
| Computer |    943 |   3200 |
| History  |    987 |   3450 |
| Economy  |    456 |   4500 |
| Economy  |    678 |   6700 |
| Economy  |    789 |   7200 |
| Computer |    324 |   2500 |
+----------+--------+--------+
8 rows in set (0.00 sec)
nth Salary calculation for each group

We shall find the nth highest salary with the help of dense rank funtion available in MySQL.Here comes the syntax and usage of it.

Syntax :

DENSE_RANK() OVER (PARTITION BY <columnname> ORDER BY <columnname> desc)

The dense_rank helps us in ranking the records over each partition. To get to know about rank and dense rank well, Please refer to the below link.

https://beginnersbug.com/rank-and-dense-rank-in-pyspark-dataframe/

The partition by clause divides the entire data into groups based on the column specified. 

The order by clause sorts the column within each group whose nth calculation needs to be performed.

Calculating dense rank

For the employee table, the entire data gets divided based on the Dep_name column and ordered by the salary column.

The dense rank function will be applied to each partitioned data to calculate the highest salary.

select Dep_name,Emp_id,Salary,DENSE_RANK() OVER (PARTITION BY Dep_name ORDER BY Salary desc) as denserank from employee;
+----------+--------+--------+-----------+
| Dep_name | Emp_id | Salary | denserank |
+----------+--------+--------+-----------+
| Computer |    943 |   3200 |         1 |
| Computer |    123 |   2500 |         2 |
| Computer |    324 |   2500 |         2 |
| Computer |    564 |   1400 |         3 |
| Economy  |    789 |   7200 |         1 |
| Economy  |    678 |   6700 |         2 |
| Economy  |    456 |   4500 |         3 |
| History  |    987 |   3450 |         1 |
+----------+--------+--------+-----------+
8 rows in set (0.00 sec)
Third highest salary for each department

With the calculated dense rank value for each department, we could filter the third dense rank to get the third highest salary.

select a.Dep_name,a.Emp_id,a.Salary from (select Dep_name,Emp_id,Salary,DENSE_RANK() OVER (PARTITION BY Dep_name ORDER BY Salary desc) as denserank from employee) a where a.denserank=3;
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    564 |   1400 |
| Economy  |    456 |   4500 |
+----------+--------+--------+
Reference

https://www.mysqltutorial.org/mysql-window-functions/mysql-dense_rank-function/

Categories
pyspark

rank and dense rank in pyspark dataframe

In this post, Let us know rank and dense rank in pyspark dataframe using window function with examples.

Rank and dense rank

The rank and dense rank in pyspark dataframe help us to rank the records based on a particular column.

This works in a similar manner as the row number function .To understand the row number function in better, please refer below link.

The row number function will work well on the columns having non-unique values . Whereas rank and dense rank help us to deal with the unique values.

Sample program – creating dataframe

We could create the dataframe containing the salary details of some employees from different departments using the below program.

from pyspark.sql import Row
# Creating dictionary with employee and their salary details 
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer"  , "Salary" : 2500 } , {"Emp_id" : 456 ,"Dep_name"  :"Economy" , "Salary" : 4500} , {"Emp_id" : 789 , "Dep_name" : "Economy" , "Salary" : 7200 } , {"Emp_id" : 564 , "Dep_name" : "Computer" , "Salary" : 1400 } , {"Emp_id" : 987 , "Dep_name" : "History" , "Salary" : 3450 }, {"Emp_id" :678 , "Dep_name" :"Economy" ,"Salary": 4500},{"Emp_id" : 943 , "Dep_name" : "Computer" , "Salary" : 3200 }]
# Creating RDD from the dictionary created above
rdd1=sc.parallelize(dict1)
# Converting RDD to dataframe
df1=rdd1.toDF()
print("Printing the dataframe df1")
df1.show()
Printing the dataframe df1
+--------+------+------+
|Dep_name|Emp_id|Salary|
+--------+------+------+
|Computer|   123|  2500|
| Economy|   456|  4500|
| Economy|   789|  7200|
|Computer|   564|  1400|
| History|   987|  3450|
| Economy|   678|  4500|
|Computer|   943|  3200|
+--------+------+------+
Sample program – rank()

In order to use the rank and dense rank in our program, we require below libraries.

from pyspark.sql import Window
from pyspark.sql.functions import rank,dense_rank

from pyspark.sql import Window
from pyspark.sql.functions import rank
df2=df1.withColumn("rank",rank().over(Window.partitionBy("Dep_name").orderBy("Salary")))
print("Printing the dataframe df2")
df2.show()

In the below output, the department economy contains two employees with the first rank. This is because of the same salary being provided for both employees.

But instead of assigning the next salary with the second rank, it is assigned with the third rank. This is how the rank function will work by skipping the ranking order.

Printing the dataframe df2
+--------+------+------+----+
|Dep_name|Emp_id|Salary|rank|
+--------+------+------+----+
|Computer|   564|  1400|   1|
|Computer|   123|  2500|   2|
|Computer|   943|  3200|   3|
| History|   987|  3450|   1|
| Economy|   456|  4500|   1|
| Economy|   678|  4500|   1|
| Economy|   789|  7200|   3|
+--------+------+------+----+
Sample program – dense rank()

In the dense rank, we can skip the ranking order . For the same scenario discussed earlier, the second rank is assigned in this case instead of skipping the sequence order. 

from pyspark.sql import Window
from pyspark.sql.functions import dense_rank
df3=df1.withColumn("denserank",dense_rank().over(Window.partitionBy("Dep_name").orderBy("Salary")))
print("Printing the dataframe df3")
df3.show()
Printing the dataframe df3
+--------+------+------+---------+
|Dep_name|Emp_id|Salary|denserank|
+--------+------+------+---------+
|Computer|   564|  1400|        1|
|Computer|   123|  2500|        2|
|Computer|   943|  3200|        3|
| History|   987|  3450|        1|
| Economy|   456|  4500|        1|
| Economy|   678|  4500|        1|
| Economy|   789|  7200|        2|
+--------+------+------+---------+
Reference

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Column.over

Categories
pyspark

row_number in pyspark dataframe

In this post, we will learn to use row_number in pyspark dataframe with examples.

What is row_number ?

This row_number in pyspark dataframe will assign consecutive numbering over a set of rows.
The window function in pyspark dataframe helps us to achieve it.
To get to know more about window function, Please refer to the below link.

Creating dataframe 

Before moving into the concept, Let us create a dataframe using the below program.

from pyspark.sql import Row
# Creating dictionary with employee and their salary details 
dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer"  , "Salary" : 2500 } , {"Emp_id" : 456 ,"Dep_name"  :"Economy" , "Salary" : 4500} , {"Emp_id" : 789 , "Dep_name" : "Economy" , "Salary" : 7200 } , {"Emp_id" : 564 , "Dep_name" : "Computer" , "Salary" : 1400 } , {"Emp_id" : 987 , "Dep_name" : "History" , "Salary" : 3450 }, {"Emp_id" :678 , "Dep_name" :"Economy" ,"Salary": 6700},{"Emp_id" : 943 , "Dep_name" : "Computer" , "Salary" : 3200 }]
# Creating RDD from the dictionary created above
rdd1=sc.parallelize(dict1)
# Converting RDD to dataframe
df1=rdd1.toDF()
print("Printing the dataframe df1")
df1.show()

Thus we created the below dataframe with the salary details of some employees from various departments.

Printing the dataframe df1
+--------+------+------+
|Dep_name|Emp_id|Salary|
+--------+------+------+
|Computer|   123|  2500|
| Economy|   456|  4500|
| Economy|   789|  7200|
|Computer|   564|  1400|
| History|   987|  3450|
| Economy|   678|  6700|
|Computer|   943|  3200|
+--------+------+------+
Sample program – row_number

With the below segment of the code, we can populate the row number based on the Salary for each department separately.

We need to import the following libraries before using the window and row_number in the code.

orderBy clause is used for sorting the values before generating the row number.

from pyspark.sql import Window
from pyspark.sql.functions import row_number
df2=df1.withColumn("row_num",row_number().over(Window.partitionBy("Dep_name").orderBy("Salary")))
print("Printing the dataframe df2")
df2.show()
Printing the dataframe df2
+--------+------+------+-------+
|Dep_name|Emp_id|Salary|row_num|
+--------+------+------+-------+
|Computer|   564|  1400|      1|
|Computer|   123|  2500|      2|
|Computer|   943|  3200|      3|
| History|   987|  3450|      1|
| Economy|   456|  4500|      1|
| Economy|   678|  6700|      2|
| Economy|   789|  7200|      3|
+--------+------+------+-------+
Reference

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.row_number