Categories
hive

drop partition in hive

In this post let us learn how to drop partition in hive using examples

Partition in hive

Hive table can have one or multiple partition keys by which the query performance has been improved. By including the partition keys in our query , we can actually achieve the query optimization.

External and internal table in hive

Two types of tables available in hive are external and internal

For external tables , table properties need to be changed before dropping partitions .

alter table employee set tblproperties ('EXTERNAL'='FALSE');

Drop partition in hive

We can drop the respective partition of the external table using the following command .

ALTER TABLE employee set tblproperties ('EXTERNAL'='FALSE');
ALTER TABLE employee drop PARTITION (state='Bihar');
#After dropping the partition table properties should be reverted to external
ALTER TABLE employee set tblproperties ('EXTERNAL'='TRUE');

Resetting the location

We can even set the partition to the new location using the following command .  

ALTER TABLE employee PARTITION(state = 'Bihar', dep = 'accounts') 
SET LOCATION 'hdfs://retail/company/employee/accounts/Bihar

Dropping the range of partition in hive

For dropping range of data , we can try like below. 

ALTER TABLE employee DROP PARTITION (entry_date>'2021-03-14',entry_date<'2021-12-16');

Dropping all the partitions

Following one drop all the partitions

ALTER TABLE employee DROP if exists (state<>'');

Reference

https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-DropPartitions

Categories
scala

How to get previous dates using scala

In this post, we will see how to get previous dates using scala.

Scala program – method 1

ZonedDateTime and ZoneId – to get the date and time based on the specific zone id which we prefers to 

DateTimeFormatter – to convert the date and time to a specific format

minusDays function helps us to get the previous dates using scala as below.

import java.time.{ZonedDateTime, ZoneId}
import java.time.format.DateTimeFormatter

object PreviousDate {
  def main(arr: Array[String]): Unit = {
    val previousday = ZonedDateTime.now(ZoneId.of("UTC")).minusDays(1)
    val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm'Z'")
    val result = formatter format previousday
    println(result)
  }
}
2021-10-01T05:21Z

Scala program – method 2

object YesterdayDate {
  def main(arr: Array[String]): Unit = {
    val today = java.time.LocalDate.now
    val yesterday_date= java.time.LocalDate.now.minusDays(1)
    println(today)
    println(yesterday_date)

  }
}
2021-10-02
2021-10-01

Scala program – method 3

import java.util._
import java.lang._
import java.io._
import java.time.Instant
import java.time.temporal.ChronoUnit


object YesterdayDate {
  def main(arr: Array[String]): Unit = {
    val now = Instant.now
    val yesterday = now.minus(1, ChronoUnit.DAYS)
    System.out.println(now)
    System.out.println(yesterday)
  }
}
2021-10-02T06:37:11.695Z
2021-10-01T06:37:11.695Z

https://github.com/ushanan/SparkFirstProject/blob/master/src/main/scala/com/firstscala/spark/YesterdayDate.scala

Categories
python

bar plot in python with example

In this post , let us learn about bar plot in python with example.

Bar plot in python

This helps us to represent the categorical data using rectangular bar with x and y axis.

Codelines

Below are the libraries need to be imported for achieving the graph plotting .

import pandas as pd
from matplotlib import pyplot as plt
a=pd.read_csv('D:\data\shows.csv')
a.head()
Result

The sample file containing nationality details looks like the below.

Age	Experience	Rank	Nationality	Go
0	36	10	9	UK	NO
1	42	12	4	USA	NO
2	23	4	6	N	NO
3	52	4	4	USA	NO
4	43	21	8	USA	YES
Codeline

Country level count can be listed with below syntax.

a['Nationality'].value_counts()
Result
UK     5
N      4
USA    4
Name: Nationality, dtype: int64
Codeline

We can try to get only the country names using keys function.

a['Nationality'].value_counts().keys()
Result
Index(['UK', 'N', 'USA'], dtype='object')
Codeline

The list function groups the various country names.

a['Nationality'].value_counts().keys().tolist()
Result
['UK', 'N', 'USA']
Codeline

The label option will label the x and y axis accordingly.

plt.bar(a['Nationality'].value_counts().keys().tolist(),a['Nationality'].value_counts().tolist())
plt.xlabel("country name")
plt.ylabel("Rank")
plt.title("Nationality detail")
Result
Reference
Categories
SQL

Referential integrity in SQL

In this post , Let us learn Referential integrity in SQL .

What is Referential integrity ?

Referential integrity in SQL requires that a foreign key must have a matching primary key or it must be null. This constraint is specified between parent and child table.It actually maintains the correspondence between rows in these tables.

What is Primary key ?

The PRIMARY KEY constraint uniquely identifies each record in a table. Primary keys must contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key; and in the table, this primary key can consist of single or multiple columns (fields).

program

CREATE TABLE Emp_detail (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int,
    PRIMARY KEY (ID)
);

What is Foreign key?

Foreign key is used to maintain relationship between two tables. Primary of a table act as foreign key in the other table.

program

CREATE TABLE Emp_salary (
	Dep_id int,
    ID int NOT NULL,
	Salary int,
    PRIMARY KEY (Dep_id),
    FOREIGN KEY (ID) REFERENCES Emp_detail(ID)
);

In this way, the referential integrity makes parent table getting co-related with child table .

Rules of Referential integrity

We cant add a record to the table that contains the foreign key unless there is a corresponding record in the linked table.

cascading update and cascading delete are the other new techniques . This ensures that changes made to the linked table gets reflected in the primary table.

Benefits of Referential Integrity

  • Restricts the entry of duplicate data
  • Avoids one table from pointing to a nonexistent field in another table
  • Prevents the deletion of a record that contains a value referred to by a foreign key in another table
  • Prevents for the addition of a record to a table that contains a foreign key unless there is a primary key in the linked table

Reference

https://towardsdatascience.com/common-issues-founded-with-referential-integrity-fc05e93693a1

Categories
python

Slicing in python

In this post, let us discuss slicing in python with examples.

How Indexing works ?

For your reminder, Indexing starts from zero.
In the below example, a[0] & a[1] help us to get the element present in the zeroth and first index position

Program

print("creating a new list")
a=[9,8,7,6,5,4,3,2,1]
print("printing the elements of the list")
print(a)
print("printing the first element")
print(a[0])
print("printing the second element")
print(a[1])

Output

creating a new list
printing the elements of the list
[9, 8, 7, 6, 5, 4, 3, 2, 1]
printing the first element
9
printing the second element
8

What is Slicing ?

Slicing in python helps us in getting the specific range of elements based on their position from the collection.

Here comes the syntax of slicing which lists the range from index position one to two. It neglects the element present in the third index.
Below is the official document
https://docs.python.org/2.3/whatsnew/section-slices.html

Slicing in Python

Program

a=[9,8,7,6,5,4,3,2,1]
print("printing the first 3 elements of a list")
print(a[0:3])
print("other way to print first 3 elements of a list")
print(a[:3])

Output

printing the first 3 elements of a list
[9, 8, 7]
other way to print first 3 elements of a list
[9, 8, 7]

Positive and Negative indexing in slicing

There are two types of indexing available :

1) Positive Indexing
2) Negative Indexing

Positive Indexing

Program

a=[9,8,7,6,5,4,3,2,1]
print("printing second and third element in a list")
print(a[1:3])

Output

printing second and third element in a list
[8, 7]

Negative indexing

Program

a=[9,8,7,6,5,4,3,2,1]
print("printing the last element")
print(a[-1])
print("printing last three elements in a list")
print(a[-3:])
print("printing from second element till second last element")
print(a[1:-1])

Output

printing the last element
1
printing last three elements in a list
[3, 2, 1]
printing from second element till second last element
[8, 7, 6, 5, 4, 3]

Interview Q&A

How to reverse the elements present in a list?

Program

a=[9,8,7,6,5,4,3,2,1]
print("Reversing the elements")
print(a[::-1])

Output

Reversing the elements
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Create dataframe in python

Categories
mysql

exists and not exists in MySQL

In this post, let us learn about the options exists and not exists in MySQL.

To understand this better, we will make use of the following two tables in the post.

Both tables are having the same columns with common and uncommon records.

mysql> select * from Employee_details;
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    564 |   1400 |
| Computer |    123 |   2500 |
| Computer |    943 |   3200 |
| History  |    987 |   3450 |
| Economy  |    456 |   4500 |
| Economy  |    678 |   6700 |
| Economy  |    789 |   7200 |
| Computer |    324 |   2500 |
+----------+--------+--------+
8 rows in set (0.00 sec)
mysql> select * from Company;
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    564 |   1400 |
| Computer |    943 |   3200 |
| History  |    987 |   3450 |
| Economy  |    456 |   4500 |
| History  |    987 |   3450 |
+----------+--------+--------+
5 rows in set (0.00 sec)
How to use?

The below option filters data from the first table if it is not present in the subquery.

The following records are present in the Employee details and not in the Company.

mysql> select * from Employee_details a where not exists (select * from Company
b where a.Emp_id =b.Emp_id );
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    123 |   2500 |
| Economy  |    678 |   6700 |
| Economy  |    789 |   7200 |
| Computer |    324 |   2500 |
+----------+--------+--------+
4 rows in set (0.00 sec)

Exists option filters the data from the first table if it presents in the subquery.

The following records present in both tables Employee_details and Company.

mysql> select * from Employee_details a where exists (select * from Company b wh
ere a.Emp_id =b.Emp_id );
+----------+--------+--------+
| Dep_name | Emp_id | Salary |
+----------+--------+--------+
| Computer |    564 |   1400 |
| Computer |    943 |   3200 |
| History  |    987 |   3450 |
| Economy  |    456 |   4500 |
+----------+--------+--------+
4 rows in set (0.00 sec)
Reference

https://dev.mysql.com/doc/refman/8.0/en/exists-and-not-exists-subqueries.html

Related Articles

https://beginnersbug.com/subtracting-dataframes-in-pyspark/

Categories
pyspark

Transformation and action in pyspark

In this post, let us learn about transformation and action in pyspark.

Transformation

Transformation is one of the operations available in pyspark.

This helps in creating a new RDD from the existing RDD.

Types of transformation

Narrow transformation :

map,filter,flatmap,distinct,sample,union,intersection,join,coalesce,repartition,pipe,cartesian

Wide transformation :

groupByKey,reduceByKey,aggregateByKey,sortByKey

What is action ?

On applying the transformation, DAG(Directed Acyclic Graph)  is usually created. And this develops on further application of some other operations.

But the operations will execute only if action is called upon.

Types of action

reduce,collect,take,head,count,first,saveAsObjectFile,countByKey,foreach,saveAsSequenceFile,saveAsTextFile,takeOrdered,takeSample

Sample program

The following program helps us to filter elements based on some conditions.

But the steps execute only at the collect function. 

from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark = SparkSession(sc)
rdd1=sc.parallelize([1,2,3,4])
rdd1_first=rdd1.filter(lambda x : x<3)
rdd1_first.collect()
[1, 2]

https://beginnersbug.com/rank-and-dense-rank-in-pyspark-dataframe/

https://beginnersbug.com/window-function-in-pyspark-with-example/

Categories
pyspark

Difference between map and flatmap in pyspark

In this post, let us learn about the difference between map and flatmap in pyspark.

What is the difference between Map and Flatmap?

Map and Flatmap are the transformation operations available in pyspark.

The map takes one input element from the RDD and results with one output element. The number of input elements will be equal to the number of output elements.

In the case of Flatmap transformation, the number of elements will not be equal. That is the difference between the two.

Let the below example clarify it clearly.

How to create an RDD ?

With the below part of the code, an RDD is created using parallelize method and its value is viewed.

Let us discuss the topic below with the created RDD.

# Creating RDD using parallelize method
rdd1=sc.parallelize([1,2,3,4])
rdd1.collect()

The RDD contains the following 4 elements.

[1, 2, 3, 4]
How to apply map transformation ?
# Applying map transformation
rdd1_map=rdd1.map(lambda x : x**2)
# Viewing the result
rdd1_map.collect()

In the below result , the output elements are the square of the input elements. And also the count is equal.

[1, 4, 9, 16]
How to apply flatMap transformation ?
# Applying flatmap transformation
rdd1_second=rdd1.flatMap(lambda x : (x**1,x**2))
# Viewing the result
rdd1_second.collect()

In the below result, we are not finding an equal number of elements as map transformation.

[1, 1, 2, 4, 3, 9, 4, 16]

https://beginnersbug.com/transformation-and-action-in-pyspark/

https://beginnersbug.com/spark-sql-operation-in-pyspark/

Categories
unix

Deleting blank lines in a file using UNIX

In this post, let us learn about deleting blank lines in a file using UNIX.

Creating File with blank lines

I am having one file with the following as the content and also some blank/empty lines in it.

cat abc.txt
apple
Mango
Orange

Grapes

Peer

Pineapple

Using the below-given command, we can get to know the count of the line which includes the blank lines also.

wc -l abc.txt
9 abc.txt

This command lists the number of blank lines in a file.

^ – beginning of the line

$ – ending of the line

grep -c ^$ abc.txt
3

sed command helps us to delete the empty lines by checking for the matches.

sed -i '/^$/d' abc.txt

After the deletion, the File in the UNIX looks like below.

cat abc.txt
apple
Mango
Orange
Grapes
Peer
Pineapple

The count of the file after deleting the blank lines in a file using UNIX.

cat abc.txt |wc -l
6

Categories
python

Returning multiple values in python

In this post, let us learn about returning multiple values in python.

Will it possible in python?

Yes, returning multiple values in python is possible. There are several ways to do it.

1 . Return as Tuple :

The tuple stores the values of various datatypes. But we cannot modify those values in any case.

Please refer to the below link to understand it better.

https://beginnersbug.com/difference-between-list-and-tuple/

The tuple is created with or without a round bracket as below.

# Sample program to return as Tuple without bracket
def first():
 a="string"
 b=10
 return a,b
a,b=first()
print(a,b)

The above python function returns more than one value.

string 10

Tuple returns within a bracket as follows.

# Sample program to return as Tuple with bracket
def first():
 a="string"
 b=10
 return (a,b)
a,b=first()
print(a,b)
string 10
2 . Return as List :

The list is also a collection created with square bracket.

The list helps us in returning multiple values in python .

It is mutable in nature.

# Sample program to return as List
def second():
 a="apple"
 b=50
 return [a,b]
second_list=second()
print(second_list)

We return multiple values from the python function using the list.

['apple', 50]

We can identify the datatype of the return by using the below command