Home ADB PySpark Coding

PySpark Coding

February 24, 2024

106

I/P Structure
BPID  |   customerID_Corporate   
1              12
1              23
1              34
1              54
2              45 
O/P
BPID    customerID_Corporate 
 1                    [12,23,34,54]       
 2                    [45]

data = [(1, 12), (1, 23), (1, 34), (1, 54), (2, 45)]                                                                                 columns = ["BPID", "customerID_Corporate"]

grouped_df = df.groupBy("BPID").agg(collect_list("customerID_Corporate").alias("customerID_Corporate"))

How to join d1,d2 dataframe

joindf= d1.join(d2,d1.id==d2.id, “inner”)

How to select required columns in dataframe

d4 = d3.select( col(“name”), col(“id”), col(“sql”)

How to read parquet file 

df3=spark.read.format('parquet').load("/FileStore/tables/storing/employee")

Syntax of broadcast join

from pyspark.sql.functions import broadcast

d3 = d1.join(broadcast(d2), d1.id == d2.id, "left")

WITH RankEmployees AS (

select

e.id AS employee_id,

e.name AS employee_name,

e.salary AS employee_salary,

e.departmentId,

d.name AS department_name,

RANK() OVER (PARTITION BY e.departmentId ORDER BY e.salary DESC) AS emp_rank

FROM

Employee e

INNER JOIN

department d on e.departmentId = d.id

)

select

employee_id,

employee_name,

employee_salary,

departmentId,

department_name

from

RankedEmployees

where

emp_rank = 3;

where

emp_rank = 3;

df.write.format(“jdbc”) \

.option(“url”, “jdbc:postgresql://database_url”) \

.option(“dbtable”, tablename) \

.option(“user”, “your_username”) \

.option(“password”, “your_password”) \

.option(“driver”, “org.postgresql.Driver”) \

.mode(“overwrite”) \

.save()

id|category|

|1 | A| 100|

| 2| B| 200|

| 3| C| 300|

| 4| D| 400|

| 5| E| 500|

+—+——–+—–+———-+———-+

| 1| A| 100| 200| null|

| 2| B| 200| 300| 100|

| 3| C| 300| 400| 200|

| 4| D| 400| 500| 300|

| 5| E| 500| null| 400|

|Jen |Mary |Brown |1980-02-17|

Find the duplicate records

Find the unique records

Delete duplicate records

Id name

10,B

20,C

30,D

How to covert row to column in sql and pyspark

Q- count of present in employee

Empid,Datepresent

1 1|2|3

2 2|3|4|5

3 1|2|3|5|5|7

Q-cumulative query

Table: Sales

SaleID, Product,SellingDate, SalesAmount

1^st 2^nd 3^rd highest sal

Avg sal in each dept

Difference between avg sal in each dept

Person Vehicle

A Cycle

A Bike

A Car

B Cycle

C Bike

C Car

who is only cycle in pyspark code

person_vehicles = df.groupBy(“Person”).agg(collect_set(“Vehicle”).alias(“Vehicles”))

only_cycle_persons = person_vehicles.filter(~col(“Vehicles”).contains(“Bike”) & ~col(“Vehicles”).contains(“Car”))

only_cycle_persons.show()

Table A

NULL

TableB

NULL

Count of left join and inner join and full outer join

EMP_ID 1 2 3

FN Johnson Jane John

LN Jane Bob Smith

SAL 50000 60000 55000

Convert column to row

Id gender

10, male

20 male

30 male

40 female

50 female

Update male to female and female to gender

Validate the pancard using regexp

client and order table, client does not active in last one year

find top 3 best customer

produt

produtname prodid, qslod,

custer

custid, prodid, custname

select custname, productname, qsold
from produt

df=saprk.read.sql(“select b.custname, a.productname, a.qsold from product a inner join custer b a.productid=.productid”)

track_id timestamp
abc 12:09:09 10:30:21
abc 12:09:09 10:30:19
abc 12:09:09 10:30:17
xyx 12:09:09 09:30:21
xyx 12:09:09 09:30:20
abc 12:09:09 08:30:21

select track_id, count(*) from table group by track_id

data = {‘a’:1,[1,2,3]:’a’,1:1,(1,2,3):1}

data = [10,20,74,30,20,56,78,55,40,40,74,50]

SELECT *
FROM (
SELECT
a.*,
DENSE_RANK() OVER (ORDER BY sal DESC) AS rank
FROM
emp a
) WHERE rank = 3;

d1, d2

d3 = d1.join(d2, d1.id == d2.id, “left”)

df= spark.read.csv(“path”)

df1=df.select(upper(“name”),

df.wthcolumnrenamed(

df, df1

d3== df.join(d2, “id”, “inner”)

df1 – deptid,empid
df2, empid, city, sal

df3=df1.join(df2,df1.empid==df2.empid, ‘inner)

df4=

select * from
(select a.*, dense_rank() over(partitioned by dept order by sal desc) as rank from d3) a
where rank =4

d4=saprk.read.sql(select * fromm
(select a.*, dense_rank() over(partitioned by dept order by sal desc) as rank from d3) a
where rank =4)

batman, ranscore,mathchnum

a 10 1
b 20 3
c 30 10

select * from
(select batman,runscore, sum(ranscore) as cnt from table group by batsman,runscore) a
where a.cnt=0 and btsman=’a’;

a, b

a comming

12 am

shedule triger – dataflow-\

adfpipelile (dataflow -sink ) -strigger

patint name, test_result, address, marketing, deseace,

slect * from
(select a.*, desnce_rank() over( order by rest_result desc) as rnk from
(select patint name, test result, address, marketing, deseace from person
where address like ‘%hyd%’ and deseace=’covid’ and trunc(test_result) in (1,2,3,4,5,6)) a
where rnk in (4,5,6)

i want list all hyd, covid result 1-5

Marks:

Id subject Marks year
1 M 75 1990
1 M 80 1991
1 M 82 1992
1 M 75 1993
2 M 78 1990
2 M 70 1991
2 M 80 1992
2 M 85 1993
1 P 68 1990
1 P 70 1991
1 P 80 1992
1 P 82 1993

select id,subject, lag(marks) over(order by year) as marks_diff_year from marks

Id mobileNo State
1 7889923456 Punjab
2 8376548761 Karnataka
3 7889923456 Andra pradesh
4 9390076544 Punjab
5 7889923456 TamilNadu

select * from mobile where mobileno
in (select mobileno from mobile from (
(select mobileno count() from mobile group by mobileno having cont()>=2) a ) where state =’panjab’

select mobileno, count() from mobile group by mobileno having cont()>=2 and state =’panjab’

Suppose you have a database with two tables, “orders” and “customers”.

The “orders” table contains order information including

customer ID, order date, and order amount.

The “customers” table contains customer information including

customer ID, name, and email.

Write a SQL query to find the top 3 customers with the highest total order amount in the month of January 2022

Select dense_rank(partition_by

(Select a.customer_ID, b.order_date from customers a join orders b

On a.customer_id=b. customer

Where order_date =’01-01-2022’)

[10:39 AM] Krishna, Sabareesh

Suppose you have a large dataset with multiple columns,

and you want to perform a group by operation on a specific column,

but you also want to calculate the mean of another column for each group.

Additionally, you want to filter out any group where the mean is less than a certain value.

How would you do this efficiently in PySpark?

LEAVE A REPLY Cancel reply