I/P Structure
BPID | customerID_Corporate
1 12
1 23
1 34
1 54
2 45
O/P
BPID customerID_Corporate
1 [12,23,34,54]
2 [45]
data = [(1, 12), (1, 23), (1, 34), (1, 54), (2, 45)] columns = ["BPID", "customerID_Corporate"]
grouped_df = df.groupBy("BPID").agg(collect_list("customerID_Corporate").alias("customerID_Corporate"))
How to join d1,d2 dataframe
joindf= d1.join(d2,d1.id==d2.id, “inner”)
How to select required columns in dataframe
d4 = d3.select( col(“name”), col(“id”), col(“sql”)
How to read parquet file
df3=spark.read.format('parquet').load("/FileStore/tables/storing/employee")
Syntax of broadcast join
from pyspark.sql.functions import broadcast
d3 = d1.join(broadcast(d2), d1.id == d2.id, "left")
WITH RankEmployees AS (
select
e.id AS employee_id,
e.name AS employee_name,
e.salary AS employee_salary,
e.departmentId,
d.name AS department_name,
RANK() OVER (PARTITION BY e.departmentId ORDER BY e.salary DESC) AS emp_rank
FROM
Employee e
INNER JOIN
department d on e.departmentId = d.id
)
select
employee_id,
employee_name,
employee_salary,
departmentId,
department_name
from
RankedEmployees
where
emp_rank = 3;
where
emp_rank = 3;
df.write.format(“jdbc”) \
.option(“url”, “jdbc:postgresql://database_url”) \
.option(“dbtable”, tablename) \
.option(“user”, “your_username”) \
.option(“password”, “your_password”) \
.option(“driver”, “org.postgresql.Driver”) \
.mode(“overwrite”) \
.save()
id|category|
|1 | A| 100|
| 2| B| 200|
| 3| C| 300|
| 4| D| 400|
| 5| E| 500|
| id|category|value|next_value|prev_value|
+—+——–+—–+———-+———-+
| 1| A| 100| 200| null|
| 2| B| 200| 300| 100|
| 3| C| 300| 400| 200|
| 4| D| 400| 500| 300|
| 5| E| 500| null| 400|
|first_name|middle_name|last_name|DOB | +———-+———–+———+———-+
|James | |Smith |1991-04-01|
|Michael |Rose | |2000-05-19|
|Robert | |Williams |1978-09-05|
|Maria |Anne |Jones |1967-12-01|
|Jen |Mary |Brown |1980-02-17|
Find the duplicate records
Find the unique records
Delete duplicate records
Id name
10,B
10,B
20,C
30,D
How to covert row to column in sql and pyspark
Q- count of present in employee
Empid,Datepresent
1 1|2|3
2 2|3|4|5
3 1|2|3|5|5|7
Q-cumulative query
Table: Sales
SaleID, Product,SellingDate, SalesAmount
1st 2nd 3rd highest sal
Avg sal in each dept
Difference between avg sal in each dept
Person Vehicle
A Cycle
A Bike
A Car
B Cycle
C Bike
C Car
who is only cycle in pyspark code
person_vehicles = df.groupBy(“Person”).agg(collect_set(“Vehicle”).alias(“Vehicles”))
only_cycle_persons = person_vehicles.filter(~col(“Vehicles”).contains(“Bike”) & ~col(“Vehicles”).contains(“Car”))
only_cycle_persons.show()
Table A
ID
1
0
0
1
0
NULL
NULL
TableB
ID
1
0
0
1
0
NULL
NULL
Count of left join and inner join and full outer join
EMP_ID 1 2 3
FN Johnson Jane John
LN Jane Bob Smith
SAL 50000 60000 55000
Convert column to row
Id gender
10, male
20 male
30 male
40 female
50 female
Update male to female and female to gender
Validate the pancard using regexp
client and order table, client does not active in last one year
find top 3 best customer
produt
produtname prodid, qslod,
custer
custid, prodid, custname
select custname, productname, qsold
from produt
df=saprk.read.sql(“select b.custname, a.productname, a.qsold from product a inner join custer b a.productid=.productid”)
track_id timestamp
abc 12:09:09 10:30:21
abc 12:09:09 10:30:19
abc 12:09:09 10:30:17
xyx 12:09:09 09:30:21
xyx 12:09:09 09:30:20
abc 12:09:09 08:30:21
select track_id, count(*) from table group by track_id
data = {‘a’:1,[1,2,3]:’a’,1:1,(1,2,3):1}
data = [10,20,74,30,20,56,78,55,40,40,74,50]
SELECT *
FROM (
SELECT
a.*,
DENSE_RANK() OVER (ORDER BY sal DESC) AS rank
FROM
emp a
) WHERE rank = 3;
d1, d2
d3 = d1.join(d2, d1.id == d2.id, “left”)
df= spark.read.csv(“path”)
df1=df.select(upper(“name”),
df.wthcolumnrenamed(
df, df1
d3== df.join(d2, “id”, “inner”)
df1 – deptid,empid
df2, empid, city, sal
df3=df1.join(df2,df1.empid==df2.empid, ‘inner)
df4=
select * from
(select a.*, dense_rank() over(partitioned by dept order by sal desc) as rank from d3) a
where rank =4
d4=saprk.read.sql(select * fromm
(select a.*, dense_rank() over(partitioned by dept order by sal desc) as rank from d3) a
where rank =4)
batman, ranscore,mathchnum
a 10 1
b 20 3
c 30 10
select * from
(select batman,runscore, sum(ranscore) as cnt from table group by batsman,runscore) a
where a.cnt=0 and btsman=’a’;
a, b
a comming
12 am
shedule triger – dataflow-\
adfpipelile (dataflow -sink ) -strigger
patint name, test_result, address, marketing, deseace,
slect * from
(select a.*, desnce_rank() over( order by rest_result desc) as rnk from
(select patint name, test result, address, marketing, deseace from person
where address like ‘%hyd%’ and deseace=’covid’ and trunc(test_result) in (1,2,3,4,5,6)) a
where rnk in (4,5,6)
i want list all hyd, covid result 1-5
Marks:
Id subject Marks year
1 M 75 1990
1 M 80 1991
1 M 82 1992
1 M 75 1993
2 M 78 1990
2 M 70 1991
2 M 80 1992
2 M 85 1993
1 P 68 1990
1 P 70 1991
1 P 80 1992
1 P 82 1993
select id,subject, lag(marks) over(order by year) as marks_diff_year from marks
Id mobileNo State
1 7889923456 Punjab
2 8376548761 Karnataka
3 7889923456 Andra pradesh
4 9390076544 Punjab
5 7889923456 TamilNadu
select * from mobile where mobileno
in (select mobileno from mobile from (
(select mobileno count() from mobile group by mobileno having cont()>=2) a ) where state =’panjab’
select mobileno, count() from mobile group by mobileno having cont()>=2 and state =’panjab’
Suppose you have a database with two tables, “orders” and “customers”.
The “orders” table contains order information including
customer ID, order date, and order amount.
The “customers” table contains customer information including
customer ID, name, and email.
Write a SQL query to find the top 3 customers with the highest total order amount in the month of January 2022
Select dense_rank(partition_by
(Select a.customer_ID, b.order_date from customers a join orders b
On a.customer_id=b. customer
Where order_date =’01-01-2022’)
[10:39 AM] Krishna, Sabareesh
Suppose you have a large dataset with multiple columns,
and you want to perform a group by operation on a specific column,
but you also want to calculate the mean of another column for each group.
Additionally, you want to filter out any group where the mean is less than a certain value.
How would you do this efficiently in PySpark?