Home Interview Questions Azure Data Engineer Interview Questions Azure Data Engineer Interview Questions

Azure Data Engineer Interview Questions

February 18, 2024

519

1. what is spark architecture?

Spark architecture refers to the underlying framework and components that enable Apache Spark, a fast and distributed data processing engine, to execute tasks and computations across clusters of machines. The architecture of Spark is designed to support parallel and distributed processing of large-scale datasets efficiently. Here are the key components of Spark architecture:
Driver Program:
The driver program is the main control process that manages the execution of Spark applications.
It runs the user’s main function and coordinates the execution of tasks on the cluster.
Cluster Manager:
Spark can run on various cluster managers such as Apache Mesos, Hadoop YARN, or in standalone mode.
The cluster manager allocates resources to different Spark applications and monitors their execution.
Executors:
Executors are worker nodes responsible for executing tasks as part of a Spark application.
They are launched and managed by the cluster manager and run tasks in parallel across the cluster.
Tasks:
Tasks are units of work that perform computations on data partitions.
Spark breaks down transformations and actions on RDDs (Resilient Distributed Datasets) into tasks that are executed in parallel across executors.
Resilient Distributed Datasets (RDDs):
RDDs are the fundamental data abstraction in Spark, representing distributed collections of objects that are partitioned across the cluster.
RDDs support fault tolerance through lineage, allowing lost data partitions to be reconstructed using transformations on other RDDs.
DataFrame and Dataset API:
Spark provides higher-level abstractions like DataFrames and Datasets for working with structured data.
These APIs offer optimizations for processing structured data and provide a more user-friendly interface compared to RDDs.
Spark Modules:
Spark Core: The foundational module providing distributed task scheduling, memory management, and basic I/O functionalities.
Spark SQL: Provides support for querying structured data using SQL syntax and integrates seamlessly with DataFrame and Dataset APIs.
Spark Streaming: Enables real-time stream processing of data streams, allowing for the processing of live data.
MLlib (Machine Learning Library): A library for scalable machine learning algorithms and utilities.
GraphX: A library for graph processing and analysis.

2. What is the difference between an RDD and a DataFrame?

Spark RDDs : are the most basic data abstraction in Spark. They are immutable and distributed collections of data. RDDs offer fine-grained control over data processing, but they can be complex to use.
Spark DataFrames : are a higher-level abstraction than RDDs. They are structured data organized into named columns, similar to a table in a database. DataFrames are easier to use than RDDs, but they offer less control over data processing.

3. What is transformations and actions in Spark?

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation

An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.

Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).

4. What is difference between wide and narrow transformation?

Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Narrow Transformations in PySpark:
map(func): Applies the function func to each element of the RDD or DataFrame independently and returns a new RDD or DataFrame with the transformed elements.
flatMap(func): Similar to map(), but the function func returns an iterator for each input record, and the resulting RDD or DataFrame is formed by flattening these iterators.
filter(func): Selects elements from the RDD or DataFrame that satisfy the specified predicate func.
mapPartitions(func): Applies a function func to each partition of the RDD, where func receives an iterator over the elements of the partition and can yield an arbitrary number of elements.
mapPartitionsWithIndex(func): Similar to mapPartitions(), but the function func also takes the index of the partition as an argument.
flatMapValues(func): Applies a function func to each value of a key-value pair RDD, where func returns an iterator for each value, and the resulting RDD consists of the flattened values.
mapValues(func): Applies a function func to each value of a key-value pair RDD, while keeping the keys unchanged.
sample(withReplacement, fraction, seed): Samples a fraction fraction of the data with or without replacement, optionally using a specified random seed.
sortBy(func, ascending): Sorts the RDD or DataFrame by the result of applying the function func to each element. The ascending parameter determines the sort order.
zip(otherDataset): Zips the RDD with another RDD or DataFrame, creating a new RDD or DataFrame of key-value pairs.

Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().

Wide Transformations in PySpark:
groupBy(): Groups the DataFrame using specified columns. It’s a narrow transformation on the partition level, followed by a wide transformation where the data is shuffled across partitions to perform the aggregation.
groupByKey(): Similar to groupBy() but operates on RDDs. It groups the data in each partition based on a key and then shuffles the data across partitions.
reduceByKey(): Applies a function to the values of each key, reducing the elements with the same key to a single value.
aggregate(): Aggregates the elements of each partition of the RDD using a given associative function and a neutral “zero value” (initial value).
aggregateByKey(): Similar to aggregate() but works on key-value pair RDDs. It allows aggregation of values for each key.
distinct(): Returns a new DataFrame with distinct rows. It requires shuffling of data across partitions to identify and eliminate duplicates.
join(): Joins two DataFrames or RDDs based on a common key. It involves shuffling data across partitions to bring together matching records from different partitions.
repartition(): Repartitions the DataFrame or RDD by redistributing the data across partitions. It involves shuffling data across the cluster.
Additional Wide Transformations:
sortByKey(): Sorts the elements of each partition based on the keys. It can lead to significant data movement across partitions.
cogroup(): Groups the values from both RDDs sharing the same key, similar to a full outer join. It shuffles data across partitions to group values by key.

5. How to handle corrupt or Bad record in Pyspark?

1️. FailFast Mode:
It is used when you want strict data validation.
If any data issues (e.g., malformed records or schema mismatch) are encountered, it will raise an exception immediately, preventing further processing of the data.
df_failfast = spark.read.option(“mode”, “FAILFAST”).csv(“data_source.csv”)
2️. DropMalformed Mode:
Used when you want to continue processing despite malformed records.
It skips malformed records, ensuring that only valid data is loaded.
df_dropmalformed = spark.read.option(“mode”, “DROPMALFORMED”).csv(“data_source.csv”)
3️. Permissive Mode:
Use when you want to capture as much data as possible. Allow the corrupted value without any interruption
It attempts to infer the schema and loads data, logging any issues encountered.
df_permissive = spark.read.option(“mode”, “PERMISSIVE”).csv(“data_source.csv”)
POINTS TO REMEMBER:
By default the read mode is permissive, allowing to capture the entire data.
It also provides an additional feature to print the corrupted records while reading the data, but we have to define the schema with an addition column “_corrupt_record”, which will capture the corrupted records.
To store the bad records at some location, an additional option “badRecordsPath” is used to define the directory location.

what is delta table?

Delta table is the default data table format in Azure Databricks and is a feature of the Delta Lake open source data framework.

A Delta table is a managed table that uses the Databricks Delta Optimistic Concurrency Control feature to provide full
ACID (atomicity, consistency, isolation, and durability) transactions.

Delta Lake is a transactional storage layer that runs on top of your existing data lake. It uses a columnar storage format and supports ACID transactions.

Advantages of Parquet file

Parquet is a columnar storage format that excels in compressing and encoding data efficiently. It’s designed for read-heavy workloads, making it an excellent choice for analytics queries. The columnar structure minimises I/O operations, reducing disk and network traffic, ultimately leading to faster query performance. Parquet’s compatibility with various data processing frameworks makes it a versatile option.

Key Advantages:
1.Excellent compression and encoding, minimizing storage costs.
2.Optimised for analytics queries due to columnar storage.
3.Widely compatible with various processing frameworks.

Can we modify spark dataframe

DataFrames are immutable hence we cannot change anything directly on it. So every operation on DataFrame results in a new Spark DataFrame.

Broadcast join

Efficiently distributes smaller DataFrame to all nodes in the cluster for join operation. Suitable when one DataFrame is small enough to fit into the memory of each executor and the other DataFrame is significantly larger. Ideal for joining small lookup tables with large fact tables.

from pyspark.sql.functions import broadcast
broadcast_join_df = df1.join(broadcast(df2), df1.id == df2.id, “inner”)

Difference between Repartition and Coalesce

Both repartition and coalesce are used to modify the number of partitions in an RDD. Repartition can increase or decrease the number of partitions, but it shuffles all data.

Where as coalesce only decreases the number of partitions. As coalesce avoids full data shuffling, it’s more efficient than repartition when reducing the number of partitions.

What is the difference between Cache and Persist?

The difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.

Are Spark clusters will in running status for 24/7 in your project?

Avoid Skewness/data distribution while writing data

Changing size of data every day in blob , so how to improve the performance.

cluster configuration in azure data bricks?

connecting sql in databricks

Databricks cluster creation

Databricks cluster driver and worker node details

Databricks cluster type

databricks connections to adf

Databricks git integration

databricks how you can use excel file

Databricks how you configure cluster

did you face any issues with performance in your coding and how did you fix it?

Difference between Row and Columner data structure

Difference between spark and Hadoop map reduce?

difference between spark dataframe and panda datafame

do you use pandas for your transformation?

External sources accessed from databricks.

File 1 and File 2 has different schema, how we will merge both in single dataframe

file formats you have worked in azure data bricks?

gather statistics in pyspark

Get the datatypes of dataframe and preseve it into list

Get the number of rows on each file in a dataframe?

Have you connected sql from databricks. if yes, how do you establish connection?

Have you ever used high concurrancy cluster? if yes, why?

have you seen in your application any place there there is data quality issue (like null values etc.)

How can we read pyspark notebook activity output into another notebook in adf

How can you create secret key & scope in databricks?

How convert date format to other format spark

How do you drop duplicates in Azure databricks pyspark

How do you handle concurrency in you project

how do you masking data in azure

How do you select 30 column names dynamically in pyspark

How do you seperate column names in pyspark with commas

How load multiple files into one file or merge into one file..

how maintain the code while working with others notebook

how many nodes you have used in your cluster

How new add column in csv and assign constant value

How to access Azure data lake storage from databricks!

How to access pyspark DF from SQL cell within notebook

How to add a partition in data frame?

how to add global variable in notebook

How to add source filename in one of column in dataframe?

How to call one notebook to another note book?

How to connect Azure SQL from databricks

How to create a schema in and assign in csv

How to decide configuration of Spark?

How to fill null values with some values

How to handle bad records in pyspark

How to handle Bad records in spark? Types

how to implement incremental load dynamically in pyspark

How to pass the runtime parameters to notebook

How to read a substring from list sort a list

how to read and write json

How to read from table all the columns and write to json

How to read the element from dataframe and convert it into variables

How to remove duplicates in spark

How to remove the null values in the dataframe

How to rename a column in csv?what are the different ways

How to send the notebook run status to adf from databricks notebook

how to skip the columns

How to sort a list difference b/w sortand sort

How to use our on schema while creating df

How to work on nested json

how to work with excel source in azure data bricks?

How to write a data in specific name in databricks

how u use secret scope in the data bricks.

How we can see the notebook system generated Logs in Azure databricks

How you read json file and convert to parquet

if you want to compare 2 dataframes how do you do that

if you wanted to find a specific pattern of a string and replace that what function do you use.

Incremental load

Interactive Databricks Cluster

Job Cluster

Job cluster vs high concurrency cluster

left join of 2 dataframes

Logging in azure databricks notebooks.

Memory bottleneck on Spark executors

MERGE in azure data bricks?

Minimize shuffling of data while joining

Mount path creation

normal data bricks and azure data bricks

Number of library work in databricks

orc?

Orchestrating databricks notebooks.

Parameter pass to databricks notebook from ADF

Parameter read in databricks(dbutils)

Partitioning

Performance tuning in Spark

Persisting

pickling and unpicking

Query plan check in databricks

Re-partitioning

Scd 1

Scd 2

Scenario based questions on performance tuning.

secret Scopes and azure keyvault?

spark context and spark config

syntax of broadcast join

Tiers of storage account

Types of dbutilies

Vacuum

What all formats you have worked in your project pyspark

What are the activities you have done in Azure DataBricks

what are the file systems databricks support

What are the magic commands in databricks?

What are the optimisation you do your project for pyspark jobs

What are the transformations you have done in the azure data bricks?

what are the types of Joins in pyspark?

What is a Data brick Runtime?

what is accumulator in pyspark and broadcast variable

what is adaptive query execution in pyspark and use of it

What is azure active directory

what is broadcast variable and accumulator varaiables?

what is catalyst optimizer?

what is coalesce and repartition?

What is DAG?

What is data governance?(unity catalog)

What is data lake architecture and why we uses it?

what is data skewness (how to solve)

What is external table and managed /internal tables?

What is federation in Azure?

what is global temporary view and temporary view in pyspark?

What is inferschema in pyspark?

What is mount point & its implementation?

what is parquet file and features

what is secret scope in Azure data bricks?

What is spark cluster and how it works? And also why spark?

what is spark version and features

what is the dalta lake file format?

what is the configuration of cluster you used?

What is the difference between interactive cluster and job cluster?

what is the process once you submit the job in pyspark

what is your cluster configuration?

what languages have you used in databricks transformation

what type of version tool you use in your project for ADB notebook

What was the size of the data and time required to load whole data?

Which partition strategy you used for data lake gen2

which type of cluster you have used in your project?

while and while loop

why we need to convert parquet file into delta file format?

How to update a file

What is secret scope in data bricks

What is Z order in pysaprk and use of it

What is vacuum

How to reduce data redundancy in pyspark
Syntax of join in pyspark
Write a simple transformation in pyspark
What is wide and narrow transformation
About data warehousing concepts? SCD and types
About Synapse Analytics
about your project, architecture and how it is implemented?
About your project? And what are the challenges & issues while working with Azure?
ADF copy activity , how to handle bad rows
ADF copy activity performance check
ADF git ingration
ADLS Gen1 vs ADLS Gen2 vs BLOB
ADLS gen1 vs gen2
ADLS Gen2/Gen1
Also Scd types of data loading need to check…
Any exp on Azure Analysis Service
any experience in CICD?
Any experience in Data Extraction from API If yes what will be data format
Architecture and implementation of previous project
Asked to query on Lag function
Asked to write SQL query to find the credit card is valid or not
Azure analysis services concepts?
Azure app insights integration and logging mechanism
Azure key vault
Azure Services used (ADF, Databricks, SHIR)
Azure Streaming experience?
Blob storage vs Azure Data Lake
By using ploybase how you access the data from data lake storage account?
can we run pipelines through database triggers?
Can we update primary key in syanpse sql dw
Can you configure the two different oracle DB instances in linked service
CI CD concepts and what are the steps we need to follow in detail while moving the code
CI/CD pipeline flow?
Clustered index vs Non-Clustered Index
Code/performance fine tuning aspects
Complex join queries
Copy multiple files from blob to blob in Azure Data Factory
Copy Multiple Tables in Bulk using Lkp and Foreach
CTE and derived table
CTE and derived table and its scope
CTE and Temp table
custom transformation
Data consistency in azure (how to compare source and destination)
Data Modelling Steps
Data processing will be 2 types
Data write to DWH from ADLS Delta
Detailed discussion on ADF components
Details of project implemented using ADF.
DEV Ops and implementation?
DevOps build and release pipeline
Diference between sql server and azure synpse
Diff bw SCD 1,2,3
Diff bw Synapse workspace and sql dw
Difference between Azure and Self-hosted integration runtime in ADF?
Difference between Azure Data Warehouse and Azure SQL
Difference between database and data warehouse
Difference between Iot hub and event hub.
Difference between release and build pipeline
Difference between Schedule trigger and Trumbling window trigger
Difference between secret and password in azure key vault.
Difference between sql server and azure synpse
Different activities used in ADF
Different data layers and logic behind it in Data Lake
Error exception handling (Try catch paradigm) using ADF
Error handling in ADF
ETL transformation method IN ADB
Experience in creating CI/CD pipelines in azure DevOps.
Experience in logic apps and azure functions.
Explaing about project architecture in detail & components
Export CSV file from on prem to cloud using ADF
fact and dimension
Factless table
Facts and dimension and their types
Full load and incremental load
full load or full refresh
Get File Names from Source Folder Dynamically in Azure Data Factory
have you used ODBC connection from ADF?
Have you worked on web Application and what are the tools?
Have you worked or applied any own logic or function, which is not present in Azure?
How can we connect SFTP server
how can we implement SCD 3
How can you deploy your code to another production region using CI/CD?
How can you read excels files in databricks & what library required for it? And where you can add those library?
How do you implement row level security in SQL DB and DW?
how do you implement SCD type 2 in pyspark (ie, how to update and insert data)
How do you pass parameters from data factory to databricks in ADF
How do you pass parameters in ADF pipeline
How do you select only 10 tables from the DB and dynamically copy those tables to the target
How increment build implemented in DevOps.
how o unzip folder in sftp
How to add custom jar in databricks and install
How to avoid duplicates and null while entering in data warehouse
how to compare source and destination records in azure
How to connect any url through
how to create devops pipeline.
How to create link service for on premises SQL?
How to create pipeline in devops
How to create Self-hosted integration runtime?
How to create service principle
How to deploy ADF using ARM template?
How to deploy DWH
How to do cherry picking in CI/CD pipelines.
How to do Upsert in transformations?
how to do validation in the pipelines?
how to establish connection between sftp to adls 2?
How to extract data from JSON to flatten hierarchy
how to find latest file in folder from sftp
How to find out the duplicate records from table?
how to handle missing records in pipelines?
how to implemenet Logic apps in adf?
how to implement the scd type 3
How to import all csv from one folder and dump it into SQL
How to improve the copy performance in ADF
how to know the bad data in copy activity and how to set it
how to load directories and sub diecteries in destination folder
How to provide access to ADLS(ACL vs RABC)
How to read 10 file from blob with different schema and dims and write in data warehouse how do you map
How to read 50 csv files from ADL path and write back to ADL path in different format and schema.
How to read the nested json and few scenarios onto it
How Tumbling windows work
how you designed the end to end process
how you have worked on multiline json or nested json files?
How you will rephrase the existing facts and dimensions model (what will be the approach)
i am giving you a scenario… frm a FTP location where you have all source files.your source file is accumilated on daily basis. using ADF you have to pick up only incremental files. in the FDF pipeline you have to pick and copy only latest file
if all transformations were in databricks notebooks, how did you create the entire orchestraction and the architecture
If I am passing 4 elements in lookup activty, how to fetch 3rd element from the array
If I delete any files from ADLS, can I recover it or not? (Is it possible?)
if i want to use the output of the previous activity in the following activity of the pipeline what kind of configuration will you do.
If one pipeline is scheduled at everyone 5 mins interval, and it goes beyond 5 mins, and another pipeline with same object got triggered, How will you avoid overlapping of pipelines
If source location was ADLS were you picking the entire data or only delta files. what was the extraction mechanism?
If you are loading data to the databricks in ADF, if the databricks activity failed saying that cluster is terminated. how will you handle it
in which scenario will you use binary copy over csv or delimiited copy
incremental
Incremental data copy logic using ADF
Incremental logic behind the ADF
Index for fact and dim table
iot hub
Is it good to migrate? Points
keyvalut
List of different external sources connected through ADF.
List of libraries used in python programming.
logic apps
Logic Apps scenario to trigger the email whenever face the issue in Power BI report
Logic Apps scenario use cases in project
Logic Apps step by step procedure
managed instance in azure
MD5 has key generation(Unique key)
Merge statement in SQL
Normalization
oltp and olap
On-Prem we have SQL DW DB, migrate to Azure Cloud
Partition and bucketing in DWH
PIPELINE LAK
Power BI concepts
power BI concepts and in powerbi report we are having performance issue how we can optimize the same
PySpark Code review?
Question on ADF – DIU
Rate yourself in python, spark, Azure?
RBAC
reading 100 csv files in adf
rest api
Scenario based questions on Data Modelling.
self hosted integration runtime
Snowflake schema vs star schema
Sql queries , tables and temp , and CTE syntax
Sql queries, tables and temp, and CTE syntax
Surroagte key
Technologies used.
the incremental approach
tumbling and event triger
Two methods difference in python- and syntax
Type of SQL Triggers
Types of indexes
Upsert operation(Update and Insert)
Use of identity(1,1) in Azure sql DB
Validation of data
Variable in DevOps pipeline
Walk through various steps required to create a data ingestion pipeline in ADF.
Walkthrough various steps in creating a release pipeline in DevOps.
what are activities in pipeline?
What are the activities and transformations worked in ADF?
What are the different ways to interact with or get data (files) from ADLS to ADBK?
what are the different ways you have used to upload a file in ADLS… and ways to pull a file from ADLS
What are the distribution methods in synapse analytics and difference between them
what are the hadoop file formats?
What are the incremental approach in DevOps?
what are the integration runtimes you have worked with
What are the parameters required for the linked services for the on premise SQL server and others?
What are the permissions in ADLS gen2 like (ACL, others) (read, write, execute)?
What are the points you will take care on the code review?
What are the source system you have worked?
What are the steps required to migrate?
What are the transformation in adf
what are the types of integration runtimes?
What Azure services are used to migrate and what are the ways to bring the data into Cloud?
What is azure function?
what is data units in ADF and how you set it?
what is datamodaling?
What is dataset and how can you pass dynamic parameter to dataset and linked services?
what is DBU?
What is difference between Datawarehouse and Data Lake Gen2
What is logic app and how to integrate in ADF?
What is meant by cherry pick
What is path global filter?
What is polybase what are the steps in polybase to create Explain all the scenarios when you use databricks,data warehouse,data factory
what is the difference between azure blob storage and azure data lake gen2?
What is the difference between single tenant and multi-tenant?
what is the file size, data volume you have worked with
What is the IR? And how can you deal with source system & how to connect it?
what is your source?
what is Linked service? DATASET
what was the deployment procedure for you. is it in production or in dev stage. how do you do deployment.
when i pull in file, if i have to pull in last modified date, what activity will i use.
Why synapase Datawarehouse used in your project(Dedicated SQL pool)
what is fact and dimension
what is factless table
what is annotation in adf pipeline
how to pass the parameters to notebook
wide and narrow
collect_list
diu
control flows in adf
what is data redundency
explain plan for dataframe
what is serialization and deserialization in spark
size of broadcast join
what is sort merge join and how to disable it
shupple partition vs repartition
system manage identity vs user manage identity
manage identity vs service principle
cte vs temptable
hash join
view vs procedure
server less cluster

order by vs sortby

difference between cache and broadcast in spark

LEAVE A REPLY Cancel reply