Spark architecture refers to the underlying framework and components that enable Apache Spark, a fast and distributed data processing engine, to execute tasks and computations across clusters of machines. The architecture of Spark is designed to support parallel and distributed processing of large-scale datasets efficiently. Here are the key components of Spark architecture:
The driver program is the main control process that manages the execution of Spark applications.
It runs the user’s main function and coordinates the execution of tasks on the cluster.
Spark can run on various cluster managers such as Apache Mesos, Hadoop YARN, or in standalone mode.
The cluster manager allocates resources to different Spark applications and monitors their execution.
Executors are worker nodes responsible for executing tasks as part of a Spark application.
They are launched and managed by the cluster manager and run tasks in parallel across the cluster.
Tasks are units of work that perform computations on data partitions.
Spark breaks down transformations and actions on RDDs (Resilient Distributed Datasets) into tasks that are executed in parallel across executors.
Resilient Distributed Datasets (RDDs):
RDDs are the fundamental data abstraction in Spark, representing distributed collections of objects that are partitioned across the cluster.
RDDs support fault tolerance through lineage, allowing lost data partitions to be reconstructed using transformations on other RDDs.
DataFrame and Dataset API:
Spark provides higher-level abstractions like DataFrames and Datasets for working with structured data.
These APIs offer optimizations for processing structured data and provide a more user-friendly interface compared to RDDs.
Spark Core: The foundational module providing distributed task scheduling, memory management, and basic I/O functionalities.
Spark SQL: Provides support for querying structured data using SQL syntax and integrates seamlessly with DataFrame and Dataset APIs.
Spark Streaming: Enables real-time stream processing of data streams, allowing for the processing of live data.
MLlib (Machine Learning Library): A library for scalable machine learning algorithms and utilities.
GraphX: A library for graph processing and analysis.
Spark RDDs : are the most basic data abstraction in Spark. They are immutable and distributed collections of data. RDDs offer fine-grained control over data processing, but they can be complex to use.
Spark DataFrames : are a higher-level abstraction than RDDs. They are structured data organized into named columns, similar to a table in a database. DataFrames are easier to use than RDDs, but they offer less control over data processing.
Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation
An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.
Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).
Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Narrow Transformations in PySpark:
map(func): Applies the function
func to each element of the RDD or DataFrame independently and returns a new RDD or DataFrame with the transformed elements.
flatMap(func): Similar to
map(), but the function
func returns an iterator for each input record, and the resulting RDD or DataFrame is formed by flattening these iterators.
filter(func): Selects elements from the RDD or DataFrame that satisfy the specified predicate
mapPartitions(func): Applies a function
func to each partition of the RDD, where
func receives an iterator over the elements of the partition and can yield an arbitrary number of elements.
mapPartitionsWithIndex(func): Similar to
mapPartitions(), but the function
func also takes the index of the partition as an argument.
flatMapValues(func): Applies a function
func to each value of a key-value pair RDD, where
func returns an iterator for each value, and the resulting RDD consists of the flattened values.
mapValues(func): Applies a function
func to each value of a key-value pair RDD, while keeping the keys unchanged.
sample(withReplacement, fraction, seed): Samples a fraction
fraction of the data with or without replacement, optionally using a specified random seed.
sortBy(func, ascending): Sorts the RDD or DataFrame by the result of applying the function
func to each element. The
ascending parameter determines the sort order.
zip(otherDataset): Zips the RDD with another RDD or DataFrame, creating a new RDD or DataFrame of key-value pairs.
Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().
Wide Transformations in PySpark:
groupBy(): Groups the DataFrame using specified columns. It’s a narrow transformation on the partition level, followed by a wide transformation where the data is shuffled across partitions to perform the aggregation.
groupByKey(): Similar to
groupBy() but operates on RDDs. It groups the data in each partition based on a key and then shuffles the data across partitions.
reduceByKey(): Applies a function to the values of each key, reducing the elements with the same key to a single value.
aggregate(): Aggregates the elements of each partition of the RDD using a given associative function and a neutral “zero value” (initial value).
aggregateByKey(): Similar to
aggregate() but works on key-value pair RDDs. It allows aggregation of values for each key.
distinct(): Returns a new DataFrame with distinct rows. It requires shuffling of data across partitions to identify and eliminate duplicates.
join(): Joins two DataFrames or RDDs based on a common key. It involves shuffling data across partitions to bring together matching records from different partitions.
repartition(): Repartitions the DataFrame or RDD by redistributing the data across partitions. It involves shuffling data across the cluster.
Additional Wide Transformations:
sortByKey(): Sorts the elements of each partition based on the keys. It can lead to significant data movement across partitions.
cogroup(): Groups the values from both RDDs sharing the same key, similar to a full outer join. It shuffles data across partitions to group values by key.
1️. FailFast Mode:
It is used when you want strict data validation.
If any data issues (e.g., malformed records or schema mismatch) are encountered, it will raise an exception immediately, preventing further processing of the data.
df_failfast = spark.read.option(“mode”, “FAILFAST”).csv(“data_source.csv”)
2️. DropMalformed Mode:
Used when you want to continue processing despite malformed records.
It skips malformed records, ensuring that only valid data is loaded.
df_dropmalformed = spark.read.option(“mode”, “DROPMALFORMED”).csv(“data_source.csv”)
3️. Permissive Mode:
Use when you want to capture as much data as possible. Allow the corrupted value without any interruption
It attempts to infer the schema and loads data, logging any issues encountered.
df_permissive = spark.read.option(“mode”, “PERMISSIVE”).csv(“data_source.csv”)
POINTS TO REMEMBER:
By default the read mode is permissive, allowing to capture the entire data.
It also provides an additional feature to print the corrupted records while reading the data, but we have to define the schema with an addition column “_corrupt_record”, which will capture the corrupted records.
To store the bad records at some location, an additional option “badRecordsPath” is used to define the directory location.
Advantages of Parquet file
Are Spark clusters will in running status for 24/7 in your project?
Avoid Skewness/data distribution while writing data
Bad records in databricks
can we modify spark dataframe
Changing size of data every day in blob , so how to improve the performance.
cluster configuration in azure data bricks?
connecting sql in databricks
Databricks cluster creation
Databricks cluster driver and worker node details
Databricks cluster type
databricks connections to adf
Databricks git integration
databricks how you can use excel file
Databricks how you configure cluster
Delta table advantage
did you face any issues with performance in your coding and how did you fix it?
Difference between coalesce & repartition
Difference between persist and cache?
Difference between Row and Columner data structure
Difference between spark and Hadoop map reduce?
difference between spark dataframe and panda datafame
do you know the difference between repartition and colease
do you use pandas for your transformation?
External sources accessed from databricks.
File 1 and File 2 has different schema, how we will merge both in single dataframe
file formats you have worked in azure data bricks?
gather statistics in pyspark
Get the datatypes of dataframe and preseve it into list
Get the number of rows on each file in a dataframe?
Have you connected sql from databricks. if yes, how do you establish connection?
Have you ever used high concurrancy cluster? if yes, why?
have you seen in your application any place there there is data quality issue (like null values etc.)
How can we read pyspark notebook activity output into another notebook in adf
How can you create secret key & scope in databricks?
How convert date format to other format spark
How do you drop duplicates in Azure databricks pyspark
How do you handle concurrency in you project
how do you masking data in azure
How do you select 30 column names dynamically in pyspark
How do you seperate column names in pyspark with commas
How handle delta in your project
How load multiple files into one file or merge into one file..
how maintain the code while working with others notebook
how many nodes you have used in your cluster
How new add column in csv and assign constant value
How to access Azure data lake storage from databricks!
How to access pyspark DF from SQL cell within notebook
How to add a partition in data frame?
how to add global variable in notebook
How to add source filename in one of column in dataframe?
How to call one notebook to another note book?
How to connect Azure SQL from databricks
How to create a schema in and assign in csv
How to decide configuration of Spark?
How to fill null values with some values
How to handle bad records in pyspark
How to handle Bad records in spark? Types
how to implement incremental load dynamically in pyspark
How to pass the runtime parameters to notebook
How to read a substring from list sort a list
how to read and write json
How to read from table all the columns and write to json
How to read the element from dataframe and convert it into variables
How to remove duplicates in spark
How to remove the null values in the dataframe
How to rename a column in csv?what are the different ways
How to send the notebook run status to adf from databricks notebook
how to skip the columns
How to sort a list difference b/w sortand sort
How to use our on schema while creating df
How to work on nested json
how to work with excel source in azure data bricks?
How to write a data in specific name in databricks
how u use secret scope in the data bricks.
How we can see the notebook system generated Logs in Azure databricks
how you done the delta load in your project.
How you read json file and convert to parquet
if you want to compare 2 dataframes how do you do that
if you wanted to find a specific pattern of a string and replace that what function do you use.
Interactive Databricks Cluster
Job cluster vs high concurrency cluster
left join of 2 dataframes
Logging in azure databricks notebooks.
Memory bottleneck on Spark executors
MERGE in azure data bricks?
Minimize shuffling of data while joining
Mount path creation
normal data bricks and azure data bricks
Number of library work in databricks
Orchestrating databricks notebooks.
Parameter pass to databricks notebook from ADF
Parameter read in databricks(dbutils)
Performance tuning in Spark
pickling and unpicking
Query plan check in databricks
SCD type 2
Scenario based questions on performance tuning.
secret Scopes and azure keyvault?
spark context and spark config
syntax of broadcast join
Tiers of storage account
Types of dbutilies
What all formats you have worked in your project pyspark
What are the activities you have done in Azure DataBricks
what are the file systems databricks support
What are the magic commands in databricks?
What are the optimisation you do your project for pyspark jobs
What are the transformations you have done in the azure data bricks?
what are the types of Joins in pyspark?
What is a Data brick Runtime?
what is accumulator in pyspark and broadcast variable
what is adaptive query execution in pyspark and use of it
What is azure active directory
what is broadcast variable and accumulator varaiables?
what is catalyst optimizer?
what is coalesce and repartition?
What is DAG?
What is data governance?
what is data governess (unity catalog)
What is data lake architecture and why we uses it?
what is data skewness (how to solve)
what is delta table
What is external table and managed /internal tables?
What is federation in Azure?
what is global temporary view and temporary view in pyspark?
What is inferschema in pyspark?
What is mount point & its implementation?
what is parquet file and features
What is parquet file and features?
What is Runtime in databricks?
what is secret scope in Azure data bricks?
What is spark cluster and how it works? And also why spark?
what is spark version and features
what is the configuration of cluster you used?
what is the dalta lake file format?
What is the difference between interactive cluster and job cluster?
What is the meaning of delta in spark?
what is the process once you submit the job in pyspark
what is your cluster configuration?
what languages have you used in databricks transformation
what type of version tool you use in your project for ADB notebook
What was the size of the data and time required to load whole data?
Which partition strategy you used for data lake gen2
which type of cluster you have used in your project?
while and while loop
why we need to convert parquet file into delta file format?
Why you used delta ,What is delta
How to update a file
What is secret scope in data bricks
What is Z order in pysaprk and use of it
What is vacuum
How to reduce data redundancy in pyspark
Syntax of join in pyspark
Write a simple transformation in pyspark
What is wide and narrow transformation
About data warehousing concepts? SCD and types
About Synapse Analytics
about your project, architecture and how it is implemented?
About your project? And what are the challenges & issues while working with Azure?
ADF copy activity , how to handle bad rows
ADF copy activity performance check
ADF git ingration
ADLS Gen1 vs ADLS Gen2 vs BLOB
ADLS gen1 vs gen2
Also Scd types of data loading need to check…
Any exp on Azure Analysis Service
any experience in CICD?
Any experience in Data Extraction from API If yes what will be data format
Architecture and implementation of previous project
Asked to query on Lag function
Asked to write SQL query to find the credit card is valid or not
Azure analysis services concepts?
Azure app insights integration and logging mechanism
Azure key vault
Azure Services used (ADF, Databricks, SHIR)
Azure Streaming experience?
Blob storage vs Azure Data Lake
By using ploybase how you access the data from data lake storage account?
can we run pipelines through database triggers?
Can we update primary key in syanpse sql dw
Can you configure the two different oracle DB instances in linked service
CI CD concepts and what are the steps we need to follow in detail while moving the code
CI/CD pipeline flow?
Clustered index vs Non-Clustered Index
Code/performance fine tuning aspects
Complex join queries
Copy multiple files from blob to blob in Azure Data Factory
Copy Multiple Tables in Bulk using Lkp and Foreach
CTE and derived table
CTE and derived table and its scope
CTE and Temp table
Data consistency in azure (how to compare source and destination)
Data Modelling Steps
Data processing will be 2 types
Data write to DWH from ADLS Delta
Detailed discussion on ADF components
Details of project implemented using ADF.
DEV Ops and implementation?
DevOps build and release pipeline
Diference between sql server and azure synpse
Diff bw SCD 1,2,3
Diff bw Synapse workspace and sql dw
Difference between Azure and Self-hosted integration runtime in ADF?
Difference between Azure Data Warehouse and Azure SQL
Difference between database and data warehouse
Difference between Iot hub and event hub.
Difference between release and build pipeline
Difference between Schedule trigger and Trumbling window trigger
Difference between secret and password in azure key vault.
Difference between sql server and azure synpse
Different activities used in ADF
Different data layers and logic behind it in Data Lake
Error exception handling (Try catch paradigm) using ADF
Error handling in ADF
ETL transformation method IN ADB
Experience in creating CI/CD pipelines in azure DevOps.
Experience in logic apps and azure functions.
Explaing about project architecture in detail & components
Export CSV file from on prem to cloud using ADF
fact and dimension
Facts and dimension and their types
Full load and incremental load
full load or full refresh
Get File Names from Source Folder Dynamically in Azure Data Factory
have you used ODBC connection from ADF?
Have you worked on web Application and what are the tools?
Have you worked or applied any own logic or function, which is not present in Azure?
How can we connect SFTP server
how can we implement SCD 3
How can you deploy your code to another production region using CI/CD?
How can you read excels files in databricks & what library required for it? And where you can add those library?
How do you implement row level security in SQL DB and DW?
how do you implement SCD type 2 in pyspark (ie, how to update and insert data)
How do you pass parameters from data factory to databricks in ADF
How do you pass parameters in ADF pipeline
How do you select only 10 tables from the DB and dynamically copy those tables to the target
How increment build implemented in DevOps.
how o unzip folder in sftp
How to add custom jar in databricks and install
How to avoid duplicates and null while entering in data warehouse
how to compare source and destination records in azure
How to connect any url through
how to create devops pipeline.
How to create link service for on premises SQL?
How to create pipeline in devops
How to create Self-hosted integration runtime?
How to create service principle
How to deploy ADF using ARM template?
How to deploy DWH
How to do cherry picking in CI/CD pipelines.
How to do Upsert in transformations?
how to do validation in the pipelines?
how to establish connection between sftp to adls 2?
How to extract data from JSON to flatten hierarchy
how to find latest file in folder from sftp
How to find out the duplicate records from table?
how to handle missing records in pipelines?
how to implemenet Logic apps in adf?
how to implement the scd type 3
How to import all csv from one folder and dump it into SQL
How to improve the copy performance in ADF
how to know the bad data in copy activity and how to set it
how to load directories and sub diecteries in destination folder
How to provide access to ADLS(ACL vs RABC)
How to read 10 file from blob with different schema and dims and write in data warehouse how do you map
How to read 50 csv files from ADL path and write back to ADL path in different format and schema.
How to read the nested json and few scenarios onto it
How Tumbling windows work
how you designed the end to end process
how you have worked on multiline json or nested json files?
How you will rephrase the existing facts and dimensions model (what will be the approach)
i am giving you a scenario… frm a FTP location where you have all source files.your source file is accumilated on daily basis. using ADF you have to pick up only incremental files. in the FDF pipeline you have to pick and copy only latest file
if all transformations were in databricks notebooks, how did you create the entire orchestraction and the architecture
If I am passing 4 elements in lookup activty, how to fetch 3rd element from the array
If I delete any files from ADLS, can I recover it or not? (Is it possible?)
if i want to use the output of the previous activity in the following activity of the pipeline what kind of configuration will you do.
If one pipeline is scheduled at everyone 5 mins interval, and it goes beyond 5 mins, and another pipeline with same object got triggered, How will you avoid overlapping of pipelines
If source location was ADLS were you picking the entire data or only delta files. what was the extraction mechanism?
If you are loading data to the databricks in ADF, if the databricks activity failed saying that cluster is terminated. how will you handle it
in which scenario will you use binary copy over csv or delimiited copy
Incremental data copy logic using ADF
Incremental logic behind the ADF
Index for fact and dim table
Is it good to migrate? Points
List of different external sources connected through ADF.
List of libraries used in python programming.
Logic Apps scenario to trigger the email whenever face the issue in Power BI report
Logic Apps scenario use cases in project
Logic Apps step by step procedure
managed instance in azure
MD5 has key generation(Unique key)
Merge statement in SQL
oltp and olap
On-Prem we have SQL DW DB, migrate to Azure Cloud
Partition and bucketing in DWH
Power BI concepts
power BI concepts and in powerbi report we are having performance issue how we can optimize the same
PySpark Code review?
Question on ADF – DIU
Rate yourself in python, spark, Azure?
reading 100 csv files in adf
Scenario based questions on Data Modelling.
self hosted integration runtime
Snowflake schema vs star schema
Sql queries , tables and temp , and CTE syntax
Sql queries, tables and temp, and CTE syntax
the incremental approach
tumbling and event triger
Two methods difference in python- and syntax
Type of SQL Triggers
Types of indexes
Upsert operation(Update and Insert)
Use of identity(1,1) in Azure sql DB
Validation of data
Variable in DevOps pipeline
Walk through various steps required to create a data ingestion pipeline in ADF.
Walkthrough various steps in creating a release pipeline in DevOps.
what are activities in pipeline?
What are the activities and transformations worked in ADF?
What are the different ways to interact with or get data (files) from ADLS to ADBK?
what are the different ways you have used to upload a file in ADLS… and ways to pull a file from ADLS
What are the distribution methods in synapse analytics and difference between them
what are the hadoop file formats?
What are the incremental approach in DevOps?
what are the integration runtimes you have worked with
What are the parameters required for the linked services for the on premise SQL server and others?
What are the permissions in ADLS gen2 like (ACL, others) (read, write, execute)?
What are the points you will take care on the code review?
What are the source system you have worked?
What are the steps required to migrate?
What are the transformation in adf
what are the types of integration runtimes?
What Azure services are used to migrate and what are the ways to bring the data into Cloud?
What is azure function?
what is data units in ADF and how you set it?
what is datamodaling?
What is dataset and how can you pass dynamic parameter to dataset and linked services?
what is DBU?
What is difference between Datawarehouse and Data Lake Gen2
What is logic app and how to integrate in ADF?
What is meant by cherry pick
What is path global filter?
What is polybase what are the steps in polybase to create Explain all the scenarios when you use databricks,data warehouse,data factory
what is the difference between azure blob storage and azure data lake gen2?
What is the difference between single tenant and multi-tenant?
what is the file size, data volume you have worked with
What is the IR? And how can you deal with source system & how to connect it?
what is your source?
what is Linked service? DATASET
what was the deployment procedure for you. is it in production or in dev stage. how do you do deployment.
when i pull in file, if i have to pull in last modified date, what activity will i use.
Why synapase Datawarehouse used in your project(Dedicated SQL pool)
what is fact and dimension
what is factless table
what is annotation in adf pipeline
how to pass the parameters to notebook
wide and narrow
control flows in adf
what is data redundency
explain plan for dataframe
what is serialization and deserialization in spark
size of broadcast join
what is sort merge join and how to disable it
shupple partition vs repartition
system manage identity vs user manage identity
manage identity vs service principle
cte vs temptable
view vs procedure
server less cluster
order by vs sortby