Spark architecture refers to the underlying framework and components that enable Apache Spark, a fast and distributed data processing engine, to execute tasks and computations across clusters of machines. The architecture of Spark is designed to support parallel and distributed processing of large-scale datasets efficiently. Here are the key components of Spark architecture:
Driver Program:
The driver program is the main control process that manages the execution of Spark applications.
It runs the user’s main function and coordinates the execution of tasks on the cluster.
Cluster Manager:
Spark can run on various cluster managers such as Apache Mesos, Hadoop YARN, or in standalone mode.
The cluster manager allocates resources to different Spark applications and monitors their execution.
Executors:
Executors are worker nodes responsible for executing tasks as part of a Spark application.
They are launched and managed by the cluster manager and run tasks in parallel across the cluster.
Tasks:
Tasks are units of work that perform computations on data partitions.
Spark breaks down transformations and actions on RDDs (Resilient Distributed Datasets) into tasks that are executed in parallel across executors.
Resilient Distributed Datasets (RDDs):
RDDs are the fundamental data abstraction in Spark, representing distributed collections of objects that are partitioned across the cluster.
RDDs support fault tolerance through lineage, allowing lost data partitions to be reconstructed using transformations on other RDDs.
DataFrame and Dataset API:
Spark provides higher-level abstractions like DataFrames and Datasets for working with structured data.
These APIs offer optimizations for processing structured data and provide a more user-friendly interface compared to RDDs.
Spark Modules:
Spark Core: The foundational module providing distributed task scheduling, memory management, and basic I/O functionalities.
Spark SQL: Provides support for querying structured data using SQL syntax and integrates seamlessly with DataFrame and Dataset APIs.
Spark Streaming: Enables real-time stream processing of data streams, allowing for the processing of live data.
MLlib (Machine Learning Library): A library for scalable machine learning algorithms and utilities.
GraphX: A library for graph processing and analysis.
Spark RDDs : are the most basic data abstraction in Spark. They are immutable and distributed collections of data. RDDs offer fine-grained control over data processing, but they can be complex to use.
Spark DataFrames : are a higher-level abstraction than RDDs. They are structured data organized into named columns, similar to a table in a database. DataFrames are easier to use than RDDs, but they offer less control over data processing.
Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation
An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.
Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).
Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Narrow Transformations in PySpark:map(func)
: Applies the function func
to each element of the RDD or DataFrame independently and returns a new RDD or DataFrame with the transformed elements.flatMap(func)
: Similar to map()
, but the function func
returns an iterator for each input record, and the resulting RDD or DataFrame is formed by flattening these iterators.filter(func)
: Selects elements from the RDD or DataFrame that satisfy the specified predicate func
.mapPartitions(func)
: Applies a function func
to each partition of the RDD, where func
receives an iterator over the elements of the partition and can yield an arbitrary number of elements.mapPartitionsWithIndex(func)
: Similar to mapPartitions()
, but the function func
also takes the index of the partition as an argument.flatMapValues(func)
: Applies a function func
to each value of a key-value pair RDD, where func
returns an iterator for each value, and the resulting RDD consists of the flattened values.mapValues(func)
: Applies a function func
to each value of a key-value pair RDD, while keeping the keys unchanged.sample(withReplacement, fraction, seed)
: Samples a fraction fraction
of the data with or without replacement, optionally using a specified random seed.sortBy(func, ascending)
: Sorts the RDD or DataFrame by the result of applying the function func
to each element. The ascending
parameter determines the sort order.zip(otherDataset)
: Zips the RDD with another RDD or DataFrame, creating a new RDD or DataFrame of key-value pairs.
Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().
Wide Transformations in PySpark:groupBy()
: Groups the DataFrame using specified columns. It’s a narrow transformation on the partition level, followed by a wide transformation where the data is shuffled across partitions to perform the aggregation.groupByKey()
: Similar to groupBy()
but operates on RDDs. It groups the data in each partition based on a key and then shuffles the data across partitions.reduceByKey()
: Applies a function to the values of each key, reducing the elements with the same key to a single value.aggregate()
: Aggregates the elements of each partition of the RDD using a given associative function and a neutral “zero value” (initial value).aggregateByKey()
: Similar to aggregate()
but works on key-value pair RDDs. It allows aggregation of values for each key.distinct()
: Returns a new DataFrame with distinct rows. It requires shuffling of data across partitions to identify and eliminate duplicates.join()
: Joins two DataFrames or RDDs based on a common key. It involves shuffling data across partitions to bring together matching records from different partitions.repartition()
: Repartitions the DataFrame or RDD by redistributing the data across partitions. It involves shuffling data across the cluster.
Additional Wide Transformations:sortByKey()
: Sorts the elements of each partition based on the keys. It can lead to significant data movement across partitions.cogroup()
: Groups the values from both RDDs sharing the same key, similar to a full outer join. It shuffles data across partitions to group values by key.
1️. FailFast Mode:
It is used when you want strict data validation.
If any data issues (e.g., malformed records or schema mismatch) are encountered, it will raise an exception immediately, preventing further processing of the data.
df_failfast = spark.read.option(“mode”, “FAILFAST”).csv(“data_source.csv”)
2️. DropMalformed Mode:
Used when you want to continue processing despite malformed records.
It skips malformed records, ensuring that only valid data is loaded.
df_dropmalformed = spark.read.option(“mode”, “DROPMALFORMED”).csv(“data_source.csv”)
3️. Permissive Mode:
Use when you want to capture as much data as possible. Allow the corrupted value without any interruption
It attempts to infer the schema and loads data, logging any issues encountered.
df_permissive = spark.read.option(“mode”, “PERMISSIVE”).csv(“data_source.csv”)
POINTS TO REMEMBER:
By default the read mode is permissive, allowing to capture the entire data.
It also provides an additional feature to print the corrupted records while reading the data, but we have to define the schema with an addition column “_corrupt_record”, which will capture the corrupted records.
To store the bad records at some location, an additional option “badRecordsPath” is used to define the directory location.
Delta table is the default data table format in Azure Databricks and is a feature of the Delta Lake open source data framework.
A Delta table is a managed table that uses the Databricks Delta Optimistic Concurrency Control feature to provide full
ACID (atomicity, consistency, isolation, and durability) transactions.
Delta Lake is a transactional storage layer that runs on top of your existing data lake. It uses a columnar storage format and supports ACID transactions.
Parquet is a columnar storage format that excels in compressing and encoding data efficiently. It’s designed for read-heavy workloads, making it an excellent choice for analytics queries. The columnar structure minimises I/O operations, reducing disk and network traffic, ultimately leading to faster query performance. Parquet’s compatibility with various data processing frameworks makes it a versatile option.
Key Advantages:
1.Excellent compression and encoding, minimizing storage costs.
2.Optimised for analytics queries due to columnar storage.
3.Widely compatible with various processing frameworks.
DataFrames are immutable hence we cannot change anything directly on it. So every operation on DataFrame results in a new Spark DataFrame.
Efficiently distributes smaller DataFrame to all nodes in the cluster for join operation. Suitable when one DataFrame is small enough to fit into the memory of each executor and the other DataFrame is significantly larger. Ideal for joining small lookup tables with large fact tables.
from pyspark.sql.functions import broadcast
broadcast_join_df = df1.join(broadcast(df2), df1.id == df2.id, “inner”)
Both repartition and coalesce are used to modify the number of partitions in an RDD. Repartition can increase or decrease the number of partitions, but it shuffles all data.
Where as coalesce only decreases the number of partitions. As coalesce avoids full data shuffling, it’s more efficient than repartition when reducing the number of partitions.
The difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.
How to reduce data redundancy in pyspark
Syntax of join in pyspark
Write a simple transformation in pyspark
What is wide and narrow transformation
About data warehousing concepts? SCD and types
About Synapse Analytics
about your project, architecture and how it is implemented?
About your project? And what are the challenges & issues while working with Azure?
ADF copy activity , how to handle bad rows
ADF copy activity performance check
ADF git ingration
ADLS Gen1 vs ADLS Gen2 vs BLOB
ADLS gen1 vs gen2
ADLS Gen2/Gen1
Also Scd types of data loading need to check…
Any exp on Azure Analysis Service
any experience in CICD?
Any experience in Data Extraction from API If yes what will be data format
Architecture and implementation of previous project
Asked to query on Lag function
Asked to write SQL query to find the credit card is valid or not
Azure analysis services concepts?
Azure app insights integration and logging mechanism
Azure key vault
Azure Services used (ADF, Databricks, SHIR)
Azure Streaming experience?
Blob storage vs Azure Data Lake
By using ploybase how you access the data from data lake storage account?
can we run pipelines through database triggers?
Can we update primary key in syanpse sql dw
Can you configure the two different oracle DB instances in linked service
CI CD concepts and what are the steps we need to follow in detail while moving the code
CI/CD pipeline flow?
Clustered index vs Non-Clustered Index
Code/performance fine tuning aspects
Complex join queries
Copy multiple files from blob to blob in Azure Data Factory
Copy Multiple Tables in Bulk using Lkp and Foreach
CTE and derived table
CTE and derived table and its scope
CTE and Temp table
custom transformation
Data consistency in azure (how to compare source and destination)
Data Modelling Steps
Data processing will be 2 types
Data write to DWH from ADLS Delta
Detailed discussion on ADF components
Details of project implemented using ADF.
DEV Ops and implementation?
DevOps build and release pipeline
Diference between sql server and azure synpse
Diff bw SCD 1,2,3
Diff bw Synapse workspace and sql dw
Difference between Azure and Self-hosted integration runtime in ADF?
Difference between Azure Data Warehouse and Azure SQL
Difference between database and data warehouse
Difference between Iot hub and event hub.
Difference between release and build pipeline
Difference between Schedule trigger and Trumbling window trigger
Difference between secret and password in azure key vault.
Difference between sql server and azure synpse
Different activities used in ADF
Different data layers and logic behind it in Data Lake
Error exception handling (Try catch paradigm) using ADF
Error handling in ADF
ETL transformation method IN ADB
Experience in creating CI/CD pipelines in azure DevOps.
Experience in logic apps and azure functions.
Explaing about project architecture in detail & components
Export CSV file from on prem to cloud using ADF
fact and dimension
Factless table
Facts and dimension and their types
Full load and incremental load
full load or full refresh
Get File Names from Source Folder Dynamically in Azure Data Factory
have you used ODBC connection from ADF?
Have you worked on web Application and what are the tools?
Have you worked or applied any own logic or function, which is not present in Azure?
How can we connect SFTP server
how can we implement SCD 3
How can you deploy your code to another production region using CI/CD?
How can you read excels files in databricks & what library required for it? And where you can add those library?
How do you implement row level security in SQL DB and DW?
how do you implement SCD type 2 in pyspark (ie, how to update and insert data)
How do you pass parameters from data factory to databricks in ADF
How do you pass parameters in ADF pipeline
How do you select only 10 tables from the DB and dynamically copy those tables to the target
How increment build implemented in DevOps.
how o unzip folder in sftp
How to add custom jar in databricks and install
How to avoid duplicates and null while entering in data warehouse
how to compare source and destination records in azure
How to connect any url through
how to create devops pipeline.
How to create link service for on premises SQL?
How to create pipeline in devops
How to create Self-hosted integration runtime?
How to create service principle
How to deploy ADF using ARM template?
How to deploy DWH
How to do cherry picking in CI/CD pipelines.
How to do Upsert in transformations?
how to do validation in the pipelines?
how to establish connection between sftp to adls 2?
How to extract data from JSON to flatten hierarchy
how to find latest file in folder from sftp
How to find out the duplicate records from table?
how to handle missing records in pipelines?
how to implemenet Logic apps in adf?
how to implement the scd type 3
How to import all csv from one folder and dump it into SQL
How to improve the copy performance in ADF
how to know the bad data in copy activity and how to set it
how to load directories and sub diecteries in destination folder
How to provide access to ADLS(ACL vs RABC)
How to read 10 file from blob with different schema and dims and write in data warehouse how do you map
How to read 50 csv files from ADL path and write back to ADL path in different format and schema.
How to read the nested json and few scenarios onto it
How Tumbling windows work
how you designed the end to end process
how you have worked on multiline json or nested json files?
How you will rephrase the existing facts and dimensions model (what will be the approach)
i am giving you a scenario… frm a FTP location where you have all source files.your source file is accumilated on daily basis. using ADF you have to pick up only incremental files. in the FDF pipeline you have to pick and copy only latest file
if all transformations were in databricks notebooks, how did you create the entire orchestraction and the architecture
If I am passing 4 elements in lookup activty, how to fetch 3rd element from the array
If I delete any files from ADLS, can I recover it or not? (Is it possible?)
if i want to use the output of the previous activity in the following activity of the pipeline what kind of configuration will you do.
If one pipeline is scheduled at everyone 5 mins interval, and it goes beyond 5 mins, and another pipeline with same object got triggered, How will you avoid overlapping of pipelines
If source location was ADLS were you picking the entire data or only delta files. what was the extraction mechanism?
If you are loading data to the databricks in ADF, if the databricks activity failed saying that cluster is terminated. how will you handle it
in which scenario will you use binary copy over csv or delimiited copy
incremental
Incremental data copy logic using ADF
Incremental logic behind the ADF
Index for fact and dim table
iot hub
Is it good to migrate? Points
keyvalut
List of different external sources connected through ADF.
List of libraries used in python programming.
logic apps
Logic Apps scenario to trigger the email whenever face the issue in Power BI report
Logic Apps scenario use cases in project
Logic Apps step by step procedure
managed instance in azure
MD5 has key generation(Unique key)
Merge statement in SQL
Normalization
oltp and olap
On-Prem we have SQL DW DB, migrate to Azure Cloud
Partition and bucketing in DWH
PIPELINE LAK
Power BI concepts
power BI concepts and in powerbi report we are having performance issue how we can optimize the same
PySpark Code review?
Question on ADF – DIU
Rate yourself in python, spark, Azure?
RBAC
reading 100 csv files in adf
rest api
Scenario based questions on Data Modelling.
self hosted integration runtime
Snowflake schema vs star schema
Sql queries , tables and temp , and CTE syntax
Sql queries, tables and temp, and CTE syntax
Surroagte key
Technologies used.
the incremental approach
tumbling and event triger
Two methods difference in python- and syntax
Type of SQL Triggers
Types of indexes
Upsert operation(Update and Insert)
Use of identity(1,1) in Azure sql DB
Validation of data
Variable in DevOps pipeline
Walk through various steps required to create a data ingestion pipeline in ADF.
Walkthrough various steps in creating a release pipeline in DevOps.
what are activities in pipeline?
What are the activities and transformations worked in ADF?
What are the different ways to interact with or get data (files) from ADLS to ADBK?
what are the different ways you have used to upload a file in ADLS… and ways to pull a file from ADLS
What are the distribution methods in synapse analytics and difference between them
what are the hadoop file formats?
What are the incremental approach in DevOps?
what are the integration runtimes you have worked with
What are the parameters required for the linked services for the on premise SQL server and others?
What are the permissions in ADLS gen2 like (ACL, others) (read, write, execute)?
What are the points you will take care on the code review?
What are the source system you have worked?
What are the steps required to migrate?
What are the transformation in adf
what are the types of integration runtimes?
What Azure services are used to migrate and what are the ways to bring the data into Cloud?
What is azure function?
what is data units in ADF and how you set it?
what is datamodaling?
What is dataset and how can you pass dynamic parameter to dataset and linked services?
what is DBU?
What is difference between Datawarehouse and Data Lake Gen2
What is logic app and how to integrate in ADF?
What is meant by cherry pick
What is path global filter?
What is polybase what are the steps in polybase to create Explain all the scenarios when you use databricks,data warehouse,data factory
what is the difference between azure blob storage and azure data lake gen2?
What is the difference between single tenant and multi-tenant?
what is the file size, data volume you have worked with
What is the IR? And how can you deal with source system & how to connect it?
what is your source?
what is Linked service? DATASET
what was the deployment procedure for you. is it in production or in dev stage. how do you do deployment.
when i pull in file, if i have to pull in last modified date, what activity will i use.
Why synapase Datawarehouse used in your project(Dedicated SQL pool)
what is fact and dimension
what is factless table
what is annotation in adf pipeline
how to pass the parameters to notebook
wide and narrow
collect_list
diu
control flows in adf
what is data redundency
explain plan for dataframe
what is serialization and deserialization in spark
size of broadcast join
what is sort merge join and how to disable it
shupple partition vs repartition
system manage identity vs user manage identity
manage identity vs service principle
cte vs temptable
hash join
view vs procedure
server less cluster
order by vs sortby
difference between cache and broadcast in spark