Spark Interview Questions
1. What is Apache Spark? - ✔✔Apache Spark is an open-source cluster computing framework
for real-time processing. It has a thriving open-source community and is the most active Apache
project
...
Spark Interview Questions
1. What is Apache Spark? - ✔✔Apache Spark is an open-source cluster computing framework
for real-time processing. It has a thriving open-source community and is the most active Apache
project at the moment. Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance.
2. Compare Hadoop and Spark - ✔✔Speed: 100 times faster than Hadoop
Real-time & Batch processing vs Hadoop Batch processing only
Easy to learn because of high level modules vs Hadoop Tough to learn
Allows recovery of partitions vs Hadoop Fault-tolerant
Has interactive modes vs Hadoop no interactive mode except Pig & Hive
3. Explain the key features of Apache Spark. - ✔✔Polyglot (high-level APIs in Java, Scala,
Python, R)
Speed (manages data using partitions that help parallelize distributed data processing with
minimal network traffic)
Multiple Format Support (Parquet, JSON, Hive and Cassandra.)
Lazy Evaluation (key to speed)
Real Time Computation (less latency because of its in-memory computation)
Hadoop Integration
Machine Learning
5. What are benefits of Spark over MapReduce? - ✔✔Due to the availability of in-memory
processing, Spark implements the processing around 10 to 100 times faster than Hadoop
MapReduce whereas MapReduce makes use of persistence storage for any of the data processing
tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core
like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop
only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.
6. What is Yarn? - ✔✔Similar to Hadoop, Yarn is one of the key features in Spark, providing a
central and resource management platform to deliver scalable operations across the cluster. Yarn
is a distributed container manager, whereas Spark is a data processing tool. Spark can run on
Yarn, the same way Hadoop Map Reduce can run on Yarn. Running Spark on Yarn necessitates
a binary distribution of Spark as built on Yarn support.
7. Do you need to install Spark on all nodes of Yarn cluster? - ✔✔No, because Spark runs on top
of Yarn. Spark runs independently from its installation. Spark has some options to use YARN
when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further,
there are some configurations to run Yarn. They include master, deploy-mode, driver-memory,
executor-memory, executor-cores, and queue.
8. Is there any benefit of learning MapReduce if Spark is better than MapReduce? - ✔✔Yes,
MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely
relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive
convert their queries into MapReduce phases to optimize them better.
9. Explain the concept of Resilient Distributed Dataset (RDD). - ✔✔RDD stands for Resilient
Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in
parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily
two types of RDD:
Parallelized Collections: Here, the existing RDDs running parallel with one another.
Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.
RDDs are basically parts of data that are stored in the memory distributed across many nodes.
RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark's speed.
10. How do we create RDDs in Spark?
Spark provides two methods to create RDD: - ✔✔1. By parallelizing a collection in your Driver
program.
This makes use of SparkContext's 'parallelize' method
val DataArray = Array(2,4,6,8,10)
val DataRDD = sc.parallelize(DataArray)
2. By loading an external dataset from external storage like HDFS, HBase, shared file system.
11. What is Executor Memory in a Spark application? - ✔✔Every spark application has same
fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to
as the Spark executor memory which is controlled with the spark.executor.memory property of
the -executor-memory flag. Every spark application will have one executor on each worker node.
The executor memory is basically a measure on how much memory of the worker node will the
application utilize.
12. Define Partitions in Apache Spark. - ✔✔As the name suggests, partition is a smaller and
logical division of data similar to 'split' in MapReduce. It is a logical chunk of a large distributed
data set. Partitioning is the process to derive logical units of data to speed up the processing
process. Spark manages data using partitions that help parallelize distributed data processing
with minimal network traffic for sending data between executors. By default, Spark tries to read
data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed
partitioned data, to optimize transformation operations it creates partitions to hold the data
chunks. Everything in Spark is a partitioned RDD.
13. What operations does RDD support? - ✔✔RDD (Resilient Distributed Dataset) is main
logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means,
each RDD is divided into multiple partitions. Each of these partitions can reside in memory or
stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data
structure. You can't change original RDD, but you can always transform it into different RDD
with all changes you want.
RDDs support two types of operations: transformations and actions.
Transformations: - ✔✔Transformations create new RDD from existing RDD like map,
reduceByKey and filter we just saw. Transformations are executed on demand. That means they
are computed lazily.
Actions: - ✔✔Actions return final results of RDD computations. Actions triggers execution
using lineage graph to load the data into
[Show More]