Scala and Spark Interview Questions
Compare Hadoop and Spark - >>>>Speed: Apache Spark (100x faster) than Hadoop
Processing: Apache Spark: Real-time and Batch Processing while Hadoop only has Batch
Processing
Difficu
...
Scala and Spark Interview Questions
Compare Hadoop and Spark - >>>>Speed: Apache Spark (100x faster) than Hadoop
Processing: Apache Spark: Real-time and Batch Processing while Hadoop only has Batch
Processing
Difficulty: Apache Spark (Easy because of high level modules), Hadoop (Tough to learn)
Recovery: Apache Spark (Allows recovery of partitions) Hadoop: Fault tolerant
Interactivity: Apache Spark: Has interactive modes, Hadoop: No interactive mode except Pig and
Hive
What is Apache Spark? - >>>>- Apache Spark is an open source cluster computing framework
for real-time processing
- Provides a framework for coding entire clusters with fault-tolerance and implicit data
parallelization.
Explain the key features of Apache Spark - >>>>1.) Polyglot
2.) Speed
3.) Multiple Format Support
4.) Lazy Evaluation
5.) Real Time Computation
6.) Hadoop Integration
7.) Machine Learning
What is polyglot? - >>>>- Spark provides high-level APIs in Java, Scala, Python, and R.
- Spark code can be written in any of these languages
- It provides a shell for Scala and Python
How can the Scala shell be accessed? - >>>>./bin/spark-shell from installed directory
How can the Python shell be accessed? - >>>>./bin/pyspark from installed directory
How does Spark achieve its speed? - >>>>1.) 100x faster than Hadoop MapReduce for largescale data processing. Achieves this speed through controlled partitioning.
What is controlled partitioning? - >>>>Managing data using partitions to help parallelize
distributed data processing with minimal network traffic
What formats does Spark support? - >>>>Parquet, JSON, Hive, and Cassandra
How to access structured data using Spark? - >>>>The Data Sources API provides a pluggable
mechanism for accessing structured data using Spark SQL. Data sources can be more than
simple pipes that convert data and pull it into Spark
What is lazy evaluation? - >>>>In programming language theory, lazy evaluation, or call-byneed[1] is an evaluation strategy which delays the evaluation of an expression until its value is
needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).[2][3] The
sharing can reduce the running time of certain functions by an exponential factor over other nonstrict evaluation strategies, such as call-by-name.[citation needed]
The benefits of lazy evaluation include:
The ability to define control flow (structures) as abstractions instead of primitives.
The ability to define potentially infinite data structures. This allows for more straightforward
implementation of some algorithms.
Performance increases by avoiding needless calculations, and error conditions in evaluating
compound expressions.
How does Spark implement lazy evaluation? - >>>>Spark adds transformations to a DAG of
computation and only when the driver requests data does the DAG actually get executed.
How does Spark manage real time computation? - >>>>- less latency due to in memory
computation
- Designed for massive scalability and users of the system run production clusters with thousands
of nodes and supports serval computational models
Does Spark have Hadoop integration? - >>>>Yes, great boon for all of the Big Data engineers
that began their career with Hadoop. Spark is a potential replacement for Hadoop MapReduce
functions while has the ability to run on top of an existing Hadoop cluster using YARN for
resource scheduling
Why is Spark faster than Hadoop MapReduce? - >>>>Spark uses available in memory
processing while MapReduce uses persistent storage for any of the data processing tasks
List 4 examples of benefits of Spark over Hadoop MapReduce - >>>>1.) The availability of in
memory processing makes Spark run 10 -> 100x faster than Hadoop MapReduce
2.) Spark has a bunch of inbuilt libraries that makes code more readable and maintainable and
allows you to do multiple tasks from the same core like batch processing.
3.) Spark uses a lot more in memory processing and caching so it relies less on disks than
Hadoop Map Reduce
4.) Spark is able to perform many computations on the same data set (iterative computation)
which is not possible with Hadoop MR
What is YARN? - >>>>YARN provides a central and resource management platform to deliver
scalable operations across the cluster. YARN is a distributed container manager (similar to
Mesos) where Spark is a data processing tool. Both Spark and Hadoop MR can run on YARN.
Do you need to install Spark on all of the nodes of a YARN cluster? - >>>>No, Spark runs on
top of YARN so it is independent of its actual installation. Spark has options to use YARN
instead of its built-in package manager (Mesos)
What are the configurations for Spark to run YARN? - >>>>master, deploy-mode, drivermemory, executor-memory, executor-cores, and queue
Uses of Hadoop MapReduce - >>>>Spark uses MapReduce paradigm. Most tools like Pig and
Hive convert their queries into MapReduce phases to optimize them better
What is a resilient distributed dataset? (RDD) - >>>>RDD is a fault-tolerant collection of
operational elements that runs in parallel. The partitioned data in RDD is immutable and
distributed in nature.
- Parts of data that are stored in memory distributed across many nodes
- RDDs are lazily evaluated in Spark, that contributes to Spark's speed
What are the two types of RDD? - >>>>1.) Parallelized Collections
2.) Hadoop Datasets
What is a parallelized collection? - >>>>Existing RDD running parallel with each other
What is a Hadoop Dataset? - >>>>Perform functions on each file record in HDFS or other
storage systems
How to create RDDs in Spark? - >>>>1.) Parallelizing a collection in your Driver program
2.) Using SparkContext's parallelize
- sc.parallelize(DataArray)
What is executor memory? - >>>>The heap size is what is referred to as the Spark executor
memory which is controlled with the spark.executor.memory property of the -executor-memory
flag.
- Executor memory is basically a measure on how much memory of the worker node will the
application utilize because there is o
[Show More]