Computer Science > QUESTIONS & ANSWERS > Spark Interview Questions (All)

Spark Interview Questions

Document Content and Description Below

Spark Interview Questions 1. What is Apache Spark? - ✔✔Apache Spark is an open-source cluster computing framework for real-time processing. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. 2. Compare Hadoop and Spark - ✔✔Speed: 100 times faster than Hadoop Real-time & Batch processing vs Hadoop Batch processing only Easy to learn because of high level modules vs Hadoop Tough to learn Allows recovery of partitions vs Hadoop Fault-tolerant Has interactive modes vs Hadoop no interactive mode except Pig & Hive 3. Explain the key features of Apache Spark. - ✔✔Polyglot (high-level APIs in Java, Scala, Python, R) Speed (manages data using partitions that help parallelize distributed data processing with minimal network traffic) Multiple Format Support (Parquet, JSON, Hive and Cassandra.) Lazy Evaluation (key to speed) Real Time Computation (less latency because of its in-memory computation) Hadoop Integration Machine Learning 5. What are benefits of Spark over MapReduce? - ✔✔Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop. 6. What is Yarn? - ✔✔Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Yarn is a distributed container manager, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. Running Spark on Yarn necessitates a binary distribution of Spark as built on Yarn support. 7. Do you need to install Spark on all nodes of Yarn cluster? - ✔✔No, because Spark runs on top of Yarn. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run Yarn. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue. 8. Is there any benefit of learning MapReduce if Spark is better than MapReduce? - ✔✔Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better. 9. Explain the concept of Resilient Distributed Dataset (RDD). - ✔✔RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily two types of RDD: Parallelized Collections: Here, the existing RDDs running parallel with one another. Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems. RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark's speed. 10. How do we create RDDs in Spark? Spark provides two methods to create RDD: - ✔✔1. By parallelizing a collection in your Driver program. This makes use of SparkContext's 'parallelize' method val DataArray = Array(2,4,6,8,10) val DataRDD = sc.parallelize(DataArray) 2. By loading an external dataset from external storage like HDFS, HBase, shared file system. 11. What is Executor Memory in a Spark application? - ✔✔Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the -executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize. 12. Define Partitions in Apache Spark. - ✔✔As the name suggests, partition is a smaller and logical division of data similar to 'split' in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD. 13. What operations does RDD support? - ✔✔RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can't change original RDD, but you can always transform it into different RDD with all changes you want. RDDs support two types of operations: transformations and actions. Transformations: - ✔✔Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily. Actions: - ✔✔Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into

[Show More]

Last updated: 3 years ago

Preview 1 out of 13 pages

Buy Now

Instant download

We Accept: