Computer Science > QUESTIONS & ANSWERS > Scala and Spark Interview Questions (All)

Scala and Spark Interview Questions

Document Content and Description Below

Scala and Spark Interview Questions Compare Hadoop and Spark - >>>>Speed: Apache Spark (100x faster) than Hadoop Processing: Apache Spark: Real-time and Batch Processing while Hadoop only has Batch ... Processing Difficulty: Apache Spark (Easy because of high level modules), Hadoop (Tough to learn) Recovery: Apache Spark (Allows recovery of partitions) Hadoop: Fault tolerant Interactivity: Apache Spark: Has interactive modes, Hadoop: No interactive mode except Pig and Hive What is Apache Spark? - >>>>- Apache Spark is an open source cluster computing framework for real-time processing - Provides a framework for coding entire clusters with fault-tolerance and implicit data parallelization. Explain the key features of Apache Spark - >>>>1.) Polyglot 2.) Speed 3.) Multiple Format Support 4.) Lazy Evaluation 5.) Real Time Computation 6.) Hadoop Integration 7.) Machine Learning What is polyglot? - >>>>- Spark provides high-level APIs in Java, Scala, Python, and R. - Spark code can be written in any of these languages - It provides a shell for Scala and Python How can the Scala shell be accessed? - >>>>./bin/spark-shell from installed directory How can the Python shell be accessed? - >>>>./bin/pyspark from installed directory How does Spark achieve its speed? - >>>>1.) 100x faster than Hadoop MapReduce for largescale data processing. Achieves this speed through controlled partitioning. What is controlled partitioning? - >>>>Managing data using partitions to help parallelize distributed data processing with minimal network traffic What formats does Spark support? - >>>>Parquet, JSON, Hive, and Cassandra How to access structured data using Spark? - >>>>The Data Sources API provides a pluggable mechanism for accessing structured data using Spark SQL. Data sources can be more than simple pipes that convert data and pull it into Spark What is lazy evaluation? - >>>>In programming language theory, lazy evaluation, or call-byneed[1] is an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).[2][3] The sharing can reduce the running time of certain functions by an exponential factor over other nonstrict evaluation strategies, such as call-by-name.[citation needed] The benefits of lazy evaluation include: The ability to define control flow (structures) as abstractions instead of primitives. The ability to define potentially infinite data structures. This allows for more straightforward implementation of some algorithms. Performance increases by avoiding needless calculations, and error conditions in evaluating compound expressions. How does Spark implement lazy evaluation? - >>>>Spark adds transformations to a DAG of computation and only when the driver requests data does the DAG actually get executed. How does Spark manage real time computation? - >>>>- less latency due to in memory computation - Designed for massive scalability and users of the system run production clusters with thousands of nodes and supports serval computational models Does Spark have Hadoop integration? - >>>>Yes, great boon for all of the Big Data engineers that began their career with Hadoop. Spark is a potential replacement for Hadoop MapReduce functions while has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling Why is Spark faster than Hadoop MapReduce? - >>>>Spark uses available in memory processing while MapReduce uses persistent storage for any of the data processing tasks List 4 examples of benefits of Spark over Hadoop MapReduce - >>>>1.) The availability of in memory processing makes Spark run 10 -> 100x faster than Hadoop MapReduce 2.) Spark has a bunch of inbuilt libraries that makes code more readable and maintainable and allows you to do multiple tasks from the same core like batch processing. 3.) Spark uses a lot more in memory processing and caching so it relies less on disks than Hadoop Map Reduce 4.) Spark is able to perform many computations on the same data set (iterative computation) which is not possible with Hadoop MR What is YARN? - >>>>YARN provides a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager (similar to Mesos) where Spark is a data processing tool. Both Spark and Hadoop MR can run on YARN. Do you need to install Spark on all of the nodes of a YARN cluster? - >>>>No, Spark runs on top of YARN so it is independent of its actual installation. Spark has options to use YARN instead of its built-in package manager (Mesos) What are the configurations for Spark to run YARN? - >>>>master, deploy-mode, drivermemory, executor-memory, executor-cores, and queue Uses of Hadoop MapReduce - >>>>Spark uses MapReduce paradigm. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better What is a resilient distributed dataset? (RDD) - >>>>RDD is a fault-tolerant collection of operational elements that runs in parallel. The partitioned data in RDD is immutable and distributed in nature. - Parts of data that are stored in memory distributed across many nodes - RDDs are lazily evaluated in Spark, that contributes to Spark's speed What are the two types of RDD? - >>>>1.) Parallelized Collections 2.) Hadoop Datasets What is a parallelized collection? - >>>>Existing RDD running parallel with each other What is a Hadoop Dataset? - >>>>Perform functions on each file record in HDFS or other storage systems How to create RDDs in Spark? - >>>>1.) Parallelizing a collection in your Driver program 2.) Using SparkContext's parallelize - sc.parallelize(DataArray) What is executor memory? - >>>>The heap size is what is referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the -executor-memory flag. - Executor memory is basically a measure on how much memory of the worker node will the application utilize because there is o [Show More]

Last updated: 2 years ago

Preview 1 out of 6 pages

Buy Now

Instant download

We Accept: