DATA 603 – Big Data Platforms
Homework #9 – Spark SQL
(1) [10 Points] Explain how SQL is applied to a typical RDD? What components are needed to
perform this task?
Ans)
Spark SQL integrates a processing like the pro
...
DATA 603 – Big Data Platforms
Homework #9 – Spark SQL
(1) [10 Points] Explain how SQL is applied to a typical RDD? What components are needed to
perform this task?
Ans)
Spark SQL integrates a processing like the processing of relational databases with Spark’s
functional programming. It provides support for various data sources and makes it possible to
weave SQL queries with code transformations thus resulting in a very powerful tool and hence
blurs the gap between RDD and relational table. It also provides higher optimization.
Spark SQL transforms RDDs into SQL using a special type of RDD called SchemaRDD. It is
essentially a RDD with schema. As it contains schema, run relation queries can be run on the
data along with basic RDD functions. The SchemaRDD can be registered as a table so that SQL
queries can be executed on it using Spark SQL.A schemaRDD is made up of Object data which
refers to the data stored in RDD and schema which describes the data types of the objects.
Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The
first method uses reflection to infer the schema of an RDD that contains specific types of
objects. This reflection-based approach leads to more concise code and works well when the
schema is known at the time of creating the Spark application. For example, in case of the
application being written in Python environment Spark SQL can convert an RDD of Row objects
to a SchemaRDD, inferring the datatypes. Rows are constructed by passing a list of key/value
pairs as kwargs to the Row class. The keys of this list define the column names of the table, and
the types are inferred by looking at the first row.
The second method for creating SchemaRDDs is through a programmatic interface that allows
you to construct a schema and then apply it to an existing RDD. While this method is more
verbose, it allows you to construct SchemaRDDs when the columns and their types are not
known until runtime. In case of Python environment, a SchemaRDD can be created
programmatically with three steps.
[Show More]