HPC/Big Data Certification Exam 2022 with complete solution
1. What qualifies as a Big Data Workload? >>>>>· Consists of semi-structured, or unstructured data not suitable for relational databases
· Data volume is co
...
HPC/Big Data Certification Exam 2022 with complete solution
1. What qualifies as a Big Data Workload? >>>>>· Consists of semi-structured, or unstructured data not suitable for relational databases
· Data volume is considered too large for other solutions (Petabyte scale)
· To process the data in a reasonable timeframe, a massively parallel solution is required
Most common Big Data workloads are? >>>>>· Batch processing
· In memory processing
· ML (typically GPU based)
What are the challenges customers run into for Big Data workloads on premises? >>>>>· Tracking growth patterns and scaling infrastructure to meet capacity requirements
· Time associated with procuring, deploying and maintaining infrastructure to meet demand
· The cost associated with processing this data using other methods
· The cost associated with Disaster Recovery when dealing with Petabytes of data
· The cost and complexity associated with hardware refresh
Customers running Big Data workloads on OCI can ___? >>>>>· Dynamically scale capacity against demand
· Leverage Object Storage as a cost-effective Data Lake and for Disaster Recovery
· Take advantage of the best price/performance in the cloud
· Use OCI's managed service offerings to easily deploy and run common Big Data frameworks
What is Big Data Appliance (BDA)? >>>>>Single tenant, Cloudera based hardware appliance deployed on-prem
What does Big Data Appliance (BDA) include? >>>>>· Cloudera Enterprise Data Hub (EDH) v5.12
· Big Data Manager
· Big Data SQL
What is Oracle Big Data Service (BDS)? >>>>>· Multitenant, managed Cloudera EDH Hadoop Deployment
What does Oracle Big Data Service (BDS) include? >>>>>· Cloudera EDH v5.16.1 or v6.2.0
· Big Data Manager
· Big Data SQL
What is the difference between Big Data Cloud Service (BDCS) and Big Data Service (BDS)? >>>>>· BDCS - Gen1, BDS - Gen2
· BDCS - Cloudera EDH v5.16.x, BDS - Cloudera EDH v5.16.1 or v6.2.0
· BDCS - deprecated, BDS - license included in consumption
What is Oracle Data Flow (ODF)? >>>>>· Provides serverless framework for running Spark based workloads
Where do customers put data and application code for Oracle Data Flow (ODF) applications? >>>>>· Object Storage
Oracle Data Flow (ODF) provides support for what type of applications? >>>>>· Java
· Python
· SQL
· Scala
What is Oracle Data Science? >>>>>· Platform for data scientists to create projects which run notebook-based modeling on-demand
What services does Oracle Data Science use? >>>>>· Compute
· Block Storage
What shapes are available for Oracle Data Science Notebook Sessions >>>>>· VM.Standard.E2.2, VM.Standard.E2.4, VM.Standard.E2.8
· VM.Standard2.1, VM.Standard2.2, VM.Standard2.4, VM.Standard2.6, VM.Standard2.8, VM.Standard2.16, VM.Standard2.24,
What is Oracle Streaming Service? >>>>>· Kafka compatible producer/consumer service that ingests continuous streams of data
Which Hadoop distributions are supported on OCI? >>>>>· Cloudera
· Hortonworks
· MapR
Some self-managed Big Data Products are driven by ___? >>>>>· OCI QuickStart program
· Marketplace
1. When deploying Hadoop on OCI, what should you consider? >>>>>· Normalize either OCPU or Memory against OCI shapes used as workers to meet workload requirements
· Use HDFS replication factor 3 when using DenseIO NVMe storage to mitigate hardware failure
· After normalizing OCPU or Memory, use heterogeneous storage on DenseIO workers leveraging Block Storage to augment HDFS capacity - allows you to scale HDFS capacity around workload
· Segregate cluster and storage network traffic on BM hosts when using Block Volumes for HDFS by leveraging both physical VNICs - create a storage network for primary interface and deploy Hadoop on secondary interface
· Use Private IP networks for Hadoop cluster hosts - enable cluster access either by using edge node(s) or VPN based access like FastConnect
What is Terasort? >>>>>· Popular benchmark that measures the amount of time to sort 1TB of randomly distributed data on a given computer system
· Used to measure MapReduce performance of an Apache Hadoop Cluster (all hardware layers - CPU, Memory, Storage, Network I/O)
What are the phases of Terasort? >>>>>· TeraGen
· TeraSort
· TeraValidate
What is TeraGen? >>>>>· Generate random dataset of specified size
What is TeraSort? >>>>>· Map, Shuffle, Reduce the source data into smaller result set
What is TeraValidate? >>>>>· Read the result set and validate it
What is TeraGen heavily dependent on? >>>>>· Write intensive
What is TeraSort heavily dependent on? >>>>>· Read, process, write, and I/O intensive
What is TeraValidate heavily dependent on? >>>>>· Read intensive
What is Terasort benchmark? >>>>>· Total time to run all 3 Terasort phases
Draw a diagram of how TeraSort looks like >>>>>Look at study guide
What is the most important phase of Terasort? >>>>>· TeraSort
What are some steps for sizing when considering deployment on OCI? >>>>>· Always build in redundancy for DenseIO hosts to mitigate data loss - in case of Hadoop use HDFS replication factor of 3 for local NVMe storage
· For low risk environments, can use a replication factor of 2 for Block Storage - more cost effective
· Block Storage throughput uses the same bandwidth available to each instance VNIC
What are some best practices for Big Data Migration? >>>>>· Object Storage
· Data Transfer Appliance
· FastConnect
What is Data Transfer Appliance? >>>>>· A way for customers who have governance requirements which restrict copying sensitive data over wire or too much data (not enough time/bandwidth) in which Oracle delivers an appliance for the customer to load data, then Oracle uploads it to Object Storage
[Show More]