Question 1

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.

Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?

Apyspark.sql.types.DateType

Bdatetime

Cpyspark.sql.types.TimestampType

DCron syntax

EThere is no way to represent and submit this information programmatically

Answer : D

Cron syntax is a tool that can be used to represent and submit a complex run schedule programmatically. Cron syntax is a string of six fields that specify the frequency, date, and time of a job run. For example, the cron expression0 0 12 * * ?means run the job at 12:00 PM every day. The data engineer can use the Databricks REST API to create or update a job with a cron schedule. The data engineer can also use the Databricks CLI to create or update a job with a cron schedule by using a JSON file that contains the cron expression. The other tools are either invalid or not suitable for representing and submitting a complex run schedule programmatically.Reference:Schedule a job,Jobs API,Databricks CLI,Cron expressions

Question 2

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

ARecords that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

BRecords that violate the expectation cause the job to fail.

CRecords that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

DRecords that violate the expectation are added to the target dataset and recorded as invalid in the event log.

ERecords that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

Answer : B

The expected behavior when a batch of data containing data that violates the expectation is processed is that the job will fail. This is because the expectation clause has theON VIOLATION FAIL UPDATEoption, which means that if any record in the batch does not meet the expectation, the entire batch will be rejected and the job will fail. This option is useful for enforcing strict data quality rules and preventing invalid data from entering the target dataset.

Option A is not correct, as theON VIOLATION FAIL UPDATEoption does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and record them as invalid in the event log, theON VIOLATION DROP RECORDoption should be used.

Option C is not correct, as theON VIOLATION FAIL UPDATEoption does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and load them into a quarantine table, theON VIOLATION QUARANTINE RECORDoption should be used.

Option D is not correct, as theON VIOLATION FAIL UPDATEoption does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and record them as invalid in the event log, theON VIOLATION LOG RECORDoption should be used.

Option E is not correct, as theON VIOLATION FAIL UPDATEoption does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and flag them as invalid in a field added to the target dataset, theON VIOLATION FLAG RECORDoption should be used.

Delta Live Tables Expectations

[Databricks Data Engineer Professional Exam Guide]

Question 3

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?

AManually programming in an alert system in each cell of the Notebook

BSetting up an Alert in the Job page

CSetting up an Alert in the Notebook

DThere is no way to notify the Job owner in the case of Job failure

EMLflow Model Registry Webhooks

Answer : B

To send the Databricks Job owner an email in the case that the Job fails, the best approach is to set up an Alert in the Job page. This way, the Job owner can configure the email address and the notification type for the Job failure event. The other options are either not feasible, not reliable, or not relevant for this task. Manually programming an alert system in each cell of the Notebook is tedious and error-prone. Setting up an Alert in the Notebook is not possible, as Alerts are only available for Jobs and Clusters. There is a way to notify the Job owner in the case of Job failure, so option D is incorrect. MLflow Model Registry Webhooks are used for model lifecycle events, not Job events, so option E is not applicable.Reference:

Add email and system notifications for job events

Alerts

MLflow Model Registry Webhooks

Question 4

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

AParquet files can be partitioned

BCREATE TABLE AS SELECT statements cannot be used on files

CParquet files have a well-defined schema

DParquet files have the ability to be optimized

EParquet files will become Delta tables

Answer : C

Option C is the correct answer because Parquet files have a well-defined schema that is embedded within the data itself. This means that the data types and column names of the Parquet files are automatically detected and preserved when creating an external table from them. This also enables the use of SQL and other structured query languages to access and analyze the data. CSV files, on the other hand, do not have a schema embedded in them, and require specifying the schema explicitly or inferring it from the data when creating an external table from them. This can lead to errors or inconsistencies in the data types and column names, and also increase the processing time and complexity.

Question 5

Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?

ANone of these

BData lake

CData warehouse

DAll of these

EData lakehouse

Answer : E

A data lakehouse is a new paradigm that can be used to simplify and unify siloed data architectures that are specialized for specific use cases. A data lakehouse combines the best of both data lakes and data warehouses, providing a single platform that supports diverse data types, open standards, low-cost storage, high-performance queries, ACID transactions, schema enforcement, and governance. A data lakehouse enables data engineers to build reliable and scalable data pipelines that can serve various downstream applications and users, such as data science, machine learning, analytics, and reporting. A data lakehouse leverages the power of Delta Lake, a storage layer that brings reliability and performance to data lakes.Reference:What is a data lakehouse?,Delta Lake,Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Free Practice Questions for Databricks Certified Data Engineer Associate Exam

Question 1

Question 2

Question 3

Question 4

Question 5