Question 1

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Adf_user_non_pii = df_user.drop('first_name', 'last_name', 'email', 'birthdate')

Bdf_user_non_pii = df_user.drop('first_name', 'last_name', 'email', 'birthdate')

Cdf_user_non_pii = df_user.dropfields('first_name', 'last_name', 'email', 'birthdate')

Ddf_user_non_pii = df_user.dropfields('first_name, last_name, email, birthdate')

Answer : A

To remove specific columns from a PySpark DataFrame, the drop() method is used. This method returns a new DataFrame without the specified columns. The correct syntax for dropping multiple columns is to pass each column name as a separate argument to the drop() method.

Correct Usage:

df_user_non_pii = df_user.drop('first_name', 'last_name', 'email', 'birthdate')

This line of code will return a new DataFrame df_user_non_pii that excludes the specified PII columns.

Explanation of Options:

A . Correct. Uses the drop() method with multiple column names passed as separate arguments, which is the standard and correct usage in PySpark.

B . Although it appears similar to Option A, if the column names are not enclosed in quotes or if there's a syntax error (e.g., missing quotes or incorrect variable names), it would result in an error. However, as written, it's identical to Option A and thus also correct.

C . Incorrect. The dropfields() method is not a method of the DataFrame class in PySpark. It's used with StructType columns to drop fields from nested structures, not top-level DataFrame columns.

D . Incorrect. Passing a single string with comma-separated column names to dropfields() is not valid syntax in PySpark.

PySpark Documentation: DataFrame.drop

Stack Overflow Discussion: How to delete columns in PySpark DataFrame

Question 2

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

ADataFrame.groupBy().agg()

BDataFrame.filter()

CDataFrame.withColumn()

DDataFrame.select()

Answer : A

Operations that trigger data movement across partitions (like groupBy, join, repartition) result in a shuffle and a new stage.

From Spark documentation:

''groupBy and aggregation cause data to be shuffled across partitions to combine rows with the same key.''

Option A (groupBy + agg) causes shuffle.

Options B, C, and D (filter, withColumn, select) transformations that do not require shuffling; they are narrow dependencies.

Final Answer: A

Question 3

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

ABroadcast a variable to share the maximum time among workers.

BConfigure the Spark UI to automatically collect maximum times.

CUse an RDD action like reduce() to compute the maximum time.

DUse an accumulator to record the maximum time on the driver.

Answer : C

RDD actions like reduce() aggregate values across all partitions and return the result to the driver.

To compute the maximum processing time, reduce() is ideal because it combines results from all tasks efficiently.

Example:

max_time = rdd_times.reduce(lambda x, y: max(x, y))

This aggregates maximum values from all executors into a single result on the driver.

Why the other options are incorrect:

A: Broadcast variables distribute read-only data; they cannot aggregate results.

B: Spark UI provides visualization, not programmatic collection.

D: Accumulators support additive operations only (e.g., counters, sums), not non-associative ones like max.

Spark RDD API --- reduce() for aggregations.

Databricks Exam Guide (June 2025): Section ''Apache Spark Architecture and Components'' --- actions, accumulators, and broadcast variables.

===========

Question 4

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

AReplace .bucketBy() with .partitionBy('event_year', 'event_month')

BChange the bucket count (42) to a lower number

CAdd .sortBy() after .bucketBy()

DReplace .bucketBy() with .partitionBy('event_year') only

Answer : A

When queries frequently filter on certain columns, partitioning by those columns ensures partition pruning, allowing Spark to scan only relevant directories instead of the entire dataset.

Correct code:

final.write.partitionBy('event_year', 'event_month').parquet('events.liveLatest')

This improves read performance dramatically for filters like:

SELECT * FROM events.liveLatest WHERE event_year = 2024 AND event_month = 5;

bucketBy() helps in clustering and joins, not in partition pruning for file-based tables.

Why the other options are incorrect:

B: Bucket count changes parallelism, not query pruning.

C: sortBy organizes data within files, not across partitions.

D: Partitioning by only one column limits pruning benefits.

Spark SQL DataFrameWriter --- partitionBy() for partitioned tables.

Databricks Exam Guide (June 2025): Section ''Using Spark SQL'' --- partitioning vs. bucketing and query optimization.

===========

Question 5

36 of 55.

What is the main advantage of partitioning the data when persisting tables?

AIt compresses the data to save disk space.

BIt automatically cleans up unused partitions to optimize storage.

CIt ensures that data is loaded into memory all at once for faster query execution.

DIt optimizes by reading only the relevant subset of data from fewer partitions.

Answer : D

Partitioning a dataset divides data into separate directories based on partition column values. When queries filter on partitioned columns, Spark can prune irrelevant partitions --- meaning it only reads files that match the filter criteria.

Advantage:

Reduces I/O and improves performance by scanning only relevant subsets of data.

Example:

/data/sales/year=2023/month=10/...

/data/sales/year=2024/month=01/...

A query filtering WHERE year = 2024 reads only the relevant partition.

Why the other options are incorrect:

A: Compression is independent of partitioning.

B: Spark does not automatically clean partitions unless managed manually.

C: Partitioning does not cause Spark to load entire data into memory.

Databricks Exam Guide (June 2025): Section ''Using Spark SQL'' --- partitioning and pruning for optimized data retrieval.

Spark SQL Documentation --- DataFrameWriter partitionBy() and query optimization.

===========

Free Practice Questions for Databricks Certified Associate Developer for Apache Spark 3.5 Exam

Question 1

Question 2

Question 3

Question 4

Question 5