Question 1

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

ASpark ML decision trees test every feature variable in the splitting algorithm

BSpark ML decision trees automatically prune overfit trees

CSpark ML decision trees test more split candidates in the splitting algorithm

DSpark ML decision trees test a random sample of feature variables in the splitting algorithm

ESpark ML decision trees test binned features values as representative split candidates

Answer : E

One reason that results can differ between sklearn and Spark ML decision trees, despite identical data and hyperparameters, is that Spark ML decision trees test binned feature values as representative split candidates. Spark ML uses a method called 'quantile binning' to reduce the number of potential split points by grouping continuous features into bins. This binning process can lead to different splits compared to sklearn, which tests all possible split points directly. This difference in the splitting algorithm can cause variations in the resulting trees. Reference:

Spark MLlib Documentation (Decision Trees and Quantile Binning).

Question 2

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

AThey can turn on Databricks Autologging

BThey can specify nested=True when starting the child run for each unique combination of hyperparameter values

CThey can start each child run inside the parent run's indented code block using mlflow.start runO

DThey can start each child run with the same experiment ID as the parent run

EThey can specify nested=True when starting the parent run for the tuning process

Answer : B

To organize MLflow runs with one parent run for the tuning process and a child run for each unique combination of hyperparameter values, the data scientist can specify nested=True when starting the child run. This approach ensures that each child run is properly nested under the parent run, maintaining a clear hierarchical structure for the experiment. This nesting helps in tracking and comparing different hyperparameter combinations within the same tuning process. Reference:

MLflow Documentation (Managing Nested Runs).

Question 3

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

AOpen the MLmodel artifact in the MLflow run paqe

BClick the 'Models' link in the row corresponding to the run in the MLflow experiment paqe

CClick the 'Source' link in the row corresponding to the run in the MLflow experiment page

DClick the 'Start Time' link in the row corresponding to the run in the MLflow experiment page

Answer : C

To view the notebook that was run to create an MLflow run, you can click the 'Source' link in the row corresponding to the run in the MLflow experiment page. The 'Source' link provides a direct reference to the source notebook or script that initiated the run, allowing you to review the code and methodology used in the experiment. This feature is particularly useful for reproducibility and for understanding the context of the experiment. Reference:

MLflow Documentation (Viewing Run Sources and Notebooks).

Question 4

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.

Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

AModel tuning

BModel evaluation

CModel deployment

DExploratory data analysis

Answer : D

AutoML platforms, such as the one available in Databricks Machine Learning, streamline various stages of the machine learning pipeline including feature engineering, model selection, hyperparameter tuning, and model evaluation. However, exploratory data analysis (EDA) is typically performed outside the AutoML process. EDA involves understanding the dataset, visualizing distributions, identifying anomalies, and gaining insights into data before feeding it into a machine learning pipeline. This step is crucial for ensuring that the data is clean and suitable for model training but is generally done manually by the data scientist.

Reference

Databricks documentation on AutoML: https://docs.databricks.com/applications/machine-learning/automl.html

Question 5

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

AThey can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

BThey can check the Databricks Runtime ML box when creating their clusters.

CThey can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

DThey can set the runtime-version variable in their Spark session to ''ml''.

Answer : C

The Databricks Runtime for Machine Learning includes pre-installed packages and libraries essential for machine learning and deep learning, including MLflow. To use it, the machine learning engineer can simply select an appropriate Databricks Runtime ML version from the 'Databricks Runtime Version' dropdown menu while creating their cluster. This selection ensures that all necessary machine learning libraries, including MLflow, are pre-installed and ready for use, avoiding the need to manually install them each time.

Reference

Databricks documentation on creating clusters: https://docs.databricks.com/clusters/create.html

Free Practice Questions for Databricks Machine Learning Associate Exam

Question 1

Question 2

Question 3

Question 4

Question 5