Snowflake SnowPro Advanced: Data Scientist Certification

QUESTION NO: 1
A data engineer is tasked with removing duplicates from a table named 'USER ACTIVITY' in Snowflake, which contains user activity logs. The table has columns: 'ACTIVITY TIMESTAMP', 'ACTIVITY TYPE', and 'DEVICE_ID. The data engineer wants to remove duplicate rows, considering only 'USER ID', 'ACTIVITY TYPE, and 'DEVICE_ID' columns. What is the most efficient and correct SQL query to achieve this while retaining only the earliest 'ACTIVITY TIMESTAMP' for each unique combination of the specified columns?

A. Option A B. Option C C. Option E D. Option D E. Option B

Correct Answer: E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 2
You are building a data science pipeline in Snowflake to perform time series forecasting. You've decided to use a Python UDTF to encapsulate the forecasting logic using a library like 'Prophet'. The UDTF needs to access historical data to train the model and generate forecasts. The data is stored in a Snowflake table named 'SALES DATA with columns 'DATE' and 'SALES'. Which of the following approaches is/are most efficient and secure for accessing the 'SALES DATA table from within the UDTF during model training?

A. Pass the entire 'SALES DATA' table as a Pandas DataFrame to the UDTF as an argument. This approach is suitable for smaller datasets. Do not partition the data frame. B. Create a view on top of 'SALES DATA' and grant access to the UDTF's owner role to the view. Then, query the view using Snowpark within the UDTF. C. Use the 'snowflake.connector' to connect to Snowflake using a dedicated service account with read-only access to the 'SALES DATA' table. Store the service account credentials securely in Snowflake secrets and retrieve them within the UDTF. D. Bypass Snowflake entirely and load data from S3 stage into a Pandas dataframe. E. Use the Snowpark API within the UDTF to query the 'SALES DATA' table directly, leveraging the existing Snowflake session context. This requires no additional credentials management.

Correct Answer: B,E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 3
You have a dataset in Snowflake containing customer reviews. One of the columns, 'review_text', contains free-text customer feedback. You want to perform sentiment analysis on these reviews and include the sentiment score as a feature in your machine learning model. Furthermore, you wish to categorize the sentiment into 'Positive', 'Negative', and 'Neutral'. Given the need for scalability and efficiency within Snowflake, which methods could be employed?

A. Use a Snowflake procedure that reads all 'review_text' data, transfers data outside of Snowflake to an external server running sentiment analysis software, and then writes results back into a new table. B. Utilize Snowflake's external functions to call a pre-existing sentiment analysis API (e.g., Google Cloud Natural Language API or AWS Comprehend) passing the review text and storing the returned sentiment score and category. Ensure proper API key management and network configuration. C. Use a Python UDF (User-Defined Function) with a pre-trained sentiment analysis library (e.g., NLTK or spaCy) to calculate the sentiment score and categorize it. Deploy the UDF in Snowflake and apply it to the 'review_text' column. D. Create a Snowpark Python DataFrame from the Snowflake table, use a sentiment analysis library within the Snowpark environment, categorize the sentiments, and then save the resulting DataFrame back to Snowflake as a new table. E. Create a series of Snowflake SQL queries utilizing complex string matching and keyword analysis to determine sentiment based on predefined lexicons. Categories are assigned through CASE statements.

Correct Answer: B,C,D

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 4
A financial institution aims to detect fraudulent transactions using a Supervised Learning model deployed in Snowflake. They have a dataset with transaction details, including amount, timestamp, merchant category, and customer ID. The target variable is 'is_fraudulent' (0 or 1). They are considering different Supervised Learning algorithms. Which of the following algorithms would be MOST suitable for this fraud detection task, considering the need for interpretability, scalability, and the potential for imbalanced classes, and what specific strategies can be employed within Snowflake to handle the class imbalance?

A. Decision Tree or Random Forest, combined with techniques like oversampling the minority class (fraudulent transactions) within Snowflake using SQL or UDFs to balance the dataset before training. These models provide reasonable interpretability and can handle non-linear relationships effectively. B. Naive Bayes, because it requires no hyperparameter tuning and works well on numerical data. C. Linear Regression, because it's computationally efficient and easy to understand, even though fraud detection is a classification problem. D. K-Nearest Neighbors (KNN), because it is simple to implement and doesn't require extensive training. E. Support Vector Machine (SVM) with a radial basis function (RBF) kernel, as it can capture complex non-linear relationships without concern for interpretability.

Correct Answer: A

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 5
You are building an automated model retraining pipeline for a sales forecasting model in Snowflake using Snowflake Tasks and Stored Procedures. After retraining, you want to validate the new model against a champion model already deployed. You need to define a validation strategy using the following models: champion model deployed as UDF "FORECAST UDF , and contender model deployed as UDF 'FORECAST UDF NEW'. Given the following objectives: (1) Minimal impact on production latency, (2) Ability to compare predictions on a large volume of real-time data, (3) A statistically sound comparison metric. Which of the following SQL statements best represents how to efficiently compare the forecasts of the two models on a sample dataset and calculate the Root Mean Squared Error (RMSE) to validate the new model?

A.

B.

C.

D.

E.

Correct Answer: B

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 6
You are training a fraud detection model on a dataset containing millions of transactions. To ensure robust generalization, you've decided to implement a train-validation-holdout split using Snowflake's capabilities. Given the following requirements: Temporal Split: The dataset contains a 'transaction date' column. You want to ensure that the validation and holdout sets contain transactions after the training data'. This is crucial because fraud patterns evolve over time. Stratified Sampling (Within Training): The training set should maintain the original proportion of fraudulent vs. non-fraudulent transactions. The column indicates if a transaction is fraudulent (1) or not (0). Deterministic Splits: You need a repeatable process to ensure consistency across model iterations. Which of the following SQL code snippets best achieves these requirements, considering performance and best practices within Snowflake?

A. Option A B. Option C C. Option E D. Option D E. Option B

Correct Answer: C

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 7
You are tasked with building a Python stored procedure in Snowflake to train a Gradient Boosting Machine (GBM) model using XGBoost.
The procedure takes a sample of data from a large table, trains the model, and stores the model in a Snowflake stage. During testing, you notice that the procedure sometimes exceeds the memory limits imposed by Snowflake, causing it to fail. Which of the following techniques can you implement within the Python stored procedure to minimize memory consumption during model training?

A. Reduce the sample size of the training data and increase the number of boosting rounds to compensate for the smaller sample. Use the 'predict_proba' method to avoid storing probabilities for all classes. B. Write the training data to a temporary table in Snowflake, then use Snowflake's external functions to train the XGBoost model on a separate compute cluster outside of Snowflake. Then upload the model to snowflake stage. C. Implement XGBoost's 'early stopping' functionality with a validation set to prevent overfitting. If the stored procedure exceeds the memory limits, the model cannot be saved. Always use larger virtual warehouse. D. Convert the Pandas DataFrame used for training to a Dask DataFrame and utilize Dask's distributed processing capabilities to train the XGBoost model in parallel across multiple Snowflake virtual warehouses. E. Use the 'hist' tree method in XGBoost, enable gradient-based sampling ('gosS), and carefully tune the 'max_depth' and parameters to reduce memory usage during tree construction. Convert all features to numerical if possible.

Correct Answer: E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 8
You're building a regression model using Snowpark Python to predict house prices. After initial training, you observe that the model consistently overestimates the prices of high-value houses and underestimates the prices of low-value houses. Given the options below, which optimization metric, along with code snippet to calculate it using Snowpark, would be most effective in addressing this specific issue?

A. Mean Absolute Error MAE - as it is sensitive to outliers and will penalize large errors more heavily.

B. Mean Squared Error (MSE) - as it is less sensitive to outliers than RMSE.

C. Root Mean Squared Error (RMSE) - as it gives more weight to larger errors, making it suitable for addressing the underestimation/overestimation problem.

D. Adjusted R-squared - as it penalizes the addition of irrelevant features, improving the model's generalization ability.

E. R-squared - as it measures the proportion of variance explained, directly addressing how well the model fits the data across all price ranges.

Correct Answer: C

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 9
A Snowflake table named 'SALES DATA contains a 'TRANSACTION DATE column stored as VARCHAR. The data in this column is inconsistent; some rows have dates in 'YYYY-MM-DD' format, others in 'MM/DD/YYYY' format, and some contain invalid date strings like 'N/A'. You need to standardize all dates to 'YYYY-MM-DD' format and store them in a new column called FORMATTED DATE in a new table 'STANDARDIZED_SALES DATA. Which of the following approaches, using Snowpark Python and SQL, most effectively handles these inconsistencies and minimizes errors during data transformation? Select all that apply:

A. Using a single 'TO_DATE function with format parameter set to 'AUTO' combined with 'TO_VARCHAR to format the date to 'YYYY-MM-DD'. B. Using a Snowpark Python UDF to parse each date string individually, handling different formats with conditional logic, and returning a formatted date string. This provides flexibility in handling diverse date formats. C. Using a series of DATE" and 'TO_VARCHAR SQL functions in Snowpark to attempt converting the date in different formats and then formatting the result to 'YYYY-MM-DD'. Any conversion failing returns NULL. D. Employing Snowpark's error handling mechanism (e.g., 'try...except' blocks) within a loop to iteratively convert each date string, catching and logging errors, and storing valid dates in a new column. E. Creating a view on top of 'SALES_DATA' that implements the conversion logic. This avoids creating a new physical table immediately and allows for experimentation with different conversion strategies before materializing the data.

Correct Answer: C,E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 10
You've deployed a regression model in Snowflake to predict product sales. After a month, you observe that the RMSE on your validation dataset has increased significantly compared to the initial deployment. Analyzing the prediction errors, you notice a pattern: the model consistently underestimates sales for products with a recent surge in social media mentions. Which of the following actions would be MOST effective in addressing this issue and improving the model's RMSE?

A. Retrain the model using only the most recent data (e.g., last week) to adapt to the changing sales patterns. B. Increase the regularization strength of the model to prevent overfitting to the original training data. C. Decrease the learning rate of the optimization algorithm during retraining to avoid overshooting the optimal weights. D. Implement a moving average smoothing technique on the target variable (sales) before retraining the model. E. Incorporate a feature representing the number of social media mentions for each product into the model and retrain.

Correct Answer: E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 11
You are preparing a dataset in Snowflake for a K-means clustering algorithm. The dataset includes features like 'age', 'income' (in USD), and 'number of_transactions'. 'Income' has significantly larger values than 'age' and 'number of_transactions'. To ensure that all features contribute equally to the distance calculations in K-means, which of the following scaling approaches should you consider, and why? Select all that apply:

A. Apply RobustScaler to handle outliers and then StandardScaler or MinMaxScaler to further scale the features. B. Apply PowerTransformer to transform income and StandardScaler to other features to handle skewness. C. Do not scale the data, as K-means is robust to differences in feature scales. D. Apply StandardScaler to all three features ('age', 'income', 'number_of_transactions') to center the data around zero and scale it to unit variance. E. Apply MinMaxScaler to all three features to scale them to a range between O and 1 .

Correct Answer: A,D,E

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

QUESTION NO: 12
You are building a machine learning model using Snowpark Python to predict house prices. The dataset contains a feature column named 'location' which contains free-form text descriptions of house locations. You want to leverage a pre-trained Large Language Model (LLM) hosted externally to extract structured location features like city, state, and zip code from the free-form text within Snowpark. You want to minimize the data transferred out of Snowflake. Which approach is most efficient and secure?

A. Create a Snowflake External Function that calls the external LLM API. Pass the 'location' column data to the External Function and retrieve the structured location features. Then apply the External Function directly on the Snowpark DataFrame. B. Use the Snowflake Connector for Python to directly query the 'location' column and call the external LLM API from the connector. Then write the updated data into a new table. C. Use to load the 'location' column data into a Pandas DataFrame, call the external LLM API in your Python script to enrich the location data and then use to store the enriched data back into a Snowflake table. D. Use Snowpark's 'createOrReplaceStage' to create an external stage pointing to the LLM API endpoint. Load the 'location' data into this stage and call the LLM API directly from the Snowflake stage using SQL. E. Create a Snowpark User-Defined Function (UDF) that calls the external LLM API. Pass the 'location' column data to the UDF and retrieve the structured location features. Then apply the UDF directly on the Snowpark DataFrame.

Correct Answer: A

Explanation: Only visible for Pass4Test members. You can sign-up / login (it's free).

Snowflake SnowPro Advanced: Data Scientist Certification - DSA-C03 Exam Questions