Associate Data Practitioner
Unlock the power of your data in the cloud! Get hands-on with Google Cloud's core data services like BigQuery and Looker to validate your practical skills in data ingestion, analysis, and management, and earn your Associate Data Practitioner certification!
Practice Test
Fundamental
Practice Test
Fundamental
2.3 Define, train, evaluate, and use ML models
Identify ML use cases for developing models by using BigQuery ML and AutoML
BigQuery ML and AutoML help you build machine learning models without needing to write complex code. BigQuery ML works right in Google BigQuery with SQL, making it easier to start if you know database queries. AutoML, part of Vertex AI, guides you through training more advanced models with little manual tuning. These tools let data teams move quickly from data to insights.
When choosing between the two, consider your project’s data size and complexity. BigQuery ML is great for tasks like linear regression or binary classification on large datasets you already store in BigQuery. AutoML shines for more complex tasks, such as image classification or natural language processing, where the system can automatically find the best model settings. Understanding the right tool can save time and improve your results.
It’s also important to weigh skill levels and speed. Analysts familiar with SQL can jump into BigQuery ML and start modeling in minutes. AutoML offers a more guided experience, which is helpful for users without deep ML expertise. Choosing the right approach means balancing model performance with how fast you need results.
Finally, think about ongoing maintenance and scalability. BigQuery ML models can be retrained by simply running SQL queries on updated data. AutoML models can also retrain automatically when you set up pipelines in Vertex AI Pipelines. Both approaches let you keep models current, but they differ in how much manual work you’ll need over time.
Use pretrained Google large language models (LLMs) using remote connection in BigQuery
Pretrained large language models (LLMs) like PaLM or Vertex AI’s text models can handle tasks such as text generation, sentiment analysis, and translation. BigQuery lets you call these powerful models without moving data out of your warehouse. You do this by setting up a remote connection that routes SQL queries to the external LLM service. This approach keeps your data secure and simplifies processing.
To use an LLM in BigQuery, you define a remote function. This function wraps the model’s API and exposes it to SQL. When you run a query, BigQuery sends the text to the LLM, and the model returns its predictions right in your query results. The process is seamless, and you can combine LLM outputs with your existing data joins and filters.
Common use cases include generating product descriptions, analyzing customer feedback, or summarizing documents. For example, you might use an LLM to extract key topics from customer reviews stored in BigQuery. By integrating these models directly, you avoid extra data movement and reduce latency in getting your insights.
Keep in mind cost and latency considerations. Each call to a remote LLM incurs API charges, so it’s important to batch requests when possible. Also, network latency can add a small delay to queries. Properly designing your workflows and caching results can help manage these factors effectively.
Plan a standard ML project (e.g., data collection, model training, model evaluation, prediction)
A well-defined plan helps ensure your ML project succeeds. Start with data collection, gathering all relevant information from sources like logs, databases, or external APIs. Next, focus on data cleaning and feature engineering to prepare the dataset for modeling. Clean, well-structured data is critical for accurate predictions.
Here are the typical steps in a standard ML project:
- Data Collection: Identify and gather data from relevant sources.
- Data Preparation: Clean the data, handle missing values, and engineer features.
- Model Training: Choose algorithms and train models.
- Model Evaluation: Use metrics like accuracy or RMSE to judge performance.
- Prediction and Deployment: Serve the model for real-time or batch predictions.
During model training, experiment with different algorithms and hyperparameters. Use BigQuery ML or AutoML to simplify this process. Once trained, move to model evaluation to check performance using test data. Select metrics that match your business goals, such as precision for classification or mean absolute error for regression.
Finally, plan for deployment and monitoring. Decide if you need real-time serving or batch predictions, and set up pipelines to retrain and redeploy models as new data arrives. Regular monitoring helps you catch model drift and keep predictions accurate over time.
Execute SQL to create, train, and evaluate models using BigQuery ML
BigQuery ML lets you build machine learning models directly with SQL. You start by running a CREATE MODEL statement, which specifies the model type and input data. For example, you might write CREATE MODEL my_dataset.my_model OPTIONS(model_type='linear_reg') AS SELECT * FROM my_table;. This single statement handles both creation and training.
After the model is trained, use ML.EVALUATE to check its performance. This function returns evaluation metrics like mean squared error or log loss. You might run SELECT * FROM ML.EVALUATE(MODEL my_dataset.my_model, (SELECT * FROM my_holdout_table));. Evaluating your model helps you understand how well it predicts unseen data.
BigQuery ML also supports hyperparameter tuning, which you enable through the OPTIONS clause in CREATE MODEL. You can specify which parameters to test and let BigQuery ML find the best combination. Automated tuning can improve model accuracy without manual trial and error.
Finally, inspect your model’s feature importance and coefficients using ML.FEATURE_INFO or ML.WEIGHTS. These functions reveal which inputs drive predictions the most. Gaining these insights helps you trust and explain your model.
Perform inference using BigQuery ML models
Once your model is trained and evaluated, you can generate predictions using ML.PREDICT. This function takes new input data and returns model predictions in a standard SQL result set. For example, SELECT * FROM ML.PREDICT(MODEL my_dataset.my_model, (SELECT * FROM new_data_table)); outputs predicted values alongside input features. This approach makes integrating predictions easy.
You can run inference in batch mode for large datasets or integrate it into dashboards for real-time insights. BigQuery’s scalability means you can predict on millions of rows without provisioning servers. Batch predictions are cost-effective for nightly scoring, while real-time queries let you embed predictions directly in your applications.
Predictions from BigQuery ML can be joined with other tables, filtered, and aggregated using familiar SQL commands. This flexibility means you can build custom reporting pipelines without moving data out of BigQuery. Combining predictions and business data helps deliver actionable insights to stakeholders fast.
Lastly, monitor inference results to catch issues like data drift or unexpected output. Use scheduled queries or Data Studio dashboards to track key metrics and trigger alerts if performance degrades. Regular checks ensure your predictions stay accurate as data evolves.
Organize models in Model Registry
A Model Registry helps you manage and track all your ML models in one place. In Google Cloud, this is part of Vertex AI. You register models after training by uploading model artifacts and metadata. This creates a central repository where teams can discover, review, and deploy models easily.
Each registered model can have multiple versions. Versioning lets you compare performance across different training runs and roll back to a previous version if needed. You can store information such as training datasets, evaluation metrics, and pipeline configurations alongside each version. Version control ensures transparency and repeatability.
The Model Registry also supports model approval workflows and access controls. You can set permissions so only certain users can register, approve, or deploy models. This governance enhances security and ensures that only vetted models go into production.
Once models are in the registry, you can deploy them to Vertex AI Endpoints for serving. The registry tracks which endpoint each version is deployed to and logs prediction traffic. Centralizing model management simplifies both operations and audits in a growing ML environment.
Conclusion
This section covered how to identify the right use cases for BigQuery ML and AutoML, and how to leverage pretrained LLMs through remote connections in BigQuery. You learned the steps to plan a standard ML project, from data collection to prediction. We saw how to execute SQL for creating, training, and evaluating models, and how to perform inference using ML.PREDICT. Finally, we discussed organizing your work with a Model Registry in Vertex AI to keep models versioned, governed, and easily deployable. Altogether, these concepts provide a solid foundation for defining, training, evaluating, and using ML models on Google Cloud.
Study Guides for Sub-Sections
BigQuery ML enables machine learning directly inside BigQuery using standard SQL commands. It lets you define, train, and evaluate models without moving data to another to...
BigQuery lets you call remote functions to run pretrained Google large language models (LLMs) directly in your SQL queries. By using a remote connection to Vertex...
Model metadata management in Google Cloud’s Model Registry helps teams track, version, and deploy machine learning models in a consistent way. By storing...
BigQuery ML lets you perform inference by applying models you have already trained to new or existing data. This process generates predictions directly in...
Before starting a machine learning project on GCP, you need to set up a GCP project. You can choose an existing project or create a new one depending on your team’s needs. It is
BigQuery ML and AutoML are Google Cloud services designed to make machine learning more accessible. BigQuery ML allows users to build and run models directly using SQL.