Associate Data Practitioner

Unlock the power of your data in the cloud! Get hands-on with Google Cloud's core data services like BigQuery and Looker to validate your practical skills in data ingestion, analysis, and management, and earn your Associate Data Practitioner certification!

Practice Test

Fundamental
Exam

1.1 Prepare and process data

Differentiate between different data manipulation methodologies

Data manipulation methodologies help transform raw data into a format that is ready for analysis. ETL stands for Extract, Transform, Load, where data is first drawn from sources, then cleaned and formatted, and finally stored in a data warehouse. ELT flips the last two steps: data is Extracted, Loaded, and then Transformed in place. A hybrid approach called ETLT combines both methods by doing an initial transformation before loading and a second transformation after.

Extract, Transform, Load (ETL) moves data through a controlled pipeline. First, you extract data from sources like databases or files. Next, you transform that data by cleaning and aggregating it. Finally, you load the processed data into a data store for analysis.

Extract, Load, Transform (ELT) loads raw data into a target system, like BigQuery, before transforming it. This lets you leverage the target system’s processing power for large datasets. You only transform what you need, which can save time and resources.

Extract, Transform, Load, Transform (ETLT) does a quick cleanup before loading and a more thorough transformation afterward. The initial step handles obvious errors, and the final step applies business rules and aggregations. This can be useful when you need a mix of real-time and batch processing.

Choose the appropriate data transfer tool

Moving data into Google Cloud requires picking the right tool for your needs. Two common options are Storage Transfer Service and Transfer Appliance. Each tool fits different scenarios based on data size, network limits, and time constraints. Choosing wisely can save costs and reduce delays.

Storage Transfer Service is a fully managed online tool for copying data between cloud storage systems or from on-premises to Cloud Storage. It supports:

  • Scheduling automatic, recurring transfers
  • Filtering by file name or date
  • High reliability with retry logic

This service works well for ongoing transfers of moderate data volumes over the network.

Transfer Appliance is a secure, high-capacity storage device you fill on-premises and ship to Google. It’s ideal for:

  • Large datasets (many terabytes or petabytes)
  • Environments with limited or expensive bandwidth
  • Tight deadlines where online transfer is too slow

You receive your data quickly without stressing your network.

When deciding, consider:

  • Data size: Online for smaller sets, offline appliance for huge volumes
  • Network speed: High bandwidth favors Storage Transfer Service
  • Cost and time: Appliance can be faster but involves hardware handling

Assess data quality

Data quality measures how fit data is for its intended use. Good quality data leads to accurate insights and decisions, while poor quality can cause errors and misguided conclusions. Assessing quality means checking for issues like missing values, duplicates, or inconsistencies.

Key quality metrics include:

  • Completeness: Are all required fields present?
  • Accuracy: Does the data reflect real-world values?
  • Consistency: Are formats and codes uniform across records?
  • Timeliness: Is the data up to date?

Tools on GCP can help assess these metrics. For instance, you might use BigQuery to run queries that count nulls or mismatches. Dataflow pipelines can flag and route bad records for review. Cloud Data Fusion offers visual tools to profile your data and spot anomalies.

Evaluating data quality early in your project prevents wasted effort later. By catching issues upfront, you ensure your analysis is based on reliable information. This step is critical for building trust in your insights and decisions.

Conduct data cleaning

Data cleaning is the process of correcting or removing bad data. It ensures your dataset is uniform and error-free before analysis. Cleaning tasks often include dealing with missing values, fixing typos, and removing duplicate records.

Cloud Data Fusion offers a visual interface for building cleaning pipelines. You can drag and drop transformations like:

  • Normalize: Standardize formats (dates, phone numbers)
  • Filter: Remove unwanted rows or columns
  • Aggregate: Summarize or group data

This makes it easy for non-developers to create repeatable workflows.

BigQuery and SQL are powerful for cleaning large datasets with code. You can write queries to:

  • Replace or impute missing values
  • Join tables to fill gaps
  • Delete or mark duplicates

This approach gives you fine-grained control and works well for batch jobs.

Dataflow is a stream and batch processing service that handles cleaning on the fly. It can parse, validate, and transform data as it moves through your pipeline. Use Dataflow when you need real-time data cleaning at scale.

Conclusion

In section 1.1, we covered foundational steps for preparing and processing data on GCP. We learned how to differentiate between ETL, ELT, and ETLT methodologies. We explored tools like Storage Transfer Service and Transfer Appliance for efficient data transfer. We also discussed how to assess data quality using metrics such as completeness and accuracy. Finally, we looked at cleaning data with services like Cloud Data Fusion, BigQuery, and Dataflow. Together, these practices ensure your data is ready for reliable analysis and better decision making.

Study Guides for Sub-Sections

Data quality is measured by criteria like accuracy and consistency. Accuracy means the data correctly represents real-world values, while...

Data transfer tools help you move large volumes of data into Google Cloud. Choosing the right tool means looking at data volume, transfer speed, network conditions

Data cleaning is the process of detecting and correcting errors in your datasets to make sure information is accurate and reliable. In Google Cloud, common services for cl...

The ETL process is a basic data integration method that stands for Extract, Transform, and Load. It helps combine data from multiple sources into a single databas...