Google Cloud Professional Data Engineer Learning Path Certification Key Topics

Quick summary of the exam

  • Wide range of Google Cloud data services and what they actually do. It includes Storage, and a LOTS of Data services
  • Nothing much on Compute and Network is covered
  • Questions sometimes tests your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on, if you have not worked on GCP before make sure you do lots of labs else you would be absolute clueless for some of the questions and commands
  • Tests are updated for the latest enhancements.
  • Pilot exam does not cover the cases studies. But given my Professional Cloud Architect exam experience, make sure you cover the case studies before hand.
  • Be sure that NO Online Course or Practice tests is going to cover all. I did Coursera, LinuxAcademy which is really vast, but hands-on or practical knowledge is MUST.

The list of topics is quite long, but something that you need to be sure to cover are

  • Identity Services
    • Cloud IAM
      • provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
      • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
      • Understand IAM Best practices
      • Make sure you know the BigQuery Access roles
  • Storage Services
    • Understand each storage service options and their use cases.
    • Cloud Storage
      • cost-effective object storage for an unstructured data.
      • very important to know the different classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
      • Understand Signed URL to give temporary access and the users do not need to be GCP users
      • Understand permissions - IAM vs ACLs (fine grained control)
    • Relational Databases
      • Know Cloud SQL and Cloud Spanner
      • Cloud SQL
        • is a fully-managed service that provides MySQL and PostgreSQL only.
        • Limited to 10TB and is a regional service.
      • Cloud Spanner
        • is a fully managed, mission-critical relational database service.
        • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
        • globally distributed and can scale and handle more than 10TB.
        • not a direct replacement and would need migration
      • There are no direct options for Microsoft SQL Server or Oracle yet.
    • NoSQL
      • Know Cloud Datastore and BigTable
      • Datastore
        • provides document database for web and mobile applications. Datastore is not for analytics
        • Understand Datastore indexes and how to update indexes for Datastore
      • Bigtable
        • provides column database suitable for both low-latency single-point lookups and precalculated analytics
        • understand Bigtable is not for long term storage as it is quite expensive
        • know the differences with HBase
        • Know how to measure performance and scale
    • Data Warehousing
      • BigQuery
        • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
        • Remember it is most suitable for historical analysis.
        • know how to access control tables, columns within tables and query results (hint - Authorized View)
        • Be sure to cover the Best Practices including key strategy, cost optimization, partitioning and clustering
  • Data Services
    • Obviously there is lots of Data and Just Data
    • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics, use
    • Cloud Storage
      • as the medium to store data as data lake
      • understand what class is the best suited and which one provides geo-redundancy.
    • Cloud Pub/Sub
      • as the messaging service to capture real time data esp. IoT
    • Cloud Pub/Sub
      • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real time IoT data capture
      • how it compares to Kafka
    • Cloud Dataflow
      • to process, transform, transfer data and the key service to integrate store and analytics.
      • know how to improve a Dataflow performance
      • Google expects you to know the Apache Beam features as well
    • Cloud BigQuery
      • for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
      • understand how BigQuery Streaming works
      • know BigQuery limitations esp. with updates and inserts
    • Cloud Dataprep
      • to clean and prepare data. It can be used anomaly detection.
      • does not need any programming language knowledge and can be done through graphical interface
      • be sure to know or try hands-on on a dataset
    • Cloud Dataproc
      • to handle existing Hadoop/Spark jobs
      • you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint - executor memory)
      • how to install other components (hint - initialization actions)
    • Cloud Datalab
      • is an interactive tool for exploration, transformation, analysis and visualization of your data on Google Cloud Platform
      • based on Jupyter
    • Cloud Composer
      • fully managed workflow orchestration service based on Apache Airflow
      • pipelines are configured as directed acyclic graphs (DAGs)
      • workflow lives on-premises, in multiple clouds, or fully within GCP.
      • provides ability to author, schedule, and monitor your workflows in a unified manner
  • Machine Learning
    • Google expects the Data Engineer to surely know some of the Data scientists stuff
    • Understand the different algorithms
      • Supervised Learning (labelled data)
        • Classification (for e.g. Spam or Not)
        • Regression (for e.g. Stock or House prices)
      • Unsupervised Learning (Unlabelled data)
        • Clustering (for e.g. categories)
      • Reinforcement Learning
    • Know Cloud ML with Tensorflow
    • Know all the Cloud AI products which include
      • Cloud Vision
      • Cloud Natural Language
      • Cloud Speech-to-Text
      • Cloud Video Intelligence
    • Cloud AutoML products, which can help you get started without much machine learning experience
  • Monitoring
    • Google Stackdriver provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
      remember audits are mainly checking Stackdriver
  • Security Services
    • Data Loss Prevention API to handle sensitive data esp. redaction of PII data.
    • understand Encryption techniques
  • Other Services
    • Storage Transfer Service allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively. Online data is the key here as it supports AWS S3, HTTP/HTTPS and other GCS buckets. If the data is on-premises you need to use gsutil command
    • Transfer Appliance to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform. Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
    • BigQuery Data Transfer Service to integrate with third-party services and load data into BigQuery

Resources

Loading... Please wait
Buy me a coffeeBuy me a coffee
<