Quick summary of the exam
- Wide range of Google Cloud data services and what they actually do. It includes Storage, and a LOTS of Data services
- Nothing much on Compute and Network is covered
- Questions sometimes tests your logical thinking rather than any concept regarding Google Cloud.
- Hands-on, if you have not worked on GCP before make sure you do lots of labs else you would be absolute clueless for some of the questions and commands
- Tests are updated for the latest enhancements.
- Pilot exam does not cover the cases studies. But given my Professional Cloud Architect exam experience, make sure you cover the case studies before hand.
- Be sure that NO Online Course or Practice tests is going to cover all. I did Coursera, LinuxAcademy which is really vast, but hands-on or practical knowledge is MUST.
The list of topics is quite long, but something that you need to be sure to cover are
- Identity Services
- Cloud IAM
- provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
- Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
- Understand IAM Best practices
- Make sure you know the BigQuery Access roles
- Cloud IAM
- Storage Services
- Understand each storage service options and their use cases.
- Cloud Storage
- cost-effective object storage for an unstructured data.
- very important to know the different classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
- Understand Signed URL to give temporary access and the users do not need to be GCP users
- Understand permissions - IAM vs ACLs (fine grained control)
- Relational Databases
- Know Cloud SQL and Cloud Spanner
- Cloud SQL
- is a fully-managed service that provides MySQL and PostgreSQL only.
- Limited to 10TB and is a regional service.
- Cloud Spanner
- is a fully managed, mission-critical relational database service.
- provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
- globally distributed and can scale and handle more than 10TB.
- not a direct replacement and would need migration
- There are no direct options for Microsoft SQL Server or Oracle yet.
- NoSQL
- Know Cloud Datastore and BigTable
- Datastore
- provides document database for web and mobile applications. Datastore is not for analytics
- Understand Datastore indexes and how to update indexes for Datastore
- Bigtable
- provides column database suitable for both low-latency single-point lookups and precalculated analytics
- understand Bigtable is not for long term storage as it is quite expensive
- know the differences with HBase
- Know how to measure performance and scale
- Data Warehousing
- BigQuery
- provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
- Remember it is most suitable for historical analysis.
- know how to access control tables, columns within tables and query results (hint - Authorized View)
- Be sure to cover the Best Practices including key strategy, cost optimization, partitioning and clustering
- BigQuery
- Data Services
- Obviously there is lots of Data and Just Data
- Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics, use
- Cloud Storage
- as the medium to store data as data lake
- understand what class is the best suited and which one provides geo-redundancy.
- Cloud Pub/Sub
- as the messaging service to capture real time data esp. IoT
- Cloud Pub/Sub
- is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real time IoT data capture
- how it compares to Kafka
- Cloud Dataflow
- to process, transform, transfer data and the key service to integrate store and analytics.
- know how to improve a Dataflow performance
- Google expects you to know the Apache Beam features as well
- understand PCollections, Transforms, ParDo and what they do
- understand windowing and triggers
- Cloud BigQuery
- for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
- understand how BigQuery Streaming works
- know BigQuery limitations esp. with updates and inserts
- Cloud Dataprep
- to clean and prepare data. It can be used anomaly detection.
- does not need any programming language knowledge and can be done through graphical interface
- be sure to know or try hands-on on a dataset
- Cloud Dataproc
- to handle existing Hadoop/Spark jobs
- you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint - executor memory)
- how to install other components (hint - initialization actions)
- Cloud Datalab
- is an interactive tool for exploration, transformation, analysis and visualization of your data on Google Cloud Platform
- based on Jupyter
- Cloud Composer
- fully managed workflow orchestration service based on Apache Airflow
- pipelines are configured as directed acyclic graphs (DAGs)
- workflow lives on-premises, in multiple clouds, or fully within GCP.
- provides ability to author, schedule, and monitor your workflows in a unified manner
- Machine Learning
- Google expects the Data Engineer to surely know some of the Data scientists stuff
- Understand the different algorithms
- Supervised Learning (labelled data)
- Classification (for e.g. Spam or Not)
- Regression (for e.g. Stock or House prices)
- Unsupervised Learning (Unlabelled data)
- Clustering (for e.g. categories)
- Reinforcement Learning
- Supervised Learning (labelled data)
- Know Cloud ML with Tensorflow
- Know all the Cloud AI products which include
- Cloud Vision
- Cloud Natural Language
- Cloud Speech-to-Text
- Cloud Video Intelligence
- Cloud AutoML products, which can help you get started without much machine learning experience
- Monitoring
- Google Stackdriver provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
remember audits are mainly checking Stackdriver
- Google Stackdriver provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
- Security Services
- Data Loss Prevention API to handle sensitive data esp. redaction of PII data.
- understand Encryption techniques
- Other Services
- Storage Transfer Service allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively. Online data is the key here as it supports AWS S3, HTTP/HTTPS and other GCS buckets. If the data is on-premises you need to use gsutil command
- Transfer Appliance to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform. Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
- BigQuery Data Transfer Service to integrate with third-party services and load data into BigQuery
Resources
- Courses
- Coursera - Preparing for the Google Cloud Professional Data Engineer Exam which is a good overview course but no detailed. (look for Audit option, which would allow you to use it free)
- Coursera - Data Engineering on Google Cloud Platform, which is more detailed (look for Audit option on the entire specialization or individual course, which would allow you to use it free)
- Linux Academy - Google Cloud Certified - Professional Cloud Architect is quite detailed as well.
- Practice tests
- Braincert Google Cloud Certified - Professional Data Engineer Practice Exams (can be great for preparation)
- Use Google Free Tier and Qwiklabs as much as possible.