



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Google Big Data Platform Cloud Dataproc
Typology: Study notes
1 / 5
This page cannot be seen from the preview
Don't miss anything!
Google Big Data solutions will help to transform the users and their business experiences through data insights which are known as Integrated Serverless Platform. Google Big Data solutions are a part of GCP Services that are fully maintained and managed. You only pay for the resources that you consume. The following services are offered by Google Big Data Platform and are integrated to help create custom solutions: o Apache Hadoop is an open-source framework for Hadoop, which is based on the MapReduce programming model. o MapReduce model consists of Map Function, which runs with a large dataset to generate intermediate results. By taking these results as input, Reduce Function will produce the final output. o Along with Apache Hadoop, there are other related projects like Apache Pig, Hive and Spark. o On Google Cloud Platform, Cloud Dataproc can be used to run Hadoop, Spark, Hive and Pig.
When you request a Hadoop Cluster, it will be built in less than 90 seconds on top of the VM. Scaling can be done up and down based on the processing power. You can monitor the cluster using Processing Power. Running the clusters in On-premises will require hardware investment. But, running them in Dataproc will allow you to pay only for the hardware resources that you use while creating the cluster. Cloud Dataproc is billed per second, and GCP stops the billing once the cluster is deleted. You can also use Preemptible instances for the batch processing to save costs. Once the cluster consumes the data, Spark and SparkSQL can be used for data mining. You can also use Apache Spark Machine Learning Libraries to discover patterns through Machine Learning.
Cloud Dataproc is suitable when you know your cluster size. But if your cluster size is unpredictable or your data shows up in real-time, then your choice should be Cloud Dataflow.
Cloud Dataflow is a managed service that allows you to develop and execute a large range of processing patterns by extracting, transforming, and loading batch computation or continuous computation. Cloud Dataflow is used to build Data pipelines for both Batch and Streaming data. It is used to automate processing resources, free you from operational tasks like Performance optimization and Resource Management. Cloud Dataflow can read data from BigQuery, process it, apply transforms like Map operations and Reduce Operations and write it to the Cloud Storage. Use cases include Fraud Detection and Financial Services, IoT Analytics, Manufacturing, Logistics, HealthCare and so on.
Say you possess a large dataset, and you need to perform ad-hoc SQL queries, then you need to go for BigQuery. BigQuery is Google's fully managed, low-cost analytical data warehouse that has Petabyte storage. You can get data into BigQuery either from Cloud Storage or Cloud Datastore and can stream that to BigQuery up to 100,000 rows per second.* You can perform super-fast SQL queries and read, write data to BigQuery through Cloud Dataflow, Spark and Hadoop. You only pay for the queries that you are running. When the data in BigQuery reaches 90 days, Google will automatically decrease the storage price. BigQuery has an availability of 99.99%
If you are working with events in real-time and you need a messaging service, then Cloud Pub/Sub will help you in the following ways: Cloud Pub/Sub is a simple, reliable, and scalable foundation for stream analytics. By using it, you can build independent applications to send and receive messages. Pub/Sub stands for Publishers and Subscribers. The application will publish their messages to pub/sub, and the subscribers who are subscribed to them will receive the messages. Cloud Pub/Sub can also be integrated with Cloud Dataflow. Cloud Datalab helps to explore the data, and it can also be integrated with multiple GCP Services like BigQuery, Cloud Storage and Compute Engine.
Cloud Natural API provides different natural language technologies to developers around the world by doing Syntax analysis, identifying verbs, nouns, adverbs, adjectives, and it can also find the relationship between words. Cloud Translation API can convert a simple arbitrary string to a supported language through a simple interface. Cloud Video Intelligence API helps to annotate videos in different formats. You can use it to make your video content searchable.
John is performing few tasks on GCP environment but he is stuck at a place and could not proceed further. Assist John to complete the following tasks. Login to GCP with the provided credentials. Create a Google cloud virtual network, subnet, firewall rule and compute instance with the name as per your choice. Create a Google cloud storage bucket with the name as per your choice and execute the command 'gsutil cp gs://cloud-training/gcpfci/my-excellent- blog.png my-excellent-blog.png' to retrieve an image. Copy the image to your newly created bucket. Create Google cloud SQL instance and SQL user. Enable the APIs like 'Kubernetes Engine API' and 'Container Registry API'. Start a Kubernetes Engine managed by Kubernetes Cluster with the name asper your choice and configure it to run 2 nodes. Launch a single instance of the Nginx container (with version 1.10.0) and expose it to the internet using target port 80. Then scale the number of pods to 3 and confirm that the external IP Address is not changed. Now, deploy the Guestbook application to App Engine. To do so, Clone a source code repository with a sample application from 'https://github.com/GoogleCloudPlatform/appengine-guestbook-python'. View your deployed application using gcloud command. Create a new dataset called 'logdata' and set the data location as 'US'. Create a new table in the logdata dataset and perform query operations on data using 'bq' command. Finally, delete all the resources created as part of this hands-on.
Learn as if you were to live forever -Mahatma Gandhi In this course, we have discussed the following topics that laid the foundation for Google Cloud Platform: Google Cloud VPC and Compute Engine Google Cloud Storage and Bigtable Cloud SQL, Spanner, and Datastore Containers, Kubernetes and Kubernetes Engine
App Engine Standard and Flexible Cloud Development, Deployment and Monitoring Google Cloud Big data and Machine Learning