Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Google Cloud Big Data and Machine Learning, Study notes of Operating Systems

Google Big Data Platform Cloud Dataproc

Typology: Study notes

2017/2018

Available from 01/15/2022

vyom-bhardwaj
vyom-bhardwaj 🇮🇳

3 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Google Cloud Big Data and Machine Learning
Google Big Data Platform
Google Big Data solutions will help to transform the users and their business
experiences through data insights which are known as Integrated Serverless
Platform.
Google Big Data solutions are a part of GCP Services that are fully
maintained and managed. You only pay for the resources that you consume.
The following services are offered by Google Big Data Platform and are
integrated to help create custom solutions:
o Apache Hadoop is an open-source framework for Hadoop, which is
based on the MapReduce programming model.
o MapReduce model consists of Map Function, which runs with a large
dataset to generate intermediate results. By taking these results as
input, Reduce Function will produce the final output.
o Along with Apache Hadoop, there are other related projects
like Apache Pig, Hive and Spark.
o On Google Cloud Platform, Cloud Dataproc can be used to
run Hadoop, Spark, Hive and Pig.
Cloud Dataproc
When you request a Hadoop Cluster, it will be built in less than 90 seconds
on top of the VM. Scaling can be done up and down based on the processing
power.
You can monitor the cluster using Processing Power.
Running the clusters in On-premises will require hardware investment. But,
running them in Dataproc will allow you to pay only for the hardware
resources that you use while creating the cluster.
Cloud Dataproc is billed per second, and GCP stops the billing once the
cluster is deleted.
You can also use Preemptible instances for the batch processing to save
costs.
Once the cluster consumes the data, Spark and SparkSQL can be used for
data mining.
You can also use Apache Spark Machine Learning Libraries to discover
patterns through Machine Learning.
Cloud Dataflow
Cloud Dataproc is suitable when you know your cluster size. But if your
cluster size is unpredictable or your data shows up in real-time, then your
choice should be Cloud Dataflow.
pf3
pf4
pf5

Partial preview of the text

Download Google Cloud Big Data and Machine Learning and more Study notes Operating Systems in PDF only on Docsity!

Google Cloud Big Data and Machine Learning

Google Big Data Platform

Google Big Data solutions will help to transform the users and their business experiences through data insights which are known as Integrated Serverless Platform.  Google Big Data solutions are a part of GCP Services that are fully maintained and managed. You only pay for the resources that you consume.  The following services are offered by Google Big Data Platform and are integrated to help create custom solutions: o Apache Hadoop is an open-source framework for Hadoop, which is based on the MapReduce programming model. o MapReduce model consists of Map Function, which runs with a large dataset to generate intermediate results. By taking these results as input, Reduce Function will produce the final output. o Along with Apache Hadoop, there are other related projects like Apache Pig, Hive and Spark. o On Google Cloud Platform, Cloud Dataproc can be used to run Hadoop, Spark, Hive and Pig.

Cloud Dataproc

 When you request a Hadoop Cluster, it will be built in less than 90 seconds on top of the VM. Scaling can be done up and down based on the processing power.  You can monitor the cluster using Processing Power.  Running the clusters in On-premises will require hardware investment. But, running them in Dataproc will allow you to pay only for the hardware resources that you use while creating the cluster.  Cloud Dataproc is billed per second, and GCP stops the billing once the cluster is deleted.  You can also use Preemptible instances for the batch processing to save costs.  Once the cluster consumes the data, Spark and SparkSQL can be used for data mining.  You can also use Apache Spark Machine Learning Libraries to discover patterns through Machine Learning.

Cloud Dataflow

Cloud Dataproc is suitable when you know your cluster size. But if your cluster size is unpredictable or your data shows up in real-time, then your choice should be Cloud Dataflow.

 Cloud Dataflow is a managed service that allows you to develop and execute a large range of processing patterns by extracting, transforming, and loading batch computation or continuous computation.  Cloud Dataflow is used to build Data pipelines for both Batch and Streaming data.  It is used to automate processing resources, free you from operational tasks like Performance optimization and Resource Management.  Cloud Dataflow can read data from BigQuery, process it, apply transforms like Map operations and Reduce Operations and write it to the Cloud Storage.  Use cases include Fraud Detection and Financial Services, IoT Analytics, Manufacturing, Logistics, HealthCare and so on.

BigQuery

Say you possess a large dataset, and you need to perform ad-hoc SQL queries, then you need to go for BigQuery.  BigQuery is Google's fully managed, low-cost analytical data warehouse that has Petabyte storage.  You can get data into BigQuery either from Cloud Storage or Cloud Datastore and can stream that to BigQuery up to 100,000 rows per second.* You can perform super-fast SQL queries and read, write data to BigQuery through Cloud Dataflow, Spark and Hadoop.  You only pay for the queries that you are running.  When the data in BigQuery reaches 90 days, Google will automatically decrease the storage price.  BigQuery has an availability of 99.99%

Cloud Pub/Sub and Cloud Datalab

If you are working with events in real-time and you need a messaging service, then Cloud Pub/Sub will help you in the following ways:  Cloud Pub/Sub is a simple, reliable, and scalable foundation for stream analytics. By using it, you can build independent applications to send and receive messages.  Pub/Sub stands for Publishers and Subscribers.  The application will publish their messages to pub/sub, and the subscribers who are subscribed to them will receive the messages.  Cloud Pub/Sub can also be integrated with Cloud Dataflow.  Cloud Datalab helps to explore the data, and it can also be integrated with multiple GCP Services like BigQuery, Cloud Storage and Compute Engine.

 Cloud Natural API provides different natural language technologies to developers around the world by doing Syntax analysis, identifying verbs, nouns, adverbs, adjectives, and it can also find the relationship between words.  Cloud Translation API can convert a simple arbitrary string to a supported language through a simple interface.  Cloud Video Intelligence API helps to annotate videos in different formats. You can use it to make your video content searchable.

Hands-on scenario

John is performing few tasks on GCP environment but he is stuck at a place and could not proceed further. Assist John to complete the following tasks.  Login to GCP with the provided credentials.  Create a Google cloud virtual network, subnet, firewall rule and compute instance with the name as per your choice.  Create a Google cloud storage bucket with the name as per your choice and execute the command 'gsutil cp gs://cloud-training/gcpfci/my-excellent- blog.png my-excellent-blog.png' to retrieve an image.  Copy the image to your newly created bucket.  Create Google cloud SQL instance and SQL user.  Enable the APIs like 'Kubernetes Engine API' and 'Container Registry API'.  Start a Kubernetes Engine managed by Kubernetes Cluster with the name asper your choice and configure it to run 2 nodes.  Launch a single instance of the Nginx container (with version 1.10.0) and expose it to the internet using target port 80. Then scale the number of pods to 3 and confirm that the external IP Address is not changed.  Now, deploy the Guestbook application to App Engine. To do so, Clone a source code repository with a sample application from 'https://github.com/GoogleCloudPlatform/appengine-guestbook-python'. View your deployed application using gcloud command.  Create a new dataset called 'logdata' and set the data location as 'US'.  Create a new table in the logdata dataset and perform query operations on data using 'bq' command.  Finally, delete all the resources created as part of this hands-on.

Conclusion

Learn as if you were to live forever -Mahatma Gandhi In this course, we have discussed the following topics that laid the foundation for Google Cloud Platform:  Google Cloud VPC and Compute Engine  Google Cloud Storage and Bigtable  Cloud SQL, Spanner, and Datastore  Containers, Kubernetes and Kubernetes Engine

 App Engine Standard and Flexible  Cloud Development, Deployment and Monitoring  Google Cloud Big data and Machine Learning