Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

introduction to pig and its working, Lecture notes of Data Acquisition

In this you can learn what is apache pig. How we can work with apache pig in different modes. working with different relational operators in pig and also wordcount problem in pig.

Typology: Lecture notes

2018/2019

Uploaded on 10/09/2019

praneeth-cheraku
praneeth-cheraku 🇮🇳

5

(1)

2 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PIG
Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig Philosophy ,
Use Case for Pig: ETL Processing , Pig Latin Overview , Data Types in Pig ,
Running Pig , Execution Modes of Pig, HDFS Commands, Relational
Operators, Piggy Bank , Word Count Example using Pig , Pig at Yahoo!,
Pig versus Hive
Apache Pig is a platform for data analysis. It is an alternative to
MapReduce Programming. Pig was developed as a research project
at Yahoo.
Key Features:
1. It provides an engine for executing data ows( how the data should
ow) Pig process data in parallel on the hadoop cluster.
2. It provides a language called “Pig Latin” to express data ows.
3. Pig Latin contains operators for many of the traditional data
operations such as joins, lter, sort etc.
4. It allows users to develop their own functions (User Dened
Functions) for reading, processing and writing data.
5. It supports HDFS commands, UNIX shell commands, Relational
operators, mathematical functions and complex data structures.
6. Ease of programming: Pig Latin is similar to SQL and it is easy to
write a Pig script if you are good at SQL.
7. Optimization opportunities: The tasks in Apache Pig optimize
their execution automatically, so the programmers need to focus
only on semantics of the language.
8. Handles all kinds of data: Apache Pig analyzes all kinds of data,
both structured as well as unstructured. It stores the results in
HDFS.
Anatomy of PIG.
The main components of Pig are as follows:
1. Data ow language(Pig Latin)
2. Interactive shell where we can write Pig Latin Statements(Grunt)
Eg.
grunt> A = load ‘student’ (rollno, name, gpa);
grunt>A = lter A by gpa4.0;
3. Pig interpreter and execution engine.
Processes and parses Pig Latin.
Checks data types.
Performs optimization.
Creates MapReduce Jobs.
Submits job to Hadoop
Monitors progress
PIG architecture.
The language used to analyze data in Hadoop using Pig is known as
Pig Latin.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download introduction to pig and its working and more Lecture notes Data Acquisition in PDF only on Docsity!

PIG

Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig Philosophy , Use Case for Pig: ETL Processing , Pig Latin Overview , Data Types in Pig , Running Pig , Execution Modes of Pig, HDFS Commands, Relational Operators, Piggy Bank , Word Count Example using Pig , Pig at Yahoo!, Pig versus Hive

Apache Pig is a platform for data analysis. It is an alternative to MapReduce Programming. Pig was developed as a research project at Yahoo. Key Features:

  1. It provides an engine for executing data flows( how the data should flow) Pig process data in parallel on the hadoop cluster.
  2. It provides a language called “Pig Latin” to express data flows.
  3. Pig Latin contains operators for many of the traditional data operations such as joins, filter, sort etc.
  4. It allows users to develop their own functions (User Defined Functions) for reading, processing and writing data.
  5. It supports HDFS commands, UNIX shell commands, Relational operators, mathematical functions and complex data structures.
  6. Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
  7. Optimization opportunities: The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.
  8. Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.

Anatomy of PIG. The main components of Pig are as follows:

  1. Data flow language(Pig Latin)
  2. Interactive shell where we can write Pig Latin Statements(Grunt) Eg. grunt> A = load ‘student’ (rollno, name, gpa); grunt>A = filter A by gpa4.0;
  3. Pig interpreter and execution engine.
    • Processes and parses Pig Latin.
    • Checks data types.
    • Performs optimization.
    • Creates MapReduce Jobs.
    • Submits job to Hadoop
    • Monitors progress

PIG architecture.

  • The language used to analyze data in Hadoop using Pig is known as Pig Latin.
  • It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data.
  • To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded).
  • After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.
  • Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.
  • Parser : Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.
  • Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • Compiler : The compiler compiles the optimized logical plan into a series of MapReduce jobs.
  • Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  1. Load: first load (LOAD) the data you want to manipulate. As in a typical MapReduce job, that data is stored in HDFS. For a Pig program to access the data, you first tell Pig what file or files to use. For that task, use the LOAD command.

The LOAD operator operates on the principle of lazy evaluation, also referred to as call-by-need.

The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files.

The optional AS clause defines a schema for the data that is being mapped.If you don’t use an AS clause, you’re basically telling the default LOAD Func to expect a plain text file that is tab delimited.

  1. Transform: run the data through a set of transformations that, which we have to concern ourself that are translated into a set of Map and Reduce tasks.
  2. Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the results in a file somewhere.

eg. faculty.txt 1,chp, 2,pnr, 3,kry,

fac.pig fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray); fac1 = filter fac by name=='chp'; dump fac1;

executing: pig -x local grunt> run fac.pig or grunt> exec fac.pig

Output: (1,chp,10000.0,)

Datatypes in Pig Latin.

Pig has a very limited set of data types. Pig data types are classified into two types. They are:

  • Primitive
  • Complex Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are:
  • int : It is signed 32 bit integer. This is similar to the Integer in java.
  • long : It is a 64 bit signed integer. This is similar to the Long in java.
  • float: It is a 32 bit floating point. This data type is similar to the Float in java.
  • double: It is a 63 bit floating pint. This data type is similar to the Double in java.
  • chararray: It is character array in unicode UTF-8 format. This corresponds to java's String object.
  • bytearray: Used to represent bytes. It is the default data type. If you don't specify a data type for a filed, then bytearray datatype is assigned for the field.
  • boolean: to represent true/false values. Complex Types: Pig supports three complex data types. They are listed below:

✓ Atom: An atom is any single value, such as a string or a number Eg. int, long, float, double, chararray, and bytearray. ✓ Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type. Think of a tuple as a row in a table. ✓ Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple in the collection can contain an arbitrary number of fields, and each field can be of any type. ✓ Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key needs to be unique. The key of a map must be a chararray and the value can be of any type.

Change present working directory to Desktop and enter into pig as below and execute. faculty.txt 1,chp, 2,pnr, 3,kry,

  1. Write a pig latin script to find details of the employee whose name is chp and run in local mode. filter1.pig fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray); fac1 = filter fac by name=='chp'; dump fac1;

executing: pig -x local grunt> run fac.pig or grunt> exec fac.pig

Output: (1,chp,10000.0,)

Executing pig in mapreduce mode: To run a pig latin script in hadoop mapreduce mode write a script and save it on hadoop file system. Similarly place necessary text files into the hdfs.

  1. Write a pig latin script to group records by deptartment and run in mapreduce mode. dept.txt 2@cse 3@mca 4@cse

group1.pig dept1 = load '/user/chp/data/dept.txt' using PigStorage('@') as (id:int,dept:chararray); dept2 = group dept1 by dept; dump dept2;

executing: pig -x mapreduce grunt> run dept.pig or grunt> exec dept.pig

Output: (cse,{(2,cse),(4,cse)}) (mca,{(3,mca)}) Pig Latin Relational Operators.

In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. Table 1 gives an overview of the operators associated with each operation.

Table 1: Operators in pig

Pig also provides a few operators that are helpful for debugging and troubleshooting, as shown in Table 2

Table 2: Debugging operators in pig. Eg.

  1. Write a pig latin script to find the highest salaried employee.

sort1.pig

fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray); fac1 = order fac by sal desc;

left outer join: (1,chp,10000.0,,,) (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca)

right outer join: (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca) (,,,,4,cse)

full outer join: (1,chp,10000.0,,,) (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca) (,,,,4,cse)

Q) Write the pig commands in local mode for the following queries using the realations(tables) faculty (fac) and department (dept1).

Entering into local mode: pig -x local grunt> fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray);

grunt> dept1 = load '/home/chp/Desktop/dept.txt' using PigStorage('@') as (id:int,dept:chararray);

grunt> dump fac;

(1,chp,10000.0,) (2,pnr,20000.0,) (3,kry,10000.0,)

grunt> dump dept1;

(2,cse) (3,mca) (4,cse)

  1. Perform cross operation on fac and dept.

grunt> fac3 = cross fac,dept1; grunt> dump fac3;

output: (1,chp,10000.0,,2,cse) (1,chp,10000.0,,3,mca) (1,chp,10000.0,,4,cse) (2,pnr,20000.0,,2,cse) (2,pnr,20000.0,,3,mca)

(2,pnr,20000.0,,4,cse) (3,kry,10000.0,,2,cse) (3,kry,10000.0,,3,mca) (3,kry,10000.0,,4,cse)

grunt> fac4 = cross dept1,fac; grunt> dump fac4;

output: (2,cse,1,chp,10000.0,) (2,cse,2,pnr,20000.0,) (2,cse,3,kry,10000.0,) (3,mca,1,chp,10000.0,) (3,mca,2,pnr,20000.0,) (3,mca,3,kry,10000.0,) (4,cse,1,chp,10000.0,) (4,cse,2,pnr,20000.0,) (4,cse,3,kry,10000.0,)

  1. Find the lowest salaried employee. grunt> fac5 = order fac by sal; grunt> fac6 = limit fac5 1; grunt> dump fac6;

output: (1,chp,10000.0,)

  1. Find distinct departments in dept. grunt> dept2 = foreach dept1 generate dept; grunt> dept3 = distinct dept2; grunt> dump dept3;

output: (cse) (mca)

  1. Perform union operation of fac and dept1 relations.

grunt> fd = union fac,dept1; grunt> dump fd;

Output: (1,chp,10000.0,) (2,cse) (2,pnr,20000.0,) (3,mca) (3,kry,10000.0,) (4,cse)

(4,{},{(4,it)})

  1. WordCount

hello.txt (Input text file) hello welcome to Guntur hello welcome to vignan welcome to cse

grunt> text = load '/home/chp/Desktop/hello.txt' as (st:chararray); grunt> words = foreach text generate FLATTEN(TOKENIZE(st,' ')) as word; grunt> grouped = group words by word; grunt> wordcount = foreach grouped generate group,COUNT(words); grunt> dump wordcount;

(to,3) (cse,1) (hello,2) (vignan,1) (welcome,3) (guntur,1)

Explanation: Convert the Sentence into words: The data we have is in sentences. So we have to convert that data into words using TOKENIZE Function.

Output will be like this: {(hello),(welcome),(to),(guntur)} {(hello),(welcome),(to),(vignan)} {(welcome),(to),(cse)}

Using FLATTEN function the bag is converted into tuple, means the array of strings converted into multiple rows.

Then the ouput is like below:

(hello) (welcome) (to) (guntur) (hello) (welcome) (to) (vignan) (welcome) (to) (cse)

We have to count each word occurrence, for that we have to group all the

words as below: Grouped = GROUP words BY word; and then Generate word count as below: wordcount = FOREACH Grouped GENERATE group, COUNT(words);

8. Describe fac grunt> describe fac; fac: {id: int,name: chararray,sal: double,designation: chararray}

References:

7.1.HADOOP DEFINITIVE GUIDE, TOM WHITE

7.2.BIG DATA AND ANALYTICS, SEEMA ACHARYA