








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
In this you can learn what is apache pig. How we can work with apache pig in different modes. working with different relational operators in pig and also wordcount problem in pig.
Typology: Lecture notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!
Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig Philosophy , Use Case for Pig: ETL Processing , Pig Latin Overview , Data Types in Pig , Running Pig , Execution Modes of Pig, HDFS Commands, Relational Operators, Piggy Bank , Word Count Example using Pig , Pig at Yahoo!, Pig versus Hive
Apache Pig is a platform for data analysis. It is an alternative to MapReduce Programming. Pig was developed as a research project at Yahoo. Key Features:
Anatomy of PIG. The main components of Pig are as follows:
PIG architecture.
The LOAD operator operates on the principle of lazy evaluation, also referred to as call-by-need.
The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files.
The optional AS clause defines a schema for the data that is being mapped.If you don’t use an AS clause, you’re basically telling the default LOAD Func to expect a plain text file that is tab delimited.
eg. faculty.txt 1,chp, 2,pnr, 3,kry,
fac.pig fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray); fac1 = filter fac by name=='chp'; dump fac1;
executing: pig -x local grunt> run fac.pig or grunt> exec fac.pig
Output: (1,chp,10000.0,)
Datatypes in Pig Latin.
Pig has a very limited set of data types. Pig data types are classified into two types. They are:
✓ Atom: An atom is any single value, such as a string or a number Eg. int, long, float, double, chararray, and bytearray. ✓ Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type. Think of a tuple as a row in a table. ✓ Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple in the collection can contain an arbitrary number of fields, and each field can be of any type. ✓ Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key needs to be unique. The key of a map must be a chararray and the value can be of any type.
Change present working directory to Desktop and enter into pig as below and execute. faculty.txt 1,chp, 2,pnr, 3,kry,
executing: pig -x local grunt> run fac.pig or grunt> exec fac.pig
Output: (1,chp,10000.0,)
Executing pig in mapreduce mode: To run a pig latin script in hadoop mapreduce mode write a script and save it on hadoop file system. Similarly place necessary text files into the hdfs.
group1.pig dept1 = load '/user/chp/data/dept.txt' using PigStorage('@') as (id:int,dept:chararray); dept2 = group dept1 by dept; dump dept2;
executing: pig -x mapreduce grunt> run dept.pig or grunt> exec dept.pig
Output: (cse,{(2,cse),(4,cse)}) (mca,{(3,mca)}) Pig Latin Relational Operators.
In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. Table 1 gives an overview of the operators associated with each operation.
Table 1: Operators in pig
Pig also provides a few operators that are helpful for debugging and troubleshooting, as shown in Table 2
Table 2: Debugging operators in pig. Eg.
sort1.pig
fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray); fac1 = order fac by sal desc;
left outer join: (1,chp,10000.0,,,) (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca)
right outer join: (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca) (,,,,4,cse)
full outer join: (1,chp,10000.0,,,) (2,pnr,20000.0,,2,cse) (3,kry,10000.0,,3,mca) (,,,,4,cse)
Q) Write the pig commands in local mode for the following queries using the realations(tables) faculty (fac) and department (dept1).
Entering into local mode: pig -x local grunt> fac = load '/home/chp/Desktop/faculty.txt' using PigStorage(',') as (id:int,name:chararray,sal:double,designation:chararray);
grunt> dept1 = load '/home/chp/Desktop/dept.txt' using PigStorage('@') as (id:int,dept:chararray);
grunt> dump fac;
(1,chp,10000.0,) (2,pnr,20000.0,) (3,kry,10000.0,)
grunt> dump dept1;
(2,cse) (3,mca) (4,cse)
grunt> fac3 = cross fac,dept1; grunt> dump fac3;
output: (1,chp,10000.0,,2,cse) (1,chp,10000.0,,3,mca) (1,chp,10000.0,,4,cse) (2,pnr,20000.0,,2,cse) (2,pnr,20000.0,,3,mca)
(2,pnr,20000.0,,4,cse) (3,kry,10000.0,,2,cse) (3,kry,10000.0,,3,mca) (3,kry,10000.0,,4,cse)
grunt> fac4 = cross dept1,fac; grunt> dump fac4;
output: (2,cse,1,chp,10000.0,) (2,cse,2,pnr,20000.0,) (2,cse,3,kry,10000.0,) (3,mca,1,chp,10000.0,) (3,mca,2,pnr,20000.0,) (3,mca,3,kry,10000.0,) (4,cse,1,chp,10000.0,) (4,cse,2,pnr,20000.0,) (4,cse,3,kry,10000.0,)
output: (1,chp,10000.0,)
output: (cse) (mca)
grunt> fd = union fac,dept1; grunt> dump fd;
Output: (1,chp,10000.0,) (2,cse) (2,pnr,20000.0,) (3,mca) (3,kry,10000.0,) (4,cse)
(4,{},{(4,it)})
hello.txt (Input text file) hello welcome to Guntur hello welcome to vignan welcome to cse
grunt> text = load '/home/chp/Desktop/hello.txt' as (st:chararray); grunt> words = foreach text generate FLATTEN(TOKENIZE(st,' ')) as word; grunt> grouped = group words by word; grunt> wordcount = foreach grouped generate group,COUNT(words); grunt> dump wordcount;
(to,3) (cse,1) (hello,2) (vignan,1) (welcome,3) (guntur,1)
Explanation: Convert the Sentence into words: The data we have is in sentences. So we have to convert that data into words using TOKENIZE Function.
Output will be like this: {(hello),(welcome),(to),(guntur)} {(hello),(welcome),(to),(vignan)} {(welcome),(to),(cse)}
Using FLATTEN function the bag is converted into tuple, means the array of strings converted into multiple rows.
Then the ouput is like below:
(hello) (welcome) (to) (guntur) (hello) (welcome) (to) (vignan) (welcome) (to) (cse)
We have to count each word occurrence, for that we have to group all the
words as below: Grouped = GROUP words BY word; and then Generate word count as below: wordcount = FOREACH Grouped GENERATE group, COUNT(words);
8. Describe fac grunt> describe fac; fac: {id: int,name: chararray,sal: double,designation: chararray}
References:
7.1.HADOOP DEFINITIVE GUIDE, TOM WHITE
7.2.BIG DATA AND ANALYTICS, SEEMA ACHARYA