Module-6 Pig Latin Part-2

Data Transformation using following major operators.

foreach : Suppose you have 60 records in a relation, then you can apply your operation using 'foreach' , on each record of your relation.

Syntax : 

alias  = FOREACH { block };

myData = foreach categories generate *; -- It will select all the columns from categories and generate new relation myData

alias : Name of the relation also it is a outer bag.

Example 1 : Projection using foreach (HandsOn)

categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',');

myData = foreach categories generate *;

dump myData;

(1,2,Football) (2,2,Soccer)

myDataSelected = foreach categories generate $0,$1; --Selecting only first 2 columns

dump myDataSelected;

(1,2) (2,2) (3,2)

Example 2 : Projection using schema (HandsOn)

categories2 = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int, subId:int, catName:chararray);

selectedCat = foreach categories2 generate subId,catName; --Selecting only two columns, using column name

DUMP selectedCat;

(2,Football) (2,Soccer) (2,Baseball & Softball)

subtract = foreach categories2 generate id-subId; --This is just to show you, you can use expression

dump subtract;

(-1) (0) (1)

So you can refer columns in a relation with

Example 3 : Another way of selecting columns using two dots ..  (HandsOn) 

selectedCat3 = foreach categories2 generate id..catName; --Select all the columns between id and catName

DUMP selectedCat3;

selectedCat4 = foreach categories2 generate subId..; --Select all the columns subId and rest which comes after subId

DUMP selectedCat4;

selectedCat5 = foreach categories2 generate ..catName; --Select all the columns comes before catName inclusive

DUMP selectedCat5;

Example 4 : Loading Complex Datatypes (HandsOn): 

Step 1 : Download complex data

Step 2 : Below is the schema for the data.

member = tuple(member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course(course_name:chararray)} , technology:map[tech_name:chararray]     )

Step 3 : Upload data in hdfs first at /user/cloudera/Training/pig/complexData.txt

Step 4 : Now write Pig script using above schema to load the data and create a relation out of this.

hadoopexamMember = LOAD '/user/cloudera/Training/pig/complexData.txt' using PigStorage('|')

   AS (member_id:int     , member_email:chararray     , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)     , course_list:bag{course: tuple(course_name:chararray)}     , technology:map[chararray]     ) ;

DESCRIBE hadoopexamMember ;

DUMP hadoopexamMember ;

Step 5 : Now find all the course lists subscribed by each member and its programming skills (To analyze , based on programming skills which all course they are interested in , so this results can be helpful to recommend same course list to similar member having same programming skills)

primaryProgramming = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, course_list, technology#'programming1' ;

DESCRIBE primaryProgramming ;

DUMP primaryProgramming ;

Step 6 : Now see some flatten results.

getFirstCourseSubscribed = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, FLATTEN(course_list), technology#'programming1' ;

DESCRIBE getFirstCourseSubscribed ;

DUMP getFirstCourseSubscribed ;

Step 7 : Get number of courses subscribed by each user

memberCourseCount= FOREACH hadoopexamMember GENERATE  member_email, COUNT(course_list);

dump memberCourseCount;

Note : We will see flatten operator in coming modules.

Example 5 : Loading compressed files (HandsOn): 

Step 1 : Download compressed file.

Step 2 : Upload in hdfs using Hue

Step 3 : Load compressed data using Pig

hadoopexamCompress = Load '/user/cloudera/Training/pig/complexData.txt.gz'  using PigStorage('|')   AS (member_id:int

   , member_email:chararray     , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)     , course_list:bag{course: tuple(course_name:chararray)}     , technology:map[chararray]     ) ;

DESCRIBE hadoopexamCompress ;

DUMP hadoopexamCompress ;

Example 6 : Store relation as compressed files (HandsOn): 

STORE hadoopexamMember INTO '/user/cloudera/Training/pig/output/complexData.txt.bz2' using PigStorage('|') ;

Now check in HDFS using Hue, that following directory has been created or not in HDFS.

/user/cloudera/Training/pig/output/complexData.txt.bz2

Note : Bag does not guarantee that their tuples will always be in order. 

Example 7 : Write way of solving the problems. (HandsOn): 

Step 1 : Download file

Step 2 : Upload in hdfs using Hue UI at /user/cloudera/HEPig/handson11/module11Score.txt

Step 3 : Now write following Pig Script (Wrong steps)

memberScore = load '/user/cloudera/Training/pig/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);

DESCRIBE memberScore;

memberScore: {email: chararray,spend1: int,spend2: int}

groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email

DESCRIBE groupedScore;

groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}

-- Below statement will through error. Because memberScore is a bag record of memberScore can have multiple tuples. Which will not allow you to do operation like (bag+bag)

totalScore = foreach groupedScore generate SUM(memberScore.spend1 + memberScore.spend2);

Correct steps 

totalScore = foreach groupedScore generate SUM(memberScore.spend1 + memberScore.spend2); -- This will through error.

memberExpense1 = foreach memberExpense generate email, spend1 + spend2 as rowLevelExpense;

describe memberExpense1 ;

memberExpense1: {email: chararray,rowLevelExpense: int}

groupedData = group memberExpense1 by email;

describe groupedData ;

groupedData: {group: chararray,memberExpense1: {(email: chararray,rowLevelExpense: int)}}

totalExpense = foreach groupedData generate SUM(memberExpense1.rowLevelExpense);

describe totalExpense ;

totalExpense: {long}

Example 8 : Nested FOREACH statements to solved same problem. (HandsOn): 

memberScore = load '/user/cloudera/HEPig/handson11/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);

DESCRIBE memberScore;

memberScore: {email: chararray,spend1: int,spend2: int}

groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email

DESCRIBE groupedScore;

groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}

result = foreach groupedScore {   individualScore = foreach memberScore generate (spend1+spend2);--it will iterate only over the records of memberScore bag     generate group, SUM(individualScore); };

dump result;

describe result;

result: {group: chararray,long}

Pig Filter Function

I have a bunch of strings that have various prefixes including "unknown:" I'd really like to filter out all the strings starting with "unknown:" in my Pig script, but it doesn't appear to work.

simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown');

I've tried a few other permutations of the regex, but it appears that MATCHES just doesn't work well with NOT. Am I missing something?

It's because the matches operator operates exactly like Java's String#matches, i.e. it tries to match the entire String and not just part of it (the prefix in your case). Just update your regular expression to match the the entire string with your specified prefix, like so:

simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown.*');

-----------

I have a pig job where in i need to filter the data by finding a word in it,

Here is the snippet

A = LOAD '/home/user/filename' USING PigStorage(',');

B = FOREACH A GENERATE $27,$38;

C = FILTER B BY ( $1 ==  '*Word*');

STORE C INTO '/home/user/out1' USING PigStorage();

the error is in the 3rd line while finding C, i have also tried using

       C = FILTER B BY $1 MATCHES '*WORD*'  

Also

      C = FILTER B BY $1 MATCHES '\\w+WORD\\w+'  

Can you please correct and help me out.

Thanks

MATCHES uses regular expressions. You should do ... MATCHES '.*WORD.*' instead.

------------------------------------------

------------------------------------------

Does PIG support in clause?

filtered = FILTER bba BY reason not in ('a','b','c','d');

or should i split it up into multiple ORs?

Thanks!

No, Pig doesn't support IN Clause. I had a similar situation. Though you can use AND operator and filter keyword as a work around. like

A= LOAD 'source.txt' AS (user:chararray, age:chararray);

B= FILTER A BY ($1 matches 'tapan') AND ($1 matches 'superman');

However, if the number of filtering required is huge. Then, probably, you can just create a relation that contains all these keywords and do a join to filter wherever the occurrence matches. Hope this helps.

You can get by using AND/OR/NOT

===================

1

down vote

favorite

I have a input file like this:

481295b2-30c7-4191-8c14-4e513c7e7577,1362974399,56973118825,56950298471,true

67912962-dd84-46fa-84ef-a2fba12c2423,1362974399,56950556676,56982431507,false

cc68e779-4798-405b-8596-c34dfb9b66da,1362974399,56999223677,56998032823,true

37a1cc9b-8846-4cba-91dd-19e85edbab00,1362974399,56954667454,56981867544,false

4116c384-3693-4909-a8cc-19090d418aa5,1362974399,56986027804,56978169216,true

I only need the line which the last filed is "true". So I use the following Pig Latin:

records = LOAD 'test/test.csv' USING PigStorage(',');

A = FILTER records BY $4 'true';

DUMP A;

The problem is the second command, I always get the error:

2013-08-07 16:48:11,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 25>  mismatched input ''true'' expecting SEMI_COLON

Why? I also try "$4 == 'true'" but still doesn't work though. Could anyone tell me how to do this simple thing?

How about:

A = FILTER records BY $4 == 'true' ;

Also, if you know how many fields the data will have beforehand, you should give it a schema. Something like:

records = LOAD 'test/test.csv' USING PigStorage(',') 

          AS (val1: chararray, val2: int, val3: int, val4: int, bool: chararray);

Or whatever names/types fit your needs.

http://stackoverflow.com/questions/10314282/filtering-in-pig

http://stackoverflow.com/questions/29064422/pig-filter-out-rows-with-improper-number-of-columns

http://stackoverflow.com/questions/16569912/apache-pig-easier-way-to-filter-by-a-bunch-of-values-from-the-same-field

http://stackoverflow.com/questions/22196120/filtering-inside-a-foreach-block-by-a-condition-value-calculated-inside-the-same

http://stackoverflow.com/questions/31675626/pig-udf-on-filter

http://stackoverflow.com/questions/10349618/too-many-filter-matching-in-pig

http://stackoverflow.com/questions/13165337/filtering-null-values-with-pig

http://stackoverflow.com/questions/18213790/conditional-filter-in-group-by-in-pig

http://stackoverflow.com/questions/9716673/using-filter-after-foreach-in-pig-latin-failed

http://stackoverflow.com/questions/10974621/pig-latin-relaxed-equals-with-null

http://stackoverflow.com/questions/21546320/pig-udf-not-being-able-to-filter-words

http://stackoverflow.com/questions/17764761/comparing-datetime-in-pig

http://stackoverflow.com/questions/24637281/how-to-filter-nan-in-pig