split - What separators does the hive ngram UDF use to tokenize?


Keywords:split 


Question: 

I am performing some sentiment analysis.

I need to count the vocabulary (distinct words) in within a text.

The ngram UDF seems to do a great job at determining the unigrams. I want to know what separators it uses to determine the unigrams/ tokens. This is important if I want to mimic the vocabulary count using the split UDF instead. For example, given the following text (a product review)

I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.

The ngram UDG counts 82 unigrams/ tokens

SELECT count(*) FROM 
(SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999))  
FROM  amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t;
82

However, using the split UDF with space, comma, period, hyphen and double quotation marks as separators, there are 85 unigrams/tokens

select  count(distinct(te)) FROM amazon.Food_review_part_small 
lateral view explode(split(upper(reviewtext), '[\\s,.-]|\"')) t as te
WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR';
85

Of course there is little to no documentation that i can find. Does anyone know what separators the ngram UDF uses to determine unigram tokens ?


1 Answer: 

the UDAF ngram does not split the data, actually it already expect an array of string or an array of array of strings as input. The UDF sentences split the data in this case, From the java comments:

+ "Unnecessary punctuation, such as periods and commas in English, is automatically stripped."
+ " If specified, 'lang' should be a two-letter ISO-639 language code (such as 'en'), and "
+ "'country' should be a two-letter ISO-3166 code (such as 'us'). Not all country and "
+ "language codes are fully supported, and if an unsupported code is specified, a default "

If you run the following query

select sentences("I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.");

you will get the following result

[["I","was","aboslutely","shocked","to","see","how","much","1","oz","really","was"],["At","I","mistakenly","assumed","it","would","be","a","decent","sized","can"],["As","locally","I","am","able","to","buy","a","medium","sized","tube","of","wasabi","paste","for","around","but","never","used","it","fast","enough","so","it","would","get","old"],["I","figured","a","powder","would","be","better","so","I","can","mix","it","as","I","needed","it"],["When","I","opened","the","box","and","dug","thru","the","packing","and","saw","this","little","little","can","I","started","looking","for","the","hidden","cameras","thought","this","HAD","to","be","a","joke"],["Nope","and","it's","NOT","returnable","either"],["SO","I","HAVE","LEARNED","MY","LESSON"],["Please","just","be","aware","if","you","should","decide","you","want","this","EXPENSIVE","wasabi","powder"]]

as you can see, the udf sentences is removing some "noise" like "$7.60", "$3" as empty string as well.