How to extract the values in tick separated by comma between the bracket of "PARTITIONED BY" clause


Keywords:shell 


Question: 

I have shell script which extracts the syntax of the create table statement for all tables in a database. I loop one create table statement at a time and create table statement will be available as a variable $DATA within the loop. I need to extract the columns in the create table statements within the partitioned by clause.

For example, $DATA is the variable within the loop

Input of Iteration 1 to the loop:

DATA="CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int, `permi` varchar(100)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')"

Output of iteration 1: dataoutput=depth,permi

Input of Iteration 2 to the loop:

DATA="CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')"

Output of iteration 2: dataoutput=depth

Input of Iteration 3 to the loop:

DATA="CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int, `permi` varchar(100), `www` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')"

Output of iteration 3: dataoutput=depth,permi,www


1 Answer: 

Just try this:

my @bcktik = "";
while(<DATA>)
{
    if($_=~m/PARTITIONED BY\s*\(((?:\(.*\)|[^\(])*)\)/i)
    {
        push(@bcktik, join "\,", ($1=~m/`([^`]*)`/g));
    }
}
print "$_\n" for @bcktik;

__DATA__
CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int, `permi` varchar(100)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')

CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')

CREATE TABLE `xxx`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100)) PARTITIONED BY ( `depth` int, `permi` varchar(100), `www` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'xxx' TBLPROPERTIES ( 'transient_lastDdlTime'='1519784177')