regex - Hive - use regxp to create groups and allocate them into an array


Keywords:arrays 


Question: 

I've a regular expression to parse some raw data in a table, something like: ',?([\w]*|\d*)'.

 select regexp_extract(raw_line,',?([\w]*|\d*){1}',1) as field1
      , regexp_extract(raw_line,',?([\w]*|\d*){2}',1) as field2
      , ...
      , regexp_extract(raw_line,',?([\w]*|\d*){n}',1) as fieldn
 from table

This creates groups to parse CSV fields. The issue is that the table and each rows to parse are long, so this is an expensive operation.

I was wondering if I could use the regular expression (without the group {n} and split it into an array. then I could use the array indexes to get each field. Something like:

 select arr_raw[0] as field1,
          ...,
        arr_raw[n] as fieldn
 from (
     split(
         select regexp_extract(raw_line, ',?([\w]*|\d*)'
     ,  ) as arr_raw -- ??
 )t

I know how many groups will be in the table, but not sure if this is possible or what the correct syntax/approach should be.

Thanks.


1 Answer: 

with raw_sample as (
select 'field1,field2,fiend3,123,456,"http://some.domain/abc/Player.aspx?playerID=111&BrowseIds=2221,423062611,423870887,424044345,...,",THIS_IS_MY,en,20 294 998 1001,end' as raw_line
)

select  fields[0] as field0
       ,fields[1] as field1
       ,fields[2] as field2
       ,fields[3] as field3
       ,fields[4] as field4
       ,fields[5] as field5
       ,fields[6] as field6
       ,fields[7] as field7
       ,fields[8] as field8
       ,fields[9] as field9

from   (select  split(regexp_replace(raw_line,'(".*?"|[^,]*),',concat('$1',unhex(1))),'\\x01')   as fields

        from    raw_sample
        ) t
;