Parsing a long text and fetching desired pattern strings using SPARK SQL
Recently there was a need from an analyst group to extract all desired words/substrings within a long text column using spark SQL.
The input text was basically an email message body text that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc.
Trivial solution that the analyst team member tried was using regex_extract built-in function.
REGEXP_EXTRACT(lower(‘Thanks for your purchase Leave your feedback “ http //feedback.abc.com” . Like us on
facebook “https //www.facebook.com/abcPets". Visit our site here “www.abcpets.com" and and save 7 of your next purchase
use code save7 at checkout. Call us 123 456 7890.’),’((http //|https //|http://|https://|(www.))\w+\.\w+)',0) AS URL_PATTERN_1 ;
But the result was only first matching string as shown below:
— — — — — — — — — -
NOTE: the email msg body text is obfuscated intentionally to not share any confidential data elements.
So, to mitigate this the alternate solution was to either write UDF or leverage some other builtin functions in tandem to get the desired result. So, here is my version of solution without the need to write UDF.
To illustrate this solution, first we need some sample data. So here it is:
/* Creating sample data for testing */
create or replace temporary view v1
select 1 as email_num, ‘Sender1’ as sender, ‘Receiver1’ as receiver, ‘Thanks for your purchase Leave your feedback “ http //feedback.abc.com” . Like us on
facebook https //www.facebook.com/abcPets.Visit our site here www.abcpets.com and and save 7 of your next purchase
use code save7 at checkout. Call us 123 456 7890.’ as email_text
select 2 as email_num, ‘Sender2’ as sender, ‘Receiver2’ as receiver, ‘Thanks for your “ http //feedback.toys.com” purchase Leave your feedback “ http //feedback.toys.com” . Like us on
facebook https //www.facebook.com/toys4Pets.Visit our site here www.123toys222222.com and and save 7 of your next purchase
use code save7 at checkout. Call us 987 654 3210.’ as email_text
/* cache the data into memory */
cache table v1;
/* verify the sample data */
select * from v1;