Recently there was a need from an analyst group to extract all desired words/substrings within a long text column using spark SQL.
The input text was basically an email message body text that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc.
Trivial solution that the analyst team member tried was using regex_extract built-in function.
REGEXP_EXTRACT(lower(‘Thanks for your purchase Leave your feedback “ http //feedback.abc.com” . Like us on
facebook “https //www.facebook.com/abcPets". Visit our site here “www.abcpets.com" and and save 7 of your next purchase
use code save7 at checkout. Call us 123…
In this article I would like to share my understanding and experiences with Spark SQL since we just completed migrating 300+ tables in my domain, from Teradata based ETL processes to Spark SQLs. Backend data storage as HDFS.
Below are the couple of questions out of many. we encountered during this migration:
How can we maintain the distribution of data like we have in Teradata based on Primary Index?
Spark does support this feature using “distribute by” method. As the name reflects it distributes the data based on the columns/(valid expressions) you pass it distributes the data across the partitions…