SAS Data Loader for Hadoop


Hello. I’m Kumar Thangamuthu, architect
for data management at SAS. Organizations today are
faced with huge volumes of often-dirty data and a
gap in the skills needed to access and manage the data. SAS Data Loader for Hadoop
helps organizations tackle their Hadoop skills shortage. Business users can access,
prepare, and cleanse data through an intuitive
user interface that requires no coding. Data scientists and power
users can edit and run the code themselves. What’s more, these processes
are run inside Hadoop, for improved performance. Let’s get started with the demo. Data Loader has a web-based
wizard-driven user interface that improves productivity. Directives are used
to execute activities like copying data, blending
it, or profiling it. In our demo, let’s assume
that I’m a business analyst at a large bank. I want to understand the
customer behavior, based on credit-card
transactions, to recommend similar or
complementary products. You can access data from
SAS data sets, SAS Viya in-memory server, CSV,
and other text files. You can use existing
SAS/ACCESS libraries to access cloud sources
like Amazon Redshift, or we can use JDBC to
access relational data sources like Oracle or
Teradata, just to name a few. In this case we have copied
cardholder and credit-card transaction data into Hadoop. First we want to look at
the metadata and column values of the cardholder table. Here we have one
column, Member Data, with all information about a
customer, like name, address, email addy, phone
number, et cetera. We can apply a rich set
of data quality functions to extract and cleanse. In this case, we’ll extract
the Member Data column into multiple fields. Then we will parse
the address column into individual fields such
as street name, city, state, postal code, and country. As I build out the
transformation of my cardholder table, I can save the
directive for reuse later. Note that I can secure
and share my directives, using the SAS folder system,
for improved collaboration and security. Now the processing is
happening inside Hadoop. There is no coding
or data movement to the client or
the application. Here you can see Cardholder
Information column extracted to individual fields. Next we want to understand
the data by profiling it. You can see metrics
about each column. If we zoom in on State, we
see have 61 state codes. So we know that
something is amiss. We can apply data
quality functions to the data to clean it up. In this case, we’ll
standardize state codes, which will ensure more accurate
reports and decisions made on our cardholder data. Next we want to identify any
duplicate customer information and choose the golden
record for a customer. We can apply
generate codes based on fuzzy matching of the names. You can choose a sensitivity
level to fuzzy match. We’ll leave the
default value, 85. Again, the processing
is pushed inside Hadoop. Here you can see Robert Smith,
Bob Smith, and Bobby Smith all have same match codes. Now we can cluster records,
based on match codes generator, and choose a surviving
golden record. Here we can see a golden
record selected for Smith. Next we’ll blend the cardholder
data with our credit-card transaction data. Here you can see the final
analytic-based table. We can now copy the data to
SAS Viya in-memory server or other analytics platform
for further analysis and visualization. As you have seen from the demo,
SAS Data Loader for Hadoop empowers business users
to prepare, integrate, and cleanse big data faster and
easier without writing code. Data scientists can push
down processing inside Hadoop for faster performance. IT is freed up from the
burden of provisioning data. If you’d like to learn more
about how SAS can help you make better decisions faster,
on data you can trust, visit us on sas.com/dataloader.

Leave a Reply

Your email address will not be published. Required fields are marked *