r sql server birds
HOW TO

How to Integrate Python and R with SQL Server 2017

Benefits & examples of R SQL Server 2017.
Author: Ion, Product & Data  
   Last updated: August 11, 2017

The use of SQL Server [2017], R and Python have become ever more popular in recent years – with one significant issue:

Lack of integration.

Whereby two priority languages seem to have acted in silos. However, as we’ll explain below, Microsoft's newly found openess has some fairly sizeable impacts.

With this in mind, we have examined the benefits by creating a model of extreme gradient boosting to demonstrate the SQL and R integration.

Combining Python and R SQL Server

As can be seen from various industry reports, Microsoft has made good progress with its big data offerings in recent years.

These solutions include:

  • Azure SQL Data Warehouse
  • Data Lake
  • HDInsight
  • Machine Learning studio suite

However, despite their ability to stay on top of the relational database game with Microsoft SQL Server, they have lacked the tight integration of more advanced analytical functionality.

Which, for those who rely on this solution, is a big void.

As such, Microsoft have been working hard to enable its data scientists, developers, and database administrators to solve just that.

With all this considered, not only has Microsoft introduced SQL Server R Services with SQL Server 2016, they equally decided to enable the rest of the community to run on-the-spot analytics on cold, relational data by implementing Python in SQL Server 2017.

In doing so, they are enabling a large part of the data science community to use their tool of preference. Especially as the community is split between R and Python as the two main languages of preference.

python r sql server table.jpg

R and Python in SQL Server most certainly is a combination of tools that data scientists have long been waiting for.

Whereby, you can combine the scalable database functionality of Microsoft SQL Server with R and Python’s plethora of advanced analytical capabilities and libraries - all to create systems that can perform demanding and computationally intensive analytical calculations.

Especially the latest addition of Python, which now allows data scientists to use an amazingly broad selection of Machine Learning and statistical third party libraries.

R and SQL Server Working in Tandem: The Benefits

With the combination of those powerful tools in their hands, data scientists can stop worrying about scale, data volume/transfer and focus on the analytical work.

Amongst others -

The benefits of building apps with R on SQL Server are:

  • Flexibility & agility
  • Performance & scale
  • Cost effective

For more information, see these great slides from Microsoft.

sql-1.png

sql-2.png

R and Microsoft SQL Server example: Extreme Gradient Boosting in

In this example, we chose to test the capabilities of SQL Server’s R Services with a well-known German credit scoring dataset that consists of client credit scoring information.

The dataset consists of clients who received a loan, with properties such as income, age, loan, etc., and whether they proved to be good for the credit.

So, first we built a model trained on that data set.

Then we predicted the client’s creditworthiness for the loan they are applying for on a test set. An interesting approach to that end is to use Extreme gradient boosting - the winner of many Kaggle competitions to date.

Doing this in R is an easy, straight-forward approach for someone familiar with Machine Learning:

  1. We just have to take the data
  2. Split it into a training and a testing set
  3. Build an XGB model with the first set
  4. Test it on the second set.

After installing the R Services and the XGB package in our SQL Server (packages have to be installed by an administrator), we were able to transfer the initial R code to SQL.

Quite noteworthy, you execute your R code by passing it as a parameter to the sp_execute_external_script procedure and inserting the result into a new table.

Below you can see the code for our example - note that it is SQL server that executes an R script contained in the string:

execute sp_execute_external_script

  @language
= N'R'

,
@script = N'

# Load library

library(xgboost)

# Split into test and
training set

train <- InputData[1:500,]

test <- InputData[501:1000,]

# Log10 transform vars

train[,2:ncol(train)]
<- log10(train[,2:ncol(train)])

test[,2:ncol(test)]
<- log10(test[,2:ncol(test)])

test_vars<-colnames(train)[2:ncol(train)]

# Convert sets to XGB
matrices

mat.train<-xgb.DMatrix(data=data.matrix(train[test_vars]),label=(train[,c("Creditability")]))

mat.test<-xgb.DMatrix(data=data.matrix(test))

param <- list(objective           = "binary:logistic",

              booster             = "gbtree",

              eta                 = 0.002,

              max_depth           = 20,

              subsample           = 1,

              colsample_bytree    = 1,

              min_child_weight    = 4.285714             

)

# Cross validate

clf <- xgb.cv(params              =
param,

              data                = mat.train,

              nrounds             = 500,

              verbose             = 1,

              maximize            = t,

              eval_metric="auc",

              nfold=3

)

# Get best model

clf_best <- xgboost(params              = param,

                    data                = mat.train,

                    nrounds             = 400,

                    verbose             = 1,

                    maximize            = FALSE,

                    eval_metric="auc")

# Get predictions

test$p <- round(predict(clf_best,mat.test))

OutputData <- as.data.frame(cbind(test$Creditability,test$p))

'

,@input_data_1 = N' SELECT * FROM inputset; '

,@input_data_1_name = N'InputData'

,@output_data_1_name = N'OutputData'

With this in mind, below you can see that out of the 157 clients of the test dataset who proved to not be credit worthy the algorithm correctly predicted 145. As such, our model has an accuracy of 92.4% when predicting non-credit worthy clients and 97.6% overall.

Credibility Rows Correct Incorrect CorrectP IncorrectP
0 157 145 12 92.36 7.64
1 343 343 0 100.00 0.00

 

While we haven’t dug into the validity of this model too much, the real purpose is to show the SQL and R integration, not the modelling.

Python and SQL Server 2017

Although I am not a Python expert, the underlying mechanisms of the Python implementation in SQL Server 2017 are fairly similar to the ones of R in SQL Server 2016.

Whereby, the most conventional application of Python with SQL Server is to execute Python scripts as normal, with SQL Server as a data source. However, Microsoft has also made it possible to embed Python code directly in SQL Server databases by including the code as a T-SQL stored procedure.

Here’s the kicker:

This allows Python code to be deployed in production along with the data it will be processing.

One of the main advantages of the Python integration in SQL Server, as mentioned earlier, could be identified as the elimination of data movement – it’s better to bring your computation to the data than the data to the computation.

Also, easy deployment is definitely one of the perks as you can embed deployed Python models in T-SQL scripts, and use them through simple stored procedure calls.

RevoScalePy and RevoScaleR

The RevoScalePy package is modelled after RevoScaleR from SQL Server R Services, and enables for enterprise-grade performance and scale in all data volume scenarios.

Finally, the rich extensibility offered along with the wide availability of the service (available in all editions of SQL Server 2017) will definitely help take the Python in SQL Server 2017 solution among the most widely recognised industry standards – scaling up data science.

Conclusion

Microsoft have successfully managed to combine a few very important data-related services, in an efficient and convenient solution, whose widespread acceptance and use by the data science community seems to be just a matter of time.

We now only have to wait and see what extra goodies they have to throw in, it can only get better from here on.

We'd love to hear your thoughts on Python & R SQL Server integration. Make sure to comment below.

Want to see more? Check out this guide...

Topics: SQL Server, Cloud

KNOWLEDGE IS POWERGet weekly email insights from RedPixie

Free GuideLearn how to choose the right IT partner with our free guide >
surface banner.jpg