However, despite their ability to stay on top of the relational database game with Microsoft SQL Server, they have lacked the tight integration of more advanced analytical functionality.
Which, for those who rely on this solution, is a big void.
As such, Microsoft have been working hard to enable its data scientists, developers, and database administrators to solve just that.
With all this considered, not only has Microsoft introduced SQL Server R Services with SQL Server 2016, they equally decided to enable the rest of the community to run on-the-spot analytics on cold, relational data by implementing Python in SQL Server 2017.
In doing so, they are enabling a large part of the data science community to use their tool of preference. Especially as the community is split between R and Python as the two main languages of preference.
R and Python in SQL Server most certainly is a combination of tools that data scientists have long been waiting for.
Whereby, you can combine the scalable database functionality of Microsoft SQL Server with R and Python’s plethora of advanced analytical capabilities and libraries - all to create systems that can perform demanding and computationally intensive analytical calculations.
Especially the latest addition of Python, which now allows data scientists to use an amazingly broad selection of Machine Learning and statistical third party libraries.
R and SQL Server Working in Tandem: The Benefits
With the combination of those powerful tools in their hands, data scientists can stop worrying about scale, data volume/transfer and focus on the analytical work.
Amongst others -
The benefits of building apps with R on SQL Server are:
Flexibility & agility
Performance & scale
For more information, see these great slides from Microsoft.
R and Microsoft SQL Server example: Extreme Gradient Boosting in
In this example, we chose to test the capabilities of SQL Server’s R Services with a well-known German credit scoring dataset that consists of client credit scoring information.
The dataset consists of clients who received a loan, with properties such as income, age, loan, etc., and whether they proved to be good for the credit.
So, first we built a model trained on that data set.
Then we predicted the client’s creditworthiness for the loan they are applying for on a test set. An interesting approach to that end is to use Extreme gradient boosting - the winner of many Kaggle competitions to date.
Doing this in R is an easy, straight-forward approach for someone familiar with Machine Learning:
We just have to take the data
Split it into a training and a testing set
Build an XGB model with the first set
Test it on the second set.
After installing the R Services and the XGB package in our SQL Server (packages have to be installed by an administrator), we were able to transfer the initial R code to SQL.
Quite noteworthy, you execute your R code by passing it as a parameter to the sp_execute_external_script procedure and inserting the result into a new table.
Below you can see the code for our example - note that it is SQL server that executes an R script contained in the string:
With this in mind, below you can see that out of the 157 clients of the test dataset who proved to not be credit worthy the algorithm correctly predicted 145. As such, our model has an accuracy of 92.4% when predicting non-credit worthy clients and 97.6% overall.
While we haven’t dug into the validity of this model too much, the real purpose is to show the SQL and R integration, not the modelling.
Python and SQL Server 2017
Although I am not a Python expert, the underlying mechanisms of the Python implementation in SQL Server 2017 are fairly similar to the ones of R in SQL Server 2016.
Whereby, the most conventional application of Python with SQL Server is to execute Python scripts as normal, with SQL Server as a data source. However, Microsoft has also made it possible to embed Python code directly in SQL Server databases by including the code as a T-SQL stored procedure.
Here’s the kicker:
This allows Python code to be deployed in production along with the data it will be processing.
One of the main advantages of the Python integration in SQL Server, as mentioned earlier, could be identified as the elimination of data movement – it’s better to bring your computation to the data than the data to the computation.
Also, easy deployment is definitely one of the perks as you can embed deployed Python models in T-SQL scripts, and use them through simple stored procedure calls.
RevoScalePy and RevoScaleR
The RevoScalePy package is modelled after RevoScaleR from SQL Server R Services, and enables for enterprise-grade performance and scale in all data volume scenarios.
Finally, the rich extensibility offered along with the wide availability of the service (available in all editions of SQL Server 2017) will definitely help take the Python in SQL Server 2017 solution among the most widely recognised industry standards – scaling up data science.
Microsoft have successfully managed to combine a few very important data-related services, in an efficient and convenient solution, whose widespread acceptance and use by the data science community seems to be just a matter of time.
We now only have to wait and see what extra goodies they have to throw in, it can only get better from here on.
We'd love to hear your thoughts on Python & R SQL Server integration. Make sure to comment below.