European SQL 2016 Hosting - HostForLIFE.eu :: Warehousing JSON formatted data in SQL Server 2016

March 8, 2017 10:34 by author

In this article, I continue to review the exciting features available in SQL Server 2016. One such feature is the long awaited T-SQL support for JSON formatted data. In this article we take a look at how JSON support will impact data warehouse solutions.

Background

Since the advent of EXtensible Markup Language (XML) many modern web applications have focused on providing data that is both human-readable and machine-readable. From a relational database perspective, SQL Server kept up with these modern web applications by providing support for XML data in a form of an XML data type and several functions that could be used to parse, query and manipulate XML formatted data.

As a result of being supported in SQL Server, data warehouse solutions based off SQL Server were then able to source XML-based OLTP data into a data mart. To illustrate this point, let’s take a look at the XML representation of our fictitious Fruit Sales data shown in figure below.

To process this data in data warehouse, we would first have to convert it into relational format of rows and columns using T-SQL XML built-in functions such as the nodes() function.

The results of the above script are shown in figure below in a recognisable format for data warehouse.

Soon after XML became a dominant language for data interchange for many modern web applications, JavaScript Object Notation (JSON) was introduced as a lightweight data-interchange format that is more convenient for web applications to process than XML. Likewise most relational database vendors released newer versions of their database systems that included the support for JSON formatted data. Unfortunately, Microsoft SQL Server was not one of those vendors and up until SQL Server 2014, JSON data was not supported. Obviously this lack of support for JSON, created challenges for data warehouse environments that are based off SQL Server.

Although there were workarounds (i.e. using Json.Net) to addressing the lack of JSON support in SQL Server, there was always sense that these workarounds were inadequate, time-wasting, and were forcing data warehouse development teams to pick up a new skill (i.e. learn .Net). Fortunately, the release of SQL Server 2016 has ensured that development teams can throw away their JSON workarounds as JSON is supported in SQL Server 2016.

Parsing JSON Data into Data Warehouse

Similarly to XML support in SQL Server, SQL Server supports of JSON can be classified into two ways:

Converting Relational dataset into JSON format
Converting JSON dataset into relational format

However, for the purposes of this discussion we are focusing primarily on the second part – which is converting a JSON formatted data (retrieved from OLTP sources) into a relational format of rows and columns. To illustrate our discussion points we once again make use of the fictitious fruit sales dataset. This time around the fictitious dataset has been converted into a JSON format as shown below.

ISJSON function

As part of supporting JSON formatted data in other relational databases such as MySQL and PostgreSQL 9.2, there is a separate JSON data type that has been introduced by these vendors. Amongst other things, JSON data type conducts validation checks to ensure that values being stored are indeed of valid JSON format.

Unfortunately, SQL Server 2016 (and ORACLE 12c) do not have a special data type for storing JSON data instead a variable character (varchar/nvarchar) data type is used. Therefore, a recommended practice to dealing with JSON data in SQL Server 2016 is to firstly ensure that you are indeed dealing with a valid JSON data. The simplest way to do so is to use the ISJSON function. This is a built-in T-SQL function that returns 1 for a valid JSON dataset and 0 for invalids.

Image below shows us the implementation of ISJSON function whereby we validate our fictitious sample dataset.

OPENJSON function

Now that we have confirmed that we are working with a valid JSON dataset, the next step is to convert the data into a table format. Again, we have a built-in T-SQL function to do this in a form of OPENJSON. OPENJSON works similar to OPENXML in that it takes in an object and convert its data into rows and columns.

Figure below shows a complete T-SQL script for converting JSON object into rows and columns.

Once we execute the above script, we get relational output shown below.

Now that we have our relational dataset, we can process this data into data warehouse.

JSON_VALUE function

Prior to concluding our discussion of JSON in SQL Server 2016, it is worth mentioning that in addition to OPENJSON, you have other functions such as JSON_VALUE that could be used to query JSON data. However this function returns a scalar value which means that unlike the multiple rows and columns returned using OPENJSON, JSON_VALUE returns a single value as shown below.

If you the JSON object that you are querying doesn’t have multiple elements, than you don’t have to specify the row index (i.e. [0]) as shown below.

Conclusion

The long wait is finally over and with the release of SQL Server 2016, JSON is now supported. Similarly to XML, T-SQL support the conversion of JSON object to relational format as well the conversion of relational tables to a JSON object. This support is implemented via built-in T-SQL functions such as OPENJSON and JSON_VALUE. In spite of all the excitement with the support of JSON is SQL Server 2016, we still don’t have a JSON data type. The ISJSON function can then be used to validate JSON text.

Comments (0) | RSS comment feed

Comment RSS

European SQL 2016 Hosting - HostForLIFE.eu :: MSSQL Server Comes with JSON?

January 17, 2017 10:27 by author

Scott

With over a thousand votes on the Microsoft Connect site, JSON support is the most requested feature for SQL Server 2016. This month, Microsoft announced that the upcoming release of SQL Server 2016 will fulfill that request. Sort of.

Microsoft will certainly be touting this feature as an additional reason to upgrade when the time comes. Natively supporting JSON helps bridge the gap between the desirable aspects of a NoSQL database and a relational database like SQL Server. The trouble is that with this current implementation, you're not really gaining anything over what's currently available. It's nothing like the native XML data type that became available starting with SQL Server 2005.

The announcement starts off with a big caveat: they will be providing native JSON support, but not a native JSON type. In fact, storage of JSON data will happen the same way it happens today, in a NVARCHAR typed column. They list 3 hollow reasons for this:

Backward compatibility
Cross feature compatibility
Non Microsoft controlled JSON parsers on the client (C#)

Backward compatibility is weak, if you're already storing JSON data you wouldn't have a hard time moving into the JSON type. Cross feature compatibility means that they're not interested in implementing JSON in other SQL Server components, so instead everything that already works with NVARCHAR (aka everything) will still work (aka nothing's changed). The client side JSON parser point is an odd one and it leads into my next gripe of this feature implementation.

On the client side, such as in a C# application, it's already common to serialize data back and forth in JSON using a JSON parsing library. The most common, though maybe not the fastest, is the Newtonsoft JSON.Net library which comes packaged with the default templates in ASP.NET projects etc. The fact that there are other options out there doesn't really matter in my opinion. Sure some behave differently, but in the end, a JSON object has a specific syntax and any proper parser should be able to serialize and deserialize the output of any other parser.

What's baffling are the features that are being supported in this native JSON support.

You can now use FOR JSON to export the results of a query as JSON. This means that you can make a normal T-SQL query and ask for the result back as a JSON formatted result. OK, but I could have just serialized the result on the client side in literally one line of code. They use the example of returning the results directly from an OData request through a web service.

You can transform a JSON object to a relational table with OPENJSON. This means that within a T-SQL query, you can provide a JSON object as part of the query and each item in the JSON will be returned as a table row which can be used to query or insert records into a relational table. They use the example of loading a JSON document into rows of a database, but again, that can easily be done on the client site by deserializing the JSON Array first.

Finally they are providing some built in functions for processing JSON data in the database. ISJSON will check if a NVARCHAR column has JSON data and JSON_VALUE which provides some scalar selection ability similar to the dot notation of JavaScript, e.g.

$.property1.property2.array1[5].property3.array2[15].property4

That's basically the least they could do with this implementation. And as for indexing of JSON data, you're left with some basic support on the level of full-text indexing of any other NVARCHAR column.

So in the end, this "native JSON support" is basically nothing more that a few convenience function on top of a normal NVARCHAR column. It takes a familiar eye about 2 minutes to uncover the sad truth about this feature and the response has shown that Microsoft isn't fooling anybody with this JSON implementation. That's not to say there aren't great features coming in SQL Server 2016 (stretch database seems pretty cool), but JSON support isn't one of them.

Tags: europe sql 2016 hosting, cheap sql 2016 hosting europe, best sql 2016 hosting europe, reliable sql 2016 hosting europe, windows sql 2016 hosting europe
Categories:
Actions: E-mail | Kick it! | Permalink | comment

Comments (0) | RSS comment feed

Comment RSS

European SQL 2016 Hosting - HostForLIFE :: Dynamic Data Masking in SQL 2016. Is it Enough?

October 11, 2016 23:43 by author

Scott

Dynamic vs. Static Data Masking

When masking data, organizations prevent unauthorized users from viewing sensitive data and protect information for following regulatory needs. Data masking technology provides data security by replacing sensitive information with a non-sensitive content, but doing so in such a way that the copy of data that looks and acts like the original.

In this article, we talk about the different types of data masking and discuss how organizations can use data masking to protect sensitive data.

Masking data isn’t the same as a firewall

Most organizations have a fair amount of security around their most sensitive data in the production (live) databases. Access to databases is restricted in a variety of ways from authentication to firewalls.

Masking limits the duplication of sensitive data within development and testing environments by distributing substitute data sets for analysis. In other cases, masking will dynamically provide masked content if a user’s request for sensitive information is considered ‘risky’. Masking data is designed to fit within existing data management frameworks and mitigate risks to information without sacrificing its usefulness. Masking platforms tend to guard data, locate data, identify risks and protect as information moves in and out of the applications.

Data masking hides the actual data. There are a variety of different algorithms for masking, depending on the requirements.

Simple masking just turns characters to blank, so, for example, an e-mail address would appear as [email protected]

More complex masking understands values, so, for example, a real name like “David Patrick” would be transformed into a fake name (with the same gender characteristics), like “John Smith”

In some algorithms, values are scrambled, so, for example, a table of health conditions might appear with values of the health conditions, but not associated with the correct person for the particular salary

Most data masking tools will offer a variety of levels of masking that can be enabled in your network. Both static and dynamic data masking use these same masking methodologies.

Static data masking

Static data masking is used by most organizations when they create testing and development environments, and, in fact, is the only possible masking method when using outsourced contractors or developers in a separate location or separate company. In these cases, it’s necessary to duplicate the database. When doing so, it is crucial to use a static data masking tools. These tools ensure that all sensitive data is masked before sending it out of the organization.

Static data masking provides a basic level of data protection by creating an offline or testing database using a standard ETL procedure. This procedure replicates a production database, but substitute’s data that has been masked, in other words, the data fields are changed to data that’s not original or is not readable.

It’s important to be aware that static masking can provide a backdoor, especially when outsourced personnel is used for administration, development, or testing. To mask data, the data is extracted from the database, at least for inspection, to comprehend the data before masking. Theoretically, this could provide a backdoor for data breaches, though it is not one of the common methods of malicious data capture.

Also, it’s clear that the static database always lags behind the actual data. The static database can be updated periodically, for example on a daily or weekly basis. This is not a security risk, but it often has implications for a variety of tests and development issues.

Static data masking allows database administrators, quality assurance, and developers to work on a non-live system so that private data is not exposed.

In many cases, in fact, you’ll want a test database anyhow. You don’t want to be running live experiments on a production database, so for R&D and testing, it makes sense to have a test database. There’s nothing wrong with this scenario.

Is your database protected with static data masking?

The answer should be obvious from the image above. Your actual production database is, in fact, not protected in any way when it comes to concealing sensitive information. Anyone or any system that has access to the production database might also have access to sensitive information. For most organizations, the only protection under this scenario is provided by limiting authorization access to the production database.

Concerns about static data masking

With static data masking, most of the DBAs, programmers, and testers never actually get to touch the production database. All of their work is done on the dummy test database. This provides one level of protection and is necessary for many environments. However, it is not a complete solution because it does not protect authorized users from viewing and extracting unauthorized information. The following concerns should be noted when using static database solutions.

Static solutions actually require extraction of all the data before it is masked, that is, it actually guarantees the data gets out of the database in unmasked form. One of the most disturbing facts about static data masking is the standard ETL (extract, transform, and load) approach. In other words, the database information was extracted as-is from the database, and only afterward transformed. You have to hope or trust that the masking solution successfully deleted the real data, and that the static masking solution is working on a secure platform that was not compromised.

The live database is not protected from those who do have permissions to access the database. There are always some administrators, QA, developers, and others with access to the actual live database. This personnel can access actual data records, which are not masked.

For organizations where a test database is not necessary for other purposes, it is wasteful to have a full test database that is a copy of the full production database, minus identifying information. The cost is in the hardware and maintenance of the second system.

Activities have to be performed twice: once on the test system and then implemented on the live system. There’s no guarantee that it will work on the production system, and then the developers or DBAs who need to debug the system will be either debugging on the testing system, or they will be granted permissions that allow them to see the actual live data.

Dynamic data masking: security for live systems

Dynamic data masking is designed to secure data in real time for live production and non-live systems. Dynamic data masking masks all sensitive data as it is accessed, in real time and the sensitive information never leaves the database. When a DBA or other authorized personal views actual data in the production database, data is masked or garbled, so the real data is never exposed. This way, under no circumstances, is anyone exposed to private data through direct database access.

Using a reverse proxy, the dynamic masking tool investigates each query before it reaches the database server. If the query involves any sensitive data, the data is masked on the database server before it reaches the application or the individual who is requesting the data. This way, the data is fully functional for development or testing purposes but is not displayed to unauthorized users.

Dynamic masking allows all authorized personnel to perform any type of action on the database without seeing real data. Of course, activities that are supposed to show data do show that data, but only to the authorized personnel using the correct access. When using advanced data masking rules, it’s possible to identify whether a particular field should be shown to a particular person, and under what circumstances. For example, someone may be able to access one hospital record at a time but only from a particular terminal or IP address, using a specific application and specific credentials. Accessing that same record using a direct database command would not work or would produce masked data.

Concerns with dynamic data masking

Dynamic data masking requires a reverse proxy, which means adding a component between the data query and response. Different solutions exist, some of which require a separate on-premises server, and others that are software-only based and can be installed on the database server.

Furthermore, when a company uses only dynamic data masking and does not have a production system, there are issues associated with performing functions on the live database.

The following concerns should be noted when using dynamic database masking solutions.

Response time for real-time database requests. In environments where milliseconds are of crucial importance, dynamic masking needs to be carefully tested to ensure that performance meets the organization standards. Even when a particular item of data is not masked, the proxy does inspect the incoming request.
Security of the proxy itself. Any type of software installed on the database server needs to be secure. And once a proxy is present, you have to enforce that the entire connections to the database are now passing through this SQL proxy. Bypassing this proxy in any way will result in access to the sensitive data without masking.
Performing of database development and testing on live systems can cause errors in the production system. In many cases, DBAs perform changes on a limited part of the system before deploying. However, best practices would require a separate database for development and testing.

Static vs. Dynamic Data Masking

The main reason to use data masking is to protect sensitive and confidential information from being breached and protected according to regulatory compliance requirements. At the same time, the data must stay in the same structure, otherwise, the testing will not show accurate results. The data needs to look real and perform exactly as data normally would in the production system. Some companies take real data for non-production environments but sometimes the data may have other uses. For example, in some organizations, when a call center personnel views customer data, the credit card data may be masked on screened.

Generally speaking, most organizations will need some combination of dynamic and static database masking. Even when static data masking is in place, almost any organization with sensitive information in the database should add dynamic data masking to protect live production systems. Organizations with minimal development and testing can rely solely on dynamic data masking, though they may find themselves providing some data with static masking to outside developers or other types of contractors.

Advantages of static data masking

Allows the development and testing without influencing live systems
Best practice for working with contractors and outsourced developers, DBAs, and testing teams
Provides a more in-depth policy of masking capabilities
Allows organizations to share the database with external companies

Advantages of dynamic data masking

The sensitive information never leaves the database!
No changes are required at the application or the database layer
Customized access per IP address, per user, or per application
No duplicate or off-line database required
Activities are performed on real data, saving time and providing real feedback to developers and quality assurance

Comments (0) | RSS comment feed

Comment RSS

European Windows 2012 Hosting BLOG

European SQL 2016 Hosting - HostForLIFE.eu :: Warehousing JSON formatted data in SQL Server 2016

Background

Parsing JSON Data into Data Warehouse

ISJSON function