maandag 11 augustus 2014

Hybrid Query (Combining T-SQL and MDX)

This article describes how to query data from a Analysis Services Multidimensional cube or Tabular model from within T-SQL code. Believe it or not, there are situations where you have to do this!


To get data from a Multidimensional cube or a Tabular model you need to talk "MDX" (or "DAX"). By using OPENQUERY, you are able to query them via a linked server from within your T-SQL code.

Hybrid query

The T-SQL code below shows an example of how you can query a Multidimensional cube by using dynamic T-SQL:
  • To query a Multidimensional cube or Tabular model you need to create a linked server first (in the code below this linked server is called "MULTIDIMENSIONAL").
  • The MDX query returns data where the column names contain square brackets. You can rename those column names by surrounding them with double quotes and giving them an alias ("[Measures].[Turnover]" AS [Turnover]).
  • You can load the data in a temporary table (in this example a table variable is created) so that you can use the results further in your code/process.


zaterdag 22 maart 2014

PASS Nederland (2014-03-18)

This week I attended the PASS evening at the HP office in Amstelveen where they invited Marco Russo to give two sessions about "DAX Patterns" and "DAX from the Field: Real-World Case Studies", which I'd like to share.

DAX Patterns

When building a tabular model people face problems wherefore already a solution exist. Marco Russo and his colleagues created a lot of solutions in tabular and Power Pivot and transformed those into patterns (a general reusable solution to a commonly occurring problem) so they can be easily reused in other projects where the tabular model or Power Pivot is used. These patterns are collected and put on a special website: A couple of examples he showcased were "Cumulative Balance (without using snapshots)", "Custom calendar", and "Actual Vs. Budgets".

PDW (Parallel Data Warehouse)
Between the sessions of Marco Russo a short session has been presented about HP's PDW appliance with SQL Server 2012 and how it is being used at a big customer in Africa. This client has a Data Warehouse of approximately 3 TB, has difficulties with finishing the ETL process on time and also has big performance issues when multiple people were executing ad hoc queries. By using HP's PDW appliance all those issues were solved:
  • The Data Warehouse runs on the appliance.
  • The performance is great (you should experience it to believe).
  • The appliance has been made High Available (they just bought two appliances and put them on different locations). 
People are saying that an appliance is a cube-killer, is that so? Well, not really...
  • Analysis Services works smoothly with SQL Server 2012 PDW and the combination with a Multidimensional cube with ROLAP is perfect as the query response from PDW is superfast.
  • When a lot of ad hoc queries are executed, an appliance is really helpful. When queries become used more widely throughout the company it's recommended to build those in a cube or tabular model.
AX from the Field: Real-World Case Studies

Building a tabular model is easy but when it comes to performance there are a couple of things you have to take into account.
  • Reduce the number of distinct values for columns (less memory needed, increases scan speeds).
  • You can check which part of the tabular model needs attention by using the Excel workbook that is provided on this website:
  • How do you determine the hardware you need? The best road to follow is to first develop your model and filling it incrementally (increments of 10%) with real production data. By doing it this way you can create a good prediction of the memory usage. But the problem is that it normally takes a while to order hardware, so it’s not always handy to wait till a tabular model is developed and filled.
  • When choosing hardware you need to take into account the following:
    - CPU (the fastest you can buy, largest L2 cache, lowest number of cores (less than 8)
    - Memory (the fastest you can buy)
    - Disk / other hardware (don’t waste your money here)
It seems that the latest Xeon processor of iNTEL has a large L2 cache wich is shared by all cores. You can switch-off cores so more cache is available for the other cores.

donderdag 18 juli 2013

Generating and printing SSRS reports automatically via SSIS 2012


If you want to schedule the generation and printing of SSRS reports automatically where each report uses different parameter values and data, you can use the Data Driven Subscription functionality in Reporting Services. But, if you don’t have the Enterprise or Business Intelligence editions of SQL Server 2012, you don’t have this functionality. As an alternative you can use a combination of SSIS and SSRS to implement such a solution, which is described in this article.


The company “ARGP Office” is a distributor of office equipment, which are sold by customers (shops) all over the world. All these customers receive a quarterly letter (and specification) about their turnover (and a bonus). It depends on the customer, and the quarter’s number (Q1, Q2, Q3, or Q4)  if they receive a letter and a specification or only a letter. Also, the contract between the company and the customer defines if they receive a bonus quarterly or yearly or don’t receive a bonus at all.

The letters and specifications are generated and printed at the end of every quarter, which is done by an automatically scheduled ETL process. This process extracts data from different source systems, transforms the data in the right format, and loads it in a reporting database. Configuration settings and the loaded data from the reporting database are used to generate and eventually print the letters and specifications, which are sent to the customers by mail.


In the solution I am going to explain, I will focus on the report generation and printing part of the ETL process. The configuration settings stored in the reporting database tell the ETL process which reports have to be generated (letters and specifications) and for which customers. Using a .NET assembly the letters and specifications are generated as PDF files. This assembly uses the webservice methods of Reporting Services to generate those files from defined reports. When the PDF files have been generated they are printed by using a batch file that calls a Ghostscript.

  • Generating reports
    The actual generation of reports is done by a SSIS package that calls the .NET assembly’s methods from within a Script Task. The SSIS package passes the parameters to those methods so they are able to generate the right letters and specifications by making use of the web services of Reporting Services.

    Adding the .NET assembly to the GAC
    To be able to call the methods of a .NET assembly (ReportGenerator.dll) from a Script Task in SSIS you have to install this assembly in the Global Assembly Cache (GAC) on the production machine. The easiest way to install the assembly in the GAC is by copying it to the directory c:\windows\assembly. When the assembly has been copied you have to reference it within the Script Task by adding a reference (Project > Add Reference…) and importing the namespace in your script (Imports ReportGenerator).

    The following code snippet (VB.NET) shows how the “Generate” method of the Generator object is called with parameters. Important parameters here, are the url of the report to be generated, the parameters that have to be passed to the report, the location where the report has to be saved and the format of the generated file (in this case “PDF”).

    .NET assembly configuration settings
    When a .NET assembly is called by a SSIS packages and uses configuration settings, these configuration settings have to be stored in the configuration file. When running a package in debug mode (BIDS / SSDT) it’s executed by “DtsDebugHost.exe” that has the related configuration file “DtsDebugHost.exe.config” wherein you can place the configuration settings to be used in debug. When running a package in production, it’s executed by “DTExec.exe” that has the related configuration file “DTExec.exe.config” wherein  you can place the configuration settings to be used in production. These files can be found in the directory C:\Program Files\Microsoft SQL Server\90\DTS\Binn.

    Calling webservice methods
    The Generator method of the Generator class calls several methods of the SSRS web services to load a report with the right parameters and render it to a file. To make use of the web services you need to add two web references to your project:
    - http://hostname/reportserver/reportexecution2005.asmx
    - http://hostname/reportserver/reportservice2010.asmx

    The following code snippet (C#.NET) shows how to use the LoadReport() method to load the report, SetExecutionParameters() to set the parameters of the loaded report, and Render() to render the loaded report to a file. For the simplicity of the example, the declaration of variables and error handling is not shown.

  • Printing reports
    The printing of the generated reports is done by a batch script that calls a Ghostscript, which is an open source tool for printing purposes.

    The following snipped of code (Windows command batch file) takes to parameters as input (filename and printer) and calls Ghostscript to print the generated PDF report.

    The references section of this article shows a link to the website where you can download Ghostscript.

With knowledge of SSIS and .NET it’s easy to implement a Data Driven Subscription solution in the  case you don’t have the expensive Enterprise or Business Intelligence editions of SQL Server.



donderdag 22 november 2012

SQL Server 2012 - High Availability - AlwaysOn in Real-life


Nowadays, high availability (HA) and disaster recovery (DR) solutions are built with different techniques like failover clustering, log shipping, mirroring, and replication to achieve maximum application availability and data protection for SQL Server instances and databases within and across data centers. These techniques have their imitations and concerns which are solved by the integrated solution AlwaysOn in SQL Server 2012 (Enterprise Edition). This article describes how AlwaysOn can be used in a real-life scenario wherein an ETL proces can executed without influencing the availability and consistency of data, and performance. The assumption is made that you already have basic knowledge about AlwaysOn, so we won't go into detail too much.


The company "Always Online" has a website that communicates via a webservice, which is used as the interface to the data warehouse database containing huge amounts of data.

The current version of the website has difficulties with handling the increasing number of users and data volumes, which negatively influences the performance tremendously. Every night, the data warehouse is not available for three hours, because during this time window the data in the database is updated by an ETL process. This availability is not acceptable, because also customers  in different time zones access the website.

The database part of the solution has to be rebuilt to maximize the availability and minimize data loss in case of failover. To make this possible the company is going to use AlwaysOn in SQL Server 2012.


To solve the availability issues the following solution has been setup. The webservice connects with a listener that directs the communication to the primary or secondary server node. Each node has a SQL Server instance that contains a data warehouse database in an Availability Group and a snapshot of the data warehouse database that's not part of the Availability Group. The ETL process is running on the primary node and updates the data in the data warehouse database of the primary replica. The updated data of the primary replica is automatically synchronized with the secondary replica. The secondary replica offloads the primary replica and is used for reporting and maintenance tasks like backup.

AlwaysOn Availability Group

An Availability Group consists of availability replicas; one primary replica and up to four active, secondary replicas. All secondary replicas are considered "active" as they can all be used for read-only operations. Each replica can host multiple availability databases, which can be stored on different storage media. Data synchronization between the replica and its secondaries can be done in two modes, namely synchronous and a-synchronous.  With synchronous-commit-mode a transaction is committed on all replicas to guarantee transactional consistency (introduces transaction latency, no data loss, automatic or manual failover, usable for replicas that exist in the same data center). With asynchronous-commit-mode the primary replica is not waiting for a transaction to be committed on the secondary replicas (better performance, more chance of data loss, manual failover, well suited for disaster recovery solutions). Secondary replicas are available for read-only operations, and its connection mode setting tells how a client can connect to this secondary replicata (not, read-intent connections, all connections).


The following flow describes the steps to make it possible to have high availability when the ETL process runs:

  1. The ETL process is not running
    De webservice uses the listener to connect to the data warehouse database on the primary replica. It's completely transparant for the webservice if the primary node failed over to the secondary node. The connectionstring that's used by the webservice must connect to the virtual server name of the listener (don't use the node server names directly).

    The following connection string connects to the data warehouse database "MS_DWH" via the virtual server name "MS_LISTENER" of the listener.

    Data Source=MS_LISTENER;Initial Catalog=MS_DWH;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;
  2. Before the ETL process is started
    Before the ETL Process is started, snapshots of the data warehouse database are created on both replicas and the connection string in the web.config of the webservice is changed to look at the secondary replica on the second node. We are taking a risk here when the secondary replica fails; the choice being made here is alligned with the requirements.

    The following connection string connects to the snapshot of the data warehouse database "MS_SNAP_DWH" directly via the server name "MS_NODE2" of the second node (bypassing the listener).

    Data Source=MS_NODE2;Initial Catalog=MS_SNAP_DWH; Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;
  3. The ETL process is running
    While the data is updated in the data warehouse database on the primary replica, the data is synchronized immediately to the secondary replica. This synchronization has impact on the performance of the ETL process and will require more time to complete.
  4. The ETL process is completed
    When the ETL Process has completed, the connection string in the web.config of the webservice will be changed back to point to the listener and the snapshots will be removed.
  • Suspend / Resume
    In our solution snapshots are used to connect in the case the ETL process is running, because you don't want that users see inconsistent data while the data on the primary replica gets updated. A nicer solution would be to switch the webservice to the secondary replica via the listener (using ApplicationIntent=ReadOnly) before the ETL processs is started and suspending (ALTER DATABASE [MS_DWH] SET HADR SUSPEND) the data synchronization.  When the ETL process is done the webservice can be switched back to the primary replica (don't use ApplicationIntent=ReadOnly) and data movement can be resumed (ALTER DATABASE [MS_DWH] SET HADR RESUME). Unfortunately this solution is not working when data is suspended. Existing connections to one of the replicas will keep working, new connections will fail, showing the following exception message:
    Unable to access the 'MS_DWH' database because no online secondary replicas are enabled for read-only access. Check the availability group configuration to verify that at least one secondary replica is configured for read-only access. Wait for an enabled replica to come online, and retry your read-only operation.
  • Webservice & stateless
    A webservice is stateless, so for every call a new connection is made. Under the hood, an existing physical connection will be reused, but because you don't have control over the application pool recycling and connection pooling you must always assume that every connection is new.
  • ApplicationIntent
    When you connect to the listener without the use of the ApplicationIntent=ReadOnly parameter in the connectionstring, the connection will be directed to the primary replica, otherwise to the secondary replica.

  • .NET Framework version
    The parameter "ApplicationIntent" in a connectionstring is a new feature and supported by the SqlClient class in the .NET Framework 4.0 and higher.

AlwaysOn is a very useful, new feature of SQL Server 2012 to implement high availability and disaster recovery solutions. Very useful in the scenario described above, but it's a pity that it's not possible to connect to a replica in an availability group when data is being suspended. That's why snapshots had to be used, which makes the solution more difficult to implement and maintain.


maandag 29 oktober 2012

Change Data Capture: process a schema change (DML)


CDC (Change Data Capture) captures data changes caused by DML (Data Manipulation Language) statements like inserts, updates, and deletions in so called "change tables". These change tables have exactly the same schema as the source tables, enriched with several meta data columns. This blog article describes a way to handle the situation where the schema of a source table changes without loosing the captured change data in the change tables.


In the table "SoccerStatistics" the wins, draws, and losses per team in the Premier League are maintained on a weekly basis.

The table is enabled for CDC; all inserts, updates, and deletions are stored in the change table "SoccerStatistics_CT". Using SQL Server Integration Services (SSIS), only the changes are loaded into the Data Warehouse.

Halfway the season, a new requirement came up to add the extra column "GoalDifference" (goals for - goals against) without losing historical data.
The following steps have to be followed to add a column to a CDC enabled table:
  1. Backup change data
    It's only possible to process a schema change on a source table when it's disabled for CDC. When a table is disabled for CDC the related change table is deleted, including its data, which is an unwanted situation. That's why you first have to create a temporary table having exactly the same schema as the change table, extended with the extra column "StartLSN". Populate this temporary table with the data of the change table, and fill the column "StartLSN" with the value "start_lsn" that can be found in the table cdc.change_tables.

    The fastest way to create a temporary table with the same schema as the change table is by using the construction "SELECT * INTO", like this is done in the following T-SQL code:

    SELECT * INTO [dbo].[SoccerStatistics_Temp] FROM [cdc].[dbo_SoccerStatistics_CT]
    Getting and updating the "StartLsn"column can be done with the following T-SQL code:
    SELECT @StartLsn = [start_lsn] FROM [cdc].[change_tables] WHERE [capture_instance] = 'dbo_SoccerStatistics';
    UPDATE [dbo].[SoccerStatistics_Temp] SET [StartLSN] = sys.fn_varbintohexstr(@StartLsn)

  2. Disable CDC
    Now the change data has been saved to the temporary table, the table can be disabled for CDC.
  3. Process schema changeRun a T-SQL script to add the column "GoalDifference" to the source table and the temporary table.
  4. Enable CDC
    Enable the source table for CDC; the created change table will also contain the newly added column "GoalDifference".
  5. Restore change data
    Copy the change data, stored in the temporary table, to the change table.
  6. Update cdc.change_tables
    When a table is enabled for CDC, an entry is added to the table cdc.change_tables, which is used by the Table-Valued Functions (TVFs). The column "start_lsn" will point to the first logged DML in the change table. Because we populated the  change table in a different way, the "start_lsn" won't be in sync, and the Table-Valued Functions will return unexpected results or will fail. To solve this, the "start_lsn" has to be updated by the value that can be found in the column "StartLSN" in the temporary table. This will make sure that the Table-Valued Functions will work as expected.

    Updating the "start_lsn" value is shown in the following T-SQL code:
    DECLARE @startLsn AS BINARY(10);
    SELECT TOP 1 @startLsn = [StartLSN] FROM [dbo].[SoccerStatistics_Temp]
    UPDATE [cdc].[change_tables] SET [start_lsn] = @startLsn WHERE [capture_instance] = 'dbo_SoccerStatistics';

  7. Remove temporary table
    The change table has been populated with the historical data, so the temporary table can be removed.
Although some tweaking is needed it's fairly simple to process schema changes on CDC enabled tables without losing your historical data.


dinsdag 10 januari 2012

SQL Server: Change Data Capture


CDC (Change Data Capture) captures data changes caused by DML (Data Manipulation Language) statements like inserts, updates, and deletions and makes it possible to archive data changes without any additional programming. These captured changed are stored in change tables that mirror the column structure of the original, tracked, user-created source tables with additional columns that include extra metadata. With CDC no changes are required to the existing source table schemas. Several database objects are provided (stored procedures, functions, and jobs) to handle the captured changes and integrate with external systems like a DWH (Data Warehouse).


All data changes caused by inserts, updates, and deletions are saved in the transaction log file _log.ldf, which is used as a source for the change data capture process. This change data capture process is scheduled by the SQL Server Agent that executes the capture job cdc._capture. The capture job scans the transaction log every 5 seconds, processing a maximum of 1000 transactions (these numbers are based on default settings, which can be changed for performance optimalisation purposes). Take into account that there’s latency between the time that a change is committed to the source table and the time that the change appears within its associated change table as the capture process extracts data from the transaction log. Archiving changes causes huge amounts of data and therefore a cleanup job cdc._cleanup exists that removes change data from the change tables that is older than 3 days.

The CDC functionality is only supported by the Enterprise and Developer editions of SQL Server 2008 R2.


For using CDC functionality you have to go through the following steps.
  • Enable CDC for database
    Before changes to any individual table within a database can be tracked change data capture must be explicitly enabled for the database, which you can do by executing the following T-SQL script:

    EXEC sys.sp_cdc_enable_db

    After executing this T-SQL script the following database objects are created:

    Object Description
    Schema Cdc
    Tables Several tables are created using the schema “cdc”
    Stored procedures Several stored procedures are created prefixed by “sp_cdc_” and “sp_MScdc_”.

    You can check which databases are enabled for CDC by querying the table “sys.databases” in the master database.

    SELECT [name], database_id, is_cdc_enabled
    FROM sys.databases
    WHERE [name] = 'DatabaseName'

  • Enable CDC for table(s)
    To enable a table for CDC, you have to execute the following T-SQL script (this example enables CDC for the table “schemaName.TableName”):

    EXEC sys.sp_cdc_enable_table @source_schema = N'schemaName', @source_name = N'TableName', @role_name = 'RoleName'

    After executing this T-SQL script the following database objects are created:

    Object Description
    Table [cdc].[dbo_HcmVolume_CT]
    Table-valued functions (TVF) The following table-valued functions are created:
    - [cdc].[fn_cdc_get_all_changes_<schemaName>_<TableName>]
    - [cdc].[fn_cdc_get_net_changes_<schemaName>_<TableName>]
    Stored procedures The following stored procedures are created:
    - [cdc].[sp_batchinsert_<ChangeTableID>]
    - [cdc].[sp_insdel_<ChangeTableID>]
    - [cdc].[sp_upd_<ChangeTableID>]
    Jobs The following jobs are created and started immediately (only happens once when the first table is enabled for CDC):
    - [cdc].[DatabaseName_capture]
    - [cdc].[DatabaseName_HcmVolume]

    For every tracked table a change table is created and two related table-valued functions. These three database objects together are called a capture instance.

    You can check which tables are enabled for CDC and which columns are tracked by querying the tables “cdc.change_tables” and “cdc.captured_columns”.

Change table schema

Default a change table has the same column structure as the original, tracked table, enriched with five additional metadata columns:

Name Description
__$start_lsn Identifies the commit log sequence number (LSN) that was assigned to the change. You can use this value to query the table [cdc].[lsn_time_mapping] to get date/time information of the change. Changes committed in the same transaction share the same commit LSN value.
__$end_lsn NULL (not used)
__$seqval Used for ordering multiple changes in one transaction.
__$operation Contains a value which corresponds to the executed DML operation:
• 1 = Delete
• 2 = Insert
• 3 = Update (before)
• 4 = Update (after)
__$update_mask Contains a bit mask value that tells which field values have been changed (only useful for updates). In the case of inserts or deletions all bits are set, in the case of updates only those bits are set that correspond to changed columns.

For every insert or delete operation a single row is added to the change table. For an update operation two rows are added (one with the value before the update and one with the values after the update).

Querying for changes

As a user you might want to know how a value changes over time and what DML operation has been executed. CDC provides several useful stored procedures and table-valued functions (TVF) to get insight in that. The following workflow explains how this is done.
  1. Determine time windowDetermine the “from” date/time and the “to” date/time.
  2. Map date/time to LSNMap the “from” date/time and “to” date/time to log sequence numbers (LSN).
  3. Get changesGet changes that occurred in the defined time window. You can query for net / all changes:
    Net: the resultset reflects the final contents of records
    All: the resultset contains all changes executed on records

In the following T-SQL code example, data changes are queried in a particular time window (from now till 24 hours back):

DECLARE @lsnFrom BINARY(10);
SET @dateTimeFrom = GETDATE() – 1;
SET @dateTimeTo = GETDATE();
SELECT @lsnFrom = sys.fn_cdc_map_time_to_lsn('smallest greater than', @dateTimeFrom);
SELECT @lsnTo = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @dateTimeTo);
SELECT * FROM cdc.fn_cdc_get_all_changes_<schemaName>_<tableName>(@lsnFrom,@lsnTo,'all');


When implementing CDC take the following into account:
  • CDC does not track who actually executed the change on the data, so you have to add a metadata column with user information to the tracked tables.
  • When using CDC the transaction log will become very big (approximately 250% of the original size), because also the inserted records in the CDC change tables will be written to the transaction log.
  • When enabling a table for CDC, make sure you define a different file group name on a different physical disk.
  • It’s important to only CDC the needed columns: narrow tables perform much better than wide tables.
  • When latency is too high you can change the following parameters of the capture job:
    - Increase maxscans
    - Increase maxtrans
    - Decrease pollinginterval
  • There’s no limitation on the number of tracked columns.



Change data capture is a great out-of-the-box functionality to track changes without using any custom programming and touching the database model (except when you want to have information about who modified the data). With the available stored procedures and functions it's very easy to setup CDC and handle change data.

maandag 3 oktober 2011

Book review: Microsoft SQL Server 2008 R2 Master Data Services

Because of my experience and knowledge about Master Data Services I have been asked to review the book Microsoft SQL Server 2008 R2 Master Data Services. Looking at the authors’ names (Ian who wrote the preface is the founder of Master Data Services, and Jeremy is very active on the Master Data Services community) that worked on this book I got the feeling that this had to be an interesting book to read (even with my background), and while reading this book this was definitely confirmed.

The book guides you, step-by-step, through the world of Master Data Services and can be used as an installation guide with a lot of in-depth, background information. This book differentiates itself with other books about Master Data Services on especially the integration part with SharePoint (workflows) and BizTalk, which is described very thoroughly. Also the description of what’s going on under the hood and the API is very detailed, which proves the knowledge and experience of the authors.

I can really recommend this book to everyone who is working or planning to work with Master Data Services and also this book will still be valid (except some minor points, like the GUI and entity based staging) for the next release of Master Data Services (code name “Denali”).