dinsdag 21 juni 2016

Data Vault – Splitting Satellites based on Rate Of Change

A Data Vault consists of three basic entity types with their specific purpose: hubs, links, and satellites. A hub contains the business keys, a link the relation between business keys, and a satellite describes the data (business objects, relationships or transactions) in those hubs and links. Satellites add context to hubs and links at a given time or over a time period. This context (consisting of descriptive attributes) changes over time and is tracked by the satellite. A satellite is attached to only one hub or link, and records are identified by the parent’s (hub or link) Sequence ID / Hash Key (Data Vault 2.0) and Timestamp (LoadDate).


Splitting Satellites

A satellite is attached to only one hub or link, but a hub or link can have multiple satellites. Satellites can be split based on different aspects:

Source System
  • The data of multiple source systems can be loaded in parallel.
  • The used data types of the different source systems could be different (no casting needed).
  • When adding new source systems, existing satellites don’t have to be changed.
  • History is kept per source system.
Classification or type of data

Based on the type of data, attributes are grouped accordingly (geographic attributes, status attributes, descriptive attributes).

Rate/frequency of Change
  • Based on the frequency of change the attributes are placed in different satellites.
  • Saves data storage.
Splitting satellites is not a specific rule you need to follow, but you should consider it when designing your Data Vault. Based on the above described aspects, you can split a satellite in multiple satellites. You could even create one satellite per attribute, which makes your model very agile, but then you are modelling the data more the Anchor way instead of Data Vault. Commonly between 3 and 5 satellites are used per hub or link.


Handling changes

Satellites store context for hubs and links in a historical way wherein the attributes “LoadDate” and “LoadEndDate” play an important role. When a new record is added to a satellite for a particular business key that already exists in the satellite, and one or more attributes have changed (and defined as the changing attributes), the existing satellite record is closed and the new one is added.

In the following example, every day, the source system “Source” provides new data (at 08:00 AM). This data is loaded into the Data Vault (hub, satellite) by a scheduled, daily ETL process (at 10:00 AM).


Day 1 (2016-05-10)
One record is provided by the source system.
  • The record with the business key “A” does not exist in the hub: the business key is added to the hub.
  • No satellite record exists for business key “A”: the satellite record is added to the satellite with a load end date of “9999-12-31”.
 
Day 2 (2016-05-11)
Two records are provided by the source system:
  • The record with the business key “A” already exists in the hub: nothing happens.
  • The record with the business key “B” does not exist in the hub: the business key is added to the hub.
  • The attribute’s value of the business key “A” changed from “100” to “110”, the Description did not change: the existing record in the satellite is end-dated (equal to the load date of the new record) and a new record is added with the new value and a load end date in the future (9999-12-31).
  • No satellite record exists for business key “B”: the satellite record is added to the satellite with a load end date of “9999-12-31”.


Rate of Change

Satellites can be splitted based on different aspects, and one of them is rate of change. For splitting a satellite based on rate of change, you should determine the change frequency of all attributes: there are attributes that never change, sometimes change, or change very frequently.

When there is a change to one attribute, a new satellite record is added containing also the attributes that did not change, which is a waste of storage and impacts the performance of loading satellites, because you get more I/O.

When satellites are initially designed/modelled they are splitted by rate of change based on profiling information of the source data (only possible if there’s representative source data available). When the satellite is already part of the Data Vault a monitoring process can validate the change frequency of attributes, which gives you valuable input for improving the satellites.

The rate of change of an attribute gives insight in how many records are eventually needed in a satellite to store all the changes. It can be determined in two ways:
  • Distinct Count
    - Group the attribute by business key.
    - Count the number of distinct attribute values.

    A distinct count of the attribute values gives a “rough” insight in the rate of change as it counts the number of unique values and not how many times an attribute value has actually changed.
  • Actual Changes
    - Group the attribute by business key.
    - Order the attributes by modified date.
    - Determine for every attribute value if it changes in the next record.

    By ordering the attribute values by modified date you are able to exactly determine when and how many times an attribute value changes. This options is the way to go forward to exactly determine the rate of change.
In the following example source data is shown with three attributes with a particular change frequency. It also shows the results of the two ways to determine the rate of change.


Source’s Attributes
  • ValueA
    The value of the attribute “ValueA” has a low change of frequency. On the 4th day it changes from “A1” to “A2” and on the 5th day it changes back to “A1”.
  • ValueB
    The value of the attribute “ValueB” has a high change of frequency as it changes every day.
  • Description
    The value of the attribute “Description” never changes.
Measures
  • NumberOfRecords
    The number of records related to a business key.
  • ChangeRate
    A number between 0% and 100% that states how many times an attribute values changes for a business key. In the case of n records, an attribute value can change a maximum of n-1 times. In this example there are 5 records and the value of “ValueB” changes 4 times which is the maximum of possible changes, resulting in a ChangeRate of 100%.
  • RecordsInSatellite
    The number of satellite records that are needed for the attribute.
Distinct Count Query
The query to determine the distinct count of the column “ColumnName” looks like this:


The clause “HAVING COUNT(*) > 1” is added as it’s useless to determine the change of rate of one record.

Actual Changes Query
The query to determine the actual changes of the column “ColumnName” looks like this:





Using Windowing the values of the attribute “ColumnName” are ordered by modified date and with the function “LEAD” it’s checked if the next value has been changed. The number of changes are summed in the outer query. Because the last record in the window is always seen as a change (HasChanged = “1”) the sum result is decreased by “1”. The clause “HAVING COUNT(*) > 1” is added as it’s useless to determine the Change of Rate of one record.

The Distinct Count way results in a lower ChangeRate and RecordsInStatellite for the attribute “ValueA” as it counts the number of unique records without caring about the actual changes. That’s why by looking at the Actual Changes gives a more accurate insight in the rate of change (it also takes into account the change from “A2” back to “A1”).


Determine optimal split

To determine the optimal split of satellites based on rate of change, the following information is needed:
  • Group the data by business key(s) and determine how many times the values of the different columns change.
  • Determine the data types of every column.
  • Determine the data size of every column (based on data type or actual data).
With an example the determination of the optimal split is explained in more detail. The example consists of a table with information about the source table and a table containing the results and shows how the data is divided over three different satellites: one with no changes (HS_RateOfChangeNo), one with a low rate of change (HS_RateOfChangeLow) and one with a high rate of change (HS_RateOfChangeHigh).



Information
Contains information about the source table that’s investigated.
  • The source table consists of 10 column (for this example the business key(s) and meta data columns have been hidden): ColumnA – ColumnJ.
  • The data types and average size of the values per column have been determined. The determination of the average size is only interesting for the columns with a variable length (for example “nvarchar”).
  • Grouped by business key(s) it has been determined how many times the values of columns change (#Records). The columns are ordered based on these number of records.
Results
Contains the results of the calculations.
  • The “Higher” column is calculated cumulative: first, one column is moved to the satellite with the high rate of change, then the second column, then the next column, etc. 
  • The “Lower” column is calculated the other way around: first, all column are moved to the satellite with the high rate of change, then the first column is moved to the low rate of change satellite, then the second column is moved, then the next column is moved, etc.
  • The values of “Higher” and “Lower” are summed and visualized with bars. The most optimal split, is as the point where the “Sum” has the lowest value.
When you create one satellite, all attributes will be stored with the frequency of the most changing attribute (in this case “ColumnA”):



The following image shows the three different satellites how they look like when they are optimal splitted, and less optimal (with related measures like number of records and needed storage).


Automation / generation

Determining the optimal split of satellites can be easily automated wherefore the following components are needed:
  • Meta data
    Configuration settings so that the calculator script knows for which tables and how to determine the optimal split of satellites.
  • Calculation results
    The calculation results should be stored in tables (historically). Part of those results are the scripts to create the splitted satellites (and moving data from existing satellites). 
  • Calculator
    A script that uses the meta data to determine the optimal split of satellites and store the results in the dedicated calculation results tables.
Thoughts

  • The storage and I/O can be reduced by using a column store index on the satellites. Is splitting a satellite still useful when using a column store index?
  • Think of clear naming conventions for the splitted satellites by using postfixes like “RCN” (Rate of Change No), “RCL” (Rate of Change Low), “RCH” (Rate of Change High).
  • The meta data and calculation results can be stored in a Data Vault model (Metric Vault).
  • The splitting of satellites based on Rate of Change can be easily automated.
References


 






































maandag 20 juli 2015

Data Warehouse Automation: Selecting the right tool

Introduction

The (Enterprise) Data Warehouse environments we are using nowadays become very big, complex, and mature. Many companies are now experiencing that it takes a lot of time to add new requirements (time to value). By using a Data Warehouse Automation (DWA) tool, companies are able to automate the building of a Data Warehouse and reducing the costs: doing more in less time. To find out if this is possible you should do Proof of Concept (PoC) with the different Data Warehouse Automation tools that are available on the market to find out which best suits your needs. This article describes which steps you should take to select the right tool for your company.

Data Warehouse Automation tool

A Data Warehouse Automation tool is an integrated development environment (IDE) that automates the development of an (Enterprise) Data Warehouse using standards and best practices (patterns). It automates (repetitive) development tasks and simplifies the management and maintenance of a Data Warehouse. By using a Data Warehouse Automation tool you can drive the costs down (less effort, less resources), speed up development, and respond faster to new or changing requirements (time to value) in an agile way.

Requirements

Before starting the selection of a Data Warehouse Automation tool you need to take some time to create a list of requirements that should be satisfied. You can think of requirements of different categories (Development, Support, Documentation, Integration, Vendor, General) and it should be able to score them (MoSCoW) so you can easily compare tools.

Here a small selection of possible requirements:
  • Using and forcing of custom naming conventions
  • Version control (check-in, shelve, compare, merge, policies, comments/annotations)
  • Real-time monitoring of processes.
  • Query logs (history, trends, start/stop date, duration, job status, task/step status, record counts (select, update, insert, delete, error), (system/business) errors/exceptions).
  • Easy to learn (a developer is quickly up-and-running).
  • The User Interface is intuitive.
  • Ability to automate deployments and roll-backs.
  • Supports multiple ETL integration methods (full-load, incremental-load (Id, Date, Checksum, Hashes, Change Data Capture (CDC), combined methods / multiple fields, custom parameters).
  • Supports multiple dimension types (Type 1 to Type N).
  •  Supports multiple fact types (Accumulating, Snapshot, and Transactional).
  • upports multiple sources and targets.
  • Ability to import meta data from other tools (PowerDesigner).
  • There's a lineage from front to back on object and data level.
  • Ability to generate AS Multidimensional cubes and AS Tabulars.
     
Proof of Concept tool selection

There are several Data Warehouse Automation tools available on the market like Kalido, Quipu (open source), Attunity Compose (formerly known as BI-Ready), TimeXtender, and WhereScape 3D/RED. It’s very time consuming and costly to do a Proof Of Concept with all of these tools, so you should define a short list of criteria a Data Warehouse Automation tool has to satisfy, and eventually select one or two tools that you want to "Poc".

Here a selection of those criteria:
  • Data-driven and agile development 
  • Multi-user 
  • Multi-environment (DTAP)
  • Versioning
  • Generation of documentation
  • Comparison of objects (between versions and environments)
  • Maturity
  • Clear roadmap
  • Licensing
  • Customer references
  • Support for particular sources and targets

Proof of Concept

The goal for the Proof of Concept is checking which Data Warehouse Automation tool fits the best in your organization by going through a list of requirements. High-level the tool must be able to build a Data Warehouse with simple or complex ETL much faster than how it’s done nowadays with the existing ETL tools (like SSIS or Informatica). If you have the possibility you could use existing projects/solutions and build them again but then with a Data Warehouse Automation tool, so that you can easily compare the old and new way of building.To get a good feeling about the user experience and the way of working with a tool I would advice to use your own colleagues controlling the buttons of the tool.

Fitting in current architecture

Most companies that start using a Data Warehouse Automation tool already have an existing BI environment where the new tool has to be fitted in. There are a couple of important points to consider here:
  • Does the tool require a separate server with particular hard- and software requirements?
  • Does the tool integrate with the currently used sources and targets, schedulers, ETL processes, reports for monitoring / maintenance purposes?
  • Is the tool able to reuse databases and tables (dimensions, facts) or does it has its own repository?
Moving forward

When you have chosen your Data Warehouse Automation tool, it’s not just buying the tool and using it.

There are a couple things you need to take into account:
·         Are you going to migrate all existing projects or are you only automating new projects?
- Budget is needed to migrate existing projects.
- Existing projects and new projects will co-exist for a while which introduces extra complexity (integration, maintenance).
·         What kind of developer profiles are needed for working with the tool?
- Less monkey-work is needed.
- More focus on modeling and business logic.
·         How does the way of working change (change management) and how do we handle that?
- Clearly describe and present a plan towards the developers, so it's clear what they can expect and how this influences their work (way of working).
- There will always be resistance when introducing new tools (the developers are familiar with the currently used tools and have to step out of their comfort zone) where you need to be aware of (you won't be able to eliminate resistance, but you should minimize it).
- You need to create commitment and train the developers.

Conclusion

Using a Data Warehouse Automation tool is not just buying a tool and generating your Data Warehouse; you need to take into account a lot of things: does it satisfy your requirements, does it fit in the current architecture, what kind of people are needed, does it fit our budget, how does the way of working change, how is it supported in case of issues, etc. When selecting a tool always make sure that you have requirements and that you figure out which tool satisfies those the best by doing a Proof of Concept. 

dinsdag 19 mei 2015

Data Quality Services (DQS) - SSIS Cleansing Task – Reference to DQS Knowledge Base identifier

You can use the cleansing functionality of Data Quality Services (DQS) in SSIS by using the “DQS Cleansing” task. This cleansing task used the selected DQS Knowledge Base to cleanse data. Under the hood this task points to the numerical identifier of the selected DQS Knowledge Base. In a DTAP (Development, Test, Acceptance, and Production) environment it could happen that this identifier is different throughout the DTAP environment and also when a DQS Knowledge Base is redeployed / recreated it gets a new identifier. When this identifier is not the same as the hard-coded one in the SSIS packages, the package will fail. This article gives you more background information about this issue and how to work around it. 

Storage of DQS Knowledge Base

When you create a new DQS Knowledge Base or import an existing one, a new numerical identifier will be created specific for this knowledge base.
This identifier is stored in the table [DQS_MAIN].[dbo].[A_KNOWLEDGEBASE] where also project related records are stored.  The following table shows a couple of important columns of this table:



You can execute the following T-SQL query to get all above mentioned values with their descriptions:

SELECT  groupDQS.[ID] AS [groupId],
        groupDQS.[NAME] AS [GroupName],
        memberDQS.[KEY] AS [MemberId],
        memberDQS.[VALUE] AS [MemberValue]
FROM           [DQS_MAIN].[dbo].[A_CODE_MEMBER] AS memberDQS
RIGHT OUTER JOIN [DQS_MAIN].[dbo].[A_CODE_GROUP] AS groupDQS ON groupDQS.[ID] = memberDQS.[GROUP_ID]
WHERE   groupDQS.[NAME] = ''


When you create a DQS Knowledge Base, a new numerical identifier will be created, which is also used to create a knowledge base specific schema (KnowledgeBaseXXXXXXX) that is used to group several knowledge base specific tables (B_ASSOCIATION_RULE, B_ASSOCIATION_RULE_CONSTRAINT, B_ASSOCIATION_RULE_CONSTRAINT_DOMAIN, B_COMPOSITE_DOMAIN_VALUE_RELATION, B_DATA_SERVICE_FIELD, B_DATA_SERVICE_RELATION, B_DATA_SERVICE_SCHEMA, B_DATA_SERVICE_SCHEMA_PROPERTY, B_DOMAIN_TERM_BASED_RELATION, B_DYNAMIC_OBJECTS, B_INDEX_LEXICON, B_INDEX_LEXICON_EXTENSION, B_INDEX_LEXICON_EXTENSION_RELATION, B_MATCHING_RULE, B_MATCHING_RULE_DOMAIN_ELEMENT, B_PATTERN, B_REFERENCE_DATA_MAPPING, B_REFERENCE_DATA_MAPPING_CHILD_DOMAINS, B_REFERENCE_DATA_MAPPING_PARAMETERS, B_RULE).

Import DQS Knowlege Base

A DQS Knowledge Base can be imported throughout the DTAP environment via the Data Quality Client or via the command line.

Export / import functionality in the Data Quality Client

An existing DQS Knowledge Base can be exported to a file (*.dqs) by selecting “Export Knowledge Base”. 


A new DQS Knowledge Base can be imported from a file (*.dqs) via the “New Knowledge Base” button on the main screen. You can’t overwrite an existing knowledge base.
















Export / import functionality via command line

The following command exports all DQS databases (it’s not possible to export one specific DQS database) from the “MDM” instance to the file “MyKnowledgeBases.dqsb”.
DQSInstaller -exportkbs c:\MyKnowledgeBases.dqsb –instance MDM
The following command imports all DQS databases (it’s not possible to import one specific DQS database) from the file “MyKnowledgeBases.dqsb” into the “MDM” instance.
DQSInstaller -importkbs c:\MyKnowledgeBases.dqsb –instance MDM

SSIS Cleansing Task

The SSIS Cleansing Task references a DQS Knowledge Base by using an identifier. The following image shows a piece of the XML code behind the SSIS package. It shows that the DQS Cleansing task refers to the hard-coded knowledge base name “2000000” (strange the property is named “Knowledgebasename” while it contains an identifier value). 




You can’t deploy a working SSIS package throughout the DTAP environment as it could be that Development is using another identifier for the DQS Knowledge Base than Acceptance, which makes it impossible to execute automated deployments. 

Work-around

On the internet several work-arounds are mentioned but none of them makes it possible to deploy your SSIS package (without any adjustments) throughout your DTAP environment. Follow the following steps to make this possible:

  • Reset the auto-increment column [DQS_MAIN].[dbo].[A_KNOWLEDGEBASE].[ID] by executing the following T-SQL script:
    DBCC CHECKIDENT ('[DQS_MAIN].[dbo].[A_KNOWLEDGEBASE]', RESEED, 1999999);
    After executing this script, a newly created DQS Knowledge Base will get the identifier 2000000.
  • Create a new DQS Knowledge Base by importing one via the export/import functionality in the Data Quality client or via the command line. If you need to overwrite an existing one, you first need to delete that one via the Data Quality client tool.
  • Deploy your SSIS package which points to the DQS Knowledge Base with the identifier 2000000.
Conclusion

This issue with the hard-coded identifier is really a bug in the SSIS Cleansing task which should be solved by Microsoft. When the identifier is replaced by a DQS Knowledge Base name, no work-arounds are needed and it’s then possible to deploy your SSIS packages throughout the DTAP environment without any adjustments.

References






Querying Active Directory Using T-SQL

When you want to use Active Directory information, like users and user groups, in your BI environment to implement row-level security (based on department) you need to query Activery Directory. Querying Active Directory can be done easily by using T-SQL, which is explained in this article.

Linked Server

To query Active Directory from T-SQL you need to setup a linked server that uses the ADsDSOObject (Microsoft OLE DB Provider for Microsoft Directory Services) provider. The following T-SQL code creates the linked server “ADSI” (Provider = OLE DB Provider for Microsoft Directory Services, Data Source = machielsesoftware.local:389, Product = Active Directory Services). The port number 389 is used for LDAP traffic.

EXEC master.dbo.sp_addlinkedserver @server = N'ADSI', @srvproduct=N'Active Directory Services', @provider=N'ADsDSOObject', @datasrc=N'machielsesoftware.local:389'


T-SQL query

The following T-SQL query creates a dynamic query that gets the following user information:
  • UserId
  • Username
  • Department
  • Country
  • Notes
  • LastLogon

-- Variables
DECLARE @finalQuery AS VARCHAR(2048);
DECLARE @queryActiveDirectory AS VARCHAR(1024);
DECLARE @activeDirectory AS VARCHAR(64);

SET @activeDirectory = 'LDAP://machielsesoftware.local:389';

-- Active Directory query
SET @queryActiveDirectory =
   '''SELECT sAMAccountName, name, department, co, notes, lastlogon ' +
   '  FROM   ''''' + @activeDirectory + ''''' ' +
   '  WHERE  objectCategory=''''person'''' AND ' +
   '         objectClass = ''''user''''''';

-- Final query
SET @finalQuery =
  'SELECT sAMAccountName AS [UserId], ' +
  '       name AS [Username], ' +
  '       department AS [Department], ' +
  '       co AS [Country], ' +
  '       notes AS [Notes], ' +
  '       CASE WHEN ISNULL(CAST(lastlogon AS BIGINT), 0) > 0 ' +
  '            THEN DATEADD(MINUTE, (ISNULL(CAST(lastlogon AS BIGINT), 0) / 600000000) –
                    157258080 + DATEDIFF(MINUTE,GETUTCDATE(),GETDATE()), 0) ' +
  '            ELSE 0 END AS [LastLogon] ' +
  'FROM OPENQUERY( ADSI, ' + @queryActiveDirectory + ') ' +
  'ORDER BY [UserId];';

EXEC (@finalQuery);

Converting Active Directory Date/Time values

Date/Time values (and also other attributes) from Active Directory are stored as large numbers (data type = Integer8 à 8 bytes / 64 bits). To convert a date/time value from Active Directory to a readable date/time value, you need know how it’s stored so you can convert it.
Here some facts and calculations:
  • The date/time value in Activee Directory represents the number of 100 nanosecond intervals that have been elapsed since 1600-01-01.
  • 1 nanosecond = 10-9 (1 / 1.000.000.000) second.
  • 100 nanosecond = 10-7 (1 / 10.000.000) second.
  • 1 second = 107 x 100 nanoseconds.
  • 1 minute = 107 x 60 = 600.000.000 nanoseconds.
  • The minimum date value in SQL Server equals 1900-01-01.
  • Difference in minutes between 1600-01-01 and 1900-01-10 = 157.258.080 minutes.
  • You need to correct the date/time value for your local time zone by determining the difference between the Greenwich Mean Time (GMT, independent of time zones) and your local time zone. The GMT date/time can be fetched by calling the function GETUTCDATE().

Conversion calculation:
  • MinutesAfterMinimumSqlDate = (DateTimeActiveDirectory / 600.000.000) - 157.258.080
  • MinutesBetweenGmtAndLocal = DATEDIFF(MINUTE, GETUTCDATE(), GETDATE())
  • Result = 1900-01-01 00:00 + (MinutesAfterMinimumSqlDate + MinutesBetweenGmtAndLocal)

Restrictions

The provider that’s used to query Active Directory has a couple of restrictions, which you need to take into account:
  • A query via the provider returns a maximum of 1000 objects.
  • It’s not possible to get the contents of the attribute “Description” as it is a multi-value field. 

Conclusion

It’s pretty easy to query Active Directory using T-SQL, but the provider has some restrictions which you need to take into account. If you want to query Active Directory without the restrictions I mentioned above than you should use the classes in the System.DirectoryServices namespace of the .NET Framwork in for example an SSIS package or a SQL Server CLR User-Defined Function (UDF).

References

maandag 11 augustus 2014

Hybrid Query (Combining T-SQL and MDX)

This article describes how to query data from a Analysis Services Multidimensional cube or Tabular model from within T-SQL code. Believe it or not, there are situations where you have to do this!

Context

To get data from a Multidimensional cube or a Tabular model you need to talk "MDX" (or "DAX"). By using OPENQUERY, you are able to query them via a linked server from within your T-SQL code.



Hybrid query

The T-SQL code below shows an example of how you can query a Multidimensional cube by using dynamic T-SQL:
  • To query a Multidimensional cube or Tabular model you need to create a linked server first (in the code below this linked server is called "MULTIDIMENSIONAL").
  • The MDX query returns data where the column names contain square brackets. You can rename those column names by surrounding them with double quotes and giving them an alias ("[Measures].[Turnover]" AS [Turnover]).
  • You can load the data in a temporary table (in this example a table variable is created) so that you can use the results further in your code/process.


References


zaterdag 22 maart 2014

PASS Nederland (2014-03-18)

This week I attended the PASS evening at the HP office in Amstelveen where they invited Marco Russo to give two sessions about "DAX Patterns" and "DAX from the Field: Real-World Case Studies", which I'd like to share.

DAX Patterns

When building a tabular model people face problems wherefore already a solution exist. Marco Russo and his colleagues created a lot of solutions in tabular and Power Pivot and transformed those into patterns (a general reusable solution to a commonly occurring problem) so they can be easily reused in other projects where the tabular model or Power Pivot is used. These patterns are collected and put on a special website: http://www.daxpatterns.com/. A couple of examples he showcased were "Cumulative Balance (without using snapshots)", "Custom calendar", and "Actual Vs. Budgets".

PDW (Parallel Data Warehouse)
 
Between the sessions of Marco Russo a short session has been presented about HP's PDW appliance with SQL Server 2012 and how it is being used at a big customer in Africa. This client has a Data Warehouse of approximately 3 TB, has difficulties with finishing the ETL process on time and also has big performance issues when multiple people were executing ad hoc queries. By using HP's PDW appliance all those issues were solved:
  • The Data Warehouse runs on the appliance.
  • The performance is great (you should experience it to believe).
  • The appliance has been made High Available (they just bought two appliances and put them on different locations). 
People are saying that an appliance is a cube-killer, is that so? Well, not really...
  • Analysis Services works smoothly with SQL Server 2012 PDW and the combination with a Multidimensional cube with ROLAP is perfect as the query response from PDW is superfast.
  • When a lot of ad hoc queries are executed, an appliance is really helpful. When queries become used more widely throughout the company it's recommended to build those in a cube or tabular model.
AX from the Field: Real-World Case Studies

Building a tabular model is easy but when it comes to performance there are a couple of things you have to take into account.
  • Reduce the number of distinct values for columns (less memory needed, increases scan speeds).
  • You can check which part of the tabular model needs attention by using the Excel workbook that is provided on this website: http://www.powerpivotblog.nl/what-is-using-all-that-memory-on-my-analysis-server-instance/
  • How do you determine the hardware you need? The best road to follow is to first develop your model and filling it incrementally (increments of 10%) with real production data. By doing it this way you can create a good prediction of the memory usage. But the problem is that it normally takes a while to order hardware, so it’s not always handy to wait till a tabular model is developed and filled.
  • When choosing hardware you need to take into account the following:
    - CPU (the fastest you can buy, largest L2 cache, lowest number of cores (less than 8)
    - Memory (the fastest you can buy)
    - Disk / other hardware (don’t waste your money here)
It seems that the latest Xeon processor of iNTEL has a large L2 cache wich is shared by all cores. You can switch-off cores so more cache is available for the other cores.

donderdag 18 juli 2013

Generating and printing SSRS reports automatically via SSIS 2012

Introduction

If you want to schedule the generation and printing of SSRS reports automatically where each report uses different parameter values and data, you can use the Data Driven Subscription functionality in Reporting Services. But, if you don’t have the Enterprise or Business Intelligence editions of SQL Server 2012, you don’t have this functionality. As an alternative you can use a combination of SSIS and SSRS to implement such a solution, which is described in this article.

Scenario

The company “ARGP Office” is a distributor of office equipment, which are sold by customers (shops) all over the world. All these customers receive a quarterly letter (and specification) about their turnover (and a bonus). It depends on the customer, and the quarter’s number (Q1, Q2, Q3, or Q4)  if they receive a letter and a specification or only a letter. Also, the contract between the company and the customer defines if they receive a bonus quarterly or yearly or don’t receive a bonus at all.

The letters and specifications are generated and printed at the end of every quarter, which is done by an automatically scheduled ETL process. This process extracts data from different source systems, transforms the data in the right format, and loads it in a reporting database. Configuration settings and the loaded data from the reporting database are used to generate and eventually print the letters and specifications, which are sent to the customers by mail.


Solution


In the solution I am going to explain, I will focus on the report generation and printing part of the ETL process. The configuration settings stored in the reporting database tell the ETL process which reports have to be generated (letters and specifications) and for which customers. Using a .NET assembly the letters and specifications are generated as PDF files. This assembly uses the webservice methods of Reporting Services to generate those files from defined reports. When the PDF files have been generated they are printed by using a batch file that calls a Ghostscript.

 
  • Generating reports
    The actual generation of reports is done by a SSIS package that calls the .NET assembly’s methods from within a Script Task. The SSIS package passes the parameters to those methods so they are able to generate the right letters and specifications by making use of the web services of Reporting Services.

    Adding the .NET assembly to the GAC
    To be able to call the methods of a .NET assembly (ReportGenerator.dll) from a Script Task in SSIS you have to install this assembly in the Global Assembly Cache (GAC) on the production machine. The easiest way to install the assembly in the GAC is by copying it to the directory c:\windows\assembly. When the assembly has been copied you have to reference it within the Script Task by adding a reference (Project > Add Reference…) and importing the namespace in your script (Imports ReportGenerator).

    The following code snippet (VB.NET) shows how the “Generate” method of the Generator object is called with parameters. Important parameters here, are the url of the report to be generated, the parameters that have to be passed to the report, the location where the report has to be saved and the format of the generated file (in this case “PDF”).



    .NET assembly configuration settings
    When a .NET assembly is called by a SSIS packages and uses configuration settings, these configuration settings have to be stored in the configuration file. When running a package in debug mode (BIDS / SSDT) it’s executed by “DtsDebugHost.exe” that has the related configuration file “DtsDebugHost.exe.config” wherein you can place the configuration settings to be used in debug. When running a package in production, it’s executed by “DTExec.exe” that has the related configuration file “DTExec.exe.config” wherein  you can place the configuration settings to be used in production. These files can be found in the directory C:\Program Files\Microsoft SQL Server\90\DTS\Binn.

    Calling webservice methods
    The Generator method of the Generator class calls several methods of the SSRS web services to load a report with the right parameters and render it to a file. To make use of the web services you need to add two web references to your project:
    - http://hostname/reportserver/reportexecution2005.asmx
    - http://hostname/reportserver/reportservice2010.asmx

    The following code snippet (C#.NET) shows how to use the LoadReport() method to load the report, SetExecutionParameters() to set the parameters of the loaded report, and Render() to render the loaded report to a file. For the simplicity of the example, the declaration of variables and error handling is not shown.

  • Printing reports
    The printing of the generated reports is done by a batch script that calls a Ghostscript, which is an open source tool for printing purposes.

    The following snipped of code (Windows command batch file) takes to parameters as input (filename and printer) and calls Ghostscript to print the generated PDF report.



    The references section of this article shows a link to the website where you can download Ghostscript.
Conclusion

With knowledge of SSIS and .NET it’s easy to implement a Data Driven Subscription solution in the  case you don’t have the expensive Enterprise or Business Intelligence editions of SQL Server.

References