SSIS on a cluster

I figured this was worth a quick post: while many of SQL Server’s aspects are cluster aware, SSIS is not. Why? My guess is that there is not a pressing business need for this to be a cluster aware service – while many companies have vital ETL processes to help run company business, the need for failover in this scenario is not as pressing as in transactional systems.

With that said, I had a customer that was testing the failover of one of their main SQL clusters and suddenly noticed that a more than several jobs failed to run.  In helping on the issues, I noticed that they had SSIS steps in the job pointing to the cluster name for pulling out the packages form MSDB, and that, more importantly, someone had forgotten to install the SSIS service on the failover node.

Someone had configured SSIS to point to the cluster on the primary node, but must have forgotten to install the service and do the same config change on the failover node – easy enough fix; but also a good education opportunity for the client as they were not familiar with SSIS at all (transitioning to a new environment for them this past year, so most of this is new for them).

SSIS is not recommend to be configured as a cluster resource, but that does not mean it cannot be used on a SQL Server that is clustered – you just have to be aware of what you are doing and the consequences.  In my current scenario, we have SSIS configured to work on the cluster environment, but not as a cluster resource (the second link below will show you how).

Here are the references:

Just be sure to weigh the pros and cons as even the simple config change has its own consequences, in particular the section in the first link on disadvantages of SSIS as a cluster resource.

Tags: ,
Categories: SSIS

October 15, 2009 15:25 by Sid
E-mail | Permalink | Comments (5) | Comment RSSRSS comment feed

TPC Benchmark Status – April 2009

For anyone unfamiliar with the TPC (Transaction Processing Performance Council), they are an independent non-profit with the goal of creating benchmarks for database systems that are verifiable and reportable to the industry.

It is worth taking a look at their site and some of the research; if nothing else, you may get some ideas on reference hardware for your next implementation or use their posted data sets for benchmarking your own solution's performance (Microsoft did this when declaring themselves the holder of the ETL world record).

In this latest announcement, the TPC reiterated that they are developing an ETL standard; when this is finally published, it will be interesting to see how all of the individual platform vendors respond, most notably for me being the battle between SSIS and Informatica – a done deal in my mind. Informatica has a great toolset, but extensibility, usability, cost and speed in development and training all suffer greatly in my mind when compared to SSIS.

Other benchmarks: http://www.tpc.org/information/benchmarks.asp

Tags: , , , , ,
Categories: SSIS

April 21, 2009 06:11 by Sid
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

CodePlex project for Data Profiling

I made a post over a week ago talking about extending the Data Profiling task and in there I just cover the some of the basics on my methods for extending it and what goals I had in mind.

It actually sparked a few e-mails and so, due to interest, I am going to graduate my sloppy proof-of-concept code into a workable project. To put further rigor and transparency, I am going to put this project on CodePlex so others can contribute and enjoy.

Anyone interested in helping in the development, feel free to shoot me an e-mail. Per CodePlex rules, I've got 30 days to post code and publish to the project, so the clock is ticking!

My goal is that this will be a simple app where you plug in a connection string to a SQL Server and your output is a document with the profile results.

For further reading on this topic, John Welch, a Microsoft MVP, was kind enough to leave a comment about my prior post as he had the same issue with the Data Profiling task as I had. For anyone wanting to look at his work around to make the profiling task more flexible, please reference his post , "Using the Data Profiling Task to Profile All the Tables in a Database", here.

Tags: , , ,
Categories: SSIS

April 10, 2009 14:12 by Sid
E-mail | Permalink | Comments (10) | Comment RSSRSS comment feed

Extending the Data Profiling Task

 

SQL Server 2008 RTM has been out a while now and yet, I am just now making my way to using the Data Profiling Task, new to SSIS in this release. Time and client demands are always a factor governing when and what I get to look at, however I had read up on the features and was hoping that what I had read was not all the new task could do.

The new task is a good start, but unfortunately, it is a little more limited than I would have hoped. Fortunately though, with the SSIS APIs open to us, we can use the class object as we please.

So first -- what did I find limiting about the Data Profiling Task?

It appears that its intended design was to serve as a checkpoint during loads/transforms and that conditional logic can be applied based upon results; this is incredibly useful. My major concern is with activities that happen before the design phase – and quite directly, before any development begins at all.

Reference this great little tip from Kimball here on data profiling and its necessity according to his framework – profiling is tantamount to understanding the system that you are getting ready to work with in the course of a project.

To get to the heart of what I wanted to see the Data Profiling task do: I want to use it to generate an aggregate view of a given system so that I can view the results and learn more about the data within the database. Many of us have been given a database, custom built or extended far from the original software vendor's spec, with little documentation.

Even for those who are not preparing to move this data for warehousing or reporting, anything that provides a reference to what the data is, how it is stored, how unique, etc. is a boon for developers.

The Data Profiling task can deliver this data -- but only one table at a time?

There are no expression settings or variables that you can use to set which table and where the output goes. For two of the profiling tasks, I can see this as a bit of a problem – both the Candidate Key and Value Inclusion profile do not lend themselves to being easily generated dynamically via iterating over all the tables and views in a database.

But all the others are useful in this aggregated manner, so why, if I am looking at a client database that has 96 tables, must I build 96 tasks via Visual Studio and still not end up with an aggregate report (just 96 xml files)? I would like to be able to run this on the entire database, and start looking at the hotspots or data profile results that are the fat tails in the distribution chart.

My solution is to build the SSIS package programmatically, run it and then read and aggregate the results. I still need to clean up the solution a bit and finish some unit tests, so I will post it soon for everyone to download and enjoy.

My envisioned endgame for this project is for the output to be configurable for user desired thresholds and for that output to render in HTML or Word – still a little ways away from that, but I have the bulk of the SSIS work done. If anyone is interested in contributing, let me know and I will see about getting this on CodePlex.

Notes for anyone who wants to work with this project or attempt this on their own:

  • This effort makes use of the Microsoft.DataDebugger.DataProfiling class, which is not necessarily a supported API. An example of using this class is on the SSIS Team Blog here.
  • DLL references necessary for this project come from multiple places. For example, the Microsoft.SqlServer.DataProfilingTask is in ~\Program Files\Microsoft SQL Server\100\DTS\Tasks and Microsoft.SqlServer.DataProfiler is in ~\Program Files\Microsoft SQL Server\100\DTS\Binn
  • It is a good idea to read up on building a package programmatically on MSDN as there are class objects, such as TaskHost, which are not immediately intuitive as they are not seen when using the designer
Tags: , , , ,
Categories: .NET | SSIS

March 31, 2009 09:19 by Sid
E-mail | Permalink | Comments (4) | Comment RSSRSS comment feed

Dynamically remapping source to destination in SSIS

This all started off in trying to help out a fellow SSIS enthusiast, and now I have made a complete project as a sample….. catch the MSDN forum thread here.

There are many reasons why you would want to remap source columns to destination at runtime – I would not necessarily recommend this type of solution for some of your larger and more impactful data imports, but there are many cases where being able to modify the mapping metadata at runtime can be incredibly useful.

I have posted the project example in the files directory on my blog, you can download the Visual Studio 2008 project here – keep in mind that I did this in VS 2008 Team System Architecture Edition, so if you don't have this version, then you might have an issue with the unit test project attached.

This should not pose a problem for many though, as the only reason I threw the unit test in there was to provide a simple and lazy way to call into the class and perform the remapping.

This example demonstrates the following:

  • Load an existing SSIS package and indentify the source and destination components
  • Instantiate the source and destination components to read their column metadata
  • Remap a source column to a destination column
  • Save the modified SSIS package

When you open the SSISExamples solution, build it, and then either copy the DynamicRemap.dtsx file from the bin directory to your c:\ root or modify the PackageToLoad string in the CreatePackage class.

Once you call into the CreateAndRunPackage method, you will end up with a copy of the package in your root named PackageModified.dtsx.

Please feel free to e-mail me with any questions. I'll try to make a few improvements to the example when I have free time over the coming holiday, but for now, enjoy!

Tags: ,
Categories: .NET | SSIS

November 14, 2008 11:21 by Sid
E-mail | Permalink | Comments (1) | Comment RSSRSS comment feed

SSIS Caching and Performance

I've been trying to be a good citizen and post more in the SQL Server section of the MSDN forums, and so I came across a question in the SQL Server > SSIS section about start-up performance that made me realize not a lot of people are aware of how a package initially executes.

Since Kirk Haselden was my original source for some of this information so I will point to his post, Caching in The SSIS Service, for anyone to reference.

Tags: ,
Categories: SSIS

November 7, 2008 11:16 by Sid
E-mail | Permalink | Comments (9) | Comment RSSRSS comment feed

SSIS Adapter for SharePoint

 

Kevin Idzi of Microsoft was kind enough to leave a comment on my Extracting SharePoint Data using SSIS post informing me of Integration Services enhancements posted on CodePlex.

I had been delaying posting as I wanted to have a full sample showing it in use, but between work and buying a new house recently, I have not had the time to put together my virtual machine images for a complete demo.

The community sample has the following components available to download (in separate install packages):

  • XML Destination Pipeline component
  • Regular Expression Flat File for regular expression based flat file parsing
  • Delimited Flat File Parser, for files with rows that are missing column fields
  • Package Generation Sample
  • Hello World Sample for demonstrating using the UI API
  • SharePoint List Source and Destination – this component was written by Kevin
  • WMI Source Component

All of these components are available for download here.

The base requirements for installing these components are .NET 3.5 and SQL 2008 Integration Services.

Once you run the MSI, you will be able to select the two new adapters from the "Choose Toolbox Items" dialogue:

One of the nice features that Kevin included in this adapter is the ability to execute CAML queries against the SharePoint List:

 

 

If you do not know how to write CAML or are not good at it like I am, then use the U2U CAML Query Builder written by Patrick Tisseghem; I've got a quick tutorial here: CAML Query Builder.

If you have any feedback about using this SharePoint List Adapter with SSIS, please be sure to leave comments or suggestions on the CodePlex site for Kevin and his team.

Tags: ,
Categories: SharePoint | SSIS

September 19, 2008 13:07 by Sid
E-mail | Permalink | Comments (8) | Comment RSSRSS comment feed

Business Intelligence Metadata Whitepaper

It is hard enough to manage your data repository, modify it to meet new business needs, and add to that: keeping documentation and metadata up to date and relevant for your consumers.

Several friends of mine who are Informatica fans let me know on a repeated basis that the data manager in the Informatica Suite helps them out here; while functional; I was never fully impressed with the feature set. I showed my colleagues the SSIS API and they were blown away with how much untapped information sat inside SSIS.

The next question – why hasn't Microsoft put any of the metadata features directly into the SSIS tool?

While I cannot answer that, I do know that Microsoft has a very active user community and has supported this exact type of development. It is, in fact, nothing new and you can get the SQL Server 2005 Metadata white paper from Microsoft off of the download center, available from December 19, 2005 forward.

The trick though is the samples – for some reason, they seem to have an issue in keeping the sample projects online, so I have made it available from my site as well so insure that it is available to the general public. (See below)

What the MSI contains:

  • DependencyAnalyzer.exe – Tool that evaluates and loads into a database the lineage on SSIS packages, Analysis Services and SQL Server. All the source code for this program is provided.
  • DependencyViewer.exe – A tool that lets you graphically see the dependencies and lineage of objects in the lineage repository. Source code is provided for this program.
  • Data Source View – A DSV that connects to the lineage repository (SSIS META database) that can be used by Reporting Services.
  • Lineage Repository – A database called SSIS_META that can be used to house metadata from nearly any system.
  • Reports – Some standard reports for impact analysis studies. You will find two key reports out of the box with several sub-reports.
  • Report Model – A report model that you can use with Report Builder to allow end-users to create ad-hoc reports.
  • Integration Services Samples – A few sample packages to start auditing and viewing lineage on.

For anyone who reads from the Kimball line, the ideas presented in this paper look familiar, particularly if you have read the Microsoft Data Warehouse Toolkit book. In chapter 13 of this book, they discuss a metadata plan for a warehouse -- the website has extra materials: sample scripts that make use of the extended properties and reporting services reports to pull out metadata repository information.

I encourage reading both the whitepaper and the book, particularly the Kimball book as the Whitepaper assumes that you know the importance of metadata in your warehouse. The Kimball Microsoft Warehouse book is great in that it takes the concepts that they have developed over the years and applies it directly to the Microsoft stack.

SQL 2005 Metadata Samples (.msi, 311 kb)

Feel free to contact me for any questions. Thanks!

Tags: ,
Categories: Business Intelligence | SSIS

June 23, 2008 15:10 by Sid
E-mail | Permalink | Comments (7) | Comment RSSRSS comment feed

ETL, Auditing and Row Counts...

Yes, it has been too long since I last posted – My wife and I are in the middle of selling our house, so it has been a busy month doing little projects here and there. For anyone doing ETL projects or working on a BI project and in the ETL phase, auditing is a major part of your work. In this instance, I am specifically looking at physical statistics on a table in my database. Here is a simple schema for a table to track these statistics:

CREATE TABLE [dbo].[TrackTableProcessLog]

(       [TableProcessKey] [int] IDENTITY(1,1) NOT NULL,       [ProcessLogKey] [int] NOT NULL,       [Object] [varchar](250) NOT NULL,       [ObjectDatabase] [varchar](75) NOT NULL,       [CountExtractRows] [int] NULL,       [CountInsertRows] [int] NULL,       [CountUpdatedRows] [int] NULL,       [CountDeletedRows] [int] NULL,       [CountErrorRows] [int] NULL,       [CountInitialRows] [bigint] NULL,       [CountFinalRows] [bigint] NULL,       [RecordInsertDate] [datetime] NOT NULL CONSTRAINT [DF_TableProcessLog_RecordInsertDate]  DEFAULT (getdate()), CONSTRAINT [PK_TrackTableProcessLog] PRIMARY KEY CLUSTERED (       [TableProcessKey] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON)

As you can see, I am tracking counts on several different items for my table objects in my database. Most of the counts are derived directly from the ETL process, inside my SSIS package, however, two of the counts are not derived from the ETL action itself: CountInitialRows and CountFinalRows. Both of these are counts of the rows in my table object before and after the ETL operation – I need this to be exact as this table is used for auditing purposes and all the numbers need to line up properly (or not, so that I can be alerted of an issue with the ETL process).

Thankfully, with SQL 2005, there are several ways to readily grab these figures, even on very large tables, without resorting to “select count(*) from …”. I will illustrate the several methods available, and you can pick which one you like best. Just as a note, each method returns the same result – I have run large data operations against some target tables and test that each method returns the same answer, so thus far, any or a combination thereof should return the same result.

First, let’s look at the system stored procedure: sp_spaceused. In executing this stored procedure with a table object as the parameter, your output will be: name, rows, reserved, data, index_size and unused. In this case, I just need rows, so while the other information is nice, it is unnecessary. The procedure also performs some other steps, so if I am shooting for efficiency in my data movement processes, then I want to eliminate any extraneous querying.

If you break open sp_spaceused, you will find that it uses the dynamic management view sys.dm_db_partition_stats to deliver the row metric. If you want to use this dynamic management view, you will need to be sensitive to the fact that a single table object can live on multiple partitions – and your indexes can live on other partitions as well. To query this table, you will need to sum the row_count column for a given object_id and where the index_id value is less than 2; this is necessary to exclude counts from any non-clustered indexes related to a table object.

Row count data is also available in sys.sysindexes – to get this data you will filter on the id column = object_id of your target object, however this table is not necessarily purposed for this information, so while possible, I would recommend using one of the other methods.

The last way to conveniently access row count information is through sys.partitions table. The filter for this data is the same as the dynamic management view; you will need to sum across all rows for a specific object_id and include only rows where the index_id is less than 2.So all in all, here are several methods to use to retrieve row count information for your table object. Each method provides sub-second response times for my ETL operations, allowing me to conveniently gather statistics about my tables as I populate my warehouse.Please feel free to post or e-mail with any questions!

References:

Sys.partitions: http://msdn2.microsoft.com/en-us/library/ms175012.aspx

Sys.dm_dm_partition_stats: http://msdn2.microsoft.com/en-us/library/ms187737.aspx

Table and index organization: http://msdn2.microsoft.com/en-us/library/ms189051.aspx  

Tags: ,
Categories: Business Intelligence | SSIS

February 29, 2008 17:58 by Sid
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

Variable locking in SSIS

Recently a package of mine with a good running history suddenly started throwing errors. This package had been scheduled and running just fine for over a month – no changes to the code had been made. The error is consistently thus:

The script threw an exception: A deadlock was detected while trying to lock variables MyVariables for read access and variables MyVariables for read/write access. A lock cannot be acquired after 16 attempts. The locks timed out.Task MyTask failed.

Luckily, I am not the only one who has run into this issue, and it has now inspired a new practice for me in designing Script Tasks inside my SSIS packages. Previously in one of my posts, I had shown the following screenshot and described the standard method for using read only and read/write variables in a script task:

For many situations, this will probably work out just fine. However, my variable locking issue has evolved from my use of a Foreach container that holds my Script Task. There are other examples in my package store not involving a Foreach that also throw this error, however all of my packages that consistently throw this error all have in common the use of a Script Task inside a repeated process – for some reason, SSIS is not properly releasing the variables between runs of the step.

Daniel Read has written a detailed method on how to work around this; you may find his post here: http://www.developerdotstar.com/community/node/512/

Jamie Thomson also has a post over the same issue – he actually has some environmental information on what updates/patches could have been the culprit, but it is not entirely conclusive. Here is his post: http://blogs.conchango.com/jamiethomson/archive/2007/08/28/Beware-of-variable-usage-in-script-tasks.aspx
Tags:
Categories: SSIS

January 23, 2008 13:01 by Sid
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

Sponsors

Author

Tags

Recent Posts