SQL Server – Large data flow with SSIS

Oguzhan Yigit
Published by Oguzhan
Category : SQL Server / SSIS
12/01/2018

SSIS design consideration

This post is about SSIS consideration while designing a data migration control flow between databases. The main goal is to optimize our whole process. Above the business process we handling, we have to deal with different technical sides :

  • flow results : how to figure out what has been successfully done
  • flow execution time: although SSIS is designed for, large data flow execution time has to be in consideration

In this post, we will focus on execution time because i faced some scenario where i had to deal with big size of data which were resulting in too much execution time. Now let imagine a flow which loads 50 millions of sql rows from one database and has to store them into another : the process has to insert or update data in the target database. Depending on how the flow is designed, execution time can be completely different (from days to a few hours).

Inserting data through the SSIS package’s data flow

For our first case, let suppose we only have to insert data from our source database inside the target database. That is “slighlty” the easiest part of our flow as SSIS provides us ready to use tools. To achieve that, first, add a data flow task which contains an OLE DB Source. Configure your SSIS connection and choose a “table or view” access mode. From there, i always specify only the columns which i need in my process. Use the column tab of your source to do aswell :

SSIS OLE DB Source column selection

Now your source component is fully configured. Then add a simple OLE DB Destination component to define your target process. At this step, you should consider following options :

SSIS OLE DB Source configuration

 

  • Data access mode : always use fast load if you are able to as it will manage all your queries with a bulk transaction
  • Check constraints : if your target table has foreign keys and your sure that you won’t get constraints exception, disable the option as it will save you processing time
  • Maximum insert commit size : there is no known 100% rule with this parameter as it mostly depend from your data and your systems. However, this parameter will tell SSIS to limit bach size while committing data. Decreasing that limit is supposed to have a performance loose because it will end with more commits. However, depending on your target system, it will also reduce stress on transaction logs either temp db and can result with improvements.

These are the minimum properties available in SSIS you should worry about while managing a simple insert process.

Of course, you should mapping your source and target columns to insert data wherever you want.

Updating data through SSIS package’s data flow

Even if SSIS does awesome job while sending data by inserting them, it’s not clearly designed to make a bunch of update requests. So, if you have a lot of data to update, you should avoid using SQL Command inside a OLE DB Destination target component. It will clearly take too much time to process and it’s today seen as a bad design behavior. Using SSIS, it’s admitted the best design for these kind of processes is to use stagging table. Shortly :

  • send your data to a stagging table in the target database
  • update rows by using SQL Engine :
    • by using a SQL job or
    • by launching a Stored procedure from your SSIS package

Once data are in the stagging table of the target SQL Server database, using a stored procedure will give you the way to do a bulk commit for all your updates. There is no chance you can easily achieve that goal directly inside your SSIS package.

What about the flow execution results

You probably thinking that bulk inserts either updates are making it hard to monitor which data has been inserted or updated. Indeed it does. It’s an all or nothing approach. You can consider to configure a limit to your batch queries when you are manipulating data (for instance define a lower maximum commit insert size on fast load inserts)  but it all depends on your needs.

Optimization considerations while designing SSIS data flows

  • When loading data from a Source, only select the columns you need
  • Avoid defining multiple definition in a derived column shape. If you need many derived column, prefer cascading them one by one
  • When you can, always choose to sort data form SQL Engine against SSIS Sort Transformation
  • Avoid making updates with a SSIS OLE DB Command unless you have a few rows to update. You should always prefer using stagging tables
  • Avoid Cursors and fetching rows from a Stored Procedure : within you database Server, it might be easier to write some SQL statements which loads and process rows, however, it will probably ends with performance issues