首页 > 解决方案 > 如何在 SSIS 中使用缓存连接管理器执行范围查找

问题描述

有没有办法使用 SSIS 中的缓存连接管理器执行日期范围查找?或者类似的东西非常高效。

我的情况如下。我在一个有日期的表中有一行,我们称之为 BusinessDate。我需要对表执行查找以查看 businessDate 是否介于维度的 StartDate 和 EndDate 之间。

问题是,我正在读取的表有数百万条记录,而我的维度(查找表)有几千条记录,而且需要很长时间。

请帮忙...

标签: ssislookup

解决方案


Nope, the Lookup with a cache connection manager is a strict equals. You might be able to finagle it with a lookup against an OLE DB source with a Partial/None cache model and custom queries.

So, what can you do?

You can modify the way you populate your Lookup Cache. Assuming your data looks something like

MyKey|StartDate|EndDate|MyVal
1    |2021-01-01|2021-02-01|A
1    |2021-02-01|9999-12-31|B

Instead of just loading as is, explode out your dimension.

MyKey|TheDate|MyVal
1    |2021-01-01|A
1    |2021-01-02|A
1    |2021-01-03|A
1    |2021-01-04|A
1    |2021-01-05|A
...
1    |2021-02-01|B
1    |2021-02-02|B
...

You might not want to build your lookup all the way to year 9999 but know your data and say go 5 years in the future as well as pick up the end date.

Now your lookup usage is a supported case - strict equals.

Otherwise, the pattern of a merge join is how people handle range joins in a data flow. Going to reproduce Matt Masson's article from the msdn blogs because it's dead

Lookup Pattern: Range Lookups

Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.

This blog post outlines three separate approaches for doing range lookups in SSIS:

  • Using the Lookup Transform
  • Merge Join + Conditional Split
  • Script Component

All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.

One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).

Lookup Transform

The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).

We can use the following query to have the lookup transform perform our range lookup:

select [ProductKey], [ProductAlternateKey], 
     [StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and   [StartDate] <= ?
and (
    [EndDate] is null or 
    [EndDate] > ?
)

On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.

This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.

Merge Join and Conditional Split

This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.

enter image description here

For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:

table1

From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.

table2

We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):

StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))

enter image description here

This approach is a little more complicated, but performs a lot better than using the Lookup Transform.

Script component

Not reproduced here

Conclusion

Not reproduced here


推荐阅读