Wednesday, March 21, 2012

How can I append children to parents in a SSIS data flow task?

I need to extract data to send to an external agency in their supplied format. The data is normalised in our system in a one to many relationship. The external agency needs it denormalised.

In our system, the parent p has p_id, p_attribute_1, p_attribute_2, p_attribute_3 and the child has c_id, c_attribute_a, c_attribute_b, c_parent_id_fk

The external agency can only use a delimited file looking like

p_id, p_attribute_1, p_attribute_2, p_attribute_3, c1_attribute_a, c1_attribute_b, c2_attribute_a, c2_attribute_b, ...., cn_attribute_a, cn_attribute_b

where n is the number of children a parent may have. Each parent can have 0 or more children - typically between 1 and 20.

How can I achieve this using SSIS? In the past I have used custom built VB apps with the ADO SHAPE command but this is not ideal as I have to rebuild each time to alter the selection criteria and and VB is not a good SQL tool.

You can use the pivot transform to move rows to columns, but it doesn't handle a variable number of columns very well. I'd try creating a script component that outputs the set of children for a parent as one long string. Then you can append it to the parent value.|||

jwelch wrote:

You can use the pivot transform to move rows to columns, but it doesn't handle a variable number of columns very well. I'd try creating a script component that outputs the set of children for a parent as one long string. Then you can append it to the parent value.

I agree with John.

The way I would approach this is to join the parent and child records together in a SQL statement so you have a row for each parent/child combination. Use this statement in the Source of a Data Flow and run the rows into an asynchronous script transformation component. You'll write code in the script to denormalize the incoming rows into a single output row for each parent. SSIS won't adapt to having an arbitrary number of child columns and I don't think there is a need for it. The output of the script should be a single column (string if it can fit, otherwise text) that is already delimited and can go directly into a flat file destination.

So the script is the only tricky part. It will look at the p_id on each row it receives to determine if this is the same parent as the last input row. If it isn't, then it will write the output row it was working on for the previous p_id to the output and start a new one by appending the p_id and parent and child attributes to a string. If it is the same p_id, it will continue to append the child attributes until it sees the next p_id or the last row.

Something like this oughta do it:

Code Snippet

Public Class ScriptMain
Inherits UserComponent

Dim last_pid As Integer = -1
Dim current_row As StringBuilder = Nothing

Public Overrides Sub FinishOutputs()
' the input is finished, flush the current row
WriteRow()
End Sub

Public Sub WriteRow()
If Not current_row Is Nothing Then
With Output0Buffer
.AddRow()
.OutputLine = current_row.ToString()
' if we use a dt_text column
'.OutputLine.AddBlobData(ASCIIEncoding.ASCII.GetBytes(current_row.ToString()))
End With
End If

End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
'
' Add your code here
'
If Row.pid <> last_pid Then

' this is a new p_id starting, write the previous one to the output
WriteRow()

' remember this pid for future rows
last_pid = Row.pid

' create new buffer
current_row = New StringBuilder()
' append parent attributes
With current_row
.Append(Row.pid.ToString())
.Append(",")
.Append(Row.pattribute1)
.Append(",")
.Append(Row.pattribute2)
.Append(",")
.Append(Row.pattribute3)
End With
End If

' append child attributes
With current_row
.Append(",")
.Append(Row.cattributea)
.Append(",")
.Append(Row.cattributeb)
End With
End Sub

End Class


Note: The JOIN should group all of the parents together, but just in case you should probably throw an ORDER BY p_id in for that SQL statement.

|||

Phew. I went out 2 hours ago with the express purpose of coming back and answering this but John and Jay have beaten me to it Smile

I was going to suggest exactly what Jay has said. Its kind of hobson's choice on this one - the metadata is unknown at design-time therefore you HAVE to have a single column with all the values concatenated. The fact that you are outputting to a file means that it really doesn't matter anyway.

I would add that this would be a great candidate for a custom component if you can be bothered to build it.

-Jamie

No comments:

Post a Comment