Rhino ETL

Simple Extract, Transform and Load library for .NET

Rhino ETL is an extract, transform and load utility that enables you to move data from many different sources, transform them however you like and then load it into a different destination source. If you have worked with ETLs in the past you may be familiar with technologies such as: DTS(Data Transformation Services) or SSIS(SQL Server Integration Services) in the Microsoft world.

 

Resources

Hello World Rhino ETL

Patrik Lindström wrote an Hello World application. Check his blog post, and complete source files at GitHub for more details on how to quickly get started.

Paul Barriere

Ayende Rahien

Coding Instinct (Torkel Ödegaard)

Learn

Rhino Etl processes can be created by either writing C# classes or through a DSL. The basic idea behind each process is a Pipeline. Each process has input rows and output rows. I will demonstrate using the DSL. Let’s assume for a moment that you have 5 records in a database as depicted in the picture above. You would like to get those records out of the database, format them and finally into a pipe delimited text file.

PROCESS 1: EXTRACT DATA FROM THE DATABASE

The input command below will execute the SQL statement and return all rows into the pipeline.

input "SourceDatabase", Command = """
SELECT
    FirstName, LastName, LoyaltyNumber
FROM LoyaltyTable
"""

It’s important to note here that each step in the pipeline does not process or gather all of the rows before it passes it to the next step in the pipeline. As each row is emitted from one process and it is sent to the next. This can dramatically reduce the memory footprint required to run the complete process. The next step is to transform the rows from the first process.

PROCESS 2: TRANSFORM THE DATA

This operation uses simple boo syntax (python-like) to manipulate each row. You can modify existing columns or create new ones just by setting their value.

operation transform_names:
    for row in rows:
        row.Name = row.FirstName + " " + row.LastName
        row.Loyalty = row.LoyaltyNumber.Substring(0,3) + "-" + row.LoyaltyNumber.Substring(3,row.LoyaltyNumber.Length-3)
        yield row

This operation will take each row and create a new field called Name with the First and Last names combined together. Along with parsing out the Loyalty the way we’d like it.

PROCESS 3: EXPORT THE DATA TO A TEXT FILE

In order to export the data we must define what format should be exported. For text files Rhino Etl comes with the FileHelpers library that allows you to define the format by simply creating a class. Since we’re demonstrating this using the DSL here is how the class would look for our export file.

import FileHelpers
[DelimitedRecord("|")]
public class LoyaltyRow:
  public Name as string
  public Loyalty as string

Then we need a simple operation to write that file to disk.

operation export_file:
    engine = Rhino.Etl.Core.Files.FluentFile.For[of LoyaltyRow]()
    file = engine.To("""DatabaseExport.txt""")
    for row in rows:
        record = row.ToObject[of LoyaltyRow]()
        file.Write(record)
        yield row

    file.Dispose()

COMPLETE FILE

Those are the major pieces of an etl script. Here is what the complete file looks like.

import FileHelpers
[DelimitedRecord("|")]
public class LoyaltyRow:
public Name as string
public Loyalty as string
operation transform_names:
    for row in rows:
        row.Name = row.FirstName + " " + row.LastName
        row.Loyalty = row.LoyaltyNumber.Substring(0,3) + "-" + row.LoyaltyNumber.Substring(3,row.LoyaltyNumber.Length-3)
        yield row
operation export_file:
    engine = Rhino.Etl.Core.Files.FluentFile.For[of LoyaltyRow]()
    file = engine.To("""DatabaseExport.txt""")
    for row in rows:
        record = row.ToObject[of LoyaltyRow]()
        file.Write(record)
        yield row
    file.Dispose()
    process Test:
    input "SourceDatabase", Command = """
        SELECT
        FirstName, LastName, LoyaltyNumber
        FROM LoyaltyTable
    """

transform_names()
export_file()

CONFIGURATION

Since we are referencing a database connection string named “SourceDatabase” it must pull that information from a configuration file. Here is a very simple one. You need to specify the name, connectionString along with the providerName to indicate which ADO.NET provider to use.

<?xml version="1.0" encoding="utf-8" ?> 
<configuration> 
    <connectionStrings> 
        <add name="SourceDatabase" 
            connectionString="Data Source=.\sqlexpress;Initial Catalog=rhino-etl-examples;Integrated Security=SSPI;Timeout=300;" 
            providerName="System.Data.SqlClient.SqlConnection, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"/> 
    </connectionStrings> 
</configuration>

EXECUTION

Once you have your two files (Example.boo for the ETL and Example.config for the configuration) you can execute it on the command line using Rhino.Etl.Cmd.

Rhino.Etl.Cmd -f:Example.boo -c:Example.config -v

Dsl – Chaining Input From Three Files

This is an example of importing 3 text files into a Microsoft Sql Server. All three are the exact same format. Two files were created to perform the operation. The Configuration.config for the connection string and Changed.boo which contains the operation.

Configuration.config

<?xml version="1.0" encoding="utf-8" ?> 
<configuration> 
  <connectionStrings> 
    <add name="ConnectionName" 
       connectionString="Data Source=YourServer;Initial Catalog=YourDatabase;Integrated Security=SSPI;Timeout=300;" 
       providerName="System.Data.SqlClient.SqlConnection, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"/> 
  </connectionStrings> 
</configuration>

Changed.boo

import FileHelpers
[FixedLengthRecord] 
class NcoaRow: 
  [FieldFixedLength(4)] public Trne as string 
  [FieldFixedLength(11)] public CustomerId as string 
  [FieldFixedLength(25)] public CustomerName as string 
  [FieldFixedLength(25)] public CustomerAddress1 as string 
  [FieldFixedLength(25)] public CustomerAddress2 as string 
  [FieldFixedLength(20)] public CustomerCity as string 
  [FieldFixedLength(2)] public CustomerState as string 
  [FieldFixedLength(9)] public CustomerZip as string 
  [FieldFixedLength(10)] public CustomerPhone as string 
  [FieldFixedLength(5)] public SignUpStore as string 
  [FieldFixedLength(1)] public TypeCode as string 
  [FieldFixedLength(20)] public FirstName as string 
  [FieldFixedLength(20)] public LastName as string 
  [FieldFixedLength(50)] public EmailAddress as string 
  [FieldFixedLength(1)] public Gender as string 
  [FieldFixedLength(1)] public HomeOwnerStatus as string 
  [FieldFixedLength(10)] public DateOfBirth as string 
  [FieldFixedLength(10)] public DateUpdated as string            
  [FieldFixedLength(1)] public Language as string 
  [FieldFixedLength(10)] public DateOther as string
operation import_undeliv: 
  for row in rows: 
    yield row 
  file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourUndelivPath""") 
  for row in file: 
    yield Row.FromObject(row)
operation import_changed: 
  for row in rows: 
    yield row 
  file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourChangedPath""") 
  for row in file: 
    yield Row.FromObject(row)
operation import_unchanged: 
  for row in rows: 
    yield row 
  file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourUnchangedPath""") 
  for row in file: 
    yield Row.FromObject(row)
process ChangedBulkImport: 
  import_undeliv() 
  import_changed() 
  import_unchanged() 
  sqlBulkInsert "ConnectionName", "TableName", TableLock = true, KeepIdentity = true, BatchSize = 10000, UseTransaction = false : 
    map "Trne", "Trne", string 
    map "CustomerId", "CustomerId", string 
    map "CustomerName", "CustomerName", string 
    map "CustomerAddress1", "CustomerAddress1", string 
    map "CustomerAddress2", "CustomerAddress2", string 
    map "CustomerCity", "CustomerCity", string 
    map "CustomerState", "CustomerState", string 
    map "CustomerZip", "CustomerZip", string 
    map "CustomerPhone", "CustomerPhone", string 
    map "SignUpStore", "SignUpStore", string 
    map "TypeCode", "TypeCode", string 
    map "FirstName", "FirstName", string 
    map "LastName", "LastName", string 
    map "EmailAddress", "EmailAddress", string 
    map "Gender", "Gender", string 
    map "HomeOwnerStatus", "HomeOwnerStatus", string 
    map "DateOfBirth", "DateOfBirth", string 
    map "DateUpdated", "DateUpdated", string 
    map "Language", "Language", string 
    map "DateOther", "DateOther", string

Executed using

Rhino.Etl.Cmd -c:Configuration.config -f:Changed.boo