IT:AD:Azure:Resources:Data Factory
Summary
Cloud-based data integration service to create data-driven workflows automating the orchestration over secure channels of data movement and transformation between cloud and on-prem.
Manages Pipelines that typically ETL (Connect&Collect,Transform&Enrich&Publish,Monitor). * Features:
- Visual Drag/Drop to develop Pipelines
- Certified data movement: HIPAA/HITECH, ISO/IEC 27001, ISO/IEC 27018, and CSA STAR
Notes
- A Subscription may have n Data Factories
- A Data Factory is composed of 4 key components (…)
- A Data Factory manages Pipelines of sequential or parallel Activities (eg: Copy)
- A Linked Service to represent a datastore (source/target connection info to database/other service) + DataSet (source and shape of data within the remote linked service).
- Eg:
- Azure Storage-linked service specifies the connection string to the data source
- Azure Blob dataset specifies the container and folder within the remote linked service
- Processes runs (as opposed to Process definitions) are kicked off by triggers.
- Pipelines contain:
- Triggers: can pass Parameters when starting a Run
- Parameters: key/value properties
- Activities
* Integrated Runtime (IR) can be Azure RI (ARI) to communicate with azure services, or Self hosted Integrated Runtime (SHIR) for accessing on-prem services.
- An SHIR can be associated with up to 4 on-prem Nodes to spread the load and risk of failure. Done by repeating the install and key association 4 times on different devices.
- A cert appears to be required (see here).
Terms
- Pipeline: performs a Task, composed of individual Activities, as a Unit of Work
- Activities: individual operations (eg: Copy) within a Pipeline's Tasks.
To Install SHIR
As per MSDN:
One cannot yet use the DF Portal or Azure Portal to register an IR, so you use powershell:
Login-AzureRmAccount # change to right subscription Set-AzureRmContext -Subscription <subscription-name>
$projectRef = "BASE" $resourceGroupName = "NZ-MOE-BASE-COMMON" $projectResourceNameTemplate = "nzmoecommon" $dataFactoryResourceName = "$($projectResourceNameTemplate)" $dataFactorySHIRName = "$($projectResourceNameTemplate)" $shirDescription = "$($ProjectRef) Self Hosted Integrated Runtime"
# use arm instead? # Set-AzureRmDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryResourceName
Set-AzureRmDataFactoryV2IntegrationRuntime -ResourceGroupName:$resourceGroupName -DataFactoryName:$dataFactoryResourceName -Name:$dataFactorySHIRName -Type:SelfHosted -Description:"$shirDescription" -whatif
This registers an Integrated Runtime in the Data Factory, generating two keys. The two keys can be collect via the DF Portal, or via Powershell:
Get-AzureRmDataFactoryV2IntegrationRuntimeKey -ResourceGroupName:$resourceGroupName -DataFactoryName:$dataFactoryResourceName -Name:$dataFactorySHIRName | ConvertTo-Json
Then download the runtime: * https://www.microsoft.com/download/details.aspx?id=39717 * Install
- At this point, do not ask for
* Note the System Tray icon * On the “Register Integration Runtime (self-hosted)“ page, provide the key * On device ensure “Microsoft Integration Runtime Configuration Manager → Settings → Remote access to intranet” is enabled. * Ensure System Tray icon is green and you are good to go. * Ensure corporate and device firewall rules allow outbound traffic to:
- .servicebus.windows.net 443, 80 * *.core.windows.net on 443 * *.frontend.clouddatahub.net on 443 You might have a proxy to deal with too.