it:ad:azure:resources:microsoft:data_factory:home

IT:AD:Azure:Resources:Data Factory

Summary

Cloud-based data integration service to create data-driven workflows automating the orchestration over secure channels of data movement and transformation between cloud and on-prem.

Manages Pipelines that typically ETL (Connect&Collect,Transform&Enrich&Publish,Monitor). * Features:

  • A Subscription may have n Data Factories
  • A Data Factory is composed of 4 key components (…)
  • A Data Factory manages Pipelines of sequential or parallel Activities (eg: Copy)
  • A Linked Service to represent a datastore (source/target connection info to database/other service) + DataSet (source and shape of data within the remote linked service).
    • Eg:
      • Azure Storage-linked service specifies the connection string to the data source
      • Azure Blob dataset specifies the container and folder within the remote linked service
  • Processes runs (as opposed to Process definitions) are kicked off by triggers.
  • Pipelines contain:
    • Triggers: can pass Parameters when starting a Run
    • Parameters: key/value properties
      • Activities

      * Integrated Runtime (IR) can be Azure RI (ARI) to communicate with azure services, or Self hosted Integrated Runtime (SHIR) for accessing on-prem services.

  • An SHIR can be associated with up to 4 on-prem Nodes to spread the load and risk of failure. Done by repeating the install and key association 4 times on different devices.
    • A cert appears to be required (see here).
  • Pipeline: performs a Task, composed of individual Activities, as a Unit of Work
  • Activities: individual operations (eg: Copy) within a Pipeline's Tasks.

As per MSDN:

One cannot yet use the DF Portal or Azure Portal to register an IR, so you use powershell:

Login-AzureRmAccount

# change to right subscription
Set-AzureRmContext -Subscription <subscription-name>

$projectRef = "BASE"
$resourceGroupName = "NZ-MOE-BASE-COMMON"
$projectResourceNameTemplate = "nzmoecommon"
$dataFactoryResourceName = "$($projectResourceNameTemplate)"
$dataFactorySHIRName = "$($projectResourceNameTemplate)"

$shirDescription = "$($ProjectRef) Self Hosted Integrated Runtime"

# use arm instead?
# Set-AzureRmDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryResourceName 

Set-AzureRmDataFactoryV2IntegrationRuntime -ResourceGroupName:$resourceGroupName -DataFactoryName:$dataFactoryResourceName -Name:$dataFactorySHIRName -Type:SelfHosted -Description:"$shirDescription" -whatif

This registers an Integrated Runtime in the Data Factory, generating two keys. The two keys can be collect via the DF Portal, or via Powershell:

Get-AzureRmDataFactoryV2IntegrationRuntimeKey -ResourceGroupName:$resourceGroupName -DataFactoryName:$dataFactoryResourceName -Name:$dataFactorySHIRName | ConvertTo-Json

Then download the runtime: * https://www.microsoft.com/download/details.aspx?id=39717 * Install

  • At this point, do not ask for

* Note the System Tray icon * On the “Register Integration Runtime (self-hosted)“ page, provide the key * On device ensure “Microsoft Integration Runtime Configuration Manager → Settings → Remote access to intranet” is enabled. * Ensure System Tray icon is green and you are good to go. * Ensure corporate and device firewall rules allow outbound traffic to:

  • .servicebus.windows.net 443, 80 * *.core.windows.net on 443 * *.frontend.clouddatahub.net on 443 You might have a proxy to deal with too.

AzureCorporate LANSystemon App ServicesOperational DBon Azure SQLStaging Containeron Azure StorageData FactoryCertifications:HIPAA/HITECH,ISO/IEC 27001,ISO/IEC 27018,CSA STARPipelineSelf HostedIntegrated RuntimeStaging DB1Sql ServerStaging DB2Sql ServerSSISOperational DbpersistTCP 1433readHTTPSwriteHTTPSprocessmanage (r/w)TCP 1433manage (r/w)TCP 1433writeTCP 1433readTCP 1433controlover confidentialchannels(HTTPS)

  • /home/skysigal/public_html/data/pages/it/ad/azure/resources/microsoft/data_factory/home.txt
  • Last modified: 2023/11/04 02:45
  • by 127.0.0.1