azure - Copy data from multiple csv files into one csv file

Question

Welcome To Ask or Share your Answers For Others

azure - Copy data from multiple csv files into one csv file

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I have multiple csv files in my azure blob storage which I wish to append into one csv file also stored in azure blob storage using the azure data factory pipeline. The problem is that all the columns of the source files are not present in the sink file and vice versa and also all the source files are not identical. I just want to map the columns I need from source files to the columns in sink file. The copy activity in the data factory is not allowing me to do so.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

178 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:13:39+0000

As @LeonYue said, it doesn't support on Azure Data Factory now. However, per my experience, as a workaround solution, you can consider to create a Python script using pandas to do that and run as WebJob of Azure App Service or on Azure VM for acceleration between Azure Storage and other Azure services.

The steps of the workaround solution is like below.

Maybe these csv files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_csv function, the code as below.

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta

account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'

service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)

#print(list(blob_urls_with_token))

To directly read csv file by read_csv function to get a pandas dataframe.

import pandas as pd

for blob_url_with_token in blob_urls_with_token:
    df = pd.read_csv(blob_url_with_token)

You can follow your want to operate these dataframe by pandas, and then write to Azure Blob Storage as a single csv file by using Azure Storage SDK for Python.

Hope it helps.

Categories

azure - Copy data from multiple csv files into one csv file

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags