Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a pipeline that moves approximately 1 TB of data, all CSV files. In this pipeline there are hundreds of files with different names. They have a date component, which is automatically partitioned. My question is how to use the CDK to automatically create subfolders based on the name of the file. In other words, the data comes in as broad category, but our data scientists need it at one more level of detail.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
152 views
Welcome To Ask or Share your Answers For Others

1 Answer

It appears that your requirement is to move incoming objects into folders based on information in their filename (Key).

This could be done by adding a trigger on the Amazon S3 bucket that triggers an AWS Lambda function when a new object is created.

Here is some code from Moving file based on filename with Amazon S3:

import boto3
import urllib

def lambda_handler(event, context):
    
    # Get the bucket and object key from the Event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
    
    # Only copy objects that were uploaded to the bucket root (to avoid an infinite loop)
    if '/' not in key:
        
        # Determine destination directory based on Key
        directory = key # Your logic goes here to extract the directory name
      
        # Copy object
        s3_client = boto3.client('s3')
        s3_client.copy_object(
            Bucket = bucket,
            Key = f"{directory}/{key}",
            CopySource= {'Bucket': bucket, 'Key': key}
        )
        
        # Delete source object
        s3_client.delete_object(
            Bucket = bucket,
            Key = key
        )

You would need to modify the code that determines the name of the destination directory based on the key of the new object.

It also assumes that new objects will come into the top-level (root) of the bucket and then be moved into sub-directories. If, instead, new objects are coming in a given path (eg incoming/) then only set the S3 trigger to operate on that path and remove the if '/' not in key logic.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...