A few months ago I took over an ETL project using AWS Glue. Currently, we are querying all records from DocDB then filtering to prevent reprocessing unnecessary records. Obviously, this is inefficient as querying and filtering ALL records for every job run is not scalable and expensive.
Question is, how can we customize the DocDB queries from a Glue job? In reviewing the Docs, it doesn't seem that glueContext.getSourceWithFormat
has an option to pass a DocDB query.
If Glue does not provide this option, I'm thinking of having the job trigger an AWS Lambda to query for the records and store as JSON in an S3 until Glue processes the records.