bash - Splitting out a large file

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

bash - Splitting out a large file

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I would like to process a 200 GB file with lines like the following:

...
{"captureTime": "1534303617.738","ua": "..."}
...

The objective is to split this file into multiple files grouped by hours.

Here is my basic script:

#!/bin/sh

echo "Splitting files"

echo "Total lines"
sed -n '$=' $1

echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

while read p; do
  date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
  echo $p >> split.$date
done <$1

Some facts:

80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.

Could you help me to optimize this bash script?

Thank you

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

269 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:05:54+0000

This awk solution might come to your rescue:

awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1

It essentially replaces your while-loop.

Furthermore, you can replace the complete script with:

# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
  print "Total lines processed: ", NR
  print "First date: "strftime("%Y%m%d%H",tmin)
  print "Last date:  "strftime("%Y%m%d%H",tmax)
}

Which you then can run as:

awk -f <awk_file.awk> <jq-file>

Note: the usage of strftime indicates that you need to use GNU awk.

Categories

bash - Splitting out a large file

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags