Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have tried numerous scripts and approaches to clean a large text file before importing into MS Access.

The text file is 500k+ lines. Some lines contain 'carriage returns' or 'line breaks'. These are displayed as square symbols in Notepad. (Interestingly in Windows XP they are squares, but in Windows 2003 they don't appear in Notepad but do break text onto the next line/row.

Each field should have no occurrences of these. Therefore I need a way of removing all of these from the file.

Example of text file contents:

 FIELD_NAME1|FIELD_NAME2                         |FIELD_NAME3
 John       |He likes food                       |1002
 Jake       |He eats food                        |1004
 Jake       |He eats food and [][] likes swimming|1003

1) One solution was to read through the file and repair rows. However difficulty in getting this to work. Typically you only realise the row is erroneous based on errors in following rows.

2) Another is to split the text file into smaller bits. Then use find and replace. Once cleansed - stick back together into MS Access.

Is there a simple way to do this?

This task only has to be run a couple of times so automation is not crucial.

Analysis output added by dmuk and then editted by Tony Dallimore

See my (Tony Dallimore) answer for an explanation of this analysis output. I had not expected such long string of control characters (caused by, for example, 44 blank lines) to be found. I have wrapped these long strings in column 1 to improve readability.

String ?       | ? ? ? File ? ?| ? ? ? Line ? ?| ? ? ? File ? ?| ? ? ? Line
?13 10 ?       | ? ? ? 1 ? ? ? | ? ? ? 1 ? ? ? | ? ? ? 376 ? ? | ? ? ? 626
?9 ? ? ?       | ? ? ? 1 ? ? ? | ? ? ? 2299 ? ?| ? ? ? 375 ? ? | ? ? ? 3524
?9 9 ? ?       | ? ? ? 3 ? ? ? | ? ? ? 6106 ? ?| ? ? ? 67 ? ? ?| ? ? ? 6111
?9 9 9 9 ? ?   | ? ? ? 6 ? ? ? | ? ? ? 1916 ? ?| ? ? ? 53 ? ? ?|       1492
?9 9 9 ?       | ? ? ? 6 ? ? ? | ? ? ? 1917 ? ?| ? ? ? 53 ? ? ?| ? ? ? 1493
?9 9 9 9 9? ? ?| ? ? ? 42 ? ? ?| ? ? ? 1266 ? ?| ? ? ? 42 ? ? ?|       1266
?10 ? ?        | ? ? ? 69 ? ? ?| ? ? ? 1524 ? ?| ? ? ? 240 ? ? | ? ? ? 4885
?10 10 ?       | ? ? ? 69 ? ? ?| ? ? ? 3577 ? ?| ? ? ? 222 ? ? | ? ? ? 4651
?13 10 13 10 ? | ? ? ? 71 ? ? ?| ? ? ? 3697 ? ?| ? ? ? 374 ? ? |       3258
?13 10 10 ? ? ?| ? ? ? 80 ? ? ?| ? ? ? 5440 ? ?| ? ? ? 240 ? ? |       4166
?13 10 13 10 13|       81      |       2657    |       290     |       2094
 10 13 10 ? ? ?|               |               |               |
?13 10 13 10 13|       81      |       2662    |       215     |       1802
 10            |               |               |               |
 13 10 13 10 10| ? ? ? 86 ? ? ?| ? ? ? 2082 ? ?| ? ? ? 86 ? ? ?|       6914
 10 10 10      |       88      |       1314    |       221     |       4754
?9 10 ?        | ? ? ? 94 ? ? ?| ? ? ? 246 ? ? | ? ? ? 94 ? ? ?| ? ? ? 246
?13 10 13 10 13|       126     |       1699    |       126     |       1699
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
?13 10 13 10 13|       143     |       2078    |       143     |       2078
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10   |               |               |               |
?10 10 10 10? ?| ? ? ? 182 ? ? | ? ? ? 1846 ? ?| ? ? ? 188 ? ? |        2663
 10 10 10 10 10|       195     |       3320    |       195     |        3320
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10? ?|               |               |               |
?13 10 13 10 13|       198     |       4223    |       198     |       4223
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 ?    | ? ? ? 198     | ? ? ? 4223 ? ?| ? ? ? 198 ? ? | ? ? ? 4223
?10 10 10 10 10|       213     |       5449    |       213     |       5449
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10         ? ?|               |               |               |
?13 10 13 10 13|       278     |       788     |       278     |       788
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 ? |               |               |               |
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
112 views
Welcome To Ask or Share your Answers For Others

1 Answer

Introduction

It appeared at first that the problem was extra carriage returns. The first solution (which has been deleted) searched for lone CRs and removed them. This had no beneficial effect so it became clear that the problem was not extra carriage returns. I provided the analysis code below so we could properly assess the true sitiation. The output from this analysis routine was added to the original question. Review of this output revealed the true problems were:

  • A large number of blank lines.
  • extra line feeds.

A revised solution based on these findings is below the analysis code.

Analysis

You need to include the code below in a module. The routine requires a worksheet named "DiagInfo".

The code loops reading blocks of approximately 1 Mb from the input file. It splits each block into lines with any control character serving as a line terminator. It creates one output file per block.

Near the top of the routine, you will find:

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

The input file is: FileInNameRoot & ".txt".

The output files are named: FileOutNameRoot & " 001.txt", FileOutNameRoot & " 002.txt", FileOutNameRoot & " 003.txt", etc.

You can change the block size from 1 Mb if you wish. The routine is very slighly faster with a block size of 1,000,000 but you get ten times more output files. I find 1 Mb gives me files that can be accessed easily with NotePad.

The output looks like:

000001 FIELD_NAME1|FIELD_NAME2|FIELD_NAME3  13 10
000002 John|He likes food|1002  13 10
000003 Jake|He eats food|1004  13 10
000004 Jake|He eats food and  13
000005 likes swimming|1003 13 10
000006 John|He likes food|1002  13 10
000007 Jake|He eats food|1004  13 10
000008 Jake|He eats food and  20 27 0 4

The first seven characters are the line number followed by space. A line is ended by any control character. The display characters from the input file are output unchanged. Each control character is output as space followed by its code value. Most lines are terminated by 13 10 (CR LF), but line 4 is terminated by 13 (CR) and line 8 is terminated by 20 27 0 4 (DC4 ESC NUL EOT).

The worksheet "DiagInfo" looks like:

               First          Last  
String      File   Line    File   Line
 13 10        1       1     66    5786
 13           1       4     66    5666
 20 27 0 4    1       8     66    5670

Column A contains every different string of control characters found by the routine. Columns B and C contain the file and line number of the first occurrence. Columns D and E contain the file and line number of the last occurrence.

The routine uses worksheet "DiagInfo" as a crude progress indicator with the last line showing the current output file number and the last line number that was a multiple of 100. With my 63Mb test file, the routine took 2 minutes.

This will tell us what we are dealing with and allow us to plan accordingly.

Option Explicit 
Sub AnalyseFileAndSplitIntoBlocks()

  Dim Block As String
  Dim BlockLen As Long
  Dim CtrlChr As Long
  Dim CtrlChrStg As String
  Dim FileIn As Object
  Dim FileInNameRoot As String
  Dim FileOut As Object
  Dim FileOutNameRoot As String
  Dim Found As Boolean
  Dim FSO As Object
  Dim LineOut As String
  Dim NumFileOut As Long
  Dim NumLine As Long
  Dim PathCrnt As String
  Dim PosCrnt As Long
  Dim PosStart As Long
  Dim RowDiagCrnt As Long
  Dim RowDiagNext As Long
  Dim StartTime As Single
  Dim TrailingFromLastBlock As String

  StartTime = Timer

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

  With Worksheets("DiagInfo")
    .Activate
    .Cells.EntireRow.Delete
    .Range("B1:C1").Merge
    With .Range("B1")
      .Value = "First"
      .HorizontalAlignment = xlCenter
    End With
    .Range("D1:E1").Merge
    With .Range("D1")
      .Value = "Last"
      .HorizontalAlignment = xlCenter
    End With
    .Range("A2").Value = "String"
    .Range("B2").Value = "File"
    .Range("C2").Value = "Line"
    .Range("D2").Value = "File"
    .Range("E2").Value = "Line"
    .Range("B2:E2").HorizontalAlignment = xlRight
    .Range("A1:E2").Font.Bold = True
    RowDiagNext = 3
    .Cells(RowDiagNext, 1).Select
  End With
  ActiveWindow.FreezePanes = False
  ActiveWindow.FreezePanes = True

  PathCrnt = ActiveWorkbook.Path
  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 1000000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "" & FileInNameRoot & ".txt", 1, 0)
  '  1 = Read.  0 = ASCII file

  NumFileOut = 0
  TrailingFromLastBlock = ""

  Do While FileIn.AtEndOfStream <> True
    Block = TrailingFromLastBlock & FileIn.read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        Block = Block & FileIn.read(1)
      Else
        Exit Do
      End If
    Loop

    With Worksheets("DiagInfo")
      NumFileOut = NumFileOut + 1
      .Cells(RowDiagNext, 2).Value = NumFileOut
      NumLine = 1
      .Cells(RowDiagNext, 3).Value = NumLine
    End With

    Set FileOut = FSO.CreateTextFile(PathCrnt & "" & FileOutNameRoot & " " & _
                            Right("000" & NumFileOut, 3) & ".txt", True, False)
    ' True = Can overwrite.  False = ASCII

    PosStart = 1        ' Start of first line
    PosCrnt = 1
    Do While PosCrnt <= Len(Block)
      If Mid(Block, PosCrnt, 1) < " " Or _
         Mid(Block, PosCrnt, 1) = Chr(127) Then
        ' Have found a control character.
        LineOut = Mid(Block, PosStart, PosCrnt - PosStart)
        ' Build display string of control character and
        ' any subsequent control characters.
        CtrlChrStg = ""
        Do While True
          CtrlChrStg = CtrlChrStg & " " & Asc(Mid(Block, PosCrnt, 1))
          PosCrnt = PosCrnt + 1
          If PosCrnt > Len(Block) Then
            ' This block finished
            Exit Do
          End If
          If Mid(Block, PosCrnt, 1) < " " Or _
             Mid(Block, PosCrnt, 1) = Chr(127) Then
            ' Another control character
          Else
            ' First display character of next line
            Exit Do
          End If
        Loop
        ' Search for control character string in worksheet DiagInfo
        With Worksheets("DiagInfo")
          Found = False
          For RowDiagCrnt = 3 To RowDiagNext - 1
            If .Cells(RowDiagCrnt, 1).Value = CtrlChrStg Then
              Found = True
              Exit For
            End If
          Next
          If Not Found Then
            ' Previously unknown string of control characters
            RowDiagCrnt = RowDiagNext
            RowDiagNext = RowDiagNext + 1
            .Cells(RowDiagNext, 1).Select
            .Cells(RowDiagCrnt, 1).Value = "'" & CtrlChrStg
            ' First occurrence
            .Cells(RowDiagCrnt, 2).Value = NumFileOut
            .Cells(RowDiagCrnt, 3).Value = NumLine
          End If
          ' Last occurrence
          .Cells(RowDiagCrnt, 4).Value = NumFileOut
          .Cells(RowDiagCrnt, 5).Value = NumLine
        End With
        FileOut.writeline Right("00000" & NumLine, 6) & " " & _
                                                     LineOut & CtrlChrStg
        PosStart = PosCrnt          ' Start of current line
        NumLine = NumLine + 1
        If NumLine Mod 100 = 0 Then
          With Worksheets("DiagInfo")
           .Cells(RowDiagNext, 2).Value = NumFileOut
           .Cells(RowDiagNext, 3).Value = NumLine
          End With
        End If
      Else
        PosCrnt = PosCrnt + 1
      End If
    Loop
    FileOut.Close
    ' Save trailing characters for next line
    TrailingFromLastBlock = Mid(Block, PosStart, Len(Block) - PosStart + 1)
  Loop

  FileIn.Close

  With Worksheets("DiagInfo")
    .Cells(RowDiagNext, 2).Value = ""
    .Cells(RowDiagNext, 3).Value = ""
    .Cells(3, 1).Select
    .Cells.Columns.AutoFit
  End With

  Debug.Print Timer - StartTime

End Sub

Revised solution

Review of the analysis output revealed the true problems were:

  • A large number of blank lines.
  • extra line feeds.

There were also tabs within the text but the questioner decided these were not a problem and were to be retained. The questioner wanted the blank lines removed and the line feeds replaced by spaces.

The routine below reads the input files in blocks of 100,000 bytes. There are significant overheads associated with updating long strings. Limited experimentation suggests that 100,000 is an acceptable compromise. If the last character of a block is a control character then the routine loops adding another character to the block until the last character is not a control character. This ensure that no sequence of control characters is split across two blocks. The routine first loops replacing CR LF CR LF by CR LF until there are no blank lines. The routine then looks for LFs not preceeded by CRs. Any that are found are replaced by spaces. On a 63 Mb file with a large number of blank lines and extra LFs, the routine took 22 seconds to complete its task.

The only statements requiring change are at the top of the routine.

Option Explicit
Sub RemoveUnwantedCtrlChars()

  Dim Block As String
  Dim BlockLen As Long
  Dim FileIn As Object
  Dim FileInName As String
  Dim FileOut As Object
  Dim FileOutName As String
  Dim FSO As Object
  Dim PathCrnt As String
  Dim PosCRLF As Long
  Dim PosLF As Long
  Dim PosLastCRLF As Long
  Dim PosLastLF As Long
  Dim StartTime As Single

  StartTime = Timer

  ' ## This assumes the input file is in the same folder
  ' ## as the workbook containing this macro.
  PathCrnt = ActiveWorkbook.Path

  ' ###### Replace names as required.
  FileInName = "TestSplitLine In.txt"
  FileOutName = "TestSplitLine Out.txt"

  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 100000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "" & FileInName, 1, 0)
  '  1 = Read.  0 = ASCII file

  Set FileOut = FSO.CreateTextFile(PathCrnt & "" & FileOutName, True, False)
  ' True = Can overwrite.  False = ASCII

  Do While FileIn.AtEndOfStream <> True
    Block = FileIn.Read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        ' character
        Block = Block & FileIn.Read(1)
      Else
        Exit Do
      End If
    Loop
    ' Remove all blank lines
    Do While InStr(1, Block, vbCr & vbLf & vbCr & vbLf) <> 0
      Block = Replace(Block, vbCr & vbLf & vbCr & vbLf, vbCr & vbLf)
    Loop
    ' Find all lone LFs and replace by " "
    PosLF = 1
    PosCRLF

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...