Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I use python 3

Okay, I got a file that lock like this:

id:1
1
34
22
52
id:2
1
23
22
31
id:3
2
12
3
31
id:4
1
21
22
11

how can I find and delete only this part of the file?

id:2
1
23
22
31

I have been trying a lot to do this but can't get it to work.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
127 views
Welcome To Ask or Share your Answers For Others

1 Answer

Is the id used for the decision to delete the sequence, or is the list of values used for the decision?

You can build a dictionary where the id number is the key (converted to int because of the later sorting) and the following lines are converted to the list of strings that is the value for the key. Then you can delete the item with the key 2, and traverse the items sorted by the key, and output the new id:key plus the formated list of the strings.

Or you can build the list of lists where the order is protected. If the sequence of the id's is to be protected (i.e. not renumbered), you can also remember the id:n in the inner list.

This can be done for a reasonably sized file. If the file is huge, you should copy the source to the destination and skip the unwanted sequence on the fly. The last case can be fairly easy also for the small file.

[added after the clarification]

I recommend to learn the following approach that is usefull in many such cases. It uses so called finite automaton that implements actions bound to transitions from one state to another (see Mealy machine).

The text line is the input element here. The nodes that represent the context status are numbered here. (My experience is that it is not worth to give them names -- keep them just stupid numbers.) Here only two states are used and the status could easily be replaced by a boolean variable. However, if the case becomes more complicated, it leads to introduction of another boolean variable, and the code becomes more error prone.

The code may look very complicated at first, but it is fairly easy to understand when you know that you can think about each if status == number separately. This is the mentioned context that captured the previous processing. Do not try to optimize, let the code that way. It can actually be human-decoded later, and you can draw the picture similar to the Mealy machine example. If you do, then it is much more understandable.

The wanted functionality is a bit generalized -- a set of ignored sections can be passed as the first argument:

import re

def filterSections(del_set, fname_in, fname_out):
    '''Filtering out the del_set sections from fname_in. Result in fname_out.'''

    # The regular expression was chosen for detecting and parsing the id-line.
    # It can be done differently, but I consider it just fine and efficient.
    rex_id = re.compile(r'^id:(d+)s*$')

    # Let's open the input and output file. The files will be closed
    # automatically.
    with open(fname_in) as fin, open(fname_out, 'w') as fout:
        status = 1                 # initial status -- expecting the id line
        for line in fin:
            m = rex_id.match(line) # get the match object if it is the id-line

            if status == 1:      # skipping the non-id lines
                if m:              # you can also write "if m is not None:"
                    num_id = int(m.group(1))  # get the numeric value of the id
                    if num_id in del_set:     # if this id should be deleted
                        status = 1            # or pass (to stay in this status)
                    else:
                        fout.write(line)      # copy this id-line
                        status = 2            # to copy the following non-id lines
                #else ignore this line (no code needed to ignore it :)

            elif status == 2:      # copy the non-id lines
                if m:                         # the id-line found
                    num_id = int(m.group(1))  # get the numeric value of the id
                    if num_id in del_set:     # if this id should be deleted
                        status = 1            # or pass (to stay in this status)
                    else:
                        fout.write(line)      # copy this id-line
                        status = 2            # to copy the following non-id lines
                else:
                    fout.write(line)          # copy this non-id line


if __name__ == '__main__':
    filterSections( {1, 3}, 'data.txt', 'output.txt')
    # or you can write the older set([1, 3]) for the first argument.

Here the output id-lines where given the original number. If you want to renumber the sections, it can be done via a simple modification. Try the code and ask for details.

Beware, the finite automata have limited power. They cannot be used for the usual programming languages as they are not able to capture nested paired structures (like parenteses).

P.S. The 7000 lines is actually a tiny file from a computer perspective ;)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...