Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

So from a text file which has a content:

Lemonade juice whiskey beer soda vodka

In Python, by implementing that same .txt file, I would like to output word-pairs in the next order:

  • juice-lemonade
  • whiskey-juice
  • beer-whiskey
  • soda-beer
  • vodka-soda

I managed outputing something like that by using list instead of opening file in Python, but in the case with some major .txt file, that is not really a handy solution. Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.

question from:https://stackoverflow.com/questions/65928201/counting-word-pairs-from-a-text-file-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
218 views
Welcome To Ask or Share your Answers For Others

1 Answer

To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.

A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.

You can have another generator that combines 2 words at a time and yields pairs.

from typing import Iterator

def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
    word = ''
    with open(text_file) as text:
        while True:
            character = text.read(1)
            if not character:
                return
            if character.isspace():
                yield word.lower()
                word = ''
            else:
                word += character


def pair_generator(text_file: str) -> Iterator[str]:
    previous_word = ''
    for word in memory_efficient_word_generator(text_file):
        if previous_word and word:
            yield f'{previous_word}-{word}'
        previous_word = word or previous_word


for pair in pair_generator('filename.txt'):
    print(pair)

Assuming filename.txt contains:

Lemonade juice whiskey beer soda vodka

cola tequila lemonade juice

You should see something like:

lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice

Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...