To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.
A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.
You can have another generator that combines 2 words at a time and yields pairs.
from typing import Iterator
def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
word = ''
with open(text_file) as text:
while True:
character = text.read(1)
if not character:
return
if character.isspace():
yield word.lower()
word = ''
else:
word += character
def pair_generator(text_file: str) -> Iterator[str]:
previous_word = ''
for word in memory_efficient_word_generator(text_file):
if previous_word and word:
yield f'{previous_word}-{word}'
previous_word = word or previous_word
for pair in pair_generator('filename.txt'):
print(pair)
Assuming filename.txt
contains:
Lemonade juice whiskey beer soda vodka
cola tequila lemonade juice
You should see something like:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…