python - Counting line frequencies and producing output files

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

python - Counting line frequencies and producing output files

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

With a textfile like this:

a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f

How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.

Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

216 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:21:34+0000

Here is an answer without frozen set.

df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']

df_all = pd.concat([df_count.assign(letter=lambda x: x['A']), 
                    df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])

df_first = df_all.groupby(['letter']).first().reset_index()

top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

------------older answer --------

Since order matters you can use a frozen set as the key to a groupby

import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']

Which will give you this

   Combos  Count
0  (a, b)      3
1  (b, f)      2
2  (d, c)      2
3  (g, f)      1
4  (b, h)      1
5  (c, g)      1
6  (d, f)      1
7  (e, a)      1

To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.

df_a = df_count.copy()
df_b = df_count.copy()

df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])

df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])

And since this is sorted by letter and count (descending) just get the first row of each group.

df_first = df_all.groupby('letter').first()

And to get the top 25%, just use

top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

And then use .to_csv to output to file.

Categories

python - Counting line frequencies and producing output files

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags