Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

i have two dataframes (in pandas)

df1:

logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10

df2:

id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B

i want new dataframe like new_df:

logged_at, item, value, id
2021-01-03 20:01:23, A, 4, 2
2021-01-03 20:01:24, A, 5, 2
2021-01-03 20:01:25, B, 4, 3
2021-01-03 20:01:26, B, 7, 3
2021-01-03 20:01:27, A, 10, 2

What I want is to attach the ID of df2 to the column of df1.

The condition is that the logged_at time of df1 exists between the start_time and the end_time of df2.

The number of data in df1 exceeds 900,000 and the number of data in df2 exceeds 100,000.

It takes too long to attach each row of df1.

Is there an efficient way?

question from:https://stackoverflow.com/questions/65625736/how-to-efficiently-search-between-two-dataframes-in-python-pandas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
198 views
Welcome To Ask or Share your Answers For Others

1 Answer

A simple merge does what you want with your sample data.

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)

new_df = df1.merge(df2.loc[:,["id","item"]], on="item")

output

           logged_at item  value  id
 2021-01-03 20:01:23    A      4   2
 2021-01-03 20:01:24    A      5   2
 2021-01-03 20:01:27    A     10   2
 2021-01-03 20:01:25    B      4   3
 2021-01-03 20:01:26    B      7   3

pandasql

Does what you specified, however your sample data in df2 looks wrong as it gives two rows for each row in df1

from pandasql import sqldf
import pandas as pd
import io

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df1["logged_at"] = pd.to_datetime(df1["logged_at"])

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
df2["start_time"] = pd.to_datetime(df2["start_time"])
df2["end_time"] = pd.to_datetime(df2["end_time"])

pysqldf = lambda q: sqldf(q, globals())
pysqldf("""
select df1.*, df2.*
from df1 
left join df2 on df1.logged_at >= df2.start_time and df1.logged_at <= df2.end_time""")

pandasql output

                 logged_at item  value  id                  start_time                    end_time item
 2021-01-03 20:01:23.000000    A      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:23.000000    A      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:24.000000    A      5   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:24.000000    A      5   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:25.000000    B      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:25.000000    B      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:26.000000    B      7   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:26.000000    B      7   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:27.000000    A     10   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:27.000000    A     10   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...