Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Hi I have data as follows in a spreadsheet

|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|----------|-------|----------|-------|----------|-------|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |

I want to import this (csv or xlsx) and arrive at a data frame that has only a single date index and columns of aaa-val, bbb-val and ccc-val e.g.

|          |aaa-val|bbb-val|ccc-val|
|----------|-------|-------|-------|
|26-03-2008|       |       |+1.401 |
|27-03-2008|       |       |+1.376 |
|28-03-2008|       |       |-0.534 |
|31-03-2008|       |-23.869|+0.688 |
|01-04-2008|       |-19.813|+3.336 |
|02-04-2008|       |-18.929|+3.413 |
|03-04-2008|       |-19.917|+4.177 |
|04-04-2008|       |-20.125|+4.279 |
|07-04-2008|       |-21.321|+2.445 |
|08-04-2008|-20.943|-22.517|+1.146 |
|09-04-2008|-20.943|-24.857|+0.917 |
|10-04-2008|-18.868|-24.701|+2.420 |
|11-04-2008|-19.057|-24.857|+1.860 |
|14-04-2008|-20.000|-26.053|+3.515 |
|15-04-2008|-18.868|-24.805|       |
|16-04-2008|-16.226|-23.557|       |
|17-04-2008|-14.340|-23.765|       |
|18-04-2008|-12.830|       |       |
|21-04-2008|-15.472|       |       |
|22-04-2008|-16.793|       |       |
|23-04-2008|-13.019|       |       |
|24-04-2008|-12.453|       |       |
|25-04-2008|-12.642|       |       | 

Is there a smart way to do this other than loading into a temp frame and then looping through date/value column pairs?

thanks


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
277 views
Welcome To Ask or Share your Answers For Others

1 Answer

I just found this article while looking up something else, and I believe it could help you:

https://pbpython.com/pandas-excel-range.html

Basically, you could read the file for specific column ranges (using the lambda method if you want to use column names) for each of the time/data ranges. I would then rename the date field to the same name or set the date field as the index. Then to multiple full outer joins to combine all of the data.

EDIT - a simple concat would not work as I originally wrote. I would advise full outer joins on the date column.

[from the link]

Another approach to using a callable is to include a lambda expression. Here is an example where we want to include only a defined list of columns. We normalize the names by converting them to lower case for comparison purposes.

cols_to_use = ['item_type', 'order id', 'order date', 'state', 'priority']
df = pd.read_excel(src_file,
                   header=1,
                   usecols=lambda x: x.lower() in cols_to_use)

EDIT SHOWING THE DIFFERENCE BETWEEN CONCAT AND MERGE:

import pandas as pd
import numpy as np
from common import  show_table

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
print(df1)
#    a  b
# 0  1  1
# 1  2  2

df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
print(df2)
#    a  c
# 0  1  1
# 1  3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#    a    b    c
# 0  1  1.0  NaN
# 1  2  2.0  NaN
# 0  1  NaN  1.0
# 1  3  NaN  3.0


# good
df4 = pd.merge(df1, df2, how='outer', on='a')
print(df4)
#    a    b    c
# 0  1  1.0  1.0
# 1  2  2.0  NaN
# 2  3  NaN  3.0

EDIT FOR INDEX VALIDATION - Concat on index does not do a full outer join

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
df1 = df1.set_index('a')
print(df1)
#    b
# a   
# 1  1
# 2  2
df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
df2 = df2.set_index('a')
print(df2)
#    c
# a   
# 1  1
# 3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#      b    c
# a          
# 1  1.0  NaN
# 2  2.0  NaN
# 1  NaN  1.0
# 3  NaN  3.0

# good
df4 = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
print(df4)
#      b    c
# a          
# 1  1.0  1.0
# 2  2.0  NaN
# 3  NaN  3.0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...