I have tried to puzzle out an answer to this question for many months while learning pandas.
(在学习熊猫的过程中,我试图解决这个问题的答案已经有很多月了。)
I use SAS for my day-to-day work and it is great for it's out-of-core support.(我在日常工作中使用SAS,这非常有用,因为它提供了核心支持。)
However, SAS is horrible as a piece of software for numerous other reasons.(但是,由于许多其他原因,SAS作为一个软件也是很糟糕的。)
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets.
(有一天,我希望用python和pandas取代我对SAS的使用,但是我目前缺少大型数据集的核心工作流程。)
I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.(我并不是在说需要分布式网络的“大数据”,而是文件太大而无法容纳在内存中,但文件又足够小而无法容纳在硬盘上。)
My first thought is to use HDFStore
to hold large datasets on disk and pull only the pieces I need into dataframes for analysis.
(我的第一个想法是使用HDFStore
将大型数据集保存在磁盘上,然后仅将我需要的部分拉入数据帧中进行分析。)
(其他人则提到MongoDB是一种更易于使用的替代方案。)
My question is this:(我的问题是这样的:)
What are some best-practice workflows for accomplishing the following:
(什么是实现以下目标的最佳实践工作流:)
- Loading flat files into a permanent, on-disk database structure
(将平面文件加载到永久的磁盘数据库结构中)
- Querying that database to retrieve data to feed into a pandas data structure
(查询该数据库以检索要输入到熊猫数据结构中的数据)
- Updating the database after manipulating pieces in pandas
(处理熊猫中的片段后更新数据库)
Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".
(现实世界中的示例将不胜感激,尤其是那些使用“大数据”中的熊猫的人。)
Edit -- an example of how I would like this to work:
(编辑-我希望如何工作的示例:)
- Iteratively import a large flat-file and store it in a permanent, on-disk database structure.
(迭代导入一个大型平面文件,并将其存储在永久的磁盘数据库结构中。)
These files are typically too large to fit in memory.(这些文件通常太大而无法容纳在内存中。)
- In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
(为了使用Pandas,我想读取这些数据的子集(通常一次只读取几列),使其适合内存。)
- I would create new columns by performing various operations on the selected columns.
(我将通过对所选列执行各种操作来创建新列。)
- I would then have to append these new columns into the database structure.
(然后,我将不得不将这些新列添加到数据库结构中。)
I am trying to find a best-practice way of performing these steps.
(我正在尝试找到执行这些步骤的最佳实践方法。)
Reading links about pandas and pytables it seems that appending a new column could be a problem.(阅读有关熊猫和pytables的链接,似乎添加一个新列可能是个问题。)
Edit -- Responding to Jeff's questions specifically:
(编辑-专门回答杰夫的问题:)
- I am building consumer credit risk models.
(我正在建立消费者信用风险模型。)
The kinds of data include phone, SSN and address characteristics;(数据类型包括电话,SSN和地址特征;)
property values;(财产价值;)
derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data.(诸如犯罪记录,破产等之类的贬义信息。我每天使用的数据集平均有近1,000到2,000个混合数据类型的字段:数字和字符数据的连续,名义和有序变量。)
I rarely append rows, but I do perform many operations that create new columns.(我很少追加行,但是我确实执行了许多创建新列的操作。)
- Typical operations involve combining several columns using conditional logic into a new, compound column.
(典型的操作涉及使用条件逻辑将几个列合并到一个新的复合列中。)
For example,if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
.(例如,
The result of these operations is a new column for every record in my dataset.if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
;if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
。)(这些操作的结果是数据集中每个记录的新列。)
- Finally, I would like to append these new columns into the on-disk data structure.
(最后,我想将这些新列添加到磁盘数据结构中。)
I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.(我将重复步骤2,使用交叉表和描述性统计数据探索数据,以寻找有趣的直观关系进行建模。)
- A typical project file is usually about 1GB.
(一个典型的项目文件通常约为1GB。)
Files are organized into such a manner where a row consists of a record of consumer data.(文件组织成这样的方式,其中一行包含消费者数据记录。)
Each row has the same number of columns for every record.(每条记录的每一行都有相同的列数。)
This will always be the case.(情况总是如此。)
- It's pretty rare that I would subset by rows when creating a new column.
(创建新列时,我会按行进行子集化是非常罕见的。)
However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics.(但是,在创建报告或生成描述性统计信息时,对行进行子集化是很常见的。)
For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards.(例如,我可能想为特定业务创建一个简单的频率,例如零售信用卡。)
To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on.(为此,除了我要报告的任何列之外,我将只选择那些业务线=零售的记录。)
When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.(但是,在创建新列时,我将拉出所有数据行,而仅提取操作所需的列。)
- The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships.
(建模过程要求我分析每一列,寻找与某些结果变量有关的有趣关系,并创建描述这些关系的新复合列。)
The columns that I explore are usually done in small sets.(我探索的列通常以小集合形式完成。)
For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan.(例如,我将重点介绍一组仅涉及属性值的20个列,并观察它们与贷款违约的关系。)
Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process.(在探索了这些列并创建了新的列之后,我接着转到另一组列,例如大学学历,然后重复该过程。)
What I'm doing is creating candidate variables that explain the relationship between my data and some outcome.(我正在做的是创建候选变量,这些变量解释我的数据和某些结果之间的关系。)
At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.(在此过程的最后,我应用了一些学习技术,这些技术可以根据这些复合列创建方程。)
It is rare that I would ever add rows to the dataset.
(我很少向数据集添加行。)
I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).(我几乎总是会创建新列(统计/机器学习术语中的变量或功能)。)
ask by Zelazny7 translate from so