The problem is from a recommendations project.
(问题来自建议项目。)
The data has ~300K users and ~200K items.(数据有?300K用户和?200K项。)
The user-item ratings matrix would be sparse and huge, much larger than that can be fit in a RAM.(用户项目评级矩阵将稀疏且庞大,远大于可容纳在RAM中的矩阵。)
I first want to get latent representations of the users with PCA, and then do similarity analyses of the users with the latent vectors using something like approximate nearest neighbors.(我首先想用PCA获得用户的潜在表示,然后使用近似最近邻等方法对用户与潜在向量进行相似性分析。)
How can I approach this problem?(我该如何解决这个问题?)
I have the options of using PySpark and/or sklearn.(我可以选择使用PySpark和/或sklearn。)
ask by candide translate from so