python - Google Cloud machine learning out of memory

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

python - Google Cloud machine learning out of memory

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I am having a issue of getting out of memory when I choose the following configuration (config.yaml):

trainingInput:
  scaleTier: CUSTOM
  masterType: large_model
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 10
  parameterServerCount: 10

I was following Google's tutorial on "criteo_tft": https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/criteo_tft/config-large.yaml

That link says they were able to train 1TB data! I was so impressed to give a try!!!

My dataset is categorical so it creates a pretty big matrix after one-hot-encoding (a 2D numpy array of size 520000 x 4000). I am able to train my data set in a local machine having 32GB memory but I am not able to do the same in cloud!!!

Here are my errors:

ERROR   2017-12-18 12:57:37 +1100   worker-replica-1        Using TensorFlow 
backend.

ERROR   2017-12-18 12:57:37 +1100   worker-replica-4        Using TensorFlow                     
backend.

INFO    2017-12-18 12:57:37 +1100   worker-replica-0        Running command: 
python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --
job-dir gs://my_bucket/my_bucket_20171218_125645

ERROR   2017-12-18 12:57:38 +1100   worker-replica-2        Using TensorFlow 
backend.

ERROR   2017-12-18 12:57:40 +1100   worker-replica-0        Using TensorFlow 
backend.

ERROR   2017-12-18 12:57:53 +1100   worker-replica-3        Command 
'['python', '-m', u'trainer.task', u'--train-file', 
u'gs://my_bucket/my_training_file.csv', '--job-dir', 
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9

INFO    2017-12-18 12:57:53 +1100   worker-replica-3        Module 
completed; cleaning up.

INFO    2017-12-18 12:57:53 +1100   worker-replica-3        Clean up 
finished.

ERROR   2017-12-18 12:57:56 +1100   worker-replica-4        Command 
'['python', '-m', u'trainer.task', u'--train-file', 
u'gs://my_bucket/my_training_file.csv', '--job-dir', 
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9

INFO    2017-12-18 12:57:56 +1100   worker-replica-4        Module 
completed; cleaning up.

INFO    2017-12-18 12:57:56 +1100   worker-replica-4        Clean up 
finished.

ERROR   2017-12-18 12:57:58 +1100   worker-replica-2        Command 
'['python', '-m', u'trainer.task', u'--train-file', 
u'gs://my_bucket/my_training_file.csv', '--job-dir', 
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9

INFO    2017-12-18 12:57:58 +1100   worker-replica-2        Module 
completed; cleaning up.

INFO    2017-12-18 12:57:58 +1100   worker-replica-2        Clean up 
finished.

ERROR   2017-12-18 12:57:59 +1100   worker-replica-1        Command 
'['python', '-m', u'trainer.task', u'--train-file', 
u'gs://my_bucket/my_training_file.csv', '--job-dir', 
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9

INFO    2017-12-18 12:57:59 +1100   worker-replica-1        Module 
completed; cleaning up.

INFO    2017-12-18 12:57:59 +1100   worker-replica-1        Clean up finished.

ERROR   2017-12-18 12:58:01 +1100   worker-replica-0        Command 
'['python', '-m', u'trainer.task', u'--train-file', 
u'gs://my_bucket/my_training_file.csv', '--job-dir', 
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit   status -9

INFO    2017-12-18 12:58:01 +1100   worker-replica-0        Module 
completed; cleaning up.

INFO    2017-12-18 12:58:01 +1100   worker-replica-0        Clean up finished.

ERROR   2017-12-18 12:58:43 +1100   service     The replica worker 0 ran 
out-of-memory and exited with a non-zero status of 247. The replica worker 1 
ran out-of-memory and exited with a non-zero status of 247. The replica 
worker 2 ran out-of-memory and exited with a non-zero status of 247. The 
replica worker 3 ran out-of-memory and exited with a non-zero status of 247. 
The replica worker 4 ran out-of-memory and exited with a non-zero status of 
247. To find out more about why your job exited please check the logs:  
https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)

INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Signal 15 (SIGTERM) 
was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Signal 15 (SIGTERM) 
was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Clean up finished.

INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Clean up finished.

INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Signal 15 
(SIGTERM) was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Clean up finished.

INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Signal 15 (SIGTERM) 
was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Signal 15 (SIGTERM) 
was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Clean up finished.

INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Clean up finished.

INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Signal 15 (SIGTERM) 
was caught. Terminated by service. This is normal behavior.

INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Module completed; 
cleaning up.

INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Clean up finished.

INFO    2017-12-18 12:59:28 +1100   service     Finished tearing down 
TensorFlow.

INFO    2017-12-18 13:00:17 +1100   service     Job failed.##

Please don't worry about "Using TensorFlow backend." error as I have got it even it the training job is successful for other smaller dataset.

Can anyone please explain what is causing running out of memory (error 247) and how can I write my config.yaml file to avoid such issues, and train my data in cloud?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

161 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:21:34+0000

answered Jan 31, 2022 by 深蓝 (71.8m points)

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Google Cloud machine learning out of memory

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags