How can I get a random row from a PySpark DataFrame? I only see the method sample()
which takes a fraction as parameter. Setting this fraction to 1/numberOfRows
leads to random results, where sometimes I won't get any row.
On RDD
there is a method takeSample()
that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?