Skip to content

3 Spark Sampling Examples

Yasset Perez-Riverol edited this page Oct 27, 2015 · 1 revision

Spark sampling functions allows to take different samples following distributions or only take a couple of them. In Spark, there are two sampling operations, the transformation sample and the action takeSample. By using a transformation we can tell Spark to apply successive transformation on a sample of a given RDD. By using an action we retrieve a given sample and we can have it in local memory to be used by any other standard library.

  1. The sample transformation takes up to three parameters SparkSampling:

1.1. First is weather the sampling is done with replacement or not.

1.2. Second is the sample size as a fraction. Finally we can optionally provide a random seed.

1.3. Finally we can optionally provide a random seed.

   
   JavaRDD<String> rawData = sc.textFile(outputFile.getAbsolutePath());
   JavaRDD<String> sampledData = rawData.sample(false, 0.1, 1234);
   
   long sampleDataSize = sampledData.count();
   long rawDataSize = rawData.count();
   System.out.println(rawDataSize + " and after the sampling: " + sampleDataSize);
 

Clone this wiki locally