-
Notifications
You must be signed in to change notification settings - Fork 36
3 Spark Sampling Examples
Yasset Perez-Riverol edited this page Oct 27, 2015
·
1 revision
Spark sampling functions allows to take different samples following distributions or only take a couple of them. In Spark, there are two sampling operations, the transformation sample and the action takeSample. By using a transformation we can tell Spark to apply successive transformation on a sample of a given RDD. By using an action we retrieve a given sample and we can have it in local memory to be used by any other standard library.
- The sample transformation takes up to three parameters SparkSampling:
1.1. First is weather the sampling is done with replacement or not.
1.2. Second is the sample size as a fraction. Finally we can optionally provide a random seed.
1.3. Finally we can optionally provide a random seed.
JavaRDD<String> rawData = sc.textFile(outputFile.getAbsolutePath());
JavaRDD<String> sampledData = rawData.sample(false, 0.1, 1234);
long sampleDataSize = sampledData.count();
long rawDataSize = rawData.count();
System.out.println(rawDataSize + " and after the sampling: " + sampleDataSize);