Programmer: Lifelong Learning: Spark Basic Statistics

Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
  
val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache();  
val parsedData = data.map( line =>  Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) )
val summary = Statistics.colStats(parsedData);
println(summary.count)
println(summary.min)
println(summary.max)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column

Stratified sampling
Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.

Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26

import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache();  
val parsedData = data.map{line => {
  val sp = line.split(' '); 
  (sp(0), sp(1).toInt);
}
}.cache()

parsedData.foreach(println)
var fractions = Map[String, Double]()

fractions += ("man" ->  0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3);
val approxSample = parsedData.sampleByKey(false, fractions).collect();
val exactSample = parsedData.sampleByKeyExact(false, fractions).collect();
print(approxSample.mkString(" "));
print(exactSample.mkString(" "));

Random data generation

import org.apache.spark.mllib.random.RandomRDDs._
val u = normalRDD(sc, 100L, 2);
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
print(u.collect())
print(v.collect())

val u = poissonRDD(sc, 10, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

val u = uniformRDD(sc, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

Histogram

val ints = sc.parallelize(1 to 100)
ints.histogram(5) // 5 evenly spaced buckets

res92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations

MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics（上）

Spark Basic Statistics - Using Scala

Labels