Spark Basic Statistics - Using Scala


Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
  
val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache();  
val parsedData = data.map( line =>  Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) )
val summary = Statistics.colStats(parsedData);
println(summary.count)
println(summary.min)
println(summary.max)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column


Stratified sampling

Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.


Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache();  
val parsedData = data.map{line => {
  val sp = line.split(' '); 
  (sp(0), sp(1).toInt);
}
}.cache()

parsedData.foreach(println)
var fractions = Map[String, Double]()

fractions += ("man" ->  0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3);
val approxSample = parsedData.sampleByKey(false, fractions).collect();
val exactSample = parsedData.sampleByKeyExact(false, fractions).collect();
print(approxSample.mkString(" "));
print(exactSample.mkString(" "));

Random data generation
import org.apache.spark.mllib.random.RandomRDDs._
val u = normalRDD(sc, 100L, 2);
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
print(u.collect())
print(v.collect())

val u = poissonRDD(sc, 10, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

val u = uniformRDD(sc, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

Histogram
val ints = sc.parallelize(1 to 100)
ints.histogram(5) // 5 evenly spaced buckets
res92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations


MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)