import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; // Create a dense vector (1.0, 0.0, 3.0). Vector dv = Vectors.dense(1.0, 0.0, 3.0); // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.mllib.regression.LabeledPoint; // Create a labeled point with a positive label and a dense feature vector. LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); // Create a labeled point with a negative label and a sparse feature vector. LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));
LIBSVM
和LIBLINERAR 中用到的默认格式(LIBSVM和LIBLINERAR是台湾林智仁教授开发的的SVM库和线性分类器)。这是一种文本格式,每行表示一个标记的稀疏特征向量,示例如下:label index1:value1 index2:value2 ...
import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.api.java.JavaRDD; JavaRDD<LabeledPoint> examples = MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();
[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]
,矩阵的大小是(3, 2)。import org.apache.spark.mllib.linalg.Matrix; import org.apache.spark.mllib.linalg.Matrices; // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0}); // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0)) Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});
分布式矩阵[Distributed matrix]:
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.distributed.RowMatrix; JavaRDD<Vector> rows = ... // a JavaRDD of local vectors // Create a RowMatrix from an JavaRDD<Vector>. RowMatrix mat = new RowMatrix(rows.rdd()); // Get its size. long m = mat.numRows(); long n = mat.numCols(); // QR decomposition QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.IndexedRow; import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; import org.apache.spark.mllib.linalg.distributed.RowMatrix; JavaRDD<IndexedRow> rows = ...// a JavaRDD of indexed rows // Create an IndexedRowMatrix from a JavaRDD<IndexedRow>. IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd()); // Get its size. long m = mat.numRows(); long n = mat.numCols(); // Drop its row indices. RowMatrix rowMat = mat.toRowMatrix();
CoordinateMatrix
)也是由RDD做底层结构的分布式矩阵。每个RDD元素是由多个(i : long, j : long, value: Double)组成的元组,其中i是行索引,j是列索引,value是元素值。CoordinateMatrix 只应该应用于矩阵纬度高并且稀疏的情况下。import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; import org.apache.spark.mllib.linalg.distributed.MatrixEntry; JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>. CoordinateMatrix mat = new CoordinateMatrix(entries.rdd()); // Get its size. long m = mat.numRows(); long n = mat.numCols(); // Convert it to an IndexRowMatrix whose rows are sparse vectors. IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
rowsPerBlock
xcolsPerBlock
。BlockMatrix支持跟其他BlockMatrix做add(加)和multiply(乘)操作。BlockMatrix还有一个辅助方法validate,这个方法可以检查BlockMatrix是否设置是否恰当。import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.BlockMatrix; import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; JavaRDD<MatrixEntry> entries = ... // a JavaRDD of (i, j, v) Matrix Entries // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>. CoordinateMatrix coordMat = new CoordinateMatrix(entries.rdd()); // Transform the CoordinateMatrix to a BlockMatrix BlockMatrix matA = coordMat.toBlockMatrix().cache(); // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid. // Nothing happens if it is valid. matA.validate(); // Calculate A^T A. BlockMatrix ata = matA.transpose().multiply(matA);
原文:http://www.cnblogs.com/yuguoshuo/p/6265704.html