org.apache.spark.ml.recommendation
Computes the feature probabilities of sparse feature vector columns through mapPartitions for efficiency and through treeReduce to avoid OOM on the driver
Computes the feature probabilities of sparse feature vector columns through mapPartitions for efficiency and through treeReduce to avoid OOM on the driver
The dataframe to aggregate, must have sparse vector columns starting at column 0
The number of sparse vector columns
Computes the weighted feature probabilities of sparse feature vector columns through mapPartitions for efficiency and through treeReduce to avoid OOM on the driver
Computes the weighted feature probabilities of sparse feature vector columns through mapPartitions for efficiency and through treeReduce to avoid OOM on the driver
The dataframe to aggregate, must have a double weight column at column 0 and sparse vector column at column 1
The per-worker mini-batch size Default: 256
The per-worker mini-batch size Default: 256
The regularization rate for the latent factor weights Default: 0.001f
The regularization rate for the latent factor weights Default: 0.001f
The name of the integer arrays column containing the itemCol ids of the items to filter from recommendations.
The name of the integer arrays column containing the itemCol ids of the items to filter from recommendations. If empty, recommendations are not filtered. Usually the arrays will contain the ids of the items of the user
Default: ""
Fits a GlintFMPairModel on the data set
Fits a GlintFMPairModel on the data set
The data set containing columns (userCol: Int, itemCol: Int, itemFeaturesCol: SparseVector, userctxFeaturesCol: SparseVector) and if acceptance sampling should be used also samplingCol.
The name of the item id column of integers from 0 to number of items in training dataset Default: "itemid"
The name of the item id column of integers from 0 to number of items in training dataset Default: "itemid"
The name of the item feature column of sparse vectors Default: "itemfeatures"
The name of the item feature column of sparse vectors Default: "itemfeatures"
The regularization rate for the linear weights Default: 0.01f
The regularization rate for the linear weights Default: 0.01f
Whether the meta data of the data frame to fit should be loaded from HDFS.
Whether the meta data of the data frame to fit should be loaded from HDFS. This allows skipping the meta data computation stages when fitting on the same data frame with different parameters. Meta data for "cross-batch" and "uniform" sampling is intercompatible but "exp" requires its own meta data
Default: false
The HDFS path to load meta data for the fit data frame from or to save the fitted meta data to.
The HDFS path to load meta data for the fit data frame from or to save the fitted meta data to. Default: ""
The number of latent factor dimensions (k) Default: 150
The number of latent factor dimensions (k) Default: 150
The number of parameter servers Default: 3
The number of parameter servers Default: 3
The parameter server configuration.
The parameter server configuration. Allows for detailed configuration of the parameter servers with the default configuration as fallback. Default: ConfigFactory.empty()
The master host of the running parameter servers.
The master host of the running parameter servers. If this is not set a standalone parameter server cluster is started in this Spark application. Default: ""
The rho value to use for the "exp" sampler.
The rho value to use for the "exp" sampler. Has to be between 0.0 and 1.0 Default: 1.0
The sampler to use.
The sampler to use.
"uniform" means sampling negative items uniformly, as originally proposed for BPR.
"exp" means sampling negative items with probability proportional to their exponential popularity distribution, as proposed in LambdaFM.
"crossbatch" means sampling negative items uniformly, but sharing them across the mini-batch as crossbatch-BPR loss, as proposed in my masters thesis.
Default: "uniform"
The name of the column of integers to use for sampling.
The name of the column of integers to use for sampling. If empty all items are accepted as negative items otherwise only items where there does not exist an interaction between the user and the sampling column value of the item. Usually the sampling column is the same as itemCol but it may also be another column with an n-to-1 relation from item column value to sampling column value.
Consider the example of playlists with "pid" as user column amd tracks with "traid" as item column. Another column "artid" holds the artist of the track. With "traid" as sampling column, only tracks which are not in the playlist are accepted as negative items. With "artid" as sampling column, only tracks whose artists are not in the playlist are accepted as negative item.
Default: ""
Whether the meta data of the fitted data frame should be saved to HDFS.
Whether the meta data of the fitted data frame should be saved to HDFS. Default: false
The depth to use for tree reduce when computing the meta data.
The depth to use for tree reduce when computing the meta data. To avoid OOM errors, this has to be set sufficiently large but lower depths might lead to faster runtimes
The name of the user id column of integers Default: "userid"
The name of the user id column of integers Default: "userid"
The name of the user and context feature column of sparse vectors Default: "userctxfeatures"
The name of the user and context feature column of sparse vectors Default: "userctxfeatures"
Distributed pairwise factorization machine / LightFM.
Pairwise factorization machines are trained on implicit-feedback training instances to rank all items with observed user-item training instances above all other items for the user, using bayesian personalized ranking.
This is an implementation using Glint parameter servers with custom methods for network-efficient training. A Spark application with the parameter servers has to be started beforehand and the host of the parameter server master given as parameter to this implementation.