doing spark-shell with mlib, error: object jblas is not a member of package org

In spark-shell, when I exectue import org.jblas.DoubleMatrix, it will throw "error: object jblas is not a member of package org" on the RHEL.Actually, I googled about "jblas" and installed "gfortran" from https://gcc.gnu.org/wiki/GFortranBinaries#MacOS on my mac pro.My spark version is spark-1.4.0-bin-hadoop2.6.tar or spark-1.5.1-bin-hadoop2.6.tar, which is download directly from the official website, it's to say that I didn't build from the source code. This step may be optional, it will install the jblas jar on your laptop repository.git clone https://github.com/mikiob...

./spark-shell doesn't start correctly (spark1.6.1-bin.hadoop2.6 version)

I installed this spark version: spark-1.6.1-bin-hadoop2.6.tgz.Now when I start spark with ./spark-shell command Im getting this issues (it shows a lot of error lines so I just put some that seems important) Cleanup action completed 16/03/27 00:19:35 ERROR Schema: Failed initialising database. Failed to create database 'metastore_db', see the next exception for details. org.datanucleus.exceptions.NucleusDataStoreException: Failed to create database 'metastore_db', see the next exception for details. at org.datanucleus.store.rdbms.Connectio...

Spark on EC2 - S3 endpoint for Scala not resolving

Hi I have been able to setup successfully a Spark Cluster on AWS EC2 for 2 ongoing months but recently I started getting the following error in the creation script. It's basically failing in setting up the Scala packages and not resolving the source S3 endpoint:--2017-02-28 17:51:30-- (try: 6) http://s3.amazonaws.com/spark-related-packages/scala-2.10.3.tgzConnecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.0.83|:80... failed: Connection timed out.Retrying.This is my source Spark version in Github https://github.com/amplab/spark-ec2/archive/branch-2.0.zipAnd the above...

How to Get Set of Partition Column(s) Values with Spark DataFrameWriter.partitionBy

I'd like to use Spark DataFrameWriter.partitionBy() to write to AWS S3. It, of course, writes a separate directory branch for each unique combination of partition column values.Is there any way to get from Spark which partition column value combinations existed in the DataFrame, i.e. were written? Without querying the "filesystem" (AWS S3 object store). If you want to partition say a and b, you can just query your dataframe df.select($"a",$"b").distinct.show(), this gives you the folders that will be created. [XXX]

Websphere MQ as a data source for Apache Spark Streaming

I was digging into the possibilities for Websphere MQ as a data source for spark-streaming becuase it is needed in one of our use case.I got to know that MQTT is the protocol that supports the communication from MQ data structures but since I am a newbie to spark streaming I need some working examples for the same. Did anyone try to connect the MQ with spark streaming. Please devise the best way for doing so. So, I am posting here the working code for CustomMQReceiver which connects the Websphere MQ and reads data :public class CustomMQReciever extends Receiver<String...

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism.Is worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is actually a JVM.As per the above link, an executor is a process launched for an application on a worker node that runs tasks. Executor is also a JVM.These are my questions:Executors are per application. Then what is role of a worker? Does it co-ordinate with the executor and communicate the result back to the driver?or does the driver di...

How to prepare data into a LibSVM format from DataFrame?

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble)}val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}val usergroup = user.grou...

sparkR on YARN cluster

I can see at url http://ec2-54-186-47-36.us-west-2.compute.amazonaws.com:8080/ that I have two worker nodes and one master node, it is showing spark cluster. by running command jps on my 2 worker node and 1 master I can see that all services are up.following script I am using to initialise SPARKR session. if (nchar(Sys.getenv("SPARK_HOME")) < 1) { Sys.setenv(SPARK_HOME = "/home/ubuntu/spark") }but whenever I tried to use Rstudio to initialize session then it fail and shows following ERROR, please advice me , I can not use real benefit of cluster. sparkR.sess...

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:1 1:101 1:91 1:91 1:90 1:11 1:81 1:80 1:20 1:20 1:3the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a posit...

How to deal with concatenated Avro files?

I'm storing data generated from my web application in Apache Avro format. The data is encoded and sent to an Apache Kinesis Firehose that buffers and writes the data to Amazon S3 every 300 seconds or so. Since I have multiple web servers, this results in multiple blobs of Avro files being sent to Kinesis, upon which it concatenates and periodically writes them to S3.When I grab the file from S3, I can't using the normal Avro tools to decode it since it's actually multiple files in one. I could add a delimiter I suppose, but that seems risky in the event that the data being ...

Spark: How change DataFrame to LibSVM and perform logistic regression

I'm using this code to get data from Hive into Spark:val hc = new org.apache.spark.sql.hive.HiveContext(sc)val MyTab = hc.sql("select * from svm_file")and I get DataFrame:scala> MyTab.show()+--------------------+| line|+--------------------+|0 2072:1 8594:1 7...||0 8609:3 101617:1...|| 0 7745:2||0 6696:2 9568:21 ...||0 200076:1 200065...||0 400026:20 6936:...||0 7793:2 9221:7 1...||0 4831:1 400026:1...||0 400011:1 400026...||0 200072:1 6936:1...||0 200065:29 4831:...||1 400026:20 3632:...||0 400026:19 6936:...||0 190004:1 9041:2...||0 190005:1 1...

How to use distributed Spark and Play Framework?

How do you use Play Framework and a Spark cluster in development?I can run any Spark app with the master set to local[*]But if I set it to run on the cluster, I get this:play.api.Application$$anon$1: Execution exception[[SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 192.168.1.239): java.lang.ClassNotFoundException: controllers.Application$$anonfun$test$1$$anonfun$2at java.net.URLClassLoader.findClass(URLClassLoader.java:381)at java.lang.ClassLoader.loadClass(ClassLoader.java:424)...

Convert csv with categorical data to libsvm

I am using spark MLlib to build machine learning models. I need to give libsvm format files as input if there are categorical variables in the data. I tried converting csv file to libsvm using1. Convert.c as suggested in the libsvm site2. Csvtolibsvm.py in phraug githubBut both these scripts do not seem to be converting categorical data. I also installed weka and tried saving to libsvm format. But couldn't find that option in the weka explorer. Please suggest any other way of converting csv with categorical data to libsvm format or let me know if I am missing anything here....

how to convert LibSVM file with multi classes into an RDD[labelPoint]

using the following method from org.apache.spark.mllib.util.MLUtils package ,Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of features determined automatically and the default number of partitions.def loadLibSVMFile(sc: SparkContext, path: String): RDD[LabeledPoint]My problem is with loading data with multi class labels?When using this method on multiclass labeled data...it is getting converted to binary labeled data..is there a way to load multiclass data in LibSVM format into an RDD[LabeledPoint]...??there is one more method in the ...

Login to zeppelin issues with docker

I've downloaded many zeppeling/spark images and with all of them I have trouble loggin in to the notebooks. This is the shiro.ini file inside the container: ...admin = password1user1 = password2user2 = password3# Sample LDAP configuration, for user Authentication, currently tested for single Realm[main]#ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm#ldapRealm.userDnTemplate = cn={0},cn=engg,ou=testdomain,dc=testdomain,dc=com#ldapRealm.contextFactory.url = ldap://ldaphost:389#ldapRealm.contextFactory.authenticationMechanism = SIMPLE[urls]# anon means the access is ...

Kotlin and Spark - SAM issues

Maybe I'm doing something that is not quite supported, but I really want to use Kotlin as I learn Apache Spark with this bookHere is the Scala code sample I'm trying to run. The flatMap() accepts a FlatMapFunction SAM type: val conf = new SparkConf().setAppName("wordCount")val sc = new SparkContext(conf)val input = sc.textFile(inputFile)val words = input.flatMap(line => line.split(" "))Here is my attempt to do this in Kotlin. But it is having a compilation issue on the fourth line: val conf = SparkConf().setMaster("local").setAppName("Line Counter")val sc = SparkContext(...

Page 1 of 149  |  Show More Pages:  Top Prev Next Last