SpatialHadoop: no scaling with multiple computing nodes

I am using SpatialHadoop to store and index a dataset with 87 million points. I then apply various range queries.I tested on 3 different cluster configurations: 1 , 2 and 4 nodes.Unfortunately, I don't see a runtime decrease with growing node number.Any ideas why there is no horizontal-scaling effect? How big is your file in megabytes? While it has 87 million points, it can still be small enough that Hadoop decides to create one or two splits only out of it.If this is the case, you can try reducing the block size in your HDFS configuration so that the file will be split ...

hadoop installation on ec2

I am trying to install a multinode cluster in EC2 by following https://dzone.com/articles/how-set-multi-node-hadoopEverything seemed to work:I have a namenode and a datanode and I have the following processes running:namenode: 1389 NameNode,1687 JobTracker, 1590 SecondaryNameNodedatanode: 1415 TaskTracker, 1286 DataNodeI could check the namenode status from "ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:50070/dfshealth.jsp" and check Jobtracker status : "ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:50030/jobtracker.jsp"The problems came when I tried to check the TaskTracker Stat...

Apache Hive how to identify which column is the partition

I have a set of log files, created a Hive table, now i want to partition the table based on a col what I don't understand & have not seen examples is how to specify the column for partition how to specify the col/fieldEx. here is line from the log 2012-04-11 16:49:10,629 ~ [http-7001-11] ~DE1F6F6667913022AE2620D1228817D6 ~ END ~ /admin/bp/setup/newedit/ok ~ pt ~ 219 ~ table struc isCREATE TABLE log (starttime STRING, thread STRING, session STRING, method STRING, targeturl STRING, registry string, ipaddress STRING, details STRING) ROW FORMAT DELIMITED FIELDS T...

Why Hadoop name node connecting to weird [aca8ca1d.ipt.aol.com] hostname?

I am using a mac system and starting a hadoop system, using the command:start-dfs.shand my hostname is "ctpllt072.local" as returned by "hostname" command.But i am getting a weird hostname and message when connecting starting the name node as follows:Starting namenodes on [aca8ca1d.ipt.aol.com]aca8ca1d.ipt.aol.com: ssh: connect to host aca8ca1d.ipt.aol.com port 22: Operation timed outI have nothing in my system as specified [aca8ca1d.ipt.aol.com], neither in /etc/hosts nor in any property file.Here are my hdfs, yarn and core-site xml files:core-site.xml<?xml version="1.0...

Cluster Configuration - Worker Nodes

I am beginner in cluster configuration. I know in our cluster we have types of worker nodes:16 x 4TB Disks128 RAM 2 x 8 Core CPUs 12 x 1.2 TB Disks256 RAM2 x 10 Core CPUsI am confused about the configuration. What does mean 2 x 8 cores? It means 2 processor with 8 physical core each? So if my processor are hyperthreading i will have 2 X 8 X 2 = 32 virtual cores?And 12 x 1.2 TB means, 12 disks with 1.2 TB each? Usually 2x 8 Core CPUs, means, that you have 2 physical chips on your motherboard, each having 8 Cores. If you enable hyperthreading, you then have 32 virtual ...

Hive Insert overwrite into Dynamic partition external table from a raw external table failed with null pointer exception.,

I have a raw external table with four columns-Table 1 : create external table external_partitioned_rawtable (age_bucket String,country_destination String,gender string,population_in_thousandsyear int) row format delimited fields terminated by '\t' lines terminated by '\n' location '/user/HadoopUser/hive'I want a external table with partitions from Country_destination and gender.Table -2 create external table external_partitioned (age_bucket String,population_in_thousandsyear int) partitioned by(country_destination String,gender String) row format delimited fiel...

Creating Avro table with Buckets in hive

I created a avro table with buckets but I face the following error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Bucket columns uniqueid is not part of the table columns ([]CREATE TABLE s.TEST_OD_V(UniqueId int,dtCd string,SysSK int,Ind string)PARTITIONED BY (vcd STRING)CLUSTERED BY (UniqueId) INTO 500 BUCKETSROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.ha...

Phoenix csv Bulk load fails with large data sets

I'm trying to load a dataset (280GB) using the Phoenix csv bulk load tool on a HDInsight Hbase cluster. The job fails with the following error: 18/02/23 06:09:10 INFO mapreduce.Job: Task Id : attempt_1519326441231_0004_m_000067_0, Status : FAILEDError: Java heap spaceContainer killed by the ApplicationMaster.Container killed on request. Exit code is 143Container exited with a non-zero exit code 143Here's my cluster configuration:Region Nodes8 cores, 56 GB RAM, 1.5TB HDDMaster Nodes4 cores, 28GB, 1.5TB HDDI tried increasing the value of yarn.nodemanager.resource.memory-...

AWS EMR Hive partitioning unable to recognize any type of partitions

I am trying to process some log files on a bucket in amazon s3. I create the table :CREATE EXTERNAL TABLE apiReleaseData2 (messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,userAccountId string, userAccountEmail string, numFiles string)ROW FORMATserde 'com.amazon.elasticmapreduce.JsonSerde'with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';Then I run the following HiveQL statem...

Create HIVE partitioned table HDFS location assistance

Sure hope someone can help me out with creating external HIVE partitioned tables by automatically adding data based on comma delimited files residing in an HDFS directory. My understanding, or lack thereof, is that when you define a CREATE EXTERNAL TABLE, PARTITIONED, and providing it with a LOCATION, it should recursively scan/read each and every sub-directory, and load data into the newly create partitioned external table. The following should provide some additional insight into my troubles…Sample HDFS directory structure:<br>/data/output/dt=2014-01-01<br>...

Hive "add partition" concurrency

We have an external Hive table that is used for processing raw log file data. The files are hourly, and are partitioned by date and source host name.At the moment we are importing files using simple python scripts that are triggered a few times per hour. The script creates sub folders on HDFS as needed, copies new files from the temporary local storage and adds any new partitions to Hive. Today, new partitions are created using "ALTER TABLE ... ADD PARTITION ...". However, if another Hive query is running on the table it will be locked, which means that the add partition co...

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.Edit - Adding details.Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country ST...

On demand user cluster for Apache Zeppelin + Spark?

We use cloudera to deploy a zeppelin-spark-yarn-hdfs cluster. Right now, there's only one instance of zeppelin and spark, and the execution of all spark notebooks affects every user. For instance, if we stop the spark context in a user's notebook, it affects all other user's notebooks. I've seen that there's an option in zeppelin to isolate interpreters, but is there a way to provide each user with its own 'cluster' on demand? Maybe using Docker and building an image with zeppelin and spark for each user, and limiting their resources to the ones provided by the user cluster...

Bootstrap Failure when trying to install Spark on EMR

I am using this link to install Spark Cluster on EMR(Elastic Map Reduce on Amazon) https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923For creating a Spark cluster I run the following command and my cluster is running into bootstrap failure every single time. I am not able to resolve this issue, and it will be great if any could help me here.aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=MYKEY --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark...

Using spark-submit externally from EMR cluster master

We have a Hadoop cluster running in AWS Elastic MapReduce (EMR) with Spark 1.6.1. No problem slogining into cluster master and submitting Spark jobs, but we'd like to be able to submit them from another independent EC2 instance.The other 'external' EC2 instance has security groups setup to allow all TCP traffic to and from the EMR instance master & slave instances. It has a binary installation of Spark downloaded directly from Apache's site.Having copied the /etc/hadoop/conf folder from the master to this instance and set $HADOOP_CONF_DIR accordingly, when attempt to s...

Does an EMR master node know its cluster id?

I want to be able to create EMR clusters and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.Does the master node know it's id (j-*****)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this id?I've taken a look through the config files in /home/hadoop/conf, and I haven't...

Page 1 of 174  |  Show More Pages:  Top Prev Next Last