java程序如何运行(java程序运行方法)

  • 时间:
  • 4320人关注

这是一篇关于java相关的编程问答内容,被468位程序员关注,内容涉及到java程序运行、java程序、java程序运行方法等,由弘理群 编辑补充,一起来看下大家的回答。

码农之家
精选回答2:shell脚本运行java程序jar的方法

2小时25分钟前回答

在UBuntu上部署项目的时候,我们往往通过一段shell来启动程序,甚至是通过crontab定时任务来定时的调用java程序,但是很奇怪的一个问题就是,比如我写了一个如下的shell脚本:

#!/bin/sh
export mypath=/root/project/wishnomal

java -Xmx3000m -Xms3000m -server -d64 -Dfile.encoding=UTF-8 -Dfetch.threads=300 -classpath $mypath/:$mypath/wish2-assembly-1.0.0.jar newstandard.CrawlerNewStandard $*

echo "END"

手动命令行运行该脚本的时候,可以正常运行java程序,但是使用crontab定时任务,貌似就不起效果了

分析可能原因:

 1)是否当前用户对此shell脚本没有可执行权限,通过ls -lrt /apps/service/mtk/checking/run.sh查看脚本是有可执行,但是有执行权限呀-rwxr-xr-x

 2)既然单独运行脚本没问题,那会不会是定时的问题呢?于是写了一个简单的输出的shell脚本通过定时也是没问题的。说明还是脚本的问题。

后来上网查了下,想到可能是脚本中环境变量的原因,因为通过crontab运行脚本,是以root用户,而不是当前用户,于是cat /etc/profile查看环境变量,然后修改脚本如下:

分析可能原因:

 1)是否当前用户对此shell脚本没有可执行权限,通过ls -lrt /apps/service/mtk/checking/run.sh查看脚本是有可执行,但是有执行权限呀-rwxr-xr-x

 2)既然单独运行脚本没问题,那会不会是定时的问题呢?于是写了一个简单的输出的shell脚本通过定时也是没问题的。说明还是脚本的问题。

后来上网查了下,想到可能是脚本中环境变量的原因,因为通过crontab运行脚本,是以root用户,而不是当前用户,于是cat /etc/profile查看环境变量,然后修改脚本如下:

#!/bin/sh
export mypath=/root/project/wishnomal
export JAVA_HOME=/root/lib/jdk1.7.0_72
PATH=$PATH:$JAVA_HOME/bin

java -Xmx3000m -Xms3000m -server -d64 -Dfile.encoding=UTF-8 -Dfetch.threads=300 -classpath $mypath/:$mypath/wish2-assembly-1.0.0.jar newstandard.CrawlerNewStandard $*

echo "END"

export显示导出为用户环境变量的环境变量

这样crontab计划任务就正常了。 

修改参考:

#!/bin/sh 
# ----------------------------------------------------------------------------- 
# Start script for the CMGP BOSSCONTROL  
# 
# $Id: run_bosscontrol.sh,v 1.0 2007/11/06 Exp $ 
# ----------------------------------------------------------------------------- 
#指定字符集 
LANG=zh_CN.GBK export LANG 
RUN_HOME=. 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/checking.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/ojdbc14.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/commons-dbutils-1.1.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/log4j-1.2.14.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/dom4j-1.6.jar 
 
export CLASSPATH 
 
java com.**.checking.Checking_Start >> log.out &  

手动命令行运行该脚本的时候,可以正常运行java程序,但是使用crontab定时任务,貌似就不起效果了,很是郁闷哪,查原因哪,分析可能原因:

 1)是否当前用户对此shell脚本没有可执行权限,通过ls -lrt /apps/service/mtk/checking/run.sh查看脚本是有可执行,但是有执行权限呀-rwxr-xr-x

 2)既然单独运行脚本没问题,那会不会是定时的问题呢?于是写了一个简单的输出的shell脚本通过定时也是没问题的。说明还是脚本的问题。

后来上网查了下,想到可能是脚本中环境变量的原因,因为通过crontab运行脚本,是以root用户,而不是当前用户,于是cat /etc/profile查看环境变量,然后修改脚本如下:

#!/bin/sh 
# ----------------------------------------------------------------------------- 
# Start script for the CMGP BOSSCONTROL  
# 
# $Id: run_bosscontrol.sh,v 1.0 2007/11/06 Exp $ 
# ----------------------------------------------------------------------------- 
export PATH=/apps/usr/java/jdk1.5/bin:$PATH 
export JAVA_HOME=/apps/usr/java/jdk1.5 
export JRE_HOME=/apps/usr/java/jdk1.5/jre 
export CLASSPATH=/apps/usr/java/jdk1.5/lib:/apps/usr/java/jdk1.5/jre/lib:$CLASSPATH 
RUN_HOME=/apps/service/checking 
CLASSPATH=$CLASSPATH$RUN_HOME/lib/checking.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/ojdbc14.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/commons-dbutils-1.1.jar 
CLASSPATH=$CLASSPATH:$RUN_HOME/lib/log4j-1.2.14.jar 
 CLASSPATH=$CLASSPATH:$RUN_HOME/lib/dom4j-1.6.jar 
 
export CLASSPATH=$CLASSPATH 
 
java com.**.checking.Checking_Start >> log.out &  

export显示导出为用户环境变量的环境变量

以上这种jar包是通过eclipse工具export导出,不包含MANIFEST.MF文件,如果使用打包工具Ant,我们可以在打包默认的build.xml文件中设置Class-Path

将第三方jar包加入manifest.mf文件中,且指定程序主类

在build.xml中添加如下内容:

<!-- create a property containing all .jar files, prefix lib/, and seperated with a space --> 
<pathconvert property="libs.project" pathsep=" "> 
  <mapper> 
   <chainedmapper> 
    <!-- remove absolute path --> 
    <flattenmapper /> 
    <!-- add lib/ prefix --> 
    <globmapper from="*" to="lib/*" /> 
   </chainedmapper> 
  </mapper> 
   <path> 
   <!-- lib.home contains all jar files, in several subdirectories --> 
   <fileset dir="${lib.dir}"> 
   <include name="**/*.jar" /> 
   </fileset> 
   </path> 
 </pathconvert> 

另外,在create manifest文件时,加上:

<!-- 这样就可以将第三方jar包加入 -->  
<attribute name="Class-Path" value="${libs.project}" /> 
<!-- 程序运行的主类 --> 
<attribute name="Main-Class" value="com.**.checking.Checking_Start " /> 

这样运行ant,打成的jar包中MANIFEST.MF中内容如下:

Manifest-Version: 1.0 
Ant-Version: Apache Ant 1.7.0 
Created-By: 1.5.0_09-b01 (Sun Microsystems Inc.) 
Implementation-Title: fee task 
Implementation-Version: 1.0 
Implementation-Vendor: Aspire 
Main-Class: com.aspire.cmgp.flowcontrol.server.FlowControlServer 
Class-Path: lib/cmgp-util-1.0.1.jar lib/commons-codec-1.3.jar lib/comm 
 ons-collections.jar lib/commons-dbcp-1.2.1.jar lib/commons-httpclient 
 .jar lib/commons-logging.jar lib/commons-pool-1.2.jar lib/dom4j.jar l 
 ib/log4j.jar lib/ojdbc14.jar 

这样在shell脚本中就不需要指定程序所需要的jar包了,也就不存在环境变量设置的恼人问题。比较正规的也是这么操作的。

这样在shell中就直接运行jar包就行了:java -jar 主程序.jar -Xmx1024m -Xms1024m -Xmn512m,

#!/bin/bash后追加

source /etc/profile
source ~/.bash_profile

测试下。。

#! /bin/sh
export JAVA_HOME=/usr/java/jdk1.6.0_18
export CLASSPATH=.:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
for i in lib/*.jar; 
    do CLASSPATH=$i:${CLASSPATH} 
done
export CLASSPATH=.:${CLASSPATH}

java -cp ${CLASSPATH} main方法所在包名.main方法所在的类名

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持码农之家。

展开问题
码农之家
精选回答3:详解Java编写并运行spark应用程序的方法

16小时10分钟前回答

我们首先提出这样一个简单的需求:

现在要分析某网站的访问日志信息,统计来自不同IP的用户访问的次数,从而通过Geo信息来获得来访用户所在国家地区分布状况。这里我拿我网站的日志记录行示例,如下所示:

121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205.198.92 - - [21/Feb/2014:00:00:11 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 26 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
121.205.198.92 - - [21/Feb/2014:00:00:12 +0800] "GET /archives/417.html/ HTTP/1.1" 301 26 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205.198.92 - - [21/Feb/2014:00:00:12 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205.241.229 - - [21/Feb/2014:00:00:13 +0800] "GET /archives/526.html HTTP/1.1" 200 12080 "http://shiyanjun.cn/archives/526.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205.241.229 - - [21/Feb/2014:00:00:15 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 26 "http://shiyanjun.cn/archives/526.html/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

Java实现Spark应用程序(Application)

我们实现的统计分析程序,有如下几个功能点:

从HDFS读取日志数据文件

将每行的第一个字段(IP地址)抽取出来

统计每个IP地址出现的次数

根据每个IP地址出现的次数进行一个降序排序

根据IP地址,调用GeoIP库获取IP所属国家

打印输出结果,每行的格式:[国家代码] IP地址 频率

下面,看我们使用Java实现的统计分析应用程序代码,如下所示:

package org.shirdrn.spark.job;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.shirdrn.spark.job.maxmind.Country;
import org.shirdrn.spark.job.maxmind.LookupService;
import scala.Serializable;
import scala.Tuple2;
public class IPAddressStats implements Serializable {
  private static final long serialVersionUID = 8533489548835413763L;
  private static final Log LOG = LogFactory.getLog(IPAddressStats.class);
  private static final Pattern SPACE = Pattern.compile(" ");
  private transient LookupService lookupService;
  private transient final String geoIPFile;
  public IPAddressStats(String geoIPFile) {
   this.geoIPFile = geoIPFile;
   try {
    // lookupService: get country code from a IP address
    File file = new File(this.geoIPFile);
    LOG.info("GeoIP file: " + file.getAbsolutePath());
    lookupService = new AdvancedLookupService(file, LookupService.GEOIP_MEMORY_CACHE);
   } catch (IOException e) {
    throw new RuntimeException(e);
   }
  }
  @SuppressWarnings("serial")
  public void stat(String[] args) {
   JavaSparkContext ctx = new JavaSparkContext(args[0], "IPAddressStats",
     System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(IPAddressStats.class));
   JavaRDD<String> lines = ctx.textFile(args[1], 1);
   // splits and extracts ip address filed
   JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
    @Override
    public Iterable<String> call(String s) {
     // 121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
     // ip address
     return Arrays.asList(SPACE.split(s)[0]);
    }
   });
   // map
   JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) {
     return new Tuple2<String, Integer>(s, 1);
    }
   });
   // reduce
   JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer i1, Integer i2) {
     return i1 + i2;
    }
   });
   List<Tuple2<String, Integer>> output = counts.collect();
   // sort statistics result by value
   Collections.sort(output, new Comparator<Tuple2<String, Integer>>() {
    @Override
    public int compare(Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) {
     if(t1._2 < t2._2) {
       return 1;
     } else if(t1._2 > t2._2) {
       return -1;
     }
     return 0;
    }
   });
   writeTo(args, output);
  }
  private void writeTo(String[] args, List<Tuple2<String, Integer>> output) {
   for (Tuple2<?, ?> tuple : output) {
    Country country = lookupService.getCountry((String) tuple._1);
    LOG.info("[" + country.getCode() + "] " + tuple._1 + "\t" + tuple._2);
   }
  }
  public static void main(String[] args) {
   // ./bin/run-my-java-example org.shirdrn.spark.job.IPAddressStats spark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
   if (args.length < 3) {
    System.err.println("Usage: IPAddressStats <master> <inFile> <GeoIPFile>");
    System.err.println(" Example: org.shirdrn.spark.job.IPAddressStats spark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat");
    System.exit(1);
   }
   String geoIPFile = args[2];
   IPAddressStats stats = new IPAddressStats(geoIPFile);
   stats.stat(args);
   System.exit(0);
  }
}

具体实现逻辑,可以参考代码中的注释。我们使用Maven管理构建Java程序,首先看一下我的pom配置中所依赖的软件包,如下所示:

<dependencies>
   <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>0.9.0-incubating</version>
   </dependency>
   <dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.16</version>
   </dependency>
   <dependency>
    <groupId>dnsjava</groupId>
    <artifactId>dnsjava</artifactId>
    <version>2.1.1</version>
   </dependency>
   <dependency>
    <groupId>commons-net</groupId>
    <artifactId>commons-net</artifactId>
    <version>3.1</version>
   </dependency>
   <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
   </dependency>
  </dependencies>

需要说明的是,当我们将程序在Spark集群上运行时,它要求我们的编写的Job能够进行序列化,如果某些字段不需要序列化或者无法序列化,可以直接使用transient修饰即可,如上面的属性lookupService没有实现序列化接口,使用transient使其不执行序列化,否则的话,可能会出现类似如下的错误:

14/03/10 22:34:06 INFO scheduler.DAGScheduler: Failed to run collect at IPAddressStats.java:76
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.shirdrn.spark.job.IPAddressStats
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:741)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:740)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:740)
  at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

在Spark集群上运行Java程序

这里,我使用了Maven管理构建Java程序,实现上述代码以后,使用Maven的maven-assembly-plugin插件,配置内容如下所示:

<plugin>
  <artifactId>maven-assembly-plugin</artifactId>
  <configuration>
   <archive>
    <manifest>
     <mainClass>org.shirdrn.spark.job.UserAgentStats</mainClass>
    </manifest>
   </archive>
   <descriptorRefs>
    <descriptorRef>jar-with-dependencies</descriptorRef>
   </descriptorRefs>
   <excludes>
    <exclude>*.properties</exclude>
    <exclude>*.xml</exclude>
   </excludes>
  </configuration>
  <executions>
   <execution>
    <id>make-assembly</id>
    <phase>package</phase>
    <goals>
     <goal>single</goal>
    </goals>
   </execution>
  </executions>
</plugin>

将相关依赖库文件都打进程序包里面,最后拷贝JAR文件到Linux系统下(不一定非要在Spark集群的Master节点上),保证该节点上Spark的环境变量配置正确即可看。Spark软件发行包解压缩后,可以看到脚本bin/run-example,我们可以直接修改该脚本,将对应的路径指向我们实现的Java程序包(修改变量EXAMPLES_DIR以及我们的JAR文件存放位置相关的内容),使用该脚本就可以运行,脚本内容如下所示:

cygwin=false
case "`uname`" in
 CYGWIN*) cygwin=true;;
esac
SCALA_VERSION=2.10
# Figure out where the Scala framework is installed
FWDIR="$(cd `dirname $0`/..; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"
# Load environment variables from conf/spark-env.sh, if it exists
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
 . $FWDIR/conf/spark-env.sh
fi
if [ -z "$1" ]; then
 echo "Usage: run-example <example-class> [<args>]" >&2
 exit 1
fi
# Figure out the JAR file that our examples were packaged into. This includes a bit of a hack
# to avoid the -sources and -doc packages that are built by publish-local.
EXAMPLES_DIR="$FWDIR"/java-examples
SPARK_EXAMPLES_JAR=""
if [ -e "$EXAMPLES_DIR"/*.jar ]; then
 export SPARK_EXAMPLES_JAR=`ls "$EXAMPLES_DIR"/*.jar`
fi
if [[ -z $SPARK_EXAMPLES_JAR ]]; then
 echo "Failed to find Spark examples assembly in $FWDIR/examples/target" >&2
 echo "You need to build Spark with sbt/sbt assembly before running this program" >&2
 exit 1
fi
# Since the examples JAR ideally shouldn't include spark-core (that dependency should be
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
CLASSPATH="$SPARK_EXAMPLES_JAR:$CLASSPATH"
if $cygwin; then
 CLASSPATH=`cygpath -wp $CLASSPATH`
 export SPARK_EXAMPLES_JAR=`cygpath -w $SPARK_EXAMPLES_JAR`
fi
# Find java binary
if [ -n "${JAVA_HOME}" ]; then
 RUNNER="${JAVA_HOME}/bin/java"
else
 if [ `command -v java` ]; then
 RUNNER="java"
 else
 echo "JAVA_HOME is not set" >&2
 exit 1
 fi
fi
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$SPARK_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
# Load extra JAVA_OPTS from conf/java-opts, if it exists
if [ -e "$FWDIR/conf/java-opts" ] ; then
 JAVA_OPTS="$JAVA_OPTS `cat $FWDIR/conf/java-opts`"
fi
export JAVA_OPTS
if [ "$SPARK_PRINT_LAUNCH_COMMAND" == "1" ]; then
 echo -n "Spark Command: "
 echo "$RUNNER" -cp "$CLASSPATH" $JAVA_OPTS "$@"
 echo "========================================"
 echo
fi
exec "$RUNNER" -cp "$CLASSPATH" $JAVA_OPTS "$@"

在Spark上运行我们开发的Java程序,执行如下命令:

cd /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1
./bin/run-my-java-example org.shirdrn.spark.job.IPAddressStats spark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat

我实现的程序类org.shirdrn.spark.job.IPAddressStats运行需要3个参数:

Spark集群主节点URL:例如我的是spark://m1:7077

输入文件路径:业务相关的,我这里是从HDFS上读取文件hdfs://m1:9000/user/shirdrn/wwwlog20140222.log

GeoIP库文件:业务相关的,用来计算IP地址所属国家的外部文件

如果程序没有错误,能够正常运行,控制台输出程序运行日志,示例如下所示:

14/03/10 22:17:24 INFO job.IPAddressStats: GeoIP file: /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/03/10 22:17:25 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/03/10 22:17:25 INFO Remoting: Starting remoting
14/03/10 22:17:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@m1:57379]
14/03/10 22:17:25 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@m1:57379]
14/03/10 22:17:25 INFO spark.SparkEnv: Registering BlockManagerMaster
14/03/10 22:17:25 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140310221725-c1cb
14/03/10 22:17:25 INFO storage.MemoryStore: MemoryStore started with capacity 143.8 MB.
14/03/10 22:17:25 INFO network.ConnectionManager: Bound socket to port 45189 with id = ConnectionManagerId(m1,45189)
14/03/10 22:17:25 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/03/10 22:17:25 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager m1:45189 with 143.8 MB RAM
14/03/10 22:17:25 INFO storage.BlockManagerMaster: Registered BlockManager
14/03/10 22:17:25 INFO spark.HttpServer: Starting HTTP Server
14/03/10 22:17:25 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/03/10 22:17:25 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:49186
14/03/10 22:17:25 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.95.3.56:49186
14/03/10 22:17:25 INFO spark.SparkEnv: Registering MapOutputTracker
14/03/10 22:17:25 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-56c3e30d-a01b-4752-83d1-af1609ab2370
14/03/10 22:17:25 INFO spark.HttpServer: Starting HTTP Server
14/03/10 22:17:25 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/03/10 22:17:25 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52073
14/03/10 22:17:26 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
14/03/10 22:17:26 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
14/03/10 22:17:26 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
14/03/10 22:17:26 INFO ui.SparkUI: Started Spark Web UI at http://m1:4040
14/03/10 22:17:26 INFO spark.SparkContext: Added JAR /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar at http://10.95.3.56:52073/jars/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1394515046396
14/03/10 22:17:26 INFO client.AppClient$ClientActor: Connecting to master spark://m1:7077...
14/03/10 22:17:26 INFO storage.MemoryStore: ensureFreeSpace(60341) called with curMem=0, maxMem=150837657
14/03/10 22:17:26 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 58.9 KB, free 143.8 MB)
14/03/10 22:17:26 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140310221726-0000
14/03/10 22:17:27 INFO client.AppClient$ClientActor: Executor added: app-20140310221726-0000/0 on worker-20140310221648-s1-52544 (s1:52544) with 1 cores
14/03/10 22:17:27 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140310221726-0000/0 on hostPort s1:52544 with 1 cores, 512.0 MB RAM
14/03/10 22:17:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/10 22:17:27 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/10 22:17:27 INFO client.AppClient$ClientActor: Executor updated: app-20140310221726-0000/0 is now RUNNING
14/03/10 22:17:27 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/10 22:17:27 INFO spark.SparkContext: Starting job: collect at IPAddressStats.java:77
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at IPAddressStats.java:70)
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Got job 0 (collect at IPAddressStats.java:77) with 1 output partitions (allowLocal=false)
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Final stage: Stage 0 (collect at IPAddressStats.java:77)
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at IPAddressStats.java:70), which has no missing parents
14/03/10 22:17:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at IPAddressStats.java:70)
14/03/10 22:17:27 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
14/03/10 22:17:28 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@s1:59233/user/Executor#-671170811] with ID 0
14/03/10 22:17:28 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 0: s1 (PROCESS_LOCAL)
14/03/10 22:17:28 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2396 bytes in 5 ms
14/03/10 22:17:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager s1:47282 with 297.0 MB RAM
14/03/10 22:17:32 INFO scheduler.TaskSetManager: Finished TID 0 in 3376 ms on s1 (progress: 0/1)
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 0)
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Stage 1 (reduceByKey at IPAddressStats.java:70) finished in 4.420 s
14/03/10 22:17:32 INFO scheduler.DAGScheduler: looking for newly runnable stages
14/03/10 22:17:32 INFO scheduler.DAGScheduler: running: Set()
14/03/10 22:17:32 INFO scheduler.DAGScheduler: waiting: Set(Stage 0)
14/03/10 22:17:32 INFO scheduler.DAGScheduler: failed: Set()
14/03/10 22:17:32 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 1.0 from pool
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Missing parents for Stage 0: List()
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[6] at reduceByKey at IPAddressStats.java:70), which is now runnable
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[6] at reduceByKey at IPAddressStats.java:70)
14/03/10 22:17:32 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/03/10 22:17:32 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 1 on executor 0: s1 (PROCESS_LOCAL)
14/03/10 22:17:32 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 2255 bytes in 1 ms
14/03/10 22:17:32 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@s1:33534
14/03/10 22:17:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 120 bytes
14/03/10 22:17:32 INFO scheduler.TaskSetManager: Finished TID 1 in 282 ms on s1 (progress: 0/1)
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
14/03/10 22:17:32 INFO scheduler.DAGScheduler: Stage 0 (collect at IPAddressStats.java:77) finished in 0.314 s
14/03/10 22:17:32 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 from pool
14/03/10 22:17:32 INFO spark.SparkContext: Job finished: collect at IPAddressStats.java:77, took 4.870958309 s
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 58.246.49.218  312
14/03/10 22:17:32 INFO job.IPAddressStats: [KR] 1.234.83.77  300
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 120.43.11.16  212
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 110.85.72.254  207
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 27.150.229.134  185
14/03/10 22:17:32 INFO job.IPAddressStats: [HK] 180.178.52.181  181
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 120.37.210.212  180
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 222.77.226.83  176
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 120.43.11.205  169
14/03/10 22:17:32 INFO job.IPAddressStats: [CN] 120.43.9.19  165
...

我们也可以通过Web控制台来查看当前执行应用程序(Application)的状态信息,通过Master节点的8080端口(如:http://m1:8080/)就能看到集群的应用程序(Application)状态信息。

另外,需要说明的时候,如果在Unix环境下使用Eclipse使用Java开发Spark应用程序,也能够直接通过Eclipse连接Spark集群,并提交开发的应用程序,然后交给集群去处理。

总结

以上就是本文关于详解Java编写并运行spark应用程序的方法的全部内容,希望对大家有所帮助。有什么问题可以随时留言,小编会及时回复大家。

展开问题

参考资料

  • Java程序员修炼之道

    Java程序员修炼之道

    伴随着关键服务平台及其生态体系的与时俱进,Java技术性一直在迅速往前发展趋势。《 Java程序员修炼之道 》包含了Java7的**特性和Java开发设计的核心技术,对当今很多开源系统技术性共存,

    大小:8.7 MBJava

    立即下载
  • Java语言程序设计(第2版)

    Java语言程序设计(第2版)

    Java语言程序设计(第二版) 作者:殷兆麟 出版时间:2007

    大小:31.58MBJava

    立即下载
  • Java语言程序设计与数据结构:基础篇(第11版)

    Java语言程序设计与数据结构:基础篇(第11版)

    Java语言程序设计是Java语言的经典教材,本书全面整合了Java 8的特性,采用“基础优先,问题驱动”的教学方式,循序渐进地介绍了程序设计基础、解决问题的方法、面向对象程序设计、图形用

    大小:259 MBJava

    立即下载
  • 实战Java高并发程序设计

    实战Java高并发程序设计

    大小:155793 MB Java

    立即下载
  • 《Java程序设计语言》期末复习练习题

    一、(单元一实训3)实现一个数字加密器。运行时输入加密前的整数,通过加密运算后,输出加密后的结果,使加密结果仍为整数。 加密规则为:加密结果=(整数*10+5)/2 + 3.14159。 二、(单元一实训4)计算个人所得税。假设某人月收人为x元,计算个人所得税,假设个人所得税征收方法如下: 当x=3 500时,不需要交税; 当3500x5000时,应征税为(x-3 500)* 3%; 当5000x8000时,应征税为(x-5000)* 10%+ 1 500 * 3%; 当8000x12500时,应征税为(

    大小:35 KBJava

    立即下载
  • Java Web应用程序开发

    Java Web应用程序开发

    《企业级卓越人才培养(信息类专业集群)解决方案十三五规划教材:Javaweb应用程序开发》通过对JAVAweb相关知识的介绍,学习IO流、JDBC、Web服务器、JSP页面以及对JSP内置对象等内容,每个知识点后

    大小:5.3 MBJava Web开发

    立即下载

更多回答

39小时38分钟前回答

Java程序运行依赖操作系统吗

在Java平台的结构中, Java虚拟机(JVM) 处在核心的位置,是程序与底层操作系统和硬件无关的关键。它的下方是移植接口,移植接口由两部分组成:适配器和Java操作系统, 其中依赖于平台的部分称为适配器;JVM 通过移植接口在具体的平台和操作系统上实现;在JVM 的上方是Java的基本类库和扩展类库以及它们的API, 利用Java API编写的应用程序(application) 和小程序(Java applet) 可以在任何Java平台上运行而无需考虑底层平台, 就是因为有Java虚拟机(JVM)实现了程序与操作系统的分离,从而实现了Java 的平台无关性。 1、java程序是跨平台的,不依赖任何操作系统,只依赖JVM(java虚拟机)而已, 只要操作系统(Windows也得,Linux也得……

47小时23分钟前回答

详解Java编写并运行spark应用程序的方法

我们首先提出这样一个简单的需求: 现在要分析某网站的访问日志信息,统计来自不同IP的用户访问的次数,从而通过Geo信息来获得来访用户所在国家地区分布状况。这里我拿我网站的日志记录行示例,如下所示: 121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"121.205.198.92 - - [21/Feb/2014:00:00:11 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 26 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"121.205.198.92 - - [21/Feb/2014:00:00:12 +0800] "GET /archives/417.html/ HTTP/1.1" 301 26 "http://shiyanjun.cn/archi……