From 1379fcdbced6666a81946baf48bfcccd7c7fbc5c Mon Sep 17 00:00:00 2001 From: Kenny Ballou Date: Sun, 28 Jan 2018 12:34:08 -0700 Subject: spark streaming post conversion --- posts/spark.org | 783 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 783 insertions(+) create mode 100644 posts/spark.org (limited to 'posts') diff --git a/posts/spark.org b/posts/spark.org new file mode 100644 index 0000000..7ed0685 --- /dev/null +++ b/posts/spark.org @@ -0,0 +1,783 @@ +#+TITLE: Real-Time Streaming with Apache Spark Streaming +#+DESCRIPTION: Overview of Apache Spark and a sample Twitter Sentiment Analysis +#+TAGS: Apache Spark +#+TAGS: Apache Kafka +#+TAGS: Apache +#+TAGS: Java +#+TAGS: Sentiment Analysis +#+TAGS: Real-time Streaming +#+TAGS: ZData Inc. +#+DATE: 2014-08-18 +#+SLUG: real-time-streaming-apache-spark-streaming +#+LINK: storm-and-kafka /blog/2014/07/real-time-streaming-storm-and-kafka/ +#+LINK: kafka https://kafka.apache.org/ +#+LINK: spark https://spark.apache.org/ +#+LINK: spark-sql https://spark.apache.org/sql/ +#+LINK: spark-streaming https://spark.apache.org/streaming/ +#+LINK: spark-mllib https://spark.apache.org/mllib/ +#+LINK: spark-graphx https://spark.apache.org/graphx/ +#+LINK: lazy-evaluation http://en.wikipedia.org/wiki/Lazy_evaluation +#+LINK: nsdi-12 https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf +#+LINK: spark-programming-guide http://spark.apache.org/docs/latest/programming-guide.html +#+LINK: data-parallelism http://en.wikipedia.org/wiki/Data_parallelism +#+LINK: spark-faq http://spark.apache.org/faq.html +#+LINK: S3 http://aws.amazon.com/s3/ +#+LINK: NFS http://en.wikipedia.org/wiki/Network_File_System +#+LINK: HDFS +#+LINK: spark-standalone http://spark.apache.org/docs/latest/spark-standalone.html +#+LINK: YARN http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html +#+LINK: spark-running-yarn http://spark.apache.org/docs/latest/running-on-yarn.html +#+LINK: Mesos http://mesos.apache.org +#+LINK: spark-running-mesos http://spark.apache.org/docs/latest/running-on-mesos.html +#+LINK: EC2 http://aws.amazon.com/ec2/ +#+LINK: spark-running-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html +#+LINK: Scala http://www.scala-lang.org/ +#+LINK: Java https://en.wikipedia.org/wiki/Java_%28programming_language%29 +#+LINK: Python https://www.python.org/ +#+LINK: monad-progrmaming http://en.wikipedia.org/wiki/Monad_(functional_programming) +#+LINK: maven-dependency-hell http://cupofjava.de/blog/2013/02/01/fight-dependency-hell-in-maven/ +#+LINK: spark-939 https://issues.apache.org/jira/browse/SPARK-939 +#+LINK: storm-sample-project https://github.com/zdata-inc/StormSampleProject +#+LINK: kafka-producer https://github.com/zdata-inc/SimpleKafkaProducer +#+LINK: storm-kafka-streaming https://kennyballou.com/blog/2014/07/real-time-streaming-storm-and-kafka +#+LINK: spark-sample-project https://github.com/zdata-inc/SparkSampleProject +#+LINK: jackson-databind https://github.com/FasterXML/jackson-databind +#+LINK: hotcloud12 https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf +#+LINK: spark-sql-future http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html +#+LINK: bag-of-words http://en.wikipedia.org/wiki/Bag-of-words_model +#+LINK: spark-kafka-extern-lib https://github.com/apache/spark/tree/master/external/kafka +#+LINK: docker http://www.docker.io/ +#+LINK: state-spark-2014 http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/ + +#+BEGIN_PREVIEW +This is the second post in a series on real-time systems tangential to the +Hadoop ecosystem. [[storm-and-kafka][Last time]], we talked about +[[kafka][Apache Kafka]] and Apache Storm for use in a real-time processing +engine. Today, we will be exploring Apache Spark (Streaming) as part of a +real-time processing engine. +#+END_PREVIEW + +** About Spark + +[[spark][Apache Spark]] is a general purpose, large-scale processing engine, +recently fully inducted as an Apache project and is currently under very active +development. As of this writing, Spark is at version 1.0.2 and 1.1 will be +released some time soon. + +Spark is intended to be a drop in replacement for Hadoop MapReduce providing +the benefit of improved performance. Combining Spark with its related projects +and libraries -- [[spark-sql][Spark SQL (formerly Shark)]], +[[spark-streaming][Spark Streaming]], [[spark-mllib][Spark MLlib]], +[[spark-graphx][GraphX]], among others -- and a very capable and promising +processing stack emerges. Spark is capable of reading from HBase, Hive, +Cassandra, and any HDFS data source. Not to mention the many external +libraries that enable consuming data from many more sources, e.g., hooking +Apache Kafka into Spark Streaming is trivial. Further, the Spark Streaming +project provides the ability to continuously compute transformations on data. + +*** Resilient Distributed Datasets + +Apache Spark's primitive type is the Resilient Distributed Dataset (RDD). All +transformations, ~map~, ~join~, ~reduce~, etc., in Spark revolve around this +type. RDD's can be created in one of three ways: /parallelizing/ (distributing +a local dataset); reading a stable, external data source, such as an HDFS file; +or transformations on existing RDD's. + +In Java, parallelizing may look like: + +#+BEGIN_SRC java + List data = Arrays.asList(1, 2, 3, 4, 5); + JavaRDD distData = sc.parallelize(data); +#+END_SRC + +Where ~sc~ defines the Spark context. + +Similarly, reading a file from HDFS may look like: + +#+BEGIN_SRC java + JavaRDD distFile = sc.textFile("hdfs:///data.txt"); +#+END_SRC + +The resiliency of RDD's comes from their [[lazy-evaluation][lazy]] +materialization and the information required to enable this lazy nature. RDD's +are not always fully materialized but they /do/ contain enough information +(their linage) to be (re)created from a stable source [[nsdi-12][Zaharia et +al.]]. + +RDD's are distributed among the participating machines, and RDD transformations +are coarse-grained -- the same transformation will be applied to /every/ +element in an RDD. The number of partitions in an RDD is generally defined by +the locality of the stable source, however, the user may control this number +via repartitioning. + +Another important property to mention, RDD's are actually immutable. This +immutability can be illustrated with [[spark-programming-guide][Spark's]] Word +Count example: + +#+BEGIN_SRC java + JavaRDD file = sc.textFile("hdfs:///data.txt"); + JavaRDD words = file.flatMap( + new FlatMapFunction() { + public Iterable call(String line) { + return Arrays.asList(line.split(" ")); + } + } + ); + JavaPairRDD pairs = words.map( + new PairFunction() { + public Tuple2 call(String word) { + return new Tuple2(word, 1); + } + } + ); + JavaPairRDD counts = pairs.reduceByKey( + new Function2() { + public Integer call(Integer a, Integer b) { return a + b; } + } + ); + counts.saveAsTextFile("hdfs:///data_counted.txt"); +#+END_SRC + +This is the canonical word count example, but here is a brief explanation: load +a file into an RDD, split the words into a new RDD, map the words into pairs +where each word is given a count (one), then reduce the counts of each word by +a key, in this case the word itself. Notice, each operation, ~map~, ~flatMap~, +~reduceByKey~, creates a /new/ RDD. + +To bring all these properties together, Resilient Distributed Datasets are +read-only, lazy distributed sets of elements that can have a chain of +transformations applied to them. They facilitate resiliency by storing lineage +graphs of the transformations (to be) applied and they +[[data-parallelism][parallelize]] the computations by partitioning the data +among the participating machines. + +*** Discretized Streams + +Moving to Spark Streaming, the primitive is still RDD's. However, there is +another type for encapsulating a continuous stream of data: Discretized Streams +or DStreams. DStreams are defined as sequences of RDD's. A DStream is created +from an input source, such as Apache Kafka, or from the transformation of +another DStream. + +Turns out, programming against DStreams is /very/ similar to programming +against RDD's. The same word count code can be slightly modified to create a +streaming word counter: + +#+BEGIN_SRC java + JavaReceiverInputDStream lines = ssc.socketTextStream("localhost", 9999); + JavaDStream words = lines.flatMap( + new FlatMapFunction() { + public Iterable call(String line) { + return Arrays.asList(line.split(" ")); + } + } + ); + JavaPairDStream pairs = words.map( + new PairFunction() { + public Tuple2 call(String word) { + return new Tuple2(word, 1); + } + } + ); + JavaPairDStream counts = pairs.reduceByKey( + new Function2() { + public Integer call(Integer a, Integer b) { return a + b; } + } + ); + counts.print(); +#+END_SRC + +Notice, really the only change between first example's code is the return +types. In the streaming context, transformations are working on streams of +RDD's, Spark handles applying the functions (that work against data in the +RDD's) to the RDD's in the current batch/ DStream. + +Though programming against DStreams is similar, there are indeed some +differences as well. Chiefly, DStreams also have /statefull/ transformations. +These include sharing state between batches/ intervals and modifying the +current frame when aggregating over a sliding window. + +#+BEGIN_QUOTE + The key idea is to treat streaming as a series of short batch jobs, + and bring down the latency of these jobs as much as possible. This + brings many of the benefits of batch processing models to stream + processing, including clear consistency semantics and a new parallel + recovery technique... + [[[https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf][Zaharia + et al.]]] +#+END_QUOTE + +*** Hadoop Requirements + +Technically speaking, Apache Spark does [[spark-faq][/not/]] require Hadoop to +be fully functional. In a cluster setting, however, a means of sharing files +between tasks will need to be facilitated. This could be accomplished through +[[S3][S3]], [[NFS][NFS]], or, more typically, [[HDFS][HDFS]]. + +*** Running Spark Applications + +Apache Spark applications can run in [[spark-standalone][standalone mode]] or +be managed by [[YARN][YARN]]([[spark-running-yarn][Running Spark on YARN]]), +[[Mesos][Mesos]]([[spark-running-mesos][Running Spark on Mesos]]), and even +[[EC2][EC2]]([[spark-running-ec2][Running Spark on EC2]]). Furthermore, if +running under YARN or Mesos, Spark does not need to be installed to work. That +is, Spark code can execute on YARN and Mesos clusters without change to the +cluster. + +*** Language Support + +Currently, Apache Spark supports the [[Scala][Scala]], [[Java][Java]], and +[[Python][Python]] programming languages. Though, this post will only be +discussing examples in [[Java][Java]]. + +*** Initial Thoughts + +Getting away from the idea of directed acyclic graphs (DAG's) is -- may be -- +both a bit of a leap and a benefit. Although it is perfectly acceptable to +define Spark's transformations altogether as a DAG, this can feel awkward when +developing Spark applications. Describing the transformations as +[[monad-programming][Monadic]] feels much more natural. Of course, a monad +structure fits the DAG analogy quite well, especially when considered in some +of the physical analogies such as assembly lines. + +Java's, and consequently Spark's, type strictness was an initial hurdle +to get accustomed. But overall, this is good. It means the compiler will +catch a lot of issues with transformations early. + +Depending on Scala's ~Tuple[\d]~ classes feels second-class, but this is +only a minor tedium. It's too bad current versions of Java don't have +good classes for this common structure. + +YARN and Mesos integration is a very nice benefit as it allows full stack +analytics to not oversubscribe clusters. Furthermore, it gives the ability to +add to existing infrastructure without overloading the developers and the +system administrators with /yet another/ computational suite and/or resource +manager. + +On the negative side of things, dependency hell can creep into Spark projects. +Your project and Spark (and possibly Spark's dependencies) may depend on a +common artifact. If the versions don't [[maven-dependency-hell][converge]], +many subtle problems can emerge. There is an [[spark-939][experimental +configuration option]] to help alleviate this problem, however, for me, it +caused more problems than solved. + +** Test Project: Twitter Stream Sentiment Analysis + +To really test Spark (Streaming), a Twitter Sentiment Analysis project was +developed. It's almost a direct port of the [[storm-sample-project][Storm +code]]. Though there is an external library for hooking Spark directly into +Twitter, Kafka is used so a more precise comparison of Spark and Storm can be +made. + +When the processing is finished, the data are written to HDFS and posted +to a simple NodeJS application. + +*** Setup + :PROPERTIES: + :CUSTOM_ID: setup + :END: + +The setup is the same as +[[https://kennyballou.com/blog/2014/07/real-time-streaming-storm-and-kafka][last +time]]: 5 node Vagrant virtual cluster with each node running 64 bit +CentOS 6.5, given 1 core, and 1024MB of RAM. Every node is running HDFS +(datanode), YARN worker nodes (nodemanager), ZooKeeper, and Kafka. The +first node, ~node0~, is the namenode and resource manager. ~node0~ is +also running a [[http://www.docker.io/][Docker]] container with a NodeJS +application for reporting purposes. + +*** Application Overview + +This project follows a very similar process structure as the Storm Topology +from last time. + +[[file:/media/SentimentAnalysisTopology.png]] + +However, each node in the above graph is actually a transformation on the +current DStream and not an individual process (or group of processes). + +This test project similarly uses the same [[kafka-producer][simple Kafka +producer]] developed. This Kafka producer will be how data are ingested by the +system. + +**** Kafka Receiver Stream + :PROPERTIES: + :CUSTOM_ID: kafka-receiver-stream + :END: + +The data processed is received from a Kafka Stream and is implemented +via the +[[https://github.com/apache/spark/tree/master/external/kafka][external +Kafka]] library. This process simply creates a connection to the Kafka +broker(s), consuming messages from the given set of topics. + +***** Stripping Kafka Message IDs + :PROPERTIES: + :CUSTOM_ID: stripping-kafka-message-ids + :END: + +It turns out the messages from Kafka are retuned as tuples, more +specifically pairs, with the message ID and the message content. Before +continuing, the message ID is stripped and the Twitter JSON data is +passed down the pipeline. + +**** Twitter Data JSON Parsing + :PROPERTIES: + :CUSTOM_ID: twitter-data-json-parsing + :END: + +As was the case last time, the important parts (tweet ID, tweet text, +and language code) need to be extracted from the JSON. Furthermore, this +project only parses English tweets. Non-English tweets are filtered out +at this stage. + +**** Filtering and Stemming + :PROPERTIES: + :CUSTOM_ID: filtering-and-stemming + :END: + +Many tweets contain messy or otherwise unnecessary characters and +punctuation that can be safely ignored. Moreover, there may also be many +common words that cannot be reliably scored either positively or +negatively. At this stage, these symbols and /stop words/ should be +filtered. + +**** Classifiers + :PROPERTIES: + :CUSTOM_ID: classifiers + :END: + +Both the Positive classifier and the Negative classifier are in separate +~map~ transformations. The implementation of both follows the +[[http://en.wikipedia.org/wiki/Bag-of-words_model][Bag-of-words]] model. + +**** Joining and Scoring + :PROPERTIES: + :CUSTOM_ID: joining-and-scoring + :END: + +Because the classifiers are done separately and a join is contrived, the +next step is to join the classifier scores together and actually declare +a winner. It turns out this is quite trivial to do in Spark. + +**** Reporting: HDFS and HTTP POST + :PROPERTIES: + :CUSTOM_ID: reporting-hdfs-and-http-post + :END: + +Finally, once the tweets are joined and scored, the scores need to be +reported. This is accomplished by writing the final tuples to HDFS and +posting a JSON object of the tuple to a simple NodeJS application. + +This process turned out to not be as awkward as was the case with Storm. +The ~foreachRDD~ function of DStreams is a natural way to do side-effect +inducing operations that don't necessarily transform the data. + +*** Implementing the Kafka Producer + +See the [[storm-kafka-streaming][post]] from last time for the details of the +Kafka producer; this has not changed. + +*** Implementing the Spark Streaming Application + +Diving into the code, here are some of the primary aspects of this project. +The full source of this test application can be found on +[[spark-sample-project][Github]]. + +**** Creating Spark Context, Wiring Transformation Chain + +The Spark context, the data source, and the transformations need to be defined. +Proceeding, the context needs to be started. This is all accomplished with the +following code: + +#+BEGIN_SRC java + SparkConf conf = new SparkConf() + .setAppName("Twitter Sentiment Analysis"); + + if (args.length > 0) + conf.setMaster(args[0]); + else + conf.setMaster("local[2]"); + + JavaStreamingContext ssc = new JavaStreamingContext( + conf, + new Duration(2000)); + + Map topicMap = new HashMap(); + topicMap.put(KAFKA_TOPIC, KAFKA_PARALLELIZATION); + + JavaPairReceiverInputDStream messages = + KafkaUtils.createStream( + ssc, + Properties.getString("rts.spark.zkhosts"), + "twitter.sentimentanalysis.kafka", + topicMap); + + JavaDStream json = messages.map( + new Function, String>() { + public String call(Tuple2 message) { + return message._2(); + } + } + ); + + JavaPairDStream tweets = json.mapToPair( + new TwitterFilterFunction()); + + JavaPairDStream filtered = tweets.filter( + new Function, Boolean>() { + public Boolean call(Tuple2 tweet) { + return tweet != null; + } + } + ); + + JavaDStream> tweetsFiltered = filtered.map( + new TextFilterFunction()); + + tweetsFiltered = tweetsFiltered.map( + new StemmingFunction()); + + JavaPairDStream, Float> positiveTweets = + tweetsFiltered.mapToPair(new PositiveScoreFunction()); + + JavaPairDStream, Float> negativeTweets = + tweetsFiltered.mapToPair(new NegativeScoreFunction()); + + JavaPairDStream, Tuple2> joined = + positiveTweets.join(negativeTweets); + + JavaDStream> scoredTweets = + joined.map(new Function, + Tuple2>, + Tuple4>() { + public Tuple4 call( + Tuple2, Tuple2> tweet) + { + return new Tuple4( + tweet._1()._1(), + tweet._1()._2(), + tweet._2()._1(), + tweet._2()._2()); + } + }); + + JavaDStream> result = + scoredTweets.map(new ScoreTweetsFunction()); + + result.foreachRDD(new FileWriter()); + result.foreachRDD(new HTTPNotifierFunction()); + + ssc.start(); + ssc.awaitTermination(); +#+END_SRC + +Some of the more trivial transforms are defined in-line. The others are +defined in their respective files. + +**** Twitter Data Filter / Parser + +Parsing Twitter JSON data is one of the first transformations and is +accomplished with help of the [[jackson-databind][JacksonXML Databind]] +library. + +#+BEGIN_SRC java + JsonNode root = mapper.readValue(tweet, JsonNode.class); + long id; + String text; + if (root.get("lang") != null && + "en".equals(root.get("lang").textValue())) + { + if (root.get("id") != null && root.get("text") != null) + { + id = root.get("id").longValue(); + text = root.get("text").textValue(); + return new Tuple2(id, text); + } + return null; + } + return null; +#+END_SRC + +The ~mapper~ (~ObjectMapper~) object is defined at the class level so it is not +recreated /for each/ RDD in the DStream, a minor optimization. + +You may recall, this is essentially the same code as +[[storm-kafka-streaming][last time]]. The only difference really is that the +tuple is returned instead of being emitted. Because certain situations (e.g., +non-English tweet, malformed tweet) return null, the nulls will need to be +filtered out. Thankfully, Spark provides a simple way to accomplish this: + +#+BEGIN_SRC java + JavaPairDStream filtered = tweets.filter( + new Function, Boolean>() { + public Boolean call(Tuple2 tweet) { + return tweet != null; + } + } + ); +#+END_SRC + +**** Text Filtering + +As mentioned before, punctuation and other symbols are simply discarded as they +provide little to no benefit to the classifiers: + +#+BEGIN_SRC java + String text = tweet._2(); + text = text.replaceAll("[^a-zA-Z\\s]", "").trim().toLowerCase(); + return new Tuple2(tweet._1(), text); +#+END_SRC + +Similarly, common words should be discarded as well: + +#+BEGIN_SRC java + String text = tweet._2(); + List stopWords = StopWords.getWords(); + for (String word : stopWords) + { + text = text.replaceAll("\\b" + word + "\\b", ""); + } + return new Tuple2(tweet._1(), text); +#+END_SRC + +**** Positive and Negative Scoring + +Each classifier is defined in its own class. Both classifiers are /very/ +similar in definition. + +The positive classifier is primarily defined by: + +#+BEGIN_SRC java + String text = tweet._2(); + Set posWords = PositiveWords.getWords(); + String[] words = text.split(" "); + int numWords = words.length; + int numPosWords = 0; + for (String word : words) + { + if (posWords.contains(word)) + numPosWords++; + } + return new Tuple2, Float>( + new Tuple2(tweet._1(), tweet._2()), + (float) numPosWords / numWords + ); +#+END_SRC + +And the negative classifier: + +#+BEGIN_SRC java + String text = tweet._2(); + Set negWords = NegativeWords.getWords(); + String[] words = text.split(" "); + int numWords = words.length; + int numPosWords = 0; + for (String word : words) + { + if (negWords.contains(word)) + numPosWords++; + } + return new Tuple2, Float>( + new Tuple2(tweet._1(), tweet._2()), + (float) numPosWords / numWords + ); +#+END_SRC + +Because both are implementing a ~PairFunction~, a join situation is contrived. +However, this could /easily/ be defined differently such that one classifier is +computed, then the next, without ever needing to join the two together. + +**** Joining + +It turns out, joining in Spark is very easy to accomplish. So easy in fact, it +can be handled without virtually /any/ code: + +#+BEGIN_SRC java + JavaPairDStream, Tuple2> joined = + positiveTweets.join(negativeTweets); +#+END_SRC + +But because working with a Tuple of nested tuples seems unwieldy, transform it +to a 4 element tuple: + +#+BEGIN_SRC java + public Tuple4 call( + Tuple2, Tuple2> tweet) + { + return new Tuple4( + tweet._1()._1(), + tweet._1()._2(), + tweet._2()._1(), + tweet._2()._2()); + } +#+END_SRC + +**** Scoring: Declaring Winning Class + +Declaring the winning class is a matter of a simple map, comparing each class's +score and take the greatest: + +#+BEGIN_SRC java + String score; + if (tweet._3() >= tweet._4()) + score = "positive"; + else + score = "negative"; + return new Tuple5( + tweet._1(), + tweet._2(), + tweet._3(), + tweet._4(), + score); +#+END_SRC + +This declarer is more optimistic about the neutral case but is otherwise +very straightforward. + +**** Reporting the Results + +Finally, the pipeline completes with writing the results to HDFS: + +#+BEGIN_SRC java + if (rdd.count() <= 0) return null; + String path = Properties.getString("rts.spark.hdfs_output_file") + + "_" + + time.milliseconds(); + rdd.saveAsTextFile(path); +#+END_SRC + +And sending POST request to a NodeJS application: + +#+BEGIN_SRC java + rdd.foreach(new SendPostFunction()); +#+END_SRC + +Where ~SendPostFunction~ is primarily given by: + +#+BEGIN_SRC java + String webserver = Properties.getString("rts.spark.webserv"); + HttpClient client = new DefaultHttpClient(); + HttpPost post = new HttpPost(webserver); + String content = String.format( + "{\"id\": \"%d\", " + + "\"text\": \"%s\", " + + "\"pos\": \"%f\", " + + "\"neg\": \"%f\", " + + "\"score\": \"%s\" }", + tweet._1(), + tweet._2(), + tweet._3(), + tweet._4(), + tweet._5()); + + try + { + post.setEntity(new StringEntity(content)); + HttpResponse response = client.execute(post); + org.apache.http.util.EntityUtils.consume(response.getEntity()); + } + catch (Exception ex) + { + Logger LOG = Logger.getLogger(this.getClass()); + LOG.error("exception thrown while attempting to post", ex); + LOG.trace(null, ex); + } +#+END_SRC + +Each file written to HDFS /will/ have data in it, but the data written will be +small. A better batching procedure should be implemented so the files written +match the HDFS block size. + +Similarly, a POST request is opened /for each/ scored tweet. This can be +expensive on both the Spark Streaming batch timings and the web server +receiving the requests. Batching here could similarly improve overall +performance of the system. + +That said, writing these side-effects this way fits very naturally into +the Spark programming style. + +** Summary + +Apache Spark, in combination with Apache Kafka, has some amazing potential. +And not only in the Streaming context, but as a drop-in replacement for +traditional Hadoop MapReduce. This combination makes it a very good candidate +for a part in an analytics engine. + +Stay tuned, as the next post will be a more in-depth comparison between Apache +Spark and Apache Storm. + +** Related Links / References + +- [[spark][Apache Spark]] + +- [[state-spark-2014][State of Apache Spark 2014]] + +- [[storm-sample-project][Storm Sample Project]] + +- [[spark-939][SPARK-939]] + +- [[kafka][Apache Spark]] + +- [[storm-kafka-streaming][Real-Time Streaming with Apache Storm and Apache + Kafka]] + +- [[docker][Docker IO Project Page]] + +- [[S3][Amazon S3]] + +- [[NFS][Network File System (NFS)]] + +- [[YARN][Hadoop YARN]] + +- [[mesos][Apache Mesos]] + +- [[spark-streaming][Spark Streaming Programming Guide]] + +- [[monad-programming][Monad]] + +- [[spark-sql][Spark SQL]] + +- [[spark-streaming][Spark Streaming]] + +- [[spark-mllib][MLlib]] + +- [[spark-graphx][GraphX]] + +- [[spark-standalone][Spark Standalone Mode]] + +- [[spark-running-yarn][Running on YARN]] + +- [[spark-running-mesos][Running on Mesos]] + +- [[mave-dependency-hell][Fight Dependency Hell in Maven]] + +- [[kafka-producer][Simple Kafka Producer]] + +- [[spark-kafka-extern-lib][Spark: External Kafka Library]] + +- [[spark-sample-project][Spark Sample Project]] + +- [[bag-of-words][Wikipedia: Bag-of-words]] + +- [[jackson-databind][Jackson XML Databind Project]] + +- [[sparking-programming-guide][Spark Programming Guide]] + +- [[EC2][Amazon EC2]] + +- [[spark-running-ec2][Running Spark on EC2]] + +- [[spark-faq][Spark FAQ]] + +- [[spark-sql-future][Future of Shark]] + +- [[nids12][Resilient Distributed Datasets: A Fault-Tolerant Abstraction for + In-Memory Cluster Computing (PDF)]] + +- [[hotcloud12][Discretized Streams: An Efficient and Fault-Tolerant Model for + Stream Processing on Large Clusters (PDF)]] + +- [[lazy-evaluation][Wikipedia: Lazy evaluation]] + +- [[data-parallelism][Wikipedia: Data Parallelism]] -- cgit v1.2.1