From 89cbde1e8c4be21d77fb9db2c670a7d324f9f521 Mon Sep 17 00:00:00 2001 From: Kenny Ballou Date: Mon, 29 Jan 2018 22:58:38 -0700 Subject: storm-vs-spark post conversion --- content/blog/Storm-vs-Spark.markdown | 461 ----------------------------------- posts/storm-vs-spark.org | 428 ++++++++++++++++++++++++++++++++ 2 files changed, 428 insertions(+), 461 deletions(-) delete mode 100644 content/blog/Storm-vs-Spark.markdown create mode 100644 posts/storm-vs-spark.org diff --git a/content/blog/Storm-vs-Spark.markdown b/content/blog/Storm-vs-Spark.markdown deleted file mode 100644 index e5f7776..0000000 --- a/content/blog/Storm-vs-Spark.markdown +++ /dev/null @@ -1,461 +0,0 @@ ---- -title: "Apache Storm and Apache Spark Streaming" -description: "Comparison of Apache Storm and Apache Spark Streaming" -tags: - - "Apache Storm" - - "Apache Spark" - - "Apache" - - "Real-time Streaming" - - "ZData Inc." -date: "2014-09-08" -categories: - - "Apache" - - "Development" - - "Real-time Systems" -slug: "apache-storm-and-apache-spark" ---- - -This is the last post in the series on real-time systems. In the [first -post][3] we discussed [Apache Storm][1] and [Apache Kafka][5]. In the [second -post][4] we discussed [Apache Spark (Streaming)][3]. In both posts we examined -a small Twitter Sentiment Analysis program. Today, we will be reviewing both -systems: how they compare and how they contrast. - -The intention is not to cast judgment over one project or the other, but rather -to exposit the differences and similarities. Any judgments made, subtle or not, -are mistakes in exposition and/ or organization and are not actual endorsements -of either project. - -## Apache Storm ## - -"Storm is a distributed real-time computation system" [[1][1]]. Apache Storm is -a [task parallel][7] continuous computational engine. It defines its workflows -in Directed Acyclic Graphs (DAG's) called "topologies". These topologies run -until shutdown by the user or encountering an unrecoverable failure. - -Storm does not natively run on top of typical Hadoop clusters, it uses -[Apache ZooKeeper][8] and its own master/ minion worker processes to -coordinate topologies, master and worker state, and the message guarantee -semantics. That said, both [Yahoo!][9] and [Hortonworks][10] are working on -providing libraries for running Storm topologies on top of Hadoop 2.x YARN -clusters. Furthermore, Storm can run on top of the [Mesos][11] scheduler as -well, [natively][12] and with help from the [Marathon][13] framework. - -Regardless though, Storm can certainly still consume files from HDFS and/ or -write files to HDFS. - -## Apache Spark (Streaming) ## - -"Apache Spark is a fast and general purpose engine for large-scale data -processing" [[2][2]]. [Apache Spark][2] is a [data parallel][8] general purpose -batch processing engine. Workflows are defined in a similar and reminiscent -style of MapReduce, however, is much more capable than traditional Hadoop -MapReduce. Apache Spark has its Streaming API project that allows for -continuous processing via short interval batches. Similar to Storm, Spark -Streaming jobs run until shutdown by the user or encounter an unrecoverable -failure. - -Apache Spark does not itself require Hadoop to operate. However, its data -parallel paradigm requires a shared filesystem for optimal use of stable data. -The stable source can range from [S3][14], [NFS][15], or, more typically, -[HDFS][16]. - -Executing Spark applications does not _require_ Hadoop YARN. Spark has its own -standalone master/ server processes. However, it is common to run Spark -applications using YARN containers. Furthermore, Spark can also run on Mesos -clusters. - -## Development ## - -As of this writing, Apache Spark is a full, top level Apache project. Whereas -Apache Storm is currently undergoing incubation. Moreover, the latest stable -version of Apache Storm is `0.9.2` and the latest stable version of Apache -Spark is `1.0.2` (with `1.1.0` to be released in the coming weeks). Of course, -as the Apache Incubation reminder states, this does not strictly reflect -stability or completeness of either project. It is, however, a reflection to -the state of the communities. Apache Spark operations and its process are -endorsed by the [Apache Software Foundation][27]. Apache Storm is working on -stabilizing its community and development process. - -Spark's `1.x` version does state that the API has stabilized and will not be -doing major changes undermining backward compatibility. Implicitly, Storm has -no guaranteed stability in its API, however, it is [running in production for -many different companies][34]. - -### Implementation Language ### - -Both Apache Spark and Apache Storm are implemented in JVM based languages: -[Scala][19] and [Clojure][20], respectively. - -Scala is a functional meets object-oriented language. In other words, the -language carries ideas from both the functional world and the object-oriented -world. This yields an interesting mix of code reusability, extensibility, and -higher-order functions. - -Clojure is a dialect of [Lisp][21] targeting the JVM providing the Lisp -philosophy: code-as-data and providing the rich macro system typical of Lisp -languages. Clojure is predominately functional in nature, however, if state or -side-effects are required, they are facilitated with a transactional memory -model, aiding in making multi-threaded based applications consistent and safe. - -#### Message Passing Layer #### - -Until version `0.9.x`, Storm was using the Java library [JZMQ][22] for -[ZeroMQ][23] messages. However, Storm has since moved the default messaging -layer to [Netty][24] with efforts from [Yahoo!][25]. Although Netty is now -being used by default, users can still use ZeroMQ, if desired, since the -migration to Netty was intended to also make the message layer pluggable. - -Spark, on the other hand, uses a combination of [Netty][24] and [Akka][26] for -distributing messages throughout the executors. - -### Commit Velocity ### - -As a reminder, these data are included not to cast judgment on one project or -the other, but rather to exposit the fluidness of each project. The continuum -of the dynamics of both projects can be used as an argument for or against, -depending on application requirements. If rigid stability is a strong -requirement, arguing for a slower commit velocity may be appropriate. - -Source of the following statistics were taken from the graphs at -[GitHub](https://github.com/) and computed from [this script][38]. - -#### Spark Commit Velocity #### - -Examining the graphs from -[GitHub](https://github.com/apache/spark/graphs/commit-activity), over the last -month (as of this writing), there have been over 330 commits. The previous -month had about 340. - -#### Storm Commit Velocity #### - -Again examining the commit graphs from -[GitHub](https://github.com/apache/incubator-storm/graphs/commit-activity), -over the last month (as of this writing), there have been over 70 commits. The -month prior had over 130. - -### Issue Velocity ### - -Sourcing the summary charts from JIRA, we can see that clearly Spark has a huge -volume of issues reported and closed in the last 30 days. Storm, roughly, an -order of magnitude less. - -Spark Open and Closed JIRA Issues (last 30 days): - -{{< figure src="/media/spark_issues_chart.png" - link="https://issues.apache.org/jira/browse/SPARK/" - alt="Apache Spark JIRA issues" >}} - -Storm Open and Closed JIRA Issues (last 30 days): - -{{< figure src="/media/storm_issues_chart.png" - link="https://issues.apache.org/jira/browse/STORM/" - alt="Apache Storm JIRA issues" >}} - -### Contributor/ Community Size ### - -#### Storm Contributor Size #### - -Sourcing the reports from -[GitHub](https://github.com/apache/incubator-storm/graphs/contributors), Storm -has over a 100 contributors. This number, though, is just the unique number of -people who have committed at least one patch. - -Over the last 60 days, Storm has seen 34 unique contributors and 16 over the -last 30. - -#### Spark Contributor Size #### - -Similarly sourcing the reports from [GitHub](https://github.com/apache/spark), -Spark has roughly 280 contributors. A similar note as before must be made about -this number: this is the number of at least one patch contributors to the -project. - -Apache Spark has had over 140 contributors over the last 60 days and 94 over -the last 30 days. - -## Development Friendliness ## - -### Developing for Storm ### - -* Describing the process structure with DAG's feels natural to the - [processing model][7]. Each node in the graph will transform the data in a - certain way, and the process continues, possibly disjointly. - -* Storm tuples, the data passed between nodes in the DAG, have a very natural - interface. However, this comes at a cost to compile-time type safety. - -### Developing for Spark ### - -* Spark's monadic expression of transformations over the data similarly feels - natural in this [processing model][6]; this falls in line with the idea - that RDD's are lazy and maintain transformation lineages, rather than - actuallized results. - -* Spark's use of Scala Tuples can feel awkward in Java, and this awkwardness - is only exacerbated with the nesting of generic types. However, this - awkwardness does come with the benefit of compile-time type checks. - - - Furthermore, until Java 1.8, anonymous functions are inherently - awkward. - - - This is probably a non-issue if using Scala. - -## Installation / Administration ## - -Installation of both Apache Spark and Apache Storm are relatively straight -forward. Spark may be simpler in some regards, however, since it technically -does not _need_ to be installed to function on YARN or Mesos clusters. The -Spark application will just require the Spark assembly be present in the -`CLASSPATH`. - -Storm, on the other hand, requires ZooKeeper to be properly installed and -running on top of the regular Storm binaries that must be installed. -Furthermore, like ZooKeeper, Storm should run under [supervision][35]; -installation of a supervisor service, e.g., [supervisord][28], is recommended. - -With respect to installation, supporting projects like Apache Kafka are out of -scope and have no impact on the installation of either Storm or Spark. - -## Processing Models ## - -Comparing Apache Storm and Apache Spark's Streaming, turns out to be a bit -challenging. One is a true stream processing engine that can do micro-batching, -the other is a batch processing engine which micro-batches, but cannot perform -streaming in the strictest sense. Furthermore, the comparison between streaming -and batching isn't exactly a subtle difference, these are fundamentally -different computing ideas. - -### Batch Processing ### - -[Batch processing][31] is the familiar concept of processing data en masse. The -batch size could be small or very large. This is the processing model of the -core Spark library. - -Batch processing excels at processing _large_ amounts of stable, existing data. -However, it generally incurs a high-latency and is completely unsuitable for -incoming data. - -### Event-Stream Processing ### - -[Stream processing][32] is a _one-at-a-time_ processing model; a datum is -processed as it arrives. The core Storm library follows this processing model. - -Stream processing excels at computing transformations as data are ingested with -sub-second latencies. However, with stream processing, it is incredibly -difficult to process stable data efficiently. - -### Micro-Batching ### - -Micro-batching is a special case of batch processing where the batch size is -orders smaller. Spark Streaming operates in this manner as does the Storm -[Trident API][33]. - -Micro-batching seems to be a nice mix between batching and streaming. However, -micro-batching incurs a cost of latency. If sub-second latency is paramount, -micro-batching will typically not suffice. On the other hand, micro-batching -trivially gives stateful computation, making [windowing][37] an easy task. - -## Fault-Tolerance / Message Guarantees ## - -As a result of each project's fundamentally different processing models, the -fault-tolerance and message guarantees are handled differently. - -### Delivery Semantics ### - -Before diving into each project's fault-tolerance and message guarantees, here -are the common delivery semantics: - -* At most once: messages may be lost but never redelivered. - -* At least once: messages will never be lost but may be redelivered. - -* Exactly once: messages are never lost and never redelivered, perfect - message delivery. - -### Apache Storm ### - -To provide fault-tolerant messaging, Storm has to keep track of each and every -record. By default, this is done with at least once delivery semantics. -Storm can be configured to provide at most once and exactly once. The delivery -semantics offered by Storm can incur latency costs; if data loss in the stream -is acceptable, at most once delivery will improve performance. - -### Apache Spark Streaming ### - -The resiliency built into Spark RDD's and the micro-batching yields a trivial -mechanism for providing fault-tolerance and message delivery guarantees. That -is, since Spark Streaming is just small-scale batching, exactly once delivery -is a trivial result for each batch; this is the _only_ delivery semantic -available to Spark. However some failure scenarios of Spark Streaming degrade -to at least once delivery. - -## Applicability ## - -### Apache Storm ### - -Some areas where Storm excels include: near real-time analytics, natural -language processing, data normalization and [ETL][36] transformations. It also -stands apart from traditional MapReduce and other course-grained technologies -yielding fine-grained transformations allowing very flexible processing -topologies. - -### Apache Spark Streaming ### - -Spark has an excellent model for performing iterative machine learning and -interactive analytics. But Spark also excels in some similar areas of Storm -including near real-time analytics, ingestion. - -## Final Thoughts ## - -Generally, the requirements will dictate the choice. However, here are some -major points to consider when choosing the right tool: - -* Latency: Is the performance of the streaming application paramount? Storm - can give sub-second latency much more easily and with less restrictions - than Spark Streaming. - -* Development Cost: Is it desired to have similar code bases for batch - processing _and_ stream processing? With Spark, batching and streaming are - _very_ similar. Storm, however, departs dramatically from the MapReduce - paradigm. - -* Message Delivery Guarantees: Is there high importance on processing _every_ - single record, or is some nominal amount of data loss acceptable? - Disregarding everything else, Spark trivially yields perfect, exactly once - message delivery. Storm can provide all three delivery semantics, but - getting perfect exactly once message delivery requires more effort to - properyly achieve. - -* Process Fault Tolerance: Is high-availability of primary concern? Both - systems actually handle fault-tolerance of this kind really well and in - relatively similar ways. - - - Production Storm clusters will run Storm processes under - [supervision][35]; if a process fails, the supervisor process will - restart it automatically. State management is handled through - ZooKeeper. Processes restarting will reread the state from ZooKeeper on - an attempt to rejoin the cluster. - - - Spark handles restarting workers via the resource manager: YARN, Mesos, - or its standalone manager. Spark's standalone resource manager handles - master node failure with standby-masters and ZooKeeper. Or, this can be - handled more primatively with just local filesystem state - checkpointing, not typically recommended for production environments. - -Both Apache Spark Streaming and Apache Storm are great solutions that solve the -streaming ingestion and transformation problem. Either system can be a great -choice for part of an analytics stack. Choosing the right one is simply a -matter of answering the above questions. - -## References ## - -[spark_jira_issues]: https://kennyballou.com/media/spark_issues_chart.png - -[storm_jira_issues]: https://kennyballou.com/media/storm_issues_chart.png - -[1]: http://storm.incubator.apache.org/documentation/Home.html - -* [Apache Storm Home Page][1] - -[2]: http://spark.apache.org - -* [Apache Spark][2] - -[3]: http://www.zdatainc.com/2014/07/real-time-streaming-apache-storm-apache-kafka/ - -* [Real Time Streaming with Apache Storm and Apache Kafka][3] - -[4]: http://www.zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/ - -* [Real Time Streaming with Apache Spark (Streaming)][4] - -[5]: http://kafka.apache.org/ - -* [Apache Kafka][5] - -[6]: http://en.wikipedia.org/wiki/Data_parallelism - -* [Wikipedia: Data Parallelism][6] - -[7]: http://en.wikipedia.org/wiki/Task_parallelism - -* [Wikipedia: Task Parallelism][7] - -[8]: http://zookeeper.apache.org - -* [Apache ZooKeeper][8] - -[9]: https://github.com/yahoo/storm-yarn - -* [Yahoo! Storm-YARN][9] - -[10]: http://hortonworks.com/kb/storm-on-yarn-install-on-hdp2-beta-cluster/ - -* [Hortonworks: Storm on YARN][10] - -[11]: http://mesos.apache.org - -* [Apache Mesos][11] - -[12]: https://mesosphere.io/learn/run-storm-on-mesos/ - -* [Run Storm on Mesos][12] - -[13]: https://github.com/mesosphere/marathon - -* [Marathon][13] - -[14]: http://aws.amazon.com/s3/ - -[15]: http://en.wikipedia.org/wiki/Network_File_System - -[16]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html - -[17]: https://issues.apache.org/jira/browse/STORM/ - -[18]: https://issues.apache.org/jira/browse/SPARK/ - -[19]: http://www.scala-lang.org/ - -[20]: http://clojure.org/ - -[21]: http://en.wikipedia.org/wiki/Lisp_(programming_language) - -[22]: https://github.com/zeromq/jzmq - -[23]: http://zeromq.org/ - -[24]: http://netty.io/ - -[25]: http://yahooeng.tumblr.com/post/64758709722/making-storm-fly-with-netty - -[26]: http://akka.io - -[27]: http://www.apache.org/ - -[28]: http://supervisord.org - -[29]: http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html - -* [Storm vs Spark Streaming: Side by Side][29] - -[30]: http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming - -* [Storm vs Spark Streaming (Slideshare)][30] - -[31]: http://en.wikipedia.org/wiki/Batch_processing - -[32]: http://en.wikipedia.org/wiki/Event_stream_processing - -[33]: https://storm.incubator.apache.org/documentation/Trident-API-Overview.html - -[34]: http://storm.incubator.apache.org/documentation/Powered-By.html - -[35]: http://en.wikipedia.org/wiki/Process_supervision - -[36]: http://en.wikipedia.org/wiki/Extract,_transform,_load - -[37]: http://en.wikipedia.org/wiki/Window_function_(SQL)#Window_function - -[38]: https://gist.github.com/kennyballou/c6ff37e5eef6710794a6 diff --git a/posts/storm-vs-spark.org b/posts/storm-vs-spark.org new file mode 100644 index 0000000..5a34f37 --- /dev/null +++ b/posts/storm-vs-spark.org @@ -0,0 +1,428 @@ +#+TITLE: Apache Storm and Apache Spark Streaming +#+DESCRIPTION: Comparison of Apache Storm and Apache Spark Streaming +#+TAGS: Apache Storm +#+TAGS: Apache Spark +#+TAGS: Apache +#+TAGS: Real-time Streaming +#+TAGS: zData Inc. +#+DATE: 2014-09-08 +#+SLUG: apache-storm-and-apache-spark +#+LINK: storm https://storm.apache.org/ +#+LINK: spark https://spark.apache.org/ +#+LINK: storm-post https://kennyballou/blog/2014/07/real-time-streaming-storm-and-kafka +#+LINK: spark-post https://kennyballou/blog/2014/08/real-time-streaming-apache-spark-streaming +#+LINK: kafka https://kafka.apache.org/ +#+LINK: wiki-data-parallelism http://en.wikipedia.org/wiki/Data_parallelism +#+LINK: wiki-task-parallelism http://en.wikipedia.org/wiki/Task_parallelism +#+LINK: zookeeper https://zookeeper.apache.org/ +#+LINK: storm-yarn https://github.com/yahoo/storm-yarn +#+LINK: horton-storm-yarn http://hortonworks.com/kb/storm-on-yarn-install-on-hdp2-beta-cluster/ +#+LINK: mesos https://mesos.apache.org +#+LINK: mesos-run-storm https://mesosphere.io/learn/run-storm-on-mesos/ +#+LINK: marathon https://github.com/mesosphere/marathon +#+LINK: aws-s3 http://aws.amazon.com/s3/ +#+LINK: wiki-nfs http://en.wikipedia.org/wiki/Network_File_System +#+LINK: hdfs-user-guide http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html +#+LINK: storm-jira-issues https://issues.apache.org/jira/browse/STORM/ +#+LINK: spark-jira-issues https://issues.apache.org/jira/browse/SPARK/ +#+LINK: scala http://www.scala-lang.org/ +#+LINK: clojure http://clojure.org/ +#+LINK: wiki-lisp http://en.wikipedia.org/wiki/Lisp_(programming_language) +#+LINK: jzmq https://github.com/zeromq/jzmq +#+LINK: zeromq http://zeromq.org/ +#+LINK: netty http://netty.io/ +#+LINK: yahoo-storm-netty http://yahooeng.tumblr.com/post/64758709722/making-storm-fly-with-netty +#+LINK: akka http://akka.io +#+LINK: apache http://www.apache.org/ +#+LINK: supervisord http://supervisord.org +#+LINK: xinhstechblog-storm-spark http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html +#+LINK: ptgoetz-storm-spark http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming +#+LINK: wiki-batch-processing http://en.wikipedia.org/wiki/Batch_processing +#+LINK: wiki-event-processing http://en.wikipedia.org/wiki/Event_stream_processing +#+LINK: storm-trident-overview https://storm.incubator.apache.org/documentation/Trident-API-Overview.html +#+LINK: storm-powered-by http://storm.incubator.apache.org/documentation/Powered-By.html +#+LINK: wiki-process-supervision http://en.wikipedia.org/wiki/Process_supervision +#+LINK: wiki-etl http://en.wikipedia.org/wiki/Extract,_transform,_load +#+LINK: wiki-sql-window-function http://en.wikipedia.org/wiki/Window_function_(SQL)#Window_function +#+LINK: git-stat-gist https://gist.github.com/kennyballou/c6ff37e5eef6710794a6 +#+LINK: github https://github.com/ +#+LINK: spark-commit-activity https://github.com/apache/spark/graphs/commit-activity +#+LINK: storm-commit-activity https://github.com/apache/incubator-storm/graphs/commit-activity +#+LINK: storm-github-contributors https://github.com/apache/incubator-storm/graphs/contributors +#+LINK: spark-github https://github.com/apache/spark + +#+BEGIN_PREVIEW +This is the last post in the series on real-time systems. In the +[[storm-post][first post]] we discussed [[storm][Apache Storm]] and +[[kafka][Apache Kafka]]. In the [[spark-post][second post]] we discussed +[[spark][Apache Spark (Streaming)]]. In both posts we examined a small Twitter +Sentiment Analysis program. Today, we will be reviewing both systems: how they +compare and how they contrast. +#+END_PREVIEW + +The intention is not to cast judgment over one project or the other, but rather +to exposit the differences and similarities. Any judgments made, subtle or +not, are mistakes in exposition and/or organization and are not actual +endorsements of either project. + +** Apache Storm + +"Storm is a distributed real-time computation system" [[storm][Storm]]. Apache +Storm is a [[wiki-task-parallelism][task parallel]] continuous computational +engine. It defines its workflows in Directed Acyclic Graphs (DAG's) called +"topologies". These topologies run until shutdown by the user or encountering +an unrecoverable failure. + +Storm does not natively run on top of typical Hadoop clusters, it uses +[[zookeeper][Apache ZooKeeper]] and its own master/minion worker processes to +coordinate topologies, master and worker state, and the message guarantee +semantics. That said, both [[storm-yarn][Yahoo!]] and +[[horton-storm-yarn][Hortonworks]] are working on providing libraries for +running Storm topologies on top of Hadoop 2.x YARN clusters. Furthermore, +Storm can run on top of the [[mesos][Mesos]] scheduler as well, +[[mesos-storm-yarn][natively]] and with help from the [[marathon][Marathon]] +framework. + +Regardless though, Storm can certainly still consume files from HDFS and/or +write files to HDFS. + +** Apache Spark (Streaming) + +"Apache Spark is a fast and general purpose engine for large-scale data +processing" [[spark][Spark]]. [[spark][Apache Spark]] is a +[[wiki-data-parallel][data parallel]] general purpose batch processing engine. +Workflows are defined in a similar and reminiscent style of MapReduce, however, +is much more capable than traditional Hadoop MapReduce. Apache Spark has its +Streaming API project that allows for continuous processing via short interval +batches. Similar to Storm, Spark Streaming jobs run until shutdown by the user +or encounter an unrecoverable failure. + +Apache Spark does not itself require Hadoop to operate. However, its data +parallel paradigm requires a shared filesystem for optimal use of stable data. +The stable source can range from [[aws-s3][S3]], [[wiki-nfs][NFS]], or, more +typically, [[hdfs-user-guide][HDFS]]. + +Executing Spark applications does not /require/ Hadoop YARN. Spark has its own +standalone master/server processes. However, it is common to run Spark +applications using YARN containers. Furthermore, Spark can also run on Mesos +clusters. + +** Development + +As of this writing, Apache Spark is a full, top level Apache project. Whereas +Apache Storm is currently undergoing incubation. Moreover, the latest stable +version of Apache Storm is =0.9.2= and the latest stable version of Apache +Spark is =1.0.2= (with =1.1.0= to be released in the coming weeks). Of course, +as the Apache Incubation reminder states, this does not strictly reflect +stability or completeness of either project. It is, however, a reflection to +the state of the communities. Apache Spark operations and its process are +endorsed by the [[apache][Apache Software Foundation]]. Apache Storm is +working on stabilizing its community and development process. + +Spark's =1.x= version does state that the API has stabilized and will not be +doing major changes undermining backward compatibility. Implicitly, Storm has +no guaranteed stability in its API, however, it is [[storm-powered-by][running +in production for many different companies]]. + +*** Implementation Language + +Both [[spark][Apache Spark]] and [[storm][Apache Storm]] are implemented in JVM +based languages: [[scala][Scala]] and [[clojure][Clojure]], respectively. + +Scala is a functional meets object-oriented language. In other words, the +language carries ideas from both the functional world and the object-oriented +world. This yields an interesting mix of code reusability, extensibility, and +higher-order functions. + +Clojure is a dialect of [[wiki-lisp][Lisp]] targeting the JVM providing the +Lisp philosophy: code-as-data and providing the rich macro system typical of +Lisp languages. Clojure is predominately functional in nature, however, if +state or side-effects are required, they are facilitated with a transactional +memory model, aiding in making multi-threaded based applications consistent and +safe. + +**** Message Passing Layer + +Until version =0.9.x=, Storm was using the Java library [[jzmq][JZMQ]] for +[[zeromq][ZeroMQ]] messages. However, Storm has since moved the default +messaging layer to [[netty][Netty]] with efforts from +[[yahoo-storm-netty][Yahoo!]]. Although Netty is now being used by default, +users can still use ZeroMQ, if desired, since the migration to Netty was +intended to also make the message layer pluggable. + +Spark, on the other hand, uses a combination of [[netty][Netty]] and +[[akka][Akka]] for distributing messages throughout the executors. + +*** Commit Velocity + +As a reminder, these data are included not to cast judgment on one project or +the other, but rather to exposit the fluidness of each project. The continuum +of the dynamics of both projects can be used as an argument for or against, +depending on application requirements. If rigid stability is a strong +requirement, arguing for a slower commit velocity may be appropriate. + +Source of the following statistics were taken from the graphs at +[[github][GitHub]] and computed from [[git-stat-gist][this script]]. + +**** Spark Commit Velocity + +Examining the graphs from [[spark-commit-activity][GitHub]], over the last +month (as of this writing), there have been over 330 commits. The previous +month had about 340. + +**** Storm Commit Velocity + +Again examining the commit graphs from [[storm-commit-activity][GitHub]], over +the last month (as of this writing), there have been over 70 commits. The +month prior had over 130. + +*** Issue Velocity + +Sourcing the summary charts from JIRA, we can see that clearly Spark has a huge +volume of issues reported and closed in the last 30 days. Storm, roughly, an +order of magnitude less. + +Spark Open and Closed JIRA Issues (last 30 days): + +#+ATTR_HTML: :align center +#+BEGIN_figure +#+NAME: fig: spark-issues-chart +[[file:/media/spark_issues_chart.png]] +#+END_figure + +Storm Open and Closed JIRA Issues (last 30 days): + +#+ATTR_HTML: :align center +#+BEGIN_figure +#+NAME: fig: storm-issues-chart +[[file:/media/storm_issues_chart.png]] +#+END_figure + +*** Contributor/ Community Size + +**** Storm Contributor Size + +Sourcing the reports from [[github-storm-contributors][GitHub]], Storm has over +a 100 contributors. This number, though, is just the unique number of people +who have committed at least one patch. + +Over the last 60 days, Storm has seen 34 unique contributors and 16 over the +last 30. + +**** Spark Contributor Size + +Similarly sourcing the reports from [[spark-github][GitHub]], Spark has roughly +280 contributors. A similar note as before must be made about this number: +this is the number of at least one patch contributors to the project. + +Apache Spark has had over 140 contributors over the last 60 days and 94 over +the last 30 days. + +** Development Friendliness + +*** Developing for Storm + +- Describing the process structure with DAG's feels natural to the + [[wiki-task-parallelism][processing model]]. Each node in the graph will + transform the data in a certain way, and the process continues, possibly + disjointly. + +- Storm tuples, the data passed between nodes in the DAG, have a very natural + interface. However, this comes at a cost to compile-time type safety. + +*** Developing for Spark + +- Spark's monadic expression of transformations over the data similarly feels + natural in this [[wiki-data-parallelism][processing model]]; this falls in + line with the idea that RDD's are lazy and maintain transformation lineages, + rather than actuallized results. + +- Spark's use of Scala Tuples can feel awkward in Java, and this awkwardness is + only exacerbated with the nesting of generic types. However, this + awkwardness does come with the benefit of compile-time type checks. + + - Furthermore, until Java 1.8, anonymous functions are inherently awkward. + + - This is probably a non-issue if using Scala. + +** Installation / Administration + +Installation of both Apache Spark and Apache Storm are relatively straight +forward. Spark may be simpler in some regards, however, since it technically +does not /need/ to be installed to function on YARN or Mesos clusters. The +Spark application will just require the Spark assembly be present in the +=CLASSPATH=. + +Storm, on the other hand, requires ZooKeeper to be properly installed and +running on top of the regular Storm binaries that must be installed. +Furthermore, like ZooKeeper, Storm should run under +[[wiki-process-supervision][supervision]]; installation of a supervisor +service, e.g., [[supervisord][supervisord]], is recommended. + +With respect to installation, supporting projects like Apache Kafka are out of +scope and have no impact on the installation of either Storm or Spark. + +** Processing Models + +Comparing Apache Storm and Apache Spark's Streaming, turns out to be a bit +challenging. One is a true stream processing engine that can do +micro-batching, the other is a batch processing engine which micro-batches, but +cannot perform streaming in the strictest sense. Furthermore, the comparison +between streaming and batching isn't exactly a subtle difference, these are +fundamentally different computing ideas. + +*** Batch Processing + +[[wiki-batch-processing][Batch processing]] is the familiar concept of +processing data en masse. The batch size could be small or very large. This +is the processing model of the core Spark library. + +Batch processing excels at processing /large/ amounts of stable, existing data. +However, it generally incurs a high-latency and is completely unsuitable for +incoming data. + +*** Event-Stream Processing + +[[wiki-event-processing][Stream processing]] is a /one-at-a-time/ processing +model; a datum is processed as it arrives. The core Storm library follows this +processing model. + +Stream processing excels at computing transformations as data are ingested with +sub-second latencies. However, with stream processing, it is incredibly +difficult to process stable data efficiently. + +*** Micro-Batching + +Micro-batching is a special case of batch processing where the batch size is +orders smaller. Spark Streaming operates in this manner as does the Storm +[[storm-triden-overview][Trident API]]. + +Micro-batching seems to be a nice mix between batching and streaming. However, +micro-batching incurs a cost of latency. If sub-second latency is paramount, +micro-batching will typically not suffice. On the other hand, micro-batching +trivially gives stateful computation, making +[[wiki-sql-window-function][windowing]] an easy task. + +** Fault-Tolerance / Message Guarantees + +As a result of each project's fundamentally different processing models, the +fault-tolerance and message guarantees are handled differently. + +*** Delivery Semantics + +Before diving into each project's fault-tolerance and message guarantees, here +are the common delivery semantics: + +- At most once: messages may be lost but never redelivered. + +- At least once: messages will never be lost but may be redelivered. + +- Exactly once: messages are never lost and never redelivered, perfect message + delivery. + +*** Apache Storm + +To provide fault-tolerant messaging, Storm has to keep track of each and every +record. By default, this is done with at least once delivery semantics. Storm +can be configured to provide at most once and exactly once. The delivery +semantics offered by Storm can incur latency costs; if data loss in the stream +is acceptable, at most once delivery will improve performance. + +*** Apache Spark Streaming + +The resiliency built into Spark RDD's and the micro-batching yields a trivial +mechanism for providing fault-tolerance and message delivery guarantees. That +is, since Spark Streaming is just small-scale batching, exactly once delivery +is a trivial result for each batch; this is the /only/ delivery semantic +available to Spark. However some failure scenarios of Spark Streaming degrade +to at least once delivery. + +** Applicability + +*** Apache Storm + +Some areas where Storm excels include: near real-time analytics, natural +language processing, data normalization and [[wiki-etl][ETL]] transformations. +It also stands apart from traditional MapReduce and other course-grained +technologies yielding fine-grained transformations allowing very flexible +processing topologies. + +*** Apache Spark Streaming + +Spark has an excellent model for performing iterative machine learning and +interactive analytics. But Spark also excels in some similar areas of Storm +including near real-time analytics, ingestion. + +** Final Thoughts + +Generally, the requirements will dictate the choice. However, here are some +major points to consider when choosing the right tool: + +- Latency: Is the performance of the streaming application paramount? Storm + can give sub-second latency much more easily and with less restrictions than + Spark Streaming. + +- Development Cost: Is it desired to have similar code bases for batch + processing /and/ stream processing? With Spark, batching and streaming are + /very/ similar. Storm, however, departs dramatically from the MapReduce + paradigm. + +- Message Delivery Guarantees: Is there high importance on processing /every/ + single record, or is some nominal amount of data loss acceptable? + Disregarding everything else, Spark trivially yields perfect, exactly once + message delivery. Storm can provide all three delivery semantics, but getting + perfect exactly once message delivery requires more effort to properyly + achieve. + +- Process Fault Tolerance: Is high-availability of primary concern? Both + systems actually handle fault-tolerance of this kind really well and in + relatively similar ways. + + - Production Storm clusters will run Storm processes under + [[wiki-process-supervision][supervision]]; if a process fails, the + supervisor process will restart it automatically. State management is + handled through ZooKeeper. Processes restarting will reread the state + from ZooKeeper on an attempt to rejoin the cluster. + + - Spark handles restarting workers via the resource manager: YARN, Mesos, or + its standalone manager. Spark's standalone resource manager handles master + node failure with standby-masters and ZooKeeper. Or, this can be handled + more primatively with just local filesystem state checkpointing, not + typically recommended for production environments. + +Both Apache Spark Streaming and Apache Storm are great solutions that solve the +streaming ingestion and transformation problem. Either system can be a great +choice for part of an analytics stack. Choosing the right one is simply a +matter of answering the above questions. + +** References + +- [[storm][Apache Storm Home Page]] + +- [[spark][Apache Spark]] + +- [[storm-post][Real Time Streaming with Apache Storm and Apache Kafka]] + +- [[spark-post][Real Time Streaming with Apache Spark (Streaming)]] + +- [[kafka][Apache Kafka]] + +- [[wiki-data-parallelism][Wikipedia: Data Parallelism]] + +- [[wiki-task-parallelism][Wikipedia: Task Parallelism]] + +- [[zookeeper][Apache ZooKeeper]] + +- [[storm-yarn][Yahoo! Storm-YARN]] + +- [[horton-storm-yarn][Hortonworks: Storm on YARN]] + +- [[mesos][Apache Mesos]] + +- [[mesos-run-storm][Run Storm on Mesos]] + +- [[marathon][Marathon]] + +- [[xinhstechblog-storm-spark][Storm vs Spark Streaming: Side by Side]] + +- [[ptgoetz-storm-spark][Storm vs Spark Streaming (Slideshare)]] -- cgit v1.2.1