aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorkballou <kballou@devnulllabs.io>2014-10-11 01:20:54 -0600
committerkballou <kballou@devnulllabs.io>2017-09-02 19:31:44 -0600
commitb0ccdf8eb059f54a0e8cfad51466f12b8cd76451 (patch)
tree51664a702772ca797c2e5e94dac3bc479105a611
downloadblog.kennyballou.com-b0ccdf8eb059f54a0e8cfad51466f12b8cd76451.tar.gz
blog.kennyballou.com-b0ccdf8eb059f54a0e8cfad51466f12b8cd76451.tar.xz
Wobsite and Blag of KennyBallou
-rw-r--r--LICENSE438
-rw-r--r--README.md13
-rw-r--r--config.yaml14
-rw-r--r--content/about-me.markdown20
-rw-r--r--content/blog/Spark.markdown739
-rw-r--r--content/blog/Storm-vs-Spark.markdown457
-rw-r--r--content/blog/Storm.markdown594
-rw-r--r--layouts/404.html8
-rw-r--r--layouts/_default/list.html0
-rw-r--r--layouts/_default/single.html17
-rw-r--r--layouts/blog/li.html0
-rw-r--r--layouts/blog/single.html27
-rw-r--r--layouts/blog/summary.html20
-rw-r--r--layouts/index.html20
-rw-r--r--layouts/partials/author.html21
-rw-r--r--layouts/partials/footer.html13
-rw-r--r--layouts/partials/head_includes.html4
-rw-r--r--layouts/partials/header.html20
-rw-r--r--layouts/partials/meta.html2
-rw-r--r--layouts/partials/subheader.html5
-rw-r--r--layouts/rss.xml21
-rw-r--r--layouts/single.html5
-rw-r--r--layouts/sitemap.xml16
-rw-r--r--static/css/site.css131
-rw-r--r--static/favicon.icobin0 -> 1150 bytes
-rw-r--r--static/media/SentimentAnalysisTopology.pngbin0 -> 12248 bytes
-rw-r--r--static/media/spark_issues_chart.pngbin0 -> 10475 bytes
-rw-r--r--static/media/storm_issues_chart.pngbin0 -> 9893 bytes
28 files changed, 2605 insertions, 0 deletions
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..718c647
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,438 @@
+Attribution-NonCommercial-ShareAlike 4.0 International
+
+=======================================================================
+
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+
+Using Creative Commons Public Licenses
+
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+
+ Considerations for licensors: Our public licenses are
+ intended for use by those authorized to give the public
+ permission to use material in ways otherwise restricted by
+ copyright and certain other rights. Our licenses are
+ irrevocable. Licensors should read and understand the terms
+ and conditions of the license they choose before applying it.
+ Licensors should also secure all rights necessary before
+ applying our licenses so that the public can reuse the
+ material as expected. Licensors should clearly mark any
+ material not subject to the license. This includes other CC-
+ licensed material, or material used under an exception or
+ limitation to copyright. More considerations for licensors:
+ wiki.creativecommons.org/Considerations_for_licensors
+
+ Considerations for the public: By using one of our public
+ licenses, a licensor grants the public permission to use the
+ licensed material under specified terms and conditions. If
+ the licensor's permission is not necessary for any reason--for
+ example, because of any applicable exception or limitation to
+ copyright--then that use is not regulated by the license. Our
+ licenses grant only permissions under copyright and certain
+ other rights that a licensor has authority to grant. Use of
+ the licensed material may still be restricted for other
+ reasons, including because others have copyright or other
+ rights in the material. A licensor may make special requests,
+ such as asking that all changes be marked or described.
+ Although not required by our licenses, you are encouraged to
+ respect those requests where reasonable. More_considerations
+ for the public:
+ wiki.creativecommons.org/Considerations_for_licensees
+
+=======================================================================
+
+Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
+Public License
+
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial-ShareAlike 4.0 International Public License
+("Public License"). To the extent this Public License may be
+interpreted as a contract, You are granted the Licensed Rights in
+consideration of Your acceptance of these terms and conditions, and the
+Licensor grants You such rights in consideration of benefits the
+Licensor receives from making the Licensed Material available under
+these terms and conditions.
+
+
+Section 1 -- Definitions.
+
+ a. Adapted Material means material subject to Copyright and Similar
+ Rights that is derived from or based upon the Licensed Material
+ and in which the Licensed Material is translated, altered,
+ arranged, transformed, or otherwise modified in a manner requiring
+ permission under the Copyright and Similar Rights held by the
+ Licensor. For purposes of this Public License, where the Licensed
+ Material is a musical work, performance, or sound recording,
+ Adapted Material is always produced where the Licensed Material is
+ synched in timed relation with a moving image.
+
+ b. Adapter's License means the license You apply to Your Copyright
+ and Similar Rights in Your contributions to Adapted Material in
+ accordance with the terms and conditions of this Public License.
+
+ c. BY-NC-SA Compatible License means a license listed at
+ creativecommons.org/compatiblelicenses, approved by Creative
+ Commons as essentially the equivalent of this Public License.
+
+ d. Copyright and Similar Rights means copyright and/or similar rights
+ closely related to copyright including, without limitation,
+ performance, broadcast, sound recording, and Sui Generis Database
+ Rights, without regard to how the rights are labeled or
+ categorized. For purposes of this Public License, the rights
+ specified in Section 2(b)(1)-(2) are not Copyright and Similar
+ Rights.
+
+ e. Effective Technological Measures means those measures that, in the
+ absence of proper authority, may not be circumvented under laws
+ fulfilling obligations under Article 11 of the WIPO Copyright
+ Treaty adopted on December 20, 1996, and/or similar international
+ agreements.
+
+ f. Exceptions and Limitations means fair use, fair dealing, and/or
+ any other exception or limitation to Copyright and Similar Rights
+ that applies to Your use of the Licensed Material.
+
+ g. License Elements means the license attributes listed in the name
+ of a Creative Commons Public License. The License Elements of this
+ Public License are Attribution, NonCommercial, and ShareAlike.
+
+ h. Licensed Material means the artistic or literary work, database,
+ or other material to which the Licensor applied this Public
+ License.
+
+ i. Licensed Rights means the rights granted to You subject to the
+ terms and conditions of this Public License, which are limited to
+ all Copyright and Similar Rights that apply to Your use of the
+ Licensed Material and that the Licensor has authority to license.
+
+ j. Licensor means the individual(s) or entity(ies) granting rights
+ under this Public License.
+
+ k. NonCommercial means not primarily intended for or directed towards
+ commercial advantage or monetary compensation. For purposes of
+ this Public License, the exchange of the Licensed Material for
+ other material subject to Copyright and Similar Rights by digital
+ file-sharing or similar means is NonCommercial provided there is
+ no payment of monetary compensation in connection with the
+ exchange.
+
+ l. Share means to provide material to the public by any means or
+ process that requires permission under the Licensed Rights, such
+ as reproduction, public display, public performance, distribution,
+ dissemination, communication, or importation, and to make material
+ available to the public including in ways that members of the
+ public may access the material from a place and at a time
+ individually chosen by them.
+
+ m. Sui Generis Database Rights means rights other than copyright
+ resulting from Directive 96/9/EC of the European Parliament and of
+ the Council of 11 March 1996 on the legal protection of databases,
+ as amended and/or succeeded, as well as other essentially
+ equivalent rights anywhere in the world.
+
+ n. You means the individual or entity exercising the Licensed Rights
+ under this Public License. Your has a corresponding meaning.
+
+
+Section 2 -- Scope.
+
+ a. License grant.
+
+ 1. Subject to the terms and conditions of this Public License,
+ the Licensor hereby grants You a worldwide, royalty-free,
+ non-sublicensable, non-exclusive, irrevocable license to
+ exercise the Licensed Rights in the Licensed Material to:
+
+ a. reproduce and Share the Licensed Material, in whole or
+ in part, for NonCommercial purposes only; and
+
+ b. produce, reproduce, and Share Adapted Material for
+ NonCommercial purposes only.
+
+ 2. Exceptions and Limitations. For the avoidance of doubt, where
+ Exceptions and Limitations apply to Your use, this Public
+ License does not apply, and You do not need to comply with
+ its terms and conditions.
+
+ 3. Term. The term of this Public License is specified in Section
+ 6(a).
+
+ 4. Media and formats; technical modifications allowed. The
+ Licensor authorizes You to exercise the Licensed Rights in
+ all media and formats whether now known or hereafter created,
+ and to make technical modifications necessary to do so. The
+ Licensor waives and/or agrees not to assert any right or
+ authority to forbid You from making technical modifications
+ necessary to exercise the Licensed Rights, including
+ technical modifications necessary to circumvent Effective
+ Technological Measures. For purposes of this Public License,
+ simply making modifications authorized by this Section 2(a)
+ (4) never produces Adapted Material.
+
+ 5. Downstream recipients.
+
+ a. Offer from the Licensor -- Licensed Material. Every
+ recipient of the Licensed Material automatically
+ receives an offer from the Licensor to exercise the
+ Licensed Rights under the terms and conditions of this
+ Public License.
+
+ b. Additional offer from the Licensor -- Adapted Material.
+ Every recipient of Adapted Material from You
+ automatically receives an offer from the Licensor to
+ exercise the Licensed Rights in the Adapted Material
+ under the conditions of the Adapter's License You apply.
+
+ c. No downstream restrictions. You may not offer or impose
+ any additional or different terms or conditions on, or
+ apply any Effective Technological Measures to, the
+ Licensed Material if doing so restricts exercise of the
+ Licensed Rights by any recipient of the Licensed
+ Material.
+
+ 6. No endorsement. Nothing in this Public License constitutes or
+ may be construed as permission to assert or imply that You
+ are, or that Your use of the Licensed Material is, connected
+ with, or sponsored, endorsed, or granted official status by,
+ the Licensor or others designated to receive attribution as
+ provided in Section 3(a)(1)(A)(i).
+
+ b. Other rights.
+
+ 1. Moral rights, such as the right of integrity, are not
+ licensed under this Public License, nor are publicity,
+ privacy, and/or other similar personality rights; however, to
+ the extent possible, the Licensor waives and/or agrees not to
+ assert any such rights held by the Licensor to the limited
+ extent necessary to allow You to exercise the Licensed
+ Rights, but not otherwise.
+
+ 2. Patent and trademark rights are not licensed under this
+ Public License.
+
+ 3. To the extent possible, the Licensor waives any right to
+ collect royalties from You for the exercise of the Licensed
+ Rights, whether directly or through a collecting society
+ under any voluntary or waivable statutory or compulsory
+ licensing scheme. In all other cases the Licensor expressly
+ reserves any right to collect such royalties, including when
+ the Licensed Material is used other than for NonCommercial
+ purposes.
+
+
+Section 3 -- License Conditions.
+
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+
+ a. Attribution.
+
+ 1. If You Share the Licensed Material (including in modified
+ form), You must:
+
+ a. retain the following if it is supplied by the Licensor
+ with the Licensed Material:
+
+ i. identification of the creator(s) of the Licensed
+ Material and any others designated to receive
+ attribution, in any reasonable manner requested by
+ the Licensor (including by pseudonym if
+ designated);
+
+ ii. a copyright notice;
+
+ iii. a notice that refers to this Public License;
+
+ iv. a notice that refers to the disclaimer of
+ warranties;
+
+ v. a URI or hyperlink to the Licensed Material to the
+ extent reasonably practicable;
+
+ b. indicate if You modified the Licensed Material and
+ retain an indication of any previous modifications; and
+
+ c. indicate the Licensed Material is licensed under this
+ Public License, and include the text of, or the URI or
+ hyperlink to, this Public License.
+
+ 2. You may satisfy the conditions in Section 3(a)(1) in any
+ reasonable manner based on the medium, means, and context in
+ which You Share the Licensed Material. For example, it may be
+ reasonable to satisfy the conditions by providing a URI or
+ hyperlink to a resource that includes the required
+ information.
+
+ 3. If requested by the Licensor, You must remove any of the
+ information required by Section 3(a)(1)(A) to the extent
+ reasonably practicable.
+
+ b. ShareAlike.
+
+ In addition to the conditions in Section 3(a), if You Share
+ Adapted Material You produce, the following conditions also apply.
+
+ 1. The Adapter's License You apply must be a Creative Commons
+ license with the same License Elements, this version or
+ later, or a BY-NC-SA Compatible License.
+
+ 2. You must include the text of, or the URI or hyperlink to, the
+ Adapter's License You apply. You may satisfy this condition
+ in any reasonable manner based on the medium, means, and
+ context in which You Share Adapted Material.
+
+ 3. You may not offer or impose any additional or different terms
+ or conditions on, or apply any Effective Technological
+ Measures to, Adapted Material that restrict exercise of the
+ rights granted under the Adapter's License You apply.
+
+
+Section 4 -- Sui Generis Database Rights.
+
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+
+ a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+ to extract, reuse, reproduce, and Share all or a substantial
+ portion of the contents of the database for NonCommercial purposes
+ only;
+
+ b. if You include all or a substantial portion of the database
+ contents in a database in which You have Sui Generis Database
+ Rights, then the database in which You have Sui Generis Database
+ Rights (but not its individual contents) is Adapted Material,
+ including for purposes of Section 3(b); and
+
+ c. You must comply with the conditions in Section 3(a) if You Share
+ all or a substantial portion of the contents of the database.
+
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+
+
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+
+ a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+ EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+ AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+ ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+ IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+ WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+ PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+ ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+ KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+ ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+
+ b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+ TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+ NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+ INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+ COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+ USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+ ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+ DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+ IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+
+ c. The disclaimer of warranties and limitation of liability provided
+ above shall be interpreted in a manner that, to the extent
+ possible, most closely approximates an absolute disclaimer and
+ waiver of all liability.
+
+
+Section 6 -- Term and Termination.
+
+ a. This Public License applies for the term of the Copyright and
+ Similar Rights licensed here. However, if You fail to comply with
+ this Public License, then Your rights under this Public License
+ terminate automatically.
+
+ b. Where Your right to use the Licensed Material has terminated under
+ Section 6(a), it reinstates:
+
+ 1. automatically as of the date the violation is cured, provided
+ it is cured within 30 days of Your discovery of the
+ violation; or
+
+ 2. upon express reinstatement by the Licensor.
+
+ For the avoidance of doubt, this Section 6(b) does not affect any
+ right the Licensor may have to seek remedies for Your violations
+ of this Public License.
+
+ c. For the avoidance of doubt, the Licensor may also offer the
+ Licensed Material under separate terms or conditions or stop
+ distributing the Licensed Material at any time; however, doing so
+ will not terminate this Public License.
+
+ d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+ License.
+
+
+Section 7 -- Other Terms and Conditions.
+
+ a. The Licensor shall not be bound by any additional or different
+ terms or conditions communicated by You unless expressly agreed.
+
+ b. Any arrangements, understandings, or agreements regarding the
+ Licensed Material not stated herein are separate from and
+ independent of the terms and conditions of this Public License.
+
+
+Section 8 -- Interpretation.
+
+ a. For the avoidance of doubt, this Public License does not, and
+ shall not be interpreted to, reduce, limit, restrict, or impose
+ conditions on any use of the Licensed Material that could lawfully
+ be made without permission under this Public License.
+
+ b. To the extent possible, if any provision of this Public License is
+ deemed unenforceable, it shall be automatically reformed to the
+ minimum extent necessary to make it enforceable. If the provision
+ cannot be reformed, it shall be severed from this Public License
+ without affecting the enforceability of the remaining terms and
+ conditions.
+
+ c. No term or condition of this Public License will be waived and no
+ failure to comply consented to unless expressly agreed to by the
+ Licensor.
+
+ d. Nothing in this Public License constitutes or may be interpreted
+ as a limitation upon, or waiver of, any privileges and immunities
+ that apply to the Licensor or You, including from the legal
+ processes of any jurisdiction or authority.
+
+=======================================================================
+
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+
+Creative Commons may be contacted at creativecommons.org.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e3c1e5a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,13 @@
+# blog.kennyballou.com #
+
+Blog content of kennyballou.com powered by [Hugo][1].
+
+## License ##
+
+The content of this repository and blog are released under the terms and
+conditions of the Create Commons [CC-BY-NC-SA-4.0][2] license. For more details
+read the [license online][2] or see the attached LICENSE file.
+
+[1]: http://gohugo.io
+
+[2]: https://creativecommons.org/licenses/by-nc-sa/4.0/
diff --git a/config.yaml b/config.yaml
new file mode 100644
index 0000000..8c717cf
--- /dev/null
+++ b/config.yaml
@@ -0,0 +1,14 @@
+---
+baseurl: "https://kennyballou.com"
+MetaDataFormat: "yaml"
+languageCode: "en-us"
+title: "~kballou"
+indexes:
+ tag: "tags"
+ topic: "topics"
+permalinks:
+ blog: /blog/:year/:month/:slug
+copyright: "Creative Commons Attribution"
+author:
+ name: "Kenny Ballou"
+...
diff --git a/content/about-me.markdown b/content/about-me.markdown
new file mode 100644
index 0000000..b0fcb87
--- /dev/null
+++ b/content/about-me.markdown
@@ -0,0 +1,20 @@
+---
+title: "about me"
+keywords: []
+tags: []
+pubdate: "2014-10-10"
+date: "2014-10-10"
+topics: []
+slug: about-me
+---
+
+# About Me #
+
+I'm a born hacker and life-long learner. I enjoy thinking through problems,
+hacking together systems, and learning anything and everything.
+
+Sometimes I `:() { :|:& };:`.
+
+Other times I `import this`.
+
+The one I choose today: `fork: retry: Resource temporarily unavailable`.
diff --git a/content/blog/Spark.markdown b/content/blog/Spark.markdown
new file mode 100644
index 0000000..1b2699e
--- /dev/null
+++ b/content/blog/Spark.markdown
@@ -0,0 +1,739 @@
+---
+title: "Real-Time Streaming with Apache Spark Streaming"
+description: "Overview of Apache Spark and a sample Twitter Sentiment Analysis"
+tags:
+ - "Apache Spark"
+ - "Apache Kafka"
+ - "Apache"
+ - "Java"
+ - "Sentiment Analysis"
+ - "Real-time Streaming"
+ - "ZData Inc."
+date: "2014-08-18"
+catagories:
+ - "Apache"
+ - "Development"
+ - "Real-time Systems"
+slug: "real-time-streaming-apache-spark-streaming"
+---
+
+This is the second post in a series on real-time systems tangential to the
+Hadoop ecosystem. [Last time][6], we talked about [Apache Kafka][5] and Apache
+Storm for use in a real-time processing engine. Today, we will be exploring
+Apache Spark (Streaming) as part of a real-time processing engine.
+
+## About Spark ##
+
+[Apache Spark][1] is a general purpose, large-scale processing engine, recently
+fully inducted as an Apache project and is currently under very active
+development. As of this writing, Spark is at version 1.0.2 and 1.1 will be
+released some time soon.
+
+Spark is intended to be a drop in replacement for Hadoop MapReduce providing
+the benefit of improved performance. Combining Spark with its related projects
+and libraries -- [Spark SQL (formerly Shark)][14], [Spark Streaming][15],
+[Spark MLlib][16], [GraphX][17], among others -- and a very capable and
+promising processing stack emerges. Spark is capable of reading from HBase,
+Hive, Cassandra, and any HDFS data source. Not to mention the many external
+libraries that enable consuming data from many more sources, e.g., hooking
+Apache Kafka into Spark Streaming is trivial. Further, the Spark Streaming
+project provides the ability to continuously compute transformations on data.
+
+### Resilient Distributed Datasets ###
+
+Apache Spark's primitive type is the Resilient Distributed Dataset (RDD). All
+transformations, `map`, `join`, `reduce`, etc., in Spark revolve around this
+type. RDD's can be created in one of three ways: _parallelizing_ (distributing
+a local dataset); reading a stable, external data source, such as an HDFS file;
+or transformations on existing RDD's.
+
+In Java, parallelizing may look like:
+
+ List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
+ JavaRDD<Integer> distData = sc.parallelize(data);
+
+Where `sc` defines the Spark context.
+
+Similarly, reading a file from HDFS may look like:
+
+ JavaRDD<String> distFile = sc.textFile("hdfs:///data.txt");
+
+The resiliency of RDD's comes from their [lazy][34] materialization and the
+information required to enable this lazy nature. RDD's are not always fully
+materialized but they _do_ contain enough information (their linage) to be
+(re)created from a stable source [[Zaharia et al.][32]].
+
+RDD's are distributed among the participating machines, and RDD transformations
+are coarse-grained -- the same transformation will be applied to _every_
+element in an RDD. The number of partitions in an RDD is generally defined by
+the locality of the stable source, however, the user may control this number
+via repartitioning.
+
+Another important property to mention, RDD's are actually immutable. This
+immutability can be illustrated with [Spark's][27] Word Count example:
+
+ JavaRDD<String> file = sc.textFile("hdfs:///data.txt");
+ JavaRDD<String> words = file.flatMap(
+ new FlatMapFunction<String, String>() {
+ public Iterable<String> call(String line) {
+ return Arrays.asList(line.split(" "));
+ }
+ }
+ );
+ JavaPairRDD<String, Integer> pairs = words.map(
+ new PairFunction<String, String, Integer>() {
+ public Tuple2<String, Integer> call(String word) {
+ return new Tuple2<String, Integer>(word, 1);
+ }
+ }
+ );
+ JavaPairRDD<String, Integer> counts = pairs.reduceByKey(
+ new Function2<Integer, Integer>() {
+ public Integer call(Integer a, Integer b) { return a + b; }
+ }
+ );
+ counts.saveAsTextFile("hdfs:///data_counted.txt");
+
+This is the canonical word count example, but here is a brief explanation: load
+a file into an RDD, split the words into a new RDD, map the words into pairs
+where each word is given a count (one), then reduce the counts of each word by
+a key, in this case the word itself. Notice, each operation, `map`, `flatMap`,
+`reduceByKey`, creates a _new_ RDD.
+
+To bring all these properties together, Resilient Distributed Datasets are
+read-only, lazy distributed sets of elements that can have a chain of
+transformations applied to them. They facilitate resiliency by storing lineage
+graphs of the transformations (to be) applied and they [parallelize][35] the
+computations by partitioning the data among the participating machines.
+
+### Discretized Streams ###
+
+Moving to Spark Streaming, the primitive is still RDD's. However, there is
+another type for encapsulating a continuous stream of data: Discretized Streams
+or DStreams. DStreams are defined as sequences of RDD's. A DStream is created
+from an input source, such as Apache Kafka, or from the transformation of
+another DStream.
+
+Turns out, programming against DStreams is _very_ similar to programming
+against RDD's. The same word count code can be slightly modified to create a
+streaming word counter:
+
+ JavaReceiverInputDStream<String> lines = ssc.socketTextStream("localhost", 9999);
+ JavaDStream<String> words = lines.flatMap(
+ new FlatMapFunction<String, String>() {
+ public Iterable<String> call(String line) {
+ return Arrays.asList(line.split(" "));
+ }
+ }
+ );
+ JavaPairDStream<String, Integer> pairs = words.map(
+ new PairFunction<String, String, Integer>() {
+ public Tuple2<String, Integer> call(String word) {
+ return new Tuple2<String, Integer>(word, 1);
+ }
+ }
+ );
+ JavaPairDStream<String, Integer> counts = pairs.reduceByKey(
+ new Function2<Integer, Integer>() {
+ public Integer call(Integer a, Integer b) { return a + b; }
+ }
+ );
+ counts.print();
+
+Notice, really the only change between first example's code is the return
+types. In the streaming context, transformations are working on streams of
+RDD's, Spark handles applying the functions (that work against data in the
+RDD's) to the RDD's in the current batch/ DStream.
+
+Though programming against DStreams is similar, there are indeed some
+differences as well. Chiefly, DStreams also have _statefull_ transformations.
+These include sharing state between batches/ intervals and modifying the
+current frame when aggregating over a sliding window.
+
+>The key idea is to treat streaming as a series of short batch jobs, and bring
+>down the latency of these jobs as much as possible. This brings many of the
+>benefits of batch processing models to stream processing, including clear
+>consistency semantics and a new parallel recovery technique...
+[[Zaharia et al.][33]]
+
+### Hadoop Requirements ###
+
+Technically speaking, Apache Spark does [_not_][30] require Hadoop to be fully
+functional. In a cluster setting, however, a means of sharing files between
+tasks will need to be facilitated. This could be accomplished through [S3][8],
+[NFS][9], or, more typically, HDFS.
+
+### Running Spark Applications ###
+
+Apache Spark applications can run in [standalone mode][18] or be managed by
+[YARN][10]([Running Spark on YARN][19]), [Mesos][11]([Running Spark on
+Mesos][20]), and even [EC2][28]([Running Spark on EC2][29]). Furthermore, if
+running under YARN or Mesos, Spark does not need to be installed to work. That
+is, Spark code can execute on YARN and Mesos clusters without change to the
+cluster.
+
+### Language Support ###
+
+Currently, Apache Spark supports the Scala, Java, and Python programming
+languages. Though, this post will only be discussing examples in Java.
+
+### Initial Thoughts ###
+
+Getting away from the idea of directed acyclic graphs (DAG's) is -- may be --
+both a bit of a leap and a benefit. Although it is perfectly acceptable to
+define Spark's transformations altogether as a DAG, this can feel awkward when
+developing Spark applications. Describing the transformations as [Monadic][13]
+feels much more natural. Of course, a monad structure fits the DAG analogy
+quite well, especially when considered in some of the physical analogies such
+as assembly lines.
+
+Java's, and consequently Spark's, type strictness was an initial hurdle to
+get accustomed. But overall, this is good. It means the compiler will catch a
+lot of issues with transformations early.
+
+Depending on Scala's `Tuple[\d]` classes feels second-class, but this is only a
+minor tedium. It's too bad current versions of Java don't have good classes for
+this common structure.
+
+YARN and Mesos integration is a very nice benefit as it allows full stack
+analytics to not oversubscribe clusters. Furthermore, it gives the ability to
+add to existing infrastructure without overloading the developers and the
+system administrators with _yet another_ computational suite and/ or resource
+manager.
+
+On the negative side of things, dependency hell can creep into Spark projects.
+Your project and Spark (and possibly Spark's dependencies) may depend on a
+common artifact. If the versions don't [converge][21], many subtle problems can
+emerge. There is an [experimental configuration option][4] to help alleviate
+this problem, however, for me, it caused more problems than solved.
+
+## Test Project: Twitter Stream Sentiment Analysis ##
+
+To really test Spark (Streaming), a Twitter Sentiment Analysis project was
+developed. It's almost a direct port of the [Storm code][3]. Though there is an
+external library for hooking Spark directly into Twitter, Kafka is used so a
+more precise comparison of Spark and Storm can be made.
+
+When the processing is finished, the data are written to HDFS and posted to a
+simple NodeJS application.
+
+### Setup ###
+
+The setup is the same as [last time][6]: 5 node Vagrant virtual cluster with
+each node running 64 bit CentOS 6.5, given 1 core, and 1024MB of RAM. Every
+node is running HDFS (datanode), YARN worker nodes (nodemanager), ZooKeeper,
+and Kafka. The first node, `node0`, is the namenode and resource manager.
+`node0` is also running a [Docker][7] container with a NodeJS application for
+reporting purposes.
+
+### Application Overview ###
+
+This project follows a very similar process structure as the Storm Topology
+from last time.
+
+![Sentiment Analysis Topology][satimg]
+
+However, each node in the above graph is actually a transformation on the
+current DStream and not an individual process (or group of processes).
+
+This test project similarly uses the same [simple Kafka producer][22]
+developed. This Kafka producer will be how data are ingested by the system.
+
+[A lot of this overview will be a rehashing of last time.]
+
+#### Kafka Receiver Stream ####
+
+The data processed is received from a Kafka Stream and is implemented via the
+[external Kafka][23] library. This process simply creates a connection to the
+Kafka broker(s), consuming messages from the given set of topics.
+
+##### Stripping Kafka Message IDs #####
+
+It turns out the messages from Kafka are retuned as tuples, more specifically
+pairs, with the message ID and the message content. Before continuing, the
+message ID is stripped and the Twitter JSON data is passed down the pipeline.
+
+#### Twitter Data JSON Parsing ####
+
+As was the case last time, the important parts (tweet ID, tweet text, and
+language code) need to be extracted from the JSON. Furthermore, this project
+only parses English tweets. Non-English tweets are filtered out at this stage.
+
+#### Filtering and Stemming ####
+
+Many tweets contain messy or otherwise unnecessary characters and punctuation
+that can be safely ignored. Moreover, there may also be many common words that
+cannot be reliably scored either positively or negatively. At this stage, these
+symbols and _stop words_ should be filtered.
+
+#### Classifiers ####
+
+Both the Positive classifier and the Negative classifier are in separate `map`
+transformations. The implementation of both follows the [Bag-of-words][25]
+model.
+
+#### Joining and Scoring ####
+
+Because the classifiers are done separately and a join is contrived, the next
+step is to join the classifier scores together and actually declare a winner.
+It turns out this is quite trivial to do in Spark.
+
+#### Reporting: HDFS and HTTP POST ####
+
+Finally, once the tweets are joined and scored, the scores need to be reported.
+This is accomplished by writing the final tuples to HDFS and posting a JSON
+object of the tuple to a simple NodeJS application.
+
+This process turned out to not be as awkward as was the case with Storm. The
+`foreachRDD` function of DStreams is a natural way to do side-effect inducing
+operations that don't necessarily transform the data.
+
+### Implementing the Kafka Producer ###
+
+See the [post][6] from last time for the details of the Kafka producer; this
+has not changed.
+
+### Implementing the Spark Streaming Application ###
+
+Diving into the code, here are some of the primary aspects of this project. The
+full source of this test application can be found on [Github][24].
+
+#### Creating Spark Context, Wiring Transformation Chain ####
+
+The Spark context, the data source, and the transformations need to be defined.
+Proceeding, the context needs to be started. This is all accomplished with the
+following code:
+
+ SparkConf conf = new SparkConf()
+ .setAppName("Twitter Sentiment Analysis");
+
+ if (args.length > 0)
+ conf.setMaster(args[0]);
+ else
+ conf.setMaster("local[2]");
+
+ JavaStreamingContext ssc = new JavaStreamingContext(
+ conf,
+ new Duration(2000));
+
+ Map<String, Integer> topicMap = new HashMap<String, Integer>();
+ topicMap.put(KAFKA_TOPIC, KAFKA_PARALLELIZATION);
+
+ JavaPairReceiverInputDStream<String, String> messages =
+ KafkaUtils.createStream(
+ ssc,
+ Properties.getString("rts.spark.zkhosts"),
+ "twitter.sentimentanalysis.kafka",
+ topicMap);
+
+ JavaDStream<String> json = messages.map(
+ new Function<Tuple2<String, String>, String>() {
+ public String call(Tuple2<String, String> message) {
+ return message._2();
+ }
+ }
+ );
+
+ JavaPairDStream<Long, String> tweets = json.mapToPair(
+ new TwitterFilterFunction());
+
+ JavaPairDStream<Long, String> filtered = tweets.filter(
+ new Function<Tuple2<Long, String>, Boolean>() {
+ public Boolean call(Tuple2<Long, String> tweet) {
+ return tweet != null;
+ }
+ }
+ );
+
+ JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
+ new TextFilterFunction());
+
+ tweetsFiltered = tweetsFiltered.map(
+ new StemmingFunction());
+
+ JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
+ tweetsFiltered.mapToPair(new PositiveScoreFunction());
+
+ JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
+ tweetsFiltered.mapToPair(new NegativeScoreFunction());
+
+ JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
+ positiveTweets.join(negativeTweets);
+
+ JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
+ joined.map(new Function<Tuple2<Tuple2<Long, String>,
+ Tuple2<Float, Float>>,
+ Tuple4<Long, String, Float, Float>>() {
+ public Tuple4<Long, String, Float, Float> call(
+ Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
+ {
+ return new Tuple4<Long, String, Float, Float>(
+ tweet._1()._1(),
+ tweet._1()._2(),
+ tweet._2()._1(),
+ tweet._2()._2());
+ }
+ });
+
+ JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
+ scoredTweets.map(new ScoreTweetsFunction());
+
+ result.foreachRDD(new FileWriter());
+ result.foreachRDD(new HTTPNotifierFunction());
+
+ ssc.start();
+ ssc.awaitTermination();
+
+Some of the more trivial transforms are defined in-line. The others are defined
+in their respective files.
+
+#### Twitter Data Filter / Parser ####
+
+Parsing Twitter JSON data is one of the first transformations and is
+accomplished with help of the [JacksonXML Databind][26] library.
+
+ JsonNode root = mapper.readValue(tweet, JsonNode.class);
+ long id;
+ String text;
+ if (root.get("lang") != null &&
+ "en".equals(root.get("lang").textValue()))
+ {
+ if (root.get("id") != null && root.get("text") != null)
+ {
+ id = root.get("id").longValue();
+ text = root.get("text").textValue();
+ return new Tuple2<Long, String>(id, text);
+ }
+ return null;
+ }
+ return null;
+
+The `mapper` (`ObjectMapper`) object is defined at the class level so it is not
+recreated _for each_ RDD in the DStream, a minor optimization.
+
+You may recall, this is essentially the same code as [last time][6]. The only
+difference really is that the tuple is returned instead of being emitted.
+Because certain situations (e.g., non-English tweet, malformed tweet) return
+null, the nulls will need to be filtered out. Thankfully, Spark provides a
+simple way to accomplish this:
+
+ JavaPairDStream<Long, String> filtered = tweets.filter(
+ new Function<Tuple2<Long, String>, Boolean>() {
+ public Boolean call(Tuple2<Long, String> tweet) {
+ return tweet != null;
+ }
+ }
+ );
+
+#### Text Filtering ####
+
+As mentioned before, punctuation and other symbols are simply discarded as they
+provide little to no benefit to the classifiers:
+
+ String text = tweet._2();
+ text = text.replaceAll("[^a-zA-Z\\s]", "").trim().toLowerCase();
+ return new Tuple2<Long, String>(tweet._1(), text);
+
+Similarly, common words should be discarded as well:
+
+ String text = tweet._2();
+ List<String> stopWords = StopWords.getWords();
+ for (String word : stopWords)
+ {
+ text = text.replaceAll("\\b" + word + "\\b", "");
+ }
+ return new Tuple2<Long, String>(tweet._1(), text);
+
+#### Positive and Negative Scoring ####
+
+Each classifier is defined in its own class. Both classifiers are _very_
+similar in definition.
+
+The positive classifier is primarily defined by:
+
+ String text = tweet._2();
+ Set<String> posWords = PositiveWords.getWords();
+ String[] words = text.split(" ");
+ int numWords = words.length;
+ int numPosWords = 0;
+ for (String word : words)
+ {
+ if (posWords.contains(word))
+ numPosWords++;
+ }
+ return new Tuple2<Tuple2<Long, String>, Float>(
+ new Tuple2<Long, String>(tweet._1(), tweet._2()),
+ (float) numPosWords / numWords
+ );
+
+And the negative classifier:
+
+ String text = tweet._2();
+ Set<String> negWords = NegativeWords.getWords();
+ String[] words = text.split(" ");
+ int numWords = words.length;
+ int numPosWords = 0;
+ for (String word : words)
+ {
+ if (negWords.contains(word))
+ numPosWords++;
+ }
+ return new Tuple2<Tuple2<Long, String>, Float>(
+ new Tuple2<Long, String>(tweet._1(), tweet._2()),
+ (float) numPosWords / numWords
+ );
+
+Because both are implementing a `PairFunction`, a join situation is contrived.
+However, this could _easily_ be defined differently such that one classifier is
+computed, then the next, without ever needing to join the two together.
+
+#### Joining ####
+
+It turns out, joining in Spark is very easy to accomplish. So easy in fact, it
+can be handled without virtually _any_ code:
+
+ JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
+ positiveTweets.join(negativeTweets);
+
+But because working with a Tuple of nested tuples seems unwieldy, transform it
+to a 4 element tuple:
+
+ public Tuple4<Long, String, Float, Float> call(
+ Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
+ {
+ return new Tuple4<Long, String, Float, Float>(
+ tweet._1()._1(),
+ tweet._1()._2(),
+ tweet._2()._1(),
+ tweet._2()._2());
+ }
+
+#### Scoring: Declaring Winning Class ####
+
+Declaring the winning class is a matter of a simple map, comparing each class's
+score and take the greatest:
+
+ String score;
+ if (tweet._3() >= tweet._4())
+ score = "positive";
+ else
+ score = "negative";
+ return new Tuple5<Long, String, Float, Float, String>(
+ tweet._1(),
+ tweet._2(),
+ tweet._3(),
+ tweet._4(),
+ score);
+
+This declarer is more optimistic about the neutral case but is otherwise very
+straightforward.
+
+#### Reporting the Results ####
+
+Finally, the pipeline completes with writing the results to HDFS:
+
+ if (rdd.count() <= 0) return null;
+ String path = Properties.getString("rts.spark.hdfs_output_file") +
+ "_" +
+ time.milliseconds();
+ rdd.saveAsTextFile(path);
+
+And sending POST request to a NodeJS application:
+
+ rdd.foreach(new SendPostFunction());
+
+Where `SendPostFunction` is primarily given by:
+
+ String webserver = Properties.getString("rts.spark.webserv");
+ HttpClient client = new DefaultHttpClient();
+ HttpPost post = new HttpPost(webserver);
+ String content = String.format(
+ "{\"id\": \"%d\", " +
+ "\"text\": \"%s\", " +
+ "\"pos\": \"%f\", " +
+ "\"neg\": \"%f\", " +
+ "\"score\": \"%s\" }",
+ tweet._1(),
+ tweet._2(),
+ tweet._3(),
+ tweet._4(),
+ tweet._5());
+
+ try
+ {
+ post.setEntity(new StringEntity(content));
+ HttpResponse response = client.execute(post);
+ org.apache.http.util.EntityUtils.consume(response.getEntity());
+ }
+ catch (Exception ex)
+ {
+ Logger LOG = Logger.getLogger(this.getClass());
+ LOG.error("exception thrown while attempting to post", ex);
+ LOG.trace(null, ex);
+ }
+
+Each file written to HDFS _will_ have data in it, but the data written will be
+small. A better batching procedure should be implemented so the files written
+match the HDFS block size.
+
+Similarly, a POST request is opened _for each_ scored tweet. This can be
+expensive on both the Spark Streaming batch timings and the web server
+receiving the requests. Batching here could similarly improve overall
+performance of the system.
+
+That said, writing these side-effects this way fits very naturally into the
+Spark programming style.
+
+## Summary ##
+
+Apache Spark, in combination with Apache Kafka, has some amazing potential. And
+not only in the Streaming context, but as a drop-in replacement for
+traditional Hadoop MapReduce. This combination makes it a very good candidate
+for a part in an analytics engine.
+
+Stay tuned, as the next post will be a more in-depth comparison between Apache
+Spark and Apache Storm.
+
+## Related Links / References ##
+
+[satimg]: https://localhost/media/SentimentAnalysisTopology.png
+
+[1]: http://spark.apache.org/
+
+* [Apache Spark][1]
+
+[2]: http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/
+
+* [State of Apache Spark 2014][2]
+
+[3]: https://github.com/zdata-inc/StormSampleProject
+
+* [Storm Sample Project][3]
+
+[4]: https://issues.apache.org/jira/browse/SPARK-939
+
+* [SPARK-939][4]
+
+[5]: http://kafka.apache.org
+
+* [Apache Spark][5]
+
+[6]: http://www.zdatainc.com/2014/07/real-time-streaming-apache-storm-apache-kafka/
+
+* [Real-Time Streaming with Apache Storm and Apache Kafka][6]
+
+[7]: http://www.docker.io/
+
+* [Docker IO Project Page][7]
+
+[8]: http://aws.amazon.com/s3/
+
+* [Amazon S3][8]
+
+[9]: http://en.wikipedia.org/wiki/Network_File_System
+
+* [Network File System (NFS)][9]
+
+[10]: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
+
+* [Hadoop YARN][10]
+
+[11]: http://mesos.apache.org
+
+* [Apache Mesos][11]
+
+[12]: http://spark.apache.org/docs/latest/streaming-programming-guide.html
+
+* [Spark Streaming Programming Guide][12]
+
+[13]: http://en.wikipedia.org/wiki/Monad_(functional_programming)
+
+* [Monad][13]
+
+[14]: http://spark.apache.org/sql/
+
+* [Spark SQL][14]
+
+[15]: http://spark.apache.org/streaming/
+
+* [Spark Streaming][15]
+
+[16]: http://spark.apache.org/mllib/
+
+* [MLlib][16]
+
+[17]: http://spark.apache.org/graphx/
+
+* [GraphX][17]
+
+[18]: http://spark.apache.org/docs/latest/spark-standalone.html
+
+* [Spark Standalone Mode][18]
+
+[19]: http://spark.apache.org/docs/latest/running-on-yarn.html
+
+* [Running on YARN][19]
+
+[20]: http://spark.apache.org/docs/latest/running-on-mesos.html
+
+* [Running on Mesos][20]
+
+[21]: http://cupofjava.de/blog/2013/02/01/fight-dependency-hell-in-maven/
+
+* [Fight Dependency Hell in Maven][21]
+
+[22]: https://github.com/zdata-inc/SimpleKafkaProducer
+
+* [Simple Kafka Producer][22]
+
+[23]: https://github.com/apache/spark/tree/master/external/kafka
+
+* [Spark: External Kafka Library][23]
+
+[24]: https://github.com/zdata-inc/SparkSampleProject
+
+* [Spark Sample Project][24]
+
+[25]: http://en.wikipedia.org/wiki/Bag-of-words_model
+
+* [Wikipedia: Bag-of-words][25]
+
+[26]: https://github.com/FasterXML/jackson-databind
+
+* [Jackson XML Databind Project][26]
+
+[27]: http://spark.apache.org/docs/latest/programming-guide.html
+
+* [Spark Programming Guide][27]
+
+[28]: http://aws.amazon.com/ec2/
+
+* [Amazon EC2][28]
+
+[29]: http://spark.apache.org/docs/latest/ec2-scripts.html
+
+* [Running Spark on EC2][29]
+
+[30]: http://spark.apache.org/faq.html
+
+* [Spark FAQ][30]
+
+[31]: http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
+
+* [Future of Shark][31]
+
+[32]: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
+
+* [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF)][32]
+
+[33]: https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf
+
+* [Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (PDF)][33]
+
+[34]: http://en.wikipedia.org/wiki/Lazy_evaluation
+
+* [Wikipedia: Lazy evaluation][34]
+
+[35]: http://en.wikipedia.org/wiki/Data_parallelism
+
+* [Wikipedia: Data Parallelism][35]
diff --git a/content/blog/Storm-vs-Spark.markdown b/content/blog/Storm-vs-Spark.markdown
new file mode 100644
index 0000000..887b60c
--- /dev/null
+++ b/content/blog/Storm-vs-Spark.markdown
@@ -0,0 +1,457 @@
+---
+title: "Apache Storm and Apache Spark Streaming"
+description: "Comparison of Apache Storm and Apache Spark Streaming"
+tags:
+ - "Apache Storm"
+ - "Apache Spark"
+ - "Apache"
+ - "Real-time Streaming"
+ - "ZData Inc."
+date: "2014-09-08"
+categories:
+ - "Apache"
+ - "Development"
+ - "Real-time Systems"
+slug: "apache-storm-and-apache-spark"
+---
+
+This is the last post in the series on real-time systems. In the [first
+post][3] we discussed [Apache Storm][1] and [Apache Kafka][5]. In the [second
+post][4] we discussed [Apache Spark (Streaming)][3]. In both posts we examined
+a small Twitter Sentiment Analysis program. Today, we will be reviewing both
+systems: how they compare and how they contrast.
+
+The intention is not to cast judgment over one project or the other, but rather
+to exposit the differences and similarities. Any judgments made, subtle or not,
+are mistakes in exposition and/ or organization and are not actual endorsements
+of either project.
+
+## Apache Storm ##
+
+"Storm is a distributed real-time computation system" [[1][1]]. Apache Storm is
+a [task parallel][7] continuous computational engine. It defines its workflows
+in Directed Acyclic Graphs (DAG's) called "topologies". These topologies run
+until shutdown by the user or encountering an unrecoverable failure.
+
+Storm does not natively run on top of typical Hadoop clusters, it uses
+[Apache ZooKeeper][8] and its own master/ minion worker processes to
+coordinate topologies, master and worker state, and the message guarantee
+semantics. That said, both [Yahoo!][9] and [Hortonworks][10] are working on
+providing libraries for running Storm topologies on top of Hadoop 2.x YARN
+clusters. Furthermore, Storm can run on top of the [Mesos][11] scheduler as
+well, [natively][12] and with help from the [Marathon][13] framework.
+
+Regardless though, Storm can certainly still consume files from HDFS and/ or
+write files to HDFS.
+
+## Apache Spark (Streaming) ##
+
+"Apache Spark is a fast and general purpose engine for large-scale data
+processing" [[2][2]]. [Apache Spark][2] is a [data parallel][8] general purpose
+batch processing engine. Workflows are defined in a similar and reminiscent
+style of MapReduce, however, is much more capable than traditional Hadoop
+MapReduce. Apache Spark has its Streaming API project that allows for
+continuous processing via short interval batches. Similar to Storm, Spark
+Streaming jobs run until shutdown by the user or encounter an unrecoverable
+failure.
+
+Apache Spark does not itself require Hadoop to operate. However, its data
+parallel paradigm requires a shared filesystem for optimal use of stable data.
+The stable source can range from [S3][14], [NFS][15], or, more typically,
+[HDFS][16].
+
+Executing Spark applications does not _require_ Hadoop YARN. Spark has its own
+standalone master/ server processes. However, it is common to run Spark
+applications using YARN containers. Furthermore, Spark can also run on Mesos
+clusters.
+
+## Development ##
+
+As of this writing, Apache Spark is a full, top level Apache project. Whereas
+Apache Storm is currently undergoing incubation. Moreover, the latest stable
+version of Apache Storm is `0.9.2` and the latest stable version of Apache
+Spark is `1.0.2` (with `1.1.0` to be released in the coming weeks). Of course,
+as the Apache Incubation reminder states, this does not strictly reflect
+stability or completeness of either project. It is, however, a reflection to
+the state of the communities. Apache Spark operations and its process are
+endorsed by the [Apache Software Foundation][27]. Apache Storm is working on
+stabilizing its community and development process.
+
+Spark's `1.x` version does state that the API has stabilized and will not be
+doing major changes undermining backward compatibility. Implicitly, Storm has
+no guaranteed stability in its API, however, it is [running in production for
+many different companies][34].
+
+### Implementation Language ###
+
+Both Apache Spark and Apache Storm are implemented in JVM based languages:
+[Scala][19] and [Clojure][20], respectively.
+
+Scala is a functional meets object-oriented language. In other words, the
+language carries ideas from both the functional world and the object-oriented
+world. This yields an interesting mix of code reusability, extensibility, and
+higher-order functions.
+
+Clojure is a dialect of [Lisp][21] targeting the JVM providing the Lisp
+philosophy: code-as-data and providing the rich macro system typical of Lisp
+languages. Clojure is predominately functional in nature, however, if state or
+side-effects are required, they are facilitated with a transactional memory
+model, aiding in making multi-threaded based applications consistent and safe.
+
+#### Message Passing Layer ####
+
+Until version `0.9.x`, Storm was using the Java library [JZMQ][22] for
+[ZeroMQ][23] messages. However, Storm has since moved the default messaging
+layer to [Netty][24] with efforts from [Yahoo!][25]. Although Netty is now
+being used by default, users can still use ZeroMQ, if desired, since the
+migration to Netty was intended to also make the message layer pluggable.
+
+Spark, on the other hand, uses a combination of [Netty][24] and [Akka][26] for
+distributing messages throughout the executors.
+
+### Commit Velocity ###
+
+As a reminder, these data are included not to cast judgment on one project or
+the other, but rather to exposit the fluidness of each project. The continuum
+of the dynamics of both projects can be used as an argument for or against,
+depending on application requirements. If rigid stability is a strong
+requirement, arguing for a slower commit velocity may be appropriate.
+
+Source of the following statistics were taken from the graphs at
+[GitHub](https://github.com/) and computed from [this script][38].
+
+#### Spark Commit Velocity ####
+
+Examining the graphs from
+[GitHub](https://github.com/apache/spark/graphs/commit-activity), over the last
+month (as of this writing), there have been over 330 commits. The previous
+month had about 340.
+
+#### Storm Commit Velocity ####
+
+Again examining the commit graphs from
+[GitHub](https://github.com/apache/incubator-storm/graphs/commit-activity),
+over the last month (as of this writing), there have been over 70 commits. The
+month prior had over 130.
+
+### Issue Velocity ###
+
+Sourcing the summary charts from JIRA, we can see that clearly Spark has a huge
+volume of issues reported and closed in the last 30 days. Storm, roughly, an
+order of magnitude less.
+
+Spark Open and Closed JIRA Issues (last 30 days):
+
+[![Spark JIRA Issues][spark_jira_issues]][18]
+
+Storm Open and Closed JIRA Issues (last 30 days):
+
+[![Storm JIRA Issues][storm_jira_issues]][17]
+
+### Contributor/ Community Size ###
+
+#### Storm Contributor Size ####
+
+Sourcing the reports from
+[GitHub](https://github.com/apache/incubator-storm/graphs/contributors), Storm
+has over a 100 contributors. This number, though, is just the unique number of
+people who have committed at least one patch.
+
+Over the last 60 days, Storm has seen 34 unique contributors and 16 over the
+last 30.
+
+#### Spark Contributor Size ####
+
+Similarly sourcing the reports from [GitHub](https://github.com/apache/spark),
+Spark has roughly 280 contributors. A similar note as before must be made about
+this number: this is the number of at least one patch contributors to the
+project.
+
+Apache Spark has had over 140 contributors over the last 60 days and 94 over
+the last 30 days.
+
+## Development Friendliness ##
+
+### Developing for Storm ###
+
+* Describing the process structure with DAG's feels natural to the
+ [processing model][7]. Each node in the graph will transform the data in a
+ certain way, and the process continues, possibly disjointly.
+
+* Storm tuples, the data passed between nodes in the DAG, have a very natural
+ interface. However, this comes at a cost to compile-time type safety.
+
+### Developing for Spark ###
+
+* Spark's monadic expression of transformations over the data similarly feels
+ natural in this [processing model][6]; this falls in line with the idea
+ that RDD's are lazy and maintain transformation lineages, rather than
+ actuallized results.
+
+* Spark's use of Scala Tuples can feel awkward in Java, and this awkwardness
+ is only exacerbated with the nesting of generic types. However, this
+ awkwardness does come with the benefit of compile-time type checks.
+
+ - Furthermore, until Java 1.8, anonymous functions are inherently
+ awkward.
+
+ - This is probably a non-issue if using Scala.
+
+## Installation / Administration ##
+
+Installation of both Apache Spark and Apache Storm are relatively straight
+forward. Spark may be simpler in some regards, however, since it technically
+does not _need_ to be installed to function on YARN or Mesos clusters. The
+Spark application will just require the Spark assembly be present in the
+`CLASSPATH`.
+
+Storm, on the other hand, requires ZooKeeper to be properly installed and
+running on top of the regular Storm binaries that must be installed.
+Furthermore, like ZooKeeper, Storm should run under [supervision][35];
+installation of a supervisor service, e.g., [supervisord][28], is recommended.
+
+With respect to installation, supporting projects like Apache Kafka are out of
+scope and have no impact on the installation of either Storm or Spark.
+
+## Processing Models ##
+
+Comparing Apache Storm and Apache Spark's Streaming, turns out to be a bit
+challenging. One is a true stream processing engine that can do micro-batching,
+the other is a batch processing engine which micro-batches, but cannot perform
+streaming in the strictest sense. Furthermore, the comparison between streaming
+and batching isn't exactly a subtle difference, these are fundamentally
+different computing ideas.
+
+### Batch Processing ###
+
+[Batch processing][31] is the familiar concept of processing data en masse. The
+batch size could be small or very large. This is the processing model of the
+core Spark library.
+
+Batch processing excels at processing _large_ amounts of stable, existing data.
+However, it generally incurs a high-latency and is completely unsuitable for
+incoming data.
+
+### Event-Stream Processing ###
+
+[Stream processing][32] is a _one-at-a-time_ processing model; a datum is
+processed as it arrives. The core Storm library follows this processing model.
+
+Stream processing excels at computing transformations as data are ingested with
+sub-second latencies. However, with stream processing, it is incredibly
+difficult to process stable data efficiently.
+
+### Micro-Batching ###
+
+Micro-batching is a special case of batch processing where the batch size is
+orders smaller. Spark Streaming operates in this manner as does the Storm
+[Trident API][33].
+
+Micro-batching seems to be a nice mix between batching and streaming. However,
+micro-batching incurs a cost of latency. If sub-second latency is paramount,
+micro-batching will typically not suffice. On the other hand, micro-batching
+trivially gives stateful computation, making [windowing][37] an easy task.
+
+## Fault-Tolerance / Message Guarantees ##
+
+As a result of each project's fundamentally different processing models, the
+fault-tolerance and message guarantees are handled differently.
+
+### Delivery Semantics ###
+
+Before diving into each project's fault-tolerance and message guarantees, here
+are the common delivery semantics:
+
+* At most once: messages may be lost but never redelivered.
+
+* At least once: messages will never be lost but may be redelivered.
+
+* Exactly once: messages are never lost and never redelivered, perfect
+ message delivery.
+
+### Apache Storm ###
+
+To provide fault-tolerant messaging, Storm has to keep track of each and every
+record. By default, this is done with at least once delivery semantics.
+Storm can be configured to provide at most once and exactly once. The delivery
+semantics offered by Storm can incur latency costs; if data loss in the stream
+is acceptable, at most once delivery will improve performance.
+
+### Apache Spark Streaming ###
+
+The resiliency built into Spark RDD's and the micro-batching yields a trivial
+mechanism for providing fault-tolerance and message delivery guarantees. That
+is, since Spark Streaming is just small-scale batching, exactly once delivery
+is a trivial result for each batch; this is the _only_ delivery semantic
+available to Spark. However some failure scenarios of Spark Streaming degrade
+to at least once delivery.
+
+## Applicability ##
+
+### Apache Storm ###
+
+Some areas where Storm excels include: near real-time analytics, natural
+language processing, data normalization and [ETL][36] transformations. It also
+stands apart from traditional MapReduce and other course-grained technologies
+yielding fine-grained transformations allowing very flexible processing
+topologies.
+
+### Apache Spark Streaming ###
+
+Spark has an excellent model for performing iterative machine learning and
+interactive analytics. But Spark also excels in some similar areas of Storm
+including near real-time analytics, ingestion.
+
+## Final Thoughts ##
+
+Generally, the requirements will dictate the choice. However, here are some
+major points to consider when choosing the right tool:
+
+* Latency: Is the performance of the streaming application paramount? Storm
+ can give sub-second latency much more easily and with less restrictions
+ than Spark Streaming.
+
+* Development Cost: Is it desired to have similar code bases for batch
+ processing _and_ stream processing? With Spark, batching and streaming are
+ _very_ similar. Storm, however, departs dramatically from the MapReduce
+ paradigm.
+
+* Message Delivery Guarantees: Is there high importance on processing _every_
+ single record, or is some nominal amount of data loss acceptable?
+ Disregarding everything else, Spark trivially yields perfect, exactly once
+ message delivery. Storm can provide all three delivery semantics, but
+ getting perfect exactly once message delivery requires more effort to
+ properyly achieve.
+
+* Process Fault Tolerance: Is high-availability of primary concern? Both
+ systems actually handle fault-tolerance of this kind really well and in
+ relatively similar ways.
+
+ - Production Storm clusters will run Storm processes under
+ [supervision][35]; if a process fails, the supervisor process will
+ restart it automatically. State management is handled through
+ ZooKeeper. Processes restarting will reread the state from ZooKeeper on
+ an attempt to rejoin the cluster.
+
+ - Spark handles restarting workers via the resource manager: YARN, Mesos,
+ or its standalone manager. Spark's standalone resource manager handles
+ master node failure with standby-masters and ZooKeeper. Or, this can be
+ handled more primatively with just local filesystem state
+ checkpointing, not typically recommended for production environments.
+
+Both Apache Spark Streaming and Apache Storm are great solutions that solve the
+streaming ingestion and transformation problem. Either system can be a great
+choice for part of an analytics stack. Choosing the right one is simply a
+matter of answering the above questions.
+
+## References ##
+
+[spark_jira_issues]: https://localhost/media/spark_issues_chart.png
+
+[storm_jira_issues]: https://localhost/media/storm_issues_chart.png
+
+[1]: http://storm.incubator.apache.org/documentation/Home.html
+
+* [Apache Storm Home Page][1]
+
+[2]: http://spark.apache.org
+
+* [Apache Spark][2]
+
+[3]: http://www.zdatainc.com/2014/07/real-time-streaming-apache-storm-apache-kafka/
+
+* [Real Time Streaming with Apache Storm and Apache Kafka][3]
+
+[4]: http://www.zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/
+
+* [Real Time Streaming with Apache Spark (Streaming)][4]
+
+[5]: http://kafka.apache.org/
+
+* [Apache Kafka][5]
+
+[6]: http://en.wikipedia.org/wiki/Data_parallelism
+
+* [Wikipedia: Data Parallelism][6]
+
+[7]: http://en.wikipedia.org/wiki/Task_parallelism
+
+* [Wikipedia: Task Parallelism][7]
+
+[8]: http://zookeeper.apache.org
+
+* [Apache ZooKeeper][8]
+
+[9]: https://github.com/yahoo/storm-yarn
+
+* [Yahoo! Storm-YARN][9]
+
+[10]: http://hortonworks.com/kb/storm-on-yarn-install-on-hdp2-beta-cluster/
+
+* [Hortonworks: Storm on YARN][10]
+
+[11]: http://mesos.apache.org
+
+* [Apache Mesos][11]
+
+[12]: https://mesosphere.io/learn/run-storm-on-mesos/
+
+* [Run Storm on Mesos][12]
+
+[13]: https://github.com/mesosphere/marathon
+
+* [Marathon][13]
+
+[14]: http://aws.amazon.com/s3/
+
+[15]: http://en.wikipedia.org/wiki/Network_File_System
+
+[16]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
+
+[17]: https://issues.apache.org/jira/browse/STORM/
+
+[18]: https://issues.apache.org/jira/browse/SPARK/
+
+[19]: http://www.scala-lang.org/
+
+[20]: http://clojure.org/
+
+[21]: http://en.wikipedia.org/wiki/Lisp_(programming_language)
+
+[22]: https://github.com/zeromq/jzmq
+
+[23]: http://zeromq.org/
+
+[24]: http://netty.io/
+
+[25]: http://yahooeng.tumblr.com/post/64758709722/making-storm-fly-with-netty
+
+[26]: http://akka.io
+
+[27]: http://www.apache.org/
+
+[28]: http://supervisord.org
+
+[29]: http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
+
+* [Storm vs Spark Streaming: Side by Side][29]
+
+[30]: http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
+
+* [Storm vs Spark Streaming (Slideshare)][30]
+
+[31]: http://en.wikipedia.org/wiki/Batch_processing
+
+[32]: http://en.wikipedia.org/wiki/Event_stream_processing
+
+[33]: https://storm.incubator.apache.org/documentation/Trident-API-Overview.html
+
+[34]: http://storm.incubator.apache.org/documentation/Powered-By.html
+
+[35]: http://en.wikipedia.org/wiki/Process_supervision
+
+[36]: http://en.wikipedia.org/wiki/Extract,_transform,_load
+
+[37]: http://en.wikipedia.org/wiki/Window_function_(SQL)#Window_function
+
+[38]: https://gist.github.com/kennyballou/c6ff37e5eef6710794a6
diff --git a/content/blog/Storm.markdown b/content/blog/Storm.markdown
new file mode 100644
index 0000000..8ff3707
--- /dev/null
+++ b/content/blog/Storm.markdown
@@ -0,0 +1,594 @@
+---
+title: "Real-Time Streaming with Apache Storm and Apache Kafka"
+descritption: "Overview of Apache Storm and sample Twitter Sentiment Analysis"
+tags:
+ - "Apache Storm"
+ - "Apache Kafka"
+ - "Apache"
+ - "Java"
+ - "Sentiment Analysis"
+ - "Real-time Streaming"
+ - "ZData Inc."
+date: "2014-07-16"
+categories:
+ - "Apache"
+ - "Development"
+ - "Real-time Systems"
+slug: "real-time-streaming-storm-and-kafka"
+---
+
+The following post is one in the series of real-time systems tangential
+to the Hadoop ecosystem.  First, exploring both Apache Storm and Apache
+Kafka as a part of a real-time processing engine. These two systems work
+together very well and make for an easy development experience while
+still being very performant.
+
+## About Kafka ##
+
+[Apache Kafka][3] is a message queue rethought as a distributed commit log. It
+follows the publish-subscribe messaging style, with speed and durability built
+in.
+
+Kafka uses Zookeeper to share and save state between brokers. Each broker
+maintains a set of partitions: primary and/ or secondary for each topic. A set
+of Kafka brokers working together will maintain a set of topics. Each topic has
+its partitions distributed over the participating Kafka brokers and, as of
+Kafka version 0.8, the replication factor determines, intuitively, the number
+of times a partition is duplicated for fault tolerance.
+
+While many brokered message queue systems have the broker maintain the state of
+its consumers, Kafka does not. This frees up resources for the broker to ingest
+data faster. For more information about Kafka's performance see [Benchmarking
+Kafka][4].
+
+### Initial Thoughts ###
+
+Kafka is a very promising project, with astounding throughput and one of
+the easiest pieces of software I have had the joy of installing and
+configuring. Although Kafka is not at the production 1.0 stable release yet,
+it's well on its way.
+
+## About Storm ##
+
+[Apache Storm][1], currently in incubation, is a real-time computational engine
+made available under the free and open-source Apache version 2.0 license. It
+runs continuously, consuming data from the configured sources (Spouts) and
+passes the data down the processing pipeline (Bolts). Combined, Spouts and
+Bolts make a Topology. A topology can be written in any language including any
+JVM based language, Python, Ruby, Perl, or, with some work, even C [[2][2]].
+
+### Why Storm ###
+
+Quoting from the project site:
+
+> Storm has many use cases: realtime analytics, online machine learning,
+> continuous computation, distributed RPC, ETL, and more. Storm is fast: a
+> benchmark clocked it at over a million tuples processed per second per node.
+> It is scalable, fault-tolerant, guarantees your data will be processed, and
+> is easy to set up and operate. [[1][1]]
+
+### Integration ###
+
+Storm can integrate with any queuing and any database system. In fact, there
+are already quite a few existing projects for use to integrate Storm with other
+projects, like Kestrel or Kafka[[5][5]].
+
+### Initial Thoughts ###
+
+I found Storm's verbiage around the computational pipeline to fit my mental
+model very well, thinking about streaming computational processes as directed
+acyclic graphs makes a lot of intuitive sense. That said, although I haven't
+been developing against Storm for very long, I do find some integration tasks
+to be slightly awkward. For example, writing an HDFS file writer bolt
+requires some special considerations given bolt life cycles and HDFS writing
+patterns. This is really only a minor blemish however, as it only means the
+developers of Storm topologies have to understand the API more intimately;
+there are already common patterns emerging that should be adaptable to about
+any situation [[16][16]].
+
+## Test Project: Twitter Stream Sentiment Analysis ##
+
+To really give Storm a try, something a little more involved than just a simple
+word counter is needed. Therefore, I have put together a Twitter Sentiment
+Analysis topology. Though this is a good representative example of a more
+complicated topology, the method used for actually scoring the Twitter data is
+overly simple.
+
+### Setup ###
+
+The setup used for this demo is a 5 node Vagrant virtual cluster. Each node is
+running 64 bit CentOS 6.5, given 1 core, and 1024MB of RAM. Every node is
+running HDFS (datanode), Zookeeper, and Kafka. The first node, `node0`, is the
+namenode, and Nimbus -- Storm's master daemon. `node0` is also running a
+[Docker][7] container with a NodeJS application, part of the reporting process.
+The remaining nodes, `node[1-4]`, are Storm worker nodes. Storm, Kafka, and
+Zookeeper are all run under supervision via [Supervisord][6], so
+High-Availability is baked into this virtual cluster.
+
+### Overview ###
+
+![Sentiment Analysis Topology][satimg]
+
+I wrote a simple Kafka producer that reads files off disk and sends them to the
+Kafka cluster. This is how we feed the whole system and is used in lieu of
+opening a stream to Twitter.
+
+#### Spout ####
+
+The orange node from the picture is our [`KafkaSpout`][8] that will be
+consuming incoming messages from the Kafka brokers.
+
+#### Twitter Data JSON Parsing ####
+
+The first bolt, `2` in the image, attempts to parse the Twitter JSON data and
+emits `tweet_id` and `tweet_text`. This implementation only processes English
+tweets. If the topology were to be more ambitious, it may pass the language
+code down and create different scoring bolts for each language.
+
+#### Filtering and Stemming ####
+
+This next bolt, `3`, performs first-round data sanitization. That is, it
+removes all non-alpha characters.
+
+Following, the next round of data cleansing, `4`, is to remove common words
+to reduce noise for the classifiers. These common words are usually referred to
+as _stop words_. [_Stemming_][15], or considering suffixes to match the root,
+could also be performed here, or in another, later bolt.
+
+`4`, when finished, will then broadcast the data to both of the classifiers.
+
+#### Classifiers ####
+
+Each classifier is defined by its own bolt. One classifier for the positive
+class, `5`, and another for the negative class,`6`.
+
+The implementation of the classifiers follows the [Bag-of-words][12] model.
+
+#### Join and Scoring ####
+
+Next, bolt `7` joins the scores from the two previous classifiers. The
+implementation of this bolt isn't perfect. For example, message transaction is
+not guaranteed: if one-side of the join fails, neither side is notified.
+
+To finish up the scoring, bolt `8` compares the scores from the classifiers and
+declares the class accordingly.
+
+#### Reporting: HDFS and HTTP POST ####
+
+Finally, the scoring bolt broadcasts off the results to a HDFS file writer
+bolt, `9`, and a NodeJS notifier bolt, `10`. The HDFS bolt fills a list until
+it has 1000 records in it and then spools to disk. Of course, like the join
+bolt, this could be better implemented to fail input tuples if the bolt fails
+to write to disk or have a timeout to write to disk after a certain
+period of time. The NodeJs notifier bolt, on the other hand, sends a HTTP POST
+each time a record is received. This could be batched as well, but again, this
+is for demonstration purposes only.
+
+### Implementing the Kafka Producer ###
+
+Here, the main portions of the code for the Kafka producer that was used to
+test our cluster are defined.
+
+In the main class, we setup the data pipes and threads:
+
+ LOGGER.debug("Setting up streams");
+ PipedInputStream send = new PipedInputStream(BUFFER_LEN);
+ PipedOutputStream input = new PipedOutputStream(send);
+
+ LOGGER.debug("Setting up connections");
+ LOGGER.debug("Setting up file reader");
+ BufferedFileReader reader = new BufferedFileReader(filename, input);
+ LOGGER.debug("Setting up kafka producer");
+ KafkaProducer kafkaProducer = new KafkaProducer(topic, send);
+
+ LOGGER.debug("Spinning up threads");
+ Thread source = new Thread(reader);
+ Thread kafka = new Thread(kafkaProducer);
+
+ source.start();
+ kafka.start();
+
+ LOGGER.debug("Joining");
+ kafka.join();
+
+The `BufferedFileReader` in its own thread reads off the data from disk:
+
+ rd = new BufferedReader(new FileReader(this.fileToRead));
+ wd = new BufferedWriter(new OutputStreamWriter(this.outputStream, ENC));
+ int b = -1;
+ while ((b = rd.read()) != -1)
+ {
+ wd.write(b);
+ }
+
+Finally, the `KafkaProducer` sends asynchronous messages to the Kafka Cluster:
+
+ rd = new BufferedReader(new InputStreamReader(this.inputStream, ENC));
+ String line = null;
+ producer = new Producer<Integer, String>(conf);
+ while ((line = rd.readLine()) != null)
+ {
+ producer.send(new KeyedMessage<Integer, String>(this.topic, line));
+ }
+
+Doing these operations on separate threads gives us the benefit of having disk
+reads not block the Kafka producer or vice-versa, enabling maximum performance
+tunable by the size of the buffer.
+
+### Implementing the Storm Topology ###
+
+#### Topology Definition ####
+
+Moving onward to Storm, here we define the topology and how each bolt will be
+talking to each other:
+
+ TopologyBuilder topology = new TopologyBuilder();
+
+ topology.setSpout("kafka_spout", new KafkaSpout(kafkaConf), 4);
+
+ topology.setBolt("twitter_filter", new TwitterFilterBolt(), 4)
+ .shuffleGrouping("kafka_spout");
+
+ topology.setBolt("text_filter", new TextFilterBolt(), 4)
+ .shuffleGrouping("twitter_filter");
+
+ topology.setBolt("stemming", new StemmingBolt(), 4)
+ .shuffleGrouping("text_filter");
+
+ topology.setBolt("positive", new PositiveSentimentBolt(), 4)
+ .shuffleGrouping("stemming");
+ topology.setBolt("negative", new NegativeSentimentBolt(), 4)
+ .shuffleGrouping("stemming");
+
+ topology.setBolt("join", new JoinSentimentsBolt(), 4)
+ .fieldsGrouping("positive", new Fields("tweet_id"))
+ .fieldsGrouping("negative", new Fields("tweet_id"));
+
+ topology.setBolt("score", new SentimentScoringBolt(), 4)
+ .shuffleGrouping("join");
+
+ topology.setBolt("hdfs", new HDFSBolt(), 4)
+ .shuffleGrouping("score");
+ topology.setBolt("nodejs", new NodeNotifierBolt(), 4)
+ .shuffleGrouping("score");
+
+Notably, the data is shuffled to each bolt until except when joining, as it's
+very important that the same tweets are given to the same instance of the
+joining bolt. To read more about stream groupings, visit the [Storm
+documentation][17].
+
+#### Twitter Data Filter / Parser ####
+
+Parsing the Twitter JSON data is one of the first things that needs to be done.
+This is accomplished with the use of the [JacksonXML Databind][11] library.
+
+ JsonNode root = mapper.readValue(json, JsonNode.class);
+ long id;
+ String text;
+ if (root.get("lang") != null &&
+ "en".equals(root.get("lang").textValue()))
+ {
+ if (root.get("id") != null && root.get("text") != null)
+ {
+ id = root.get("id").longValue();
+ text = root.get("text").textValue();
+ collector.emit(new Values(id, text));
+ }
+ else
+ LOGGER.debug("tweet id and/ or text was null");
+ }
+ else
+ LOGGER.debug("Ignoring non-english tweet");
+
+As mentioned before, `tweet_id` and `tweet_text` are the only values being
+emitted.
+
+This is just using the basic `ObjectMapper` class from the Jackson Databind
+library to map the simple JSON object provided by the Twitter Streaming API to
+a `JsonNode`. The language code is first tested to be English, as the topology
+does not support non-English tweets. The new tuple is emitted down the Storm
+pipeline after ensuring the existence of the desired data, namely, `tweet_id`,
+and `tweet_text`.
+
+#### Text Filtering ####
+
+As previously mentioned, punctuation and other symbols are removed because they
+are just noise to the classifiers:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String text = input.getString(input.fieldIndex("tweet_text"));
+ text = text.replaceAll("[^a-zA-Z\\s]", "").trim().toLowerCase();
+ collector.emit(new Values(id, text));
+
+_Very_ common words are also removed because they are similarly noisy to the
+classifiers:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String text = input.getString(input.fieldIndex("tweet_text"));
+ List<String> stopWords = StopWords.getWords();
+ for (String word : stopWords)
+ {
+ text = text.replaceAll(word, "");
+ }
+ collector.emit(new Values(id, text));
+
+Here the `StopWords` class is a singleton holding the list of words we
+wish to omit.
+
+#### Positive and Negative Scoring ####
+
+Since the approach for scoring is based on a very limited [Bag-of-words][12]
+model, Each class is put into its own bolt; this also contrives the need for a
+join later.
+
+Positive Scoring:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String text = input.getString(input.fieldIndex("tweet_text"));
+ Set<String> posWords = PositiveWords.getWords();
+ String[] words = text.split(" ");
+ int numWords = words.length;
+ int numPosWords = 0;
+ for (String word : words)
+ {
+ if (posWords.contains(word))
+ numPosWords++;
+ }
+ collector.emit(new Values(id, (float) numPosWords / numWords, text));
+
+Negative Scoring:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String text = input.getString(input.fieldIndex("tweet_text"));
+ Set<String> negWords = NegativeWords.getWords();
+ String[] words = text.split(" ");
+ int numWords = words.length;
+ int numNegWords = 0;
+ for (String word : words)
+ {
+ if (negWords.contains(word))
+ numNegWords++;
+ }
+ collector.emit(new Values(id, (float)numNegWords / numWords, text));
+
+Similar to `StopWords`, `PositiveWords` and `NegativeWords` are both singletons
+maintaining lists of positive and negative words, respectively.
+
+#### Joining Scores ####
+
+Joining in Storm isn't the most straight forward process to implement, although
+the process may be made simpler with the use of the [Trident API][13]. However,
+to get a feel for what to do without Trident and as an Academic exercise, the
+join is not implemented with the Trident API. On related note, this join
+isn't perfect! It ignores a couple of issues, namely, it does not correctly
+fail a tuple on a one-sided join (when tweets are received twice from the same
+scoring bolt) and doesn't timeout tweets if they are left in the queue for too
+long. With this in mind, here is our join:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String text = input.getString(input.fieldIndex("tweet_text"));
+ if (input.contains("pos_score"))
+ {
+ Float pos = input.getFloat(input.fieldIndex("pos_score"));
+ if (this.tweets.containsKey(id))
+ {
+ Triple<String, Float, String> triple = this.tweets.get(id);
+ if ("neg".equals(triple.getCar()))
+ emit(collector, id, triple.getCaar(), pos, triple.getCdr());
+ else
+ {
+ LOGGER.warn("one sided join attempted");
+ this.tweets.remove(id);
+ }
+ }
+ else
+ this.tweets.put(
+ id,
+ new Triple<String, Float, String>("pos", pos, text));
+ }
+ else if (input.contains("neg_score"))
+ {
+ Float neg = input.getFloat(input.fieldIndex("neg_score"));
+ if (this.tweets.containsKey(id))
+ {
+ Triple<String, Float, String> triple = this.tweets.get(id);
+ if ("pos".equals(triple.getCar()))
+ emit(collector, id, triple.getCaar(), neg, triple.getCdr());
+ else
+ {
+ LOGGER.warn("one sided join attempted");
+ this.tweets.remove(id);
+ }
+ }
+ else
+ this.tweets.put(
+ id,
+ new Triple<String, Float, String>("neg", neg, text));
+ }
+
+Where `emit` is defined simply by:
+
+ private void emit(
+ BasicOutputCollector collector,
+ Long id,
+ String text,
+ float pos,
+ float neg)
+ {
+ collector.emit(new Values(id, pos, neg, text));
+ this.tweets.remove(id);
+ }
+
+#### Deciding the Winning Class ####
+
+To ensure the [Single responsibility principle][14] is not violated, joining
+and scoring are split into separate bolts, though the class of the tweet could
+certainly be decided at the time of joining.
+
+ Long id = tuple.getLong(tuple.fieldIndex("tweet_id"));
+ String text = tuple.getString(tuple.fieldIndex("tweet_text"));
+ Float pos = tuple.getFloat(tuple.fieldIndex("pos_score"));
+ Float neg = tuple.getFloat(tuple.fieldIndex("neg_score"));
+ String score = pos > neg ? "positive" : "negative";
+ collector.emit(new Values(id, text, pos, neg, score));
+
+This decider will prefer negative scores, so if there is a tie, it's
+automatically handed to the negative class.
+
+#### Report the Results ####
+
+Finally, there are two bolts that will yield results: one that writes
+data to HDFS, and another to send the data to a web server.
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String tweet = input.getString(input.fieldIndex("tweet_text"));
+ Float pos = input.getFloat(input.fieldIndex("pos_score"));
+ Float neg = input.getFloat(input.fieldIndex("neg_score"));
+ String score = input.getString(input.fieldIndex("score"));
+ String tweet_score =
+ String.format("%s,%s,%f,%f,%s\n", id, tweet, pos, neg, score);
+ this.tweet_scores.add(tweet_score);
+ if (this.tweet_scores.size() >= 1000)
+ {
+ writeToHDFS();
+ this.tweet_scores = new ArrayList<String>(1000);
+ }
+
+Where `writeToHDFS` is primarily given by:
+
+ Configuration conf = new Configuration();
+ conf.addResource(new Path("/opt/hadoop/etc/hadoop/core-site.xml"));
+ conf.addResource(new Path("/opt/hadoop/etc/hadoop/hdfs-site.xml"));
+ hdfs = FileSystem.get(conf);
+ file = new Path(
+ Properties.getString("rts.storm.hdfs_output_file") + this.id);
+ if (hdfs.exists(file))
+ os = hdfs.append(file);
+ else
+ os = hdfs.create(file);
+ wd = new BufferedWriter(new OutputStreamWriter(os, "UTF-8"));
+ for (String tweet_score : tweet_scores)
+ {
+ wd.write(tweet_score);
+ }
+
+And our `HTTP POST` to a web server:
+
+ Long id = input.getLong(input.fieldIndex("tweet_id"));
+ String tweet = input.getString(input.fieldIndex("tweet_text"));
+ Float pos = input.getFloat(input.fieldIndex("pos_score"));
+ Float neg = input.getFloat(input.fieldIndex("neg_score"));
+ String score = input.getString(input.fieldIndex("score"));
+ HttpPost post = new HttpPost(this.webserver);
+ String content = String.format(
+ "{\"id\": \"%d\", " +
+ "\"text\": \"%s\", " +
+ "\"pos\": \"%f\", " +
+ "\"neg\": \"%f\", " +
+ "\"score\": \"%s\" }",
+ id, tweet, pos, neg, score);
+
+ try
+ {
+ post.setEntity(new StringEntity(content));
+ HttpResponse response = client.execute(post);
+ org.apache.http.util.EntityUtils.consume(response.getEntity());
+ }
+ catch (Exception ex)
+ {
+ LOGGER.error("exception thrown while attempting post", ex);
+ LOGGER.trace(null, ex);
+ reconnect();
+ }
+
+There are some faults to point out with both of these bolts. Namely, the HDFS
+bolt tries to batch the writes into 1000 tweets, but does not keep track of the
+tuples nor does it time out tuples or flush them at some interval. That is, if
+a write fails or if the queue sits idle for too long, the topology is not
+notified and can't resend the tuples. Similarly, the `HTTP POST`, does not
+batch and sends each POST synchronously causing the bolt to block for each
+message. This can be alleviated with more instances of this bolt and more web
+servers to handle the increase in posts, and/ or a better batching process.
+
+## Summary ##
+
+The full source of this test application can be found on [Github][9].
+
+Apache Storm and Apache Kafka both have great potential in the real-time
+streaming market and have so far proven themselves to be very capable systems
+for performing real-time analytics.
+
+Stay tuned, as the next post in this series will be an overview look at
+Streaming with Apache Spark.
+
+## Related Links / References ##
+
+[satimg]: https://localhost/media/SentimentAnalysisTopology.png
+
+[1]: http://storm.incubator.apache.org/
+
+* [Apache Storm Project Page][1]
+
+[2]: http://storm.incubator.apache.org/about/multi-language.html
+
+* [Storm Multi-Language Documentation][2]
+
+[3]: http://kafka.apache.org/
+
+* [Apache Kafka Project Page][3]
+
+[4]: http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
+
+* [LinkedIn Kafka Benchmarking: 2 million writes per second][4]
+
+[5]: http://storm.incubator.apache.org/about/integrates.html
+
+* [Storm Integration Documentation][5]
+
+[6]: http://supervisord.org/
+
+* [Supervisord Project Page][6]
+
+[7]: http://www.docker.io/
+
+* [Docker IO Project Page][7]
+
+[8]: https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
+
+* [Storm-Kafka Source][8]
+
+[9]: https://github.com/zdata-inc/StormSampleProject
+
+* [Full Source of Test Project][9]
+
+[10]: https://wiki.apache.org/incubator/StormProposal
+
+* [Apache Storm Incubation Proposal][10]
+
+[11]: https://github.com/FasterXML/jackson-databind
+
+* [Jackson Databind Project Bag][11]
+
+[12]: http://en.wikipedia.org/wiki/Bag-of-words_model
+
+* [Wikipedia: Bag of words][12]
+
+[13]: http://storm.incubator.apache.org/documentation/Trident-API-Overview.html
+
+* [Storm Trident API Overview][13]
+
+[14]: http://en.wikipedia.org/wiki/Single_responsibility_principle
+
+* [Wikipedia: Single responsibility principle][14]
+
+[15]: http://en.wikipedia.org/wiki/Stemming
+
+* [Wikipedia: Stemming][15]
+
+[16]: http://storm.incubator.apache.org/documentation/Common-patterns.html
+
+* [Storm Documentation: Common Patterns][16]
+
+[17]: http://storm.incubator.apache.org/documentation/Concepts.html#stream-groupings
+
+* [Stream Groupings][17]
diff --git a/layouts/404.html b/layouts/404.html
new file mode 100644
index 0000000..0c15e5a
--- /dev/null
+++ b/layouts/404.html
@@ -0,0 +1,8 @@
+{{ partial "header.html" . }}
+<body>
+<h1>You Seem Lost...</h1>
+<p>The page you were looking for seems to have left the server or never
+existed. If you think this may be an error, sorry. The web server thinks you
+have an error.</p>
+</body>
+{{ partial "footer.html" . }}
diff --git a/layouts/_default/list.html b/layouts/_default/list.html
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/layouts/_default/list.html
diff --git a/layouts/_default/single.html b/layouts/_default/single.html
new file mode 100644
index 0000000..00aaabc
--- /dev/null
+++ b/layouts/_default/single.html
@@ -0,0 +1,17 @@
+{{ partial "header.html" . }}
+<body>
+{{ partial "subheader.html" . }}
+<section id="main">
+ <div>
+ <div class="colleft">
+ <article id="content">
+ {{ .Content}}
+ </article>
+ </div>
+ <div class="colright">
+ {{ partial "author.html" . }}
+ </div>
+ </div>
+</section>
+</body>
+{{ partial "footer.html" . }}
diff --git a/layouts/blog/li.html b/layouts/blog/li.html
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/layouts/blog/li.html
diff --git a/layouts/blog/single.html b/layouts/blog/single.html
new file mode 100644
index 0000000..1086aa8
--- /dev/null
+++ b/layouts/blog/single.html
@@ -0,0 +1,27 @@
+{{ partial "header.html" . }}
+<body>
+{{ partial "subheader.html" . }}
+<section id="main">
+ <div>
+ <div class="colleft">
+ <article id="content">
+ <h1 id="title"> {{ .Title }}</h1>
+ <div class="post-meta">
+ <ul class="tags">
+ <li><i class="fa fa-tags"></i></li>
+ {{ range .Params.tags }}
+ <li>{{ . }}</li>
+ {{ end }}
+ </ul>
+ <h4 id="date"> {{ .Date.Format "Mon Jan 2, 2006" }}</h4>
+ </div>
+ {{ .Content}}
+ </article>
+ </div>
+ <div class="colright">
+ {{ partial "author.html" . }}
+ </div>
+ </div>
+</section>
+</body>
+{{ partial "footer.html" . }}
diff --git a/layouts/blog/summary.html b/layouts/blog/summary.html
new file mode 100644
index 0000000..6fe1d59
--- /dev/null
+++ b/layouts/blog/summary.html
@@ -0,0 +1,20 @@
+<article class="post">
+ <header>
+ <h2><a href="{{ .Permalink }}">{{ .Title }}</a></h2>
+ <div class="post-meta">{{ .Date.Format "Mon Jan 2, 2006" }}</div>
+ </header>
+
+ <blockquote>
+ {{ .Summary }}
+ </blockquote>
+ <ul class="tags">
+ <li><i class="fa fa-tags"></i></li>
+ {{ range .Params.tags }}
+ <li>{{ . }}</li>
+ {{ end }}
+ </ul>
+
+ <footer>
+ <a href="{{ .Permalink }}"><nobr>Read More</nobr></a>
+ </footer>
+</article>
diff --git a/layouts/index.html b/layouts/index.html
new file mode 100644
index 0000000..528a97c
--- /dev/null
+++ b/layouts/index.html
@@ -0,0 +1,20 @@
+{{ partial "header.html" . }}
+<body>
+{{ partial "subheader.html" . }}
+<section id="main">
+ <div>
+ <div class="colleft">
+ {{ range first 10 .Data.Pages }}
+ {{ if eq .Section "blog" }}
+ {{ .Render "summary" }}
+ {{ end }}
+ {{ end }}
+ </div>
+ <div class="colright">
+ {{ partial "author.html" . }}
+ </div>
+ <div>
+</section>
+
+
+{{ partial "footer.html" . }}
diff --git a/layouts/partials/author.html b/layouts/partials/author.html
new file mode 100644
index 0000000..a1c7939
--- /dev/null
+++ b/layouts/partials/author.html
@@ -0,0 +1,21 @@
+<section>
+ <div id="author">
+ <a class="fade" href="about-me/"><h3 class="fade">Kenny Ballou</h3></a>
+ <h4><code>:(){ :|:&amp; };:</code></h4>
+ <a href="https://github.com/kennyballou">
+ <i class="fade fa fa-github-square fa-2x"></i>
+ </a>
+ <a href="https://bitbucket.org/kballou">
+ <i class="fade fa fa-bitbucket-square fa-2x"></i>
+ </a>
+ <a href="http://www.linkedin.com/pub/kenny-ballou/43/1a9/90">
+ <i class="fade fa fa-linkedin-square fa-2x"></i>
+ </a>
+ <a href="https://plus.google.com/+KennyBallou">
+ <i class="fade fa fa-google-plus-square fa-2x"></i>
+ </a>
+ <a href="https://twitter.com/kennyballou">
+ <i class="fade fa fa-twitter-square fa-2x"></i>
+ </a>
+ </div>
+</section>
diff --git a/layouts/partials/footer.html b/layouts/partials/footer.html
new file mode 100644
index 0000000..5fb79e9
--- /dev/null
+++ b/layouts/partials/footer.html
@@ -0,0 +1,13 @@
+<footer>
+ <div id="footer">
+ <div class="colleft">
+ <p>&copy; 2014 Kenny Ballou. <a
+ href="http://creativecommons.org/licenses/by/3.0/">Some rights
+ reserved.</a></p>
+ </div>
+ <div class="colright">
+ <p>Powered by <a href="http://gohugo.io/">Hugo.</a></p>
+ </div>
+ </div>
+</footer>
+</html>
diff --git a/layouts/partials/head_includes.html b/layouts/partials/head_includes.html
new file mode 100644
index 0000000..2df69ce
--- /dev/null
+++ b/layouts/partials/head_includes.html
@@ -0,0 +1,4 @@
+<link rel="sytlesheet" href="/css/site.css" />
+<link rel="stylesheet"
+ href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css">
+<script src="{{ .Site.BaseUrl }}/js/site.js"></script>
diff --git a/layouts/partials/header.html b/layouts/partials/header.html
new file mode 100644
index 0000000..2dc4d46
--- /dev/null
+++ b/layouts/partials/header.html
@@ -0,0 +1,20 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+
+ {{ partial "meta.html" . }}
+
+ <base href="{{ .Site.BaseUrl }}" />
+ <title>{{ .Site.Title }}</title>
+
+ <link rel="canonical" href="{{ .Permalink }}" />
+ {{ if .RSSlink }}
+ <link href="{{ .RSSlink }}" rel="alternate" type="application/rss+xml"
+ title="{{ .Title }}" />
+ {{ end }}
+
+ {{ partial "head_includes.html" . }}
+ <link rel="stylesheet" href="{{ .Site.BaseUrl }}/css/site.css" />
+ <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
+</head>
diff --git a/layouts/partials/meta.html b/layouts/partials/meta.html
new file mode 100644
index 0000000..ec09d41
--- /dev/null
+++ b/layouts/partials/meta.html
@@ -0,0 +1,2 @@
+<meta name="author" content="Kenny Ballou">
+<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
diff --git a/layouts/partials/subheader.html b/layouts/partials/subheader.html
new file mode 100644
index 0000000..3a4aaae
--- /dev/null
+++ b/layouts/partials/subheader.html
@@ -0,0 +1,5 @@
+<div id="header">
+ <header>
+ <a class="fade" href="{{ .Site.BaseUrl }}">~kballou</a>
+ </header>
+</div>
diff --git a/layouts/rss.xml b/layouts/rss.xml
new file mode 100644
index 0000000..5950804
--- /dev/null
+++ b/layouts/rss.xml
@@ -0,0 +1,21 @@
+<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
+ <channel>
+ <title>{{ .Title }} on {{ .Site.Title }} </title>
+ <generator uri="https://gohugo.io">Hugo</generator>
+ <link>{{ .Permalink }}</link>
+ {{ with .Site.LanguageCode }}<language>{{ . }}</language>{{ end }}
+ {{ with .Site.Author.name }}<author>{{ . }}</author>{{ end }}
+ {{ with .Site.Copyright }}<copyright>{{ . }}</copyright>{{ end }}
+ <updated>{{ .Date.Format "Mon, 02 Jan 2006 15:04:05 MST" }}</updated>
+ {{ range first 15 .Data.Pages }}
+ <item>
+ <title>{{ .Title }}</title>
+ <link>{{ .Permalink }}</link>
+ <pubDate>{{ .Date.Format "Mon, 02 Jan 2006 15:04:05 MST" }} </pubDate>
+ {{ with .Site.Author.name }}<author>{{ . }}</author>{{ end }}
+ <guid>{{ .Permalink }}</guid>
+ <description>{{ .Content | html }}</description>
+ </item>
+ {{ end }}
+ </channel>
+</rss>
diff --git a/layouts/single.html b/layouts/single.html
new file mode 100644
index 0000000..5c0336c
--- /dev/null
+++ b/layouts/single.html
@@ -0,0 +1,5 @@
+{{ partial "header.html" . }}
+<body>
+ {{ .Content}}
+</body>
+{{ partial "header.html" . }}
diff --git a/layouts/sitemap.xml b/layouts/sitemap.xml
new file mode 100644
index 0000000..e99a1bc
--- /dev/null
+++ b/layouts/sitemap.xml
@@ -0,0 +1,16 @@
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+ {{ range .Data.Pages }}
+ <url>
+ <loc>{{ .Permalink }}</loc>
+ <lastmod>
+ {{ safeHtml ( .Date.Format "2006-01-02T15:04:05-7:00" ) }}
+ </lastmod>
+ {{ with .Sitemap.ChangeFreq }}
+ <changefreq>{{ . }}</changefreq>
+ {{ end }}
+ {{ if ge .Sitemap.Priority 0.0 }}
+ <priority>{{ .Sitemap.Priority }}</priority>
+ {{ end }}
+ </url>
+ {{ end }}
+</urlset>
diff --git a/static/css/site.css b/static/css/site.css
new file mode 100644
index 0000000..bbd6065
--- /dev/null
+++ b/static/css/site.css
@@ -0,0 +1,131 @@
+body {
+ box-sizing: border-box;
+ font: 13px Helvetica, Arial;
+ background-color: #282828;
+ color: #F8F8F2;
+}
+h1 {
+ color: #F92672;
+}
+h2, h4 {
+ color: #66D9EF;
+}
+h3, h5 {
+ color: #A6E22E;
+}
+a {
+ color: #AE81FF;
+ text-decoration: none;
+}
+a:hover {
+ text-decoration: underline;
+}
+blockquote {
+ padding: 1em 1em;
+ border-left: 5px solid #333333;
+ margin: 0 0 1em;
+}
+.post-meta {
+ font-style: italic;
+}
+#author {
+ padding: 5em;
+}
+#content {
+ margin: 2em 2em 5em 3em;
+ width: 100%;
+}
+#header {
+ background-color: #282828;
+ display: block;
+ height: auto;
+ overflow: visible;
+ padding-right: 2em;
+ padding-left: 0.7em;
+ border-bottom: 1px solid #292929;
+}
+#footer {
+ background-color: #282828;
+ height: 2.5em;
+ padding: 1em 0;
+ overflow: hidden;
+ border-top: 1px solid #292929;
+ position: fixed;
+ bottom: 0;
+ right: 0;
+ left: 0;
+ z-index: 1030;
+ margin-bottom: 0;
+ margin-left: 2em;
+ border-width: 1px 0 0:
+}
+#header header a {
+ color: #F92672;
+ font-size: 210%;
+}
+#header header a:hover {
+ text-decoration: none;
+}
+div#author a:hover {
+ text-decoration: none;
+}
+.fade {
+ opacity: 0.5;
+ transition: opacity .25s ease-in-out;
+ -moz-transition: opacity: .25s ease-in-out;
+ -webkit-transition: opacity .25s ease-in-out;
+}
+.fade:hover {
+ opacity: 1;
+}
+pre {
+ padding: 9.5px;
+ margin: 0 0 10px;
+ word-break: break-all;
+ word-wrap: break-word;
+ color: #2b2b2b;
+ background-color: #F5F5F5;
+ border: 1px solid #CCCCCC;
+ border-radius: 4px;
+}
+code {
+ padding: 0.25em;
+ color: #F8F8F2;
+ font-family: DejaVu Sans Mono, Consolas;
+ white-space: pre-wrap;
+ border: 0;
+ border-radius: 4px;
+}
+pre code {
+ display: block;
+ background-color: #303030;
+}
+.post {
+ width: 100%;
+ margin-left: 3em;
+ margin-right: 3em;
+ margin-bottom: 3em;
+}
+.tags {
+ display: inline-block;
+ list-style: none;
+ padding-left: 0;
+ margin: 0 0 0 0.2em;
+}
+.tags li {
+ display: inline-block;
+ padding-left: 0.3em;
+}
+.tags li:nth-child(even) {
+ color: #66D9EF;
+}
+.colleft {
+ float: left;
+ width: 70%;
+ position: relative;
+}
+.colright {
+ float: right;
+ width: 20%;
+ position: relative;
+}
diff --git a/static/favicon.ico b/static/favicon.ico
new file mode 100644
index 0000000..bd80b91
--- /dev/null
+++ b/static/favicon.ico
Binary files differ
diff --git a/static/media/SentimentAnalysisTopology.png b/static/media/SentimentAnalysisTopology.png
new file mode 100644
index 0000000..44d8ede
--- /dev/null
+++ b/static/media/SentimentAnalysisTopology.png
Binary files differ
diff --git a/static/media/spark_issues_chart.png b/static/media/spark_issues_chart.png
new file mode 100644
index 0000000..1932741
--- /dev/null
+++ b/static/media/spark_issues_chart.png
Binary files differ
diff --git a/static/media/storm_issues_chart.png b/static/media/storm_issues_chart.png
new file mode 100644
index 0000000..78bc99a
--- /dev/null
+++ b/static/media/storm_issues_chart.png
Binary files differ