Data routing, transformation, and system mediation in Big Data & IoT scenarios with Apache NiFi

So a few months ago I published a serie of post explaining how to capture WIFI traffic and process it near to real time by using WSO2 BAM, CEP Siddhi, Apache Cassandra, Apache Thrift, Kismet running on a Raspberry Pi and Docker.

01-wifi-traffic-capture-wso2-bam

Now, after several Big Data and Security projects, I can add to previous solution, fresh air and improve the technological approach.

Using Elasticsearch, Logstash and Kibana

Well, the first approach I considered was starting with ELK stack (Elasticsearch, Logstash and Kibana), that is the natural way to follow.

02-wifi-traffic-capture-elasticsearch-logstash-kibana

 

But, there are still some issues to face:

  • Deal with the resilience.
    • Several times Logstash stops because it was processing a malformed incoming message.
  • Portability.
    • Logstash uses Java, Ruby and should be compiled and tuned for ARM architectures (Raspberry Pi). Yes, there are some instructions to do that, but I don’t want to spent time to do that and I would like to focus on data analysis.
  • Large scaling.
    • I would like to avoid to deploy Logstash in each Raspberry Pi just to transform in JSON the captured 802.11 (WIFI) traffic and send it to Elasticsearch. Other approach what I want to avoid is to deploy Logstash with the UDP/TCP Input Plugin in the Elasticsearch side, because both choices need parse/transform/filter the captured traffic by using GROK and Elasticsearch Index Templates for each Logstash instance deployed. What if I have 100 or more Raspberry Pi distributed in different locations?.
  • Security.
    • I’m using Kismet installed in each Raspberry Pi to capture 802.11 traffic, by default Kismet sends that traffic over UDP, UDP is faster but not secure. The big problem with Logstash listening UDP traffic over a port  is that Logstash is susceptible to DoS attacks and the traffic to be spoofed. I have to update UDP to the “secure UDP”, UDP over SSL/TLS for example.
  • Monitoring/Tracking.
    • How to monitor if Kismet is running in the Raspberry Pi?, How to know if Raspberry Pi is healthy ?.
  • Administrable remotely.
    • Definitely I can’t do that in a massively distributed Raspberry Pi’s.

Then, what can I do ?….

Apache NiFi to the rescue!

03-apache-nifi-logo

I was involved in several Integration Project where I frequently used WSO2 ESB.

WSO2 ESB is based on Apache Synapse, it is a lightweight and high-performance Enterprise Service Bus (ESB). Powered by a fast and asynchronous mediation engine, It provides support for XML, SOAP and REST. It supports HTTP/S, Mail (POP3, IMAP, SMTP), JMS, TCP, UDP, VFS, SMS, XMPP and FIX through “mediators”.

Other opensource and popular choice is Apache Camel. Also We can consider ETL tools such as Pentaho Data Integration (a.k.a Pentaho Kettle), but all them are too heavy to use with/in a Raspberry Pi. Until I found the Apache NiFi.

Taken from Apache NiFi webpage:

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Some of the high-level capabilities and objectives of Apache NiFi include:

Web-based user interface.
– Seamless experience between design, control, feedback, and monitoring
Highly configurable.
– Loss tolerant vs guaranteed delivery
– Low latency vs high throughput
– Dynamic prioritization
– Flow can be modified at runtime
– Back pressure
Data Provenance.
– Track dataflow from beginning to end
Designed for extension
– Build your own processors and more
– Enables rapid development and effective testing
Secure.
– SSL, SSH, HTTPS, encrypted content, etc…
– Multi-tenant authorization and internal authorization/policy management.

What do you think about that? Do you think that Apache NiFi can help me ?. Yes, It does. The new approach would be as follows:

04-wifi-traffic-capture-apache-nifi-minifi

The above choice covers basically all gaps above explained. In the side of Raspberry Pi we could use Apache MiNiFi, a subproject of NiFi suitable for constrained resources. The specific goals comprise:

  • small and lightweight footprint
  • central management of agents
  • generation of data provenanceFor other side, the below choice is also a valid alternative. Even as PoC that demonstrates the ease and the power of using Apache NiFi, this approach is enough.

05-wifi-traffic-capture-apache-nifi

In the next post I will share technical details and code to implement the above approach. Meanwhile I share four great resources:

Conclusions

  • Apache NiFi as system mediator (data routing, transformation, etc.) to does data routing, data streaming, move big data chunks, pull, push and put from/to different sources of data, is the perfect companion for Big Data projects.
  • Apache NiFi speaks different languages through Processors. I can replace Logstash with all Input and Output Plugins easily. I can connect Apache NiFi to Elasticsearch (Put/Fetch Elasticsearch), Apache Hadoop (PutHDFS, FetchHDFS), Twitter, Kafka, etc.

 

@Chilcano

Tagged with: , , ,
Posted in BAM, Big Data, DevOps, IoT
3 comments on “Data routing, transformation, and system mediation in Big Data & IoT scenarios with Apache NiFi
  1. […] Data routing, transformation, and system mediation in Big Data & IoT scenarios with Apache NiFi by Roger Carhuatocto […]

  2. […] you have not read my previous post about Apache NiFi, well I can say that is a Data Mediator Engine and ETL with steroids suitable for BigData Projects […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives
%d bloggers like this: