Apache Nifi - GROUP 1


language: EN

                  WEBVTT
So for those that are on your desktop if you can can you bring up your
Desktop you should see a blank canvas
You know a blank desktop like mine. You'll have Microsoft Edge Docker desktop
Uploads you know those types of things
But one of the things I want you to check is if you can open your file explorer
And when you go to downloads
There should be
Quite a bit of downloads minify minify nine five registry toolkit nine five those types of things
If you do not have that information let me know
That way I can go in and replicate to your desktop the you know the information kodi.se you have it
Aaron nothing
Sean
You look good to it
Amanda you should have it as well. I remember replicating your desktop perfect perfect perfect
Pedro are you able to see the files like Pedro?
Unless if you look in your downloads file explorer then go to downloads you should see
A list of files
No worries I again I'm here on my ranch in Central, Texas, so
Starly you know if he decides I want data today I get it
Yep, I see your screen
Perfect perfect perfect so the the downloads all worked and you can go there
So what we'll do is go through some of the PowerPoint presentation a little bit more
We'll take a break for just take a quick break. I like to do like a 10 to 15 minute break
You know depending on on how many questions we get and those types of things
It's my understanding
Everyone's in Arizona so my lunch time is usually in about an hour
But for Arizona time we will try to go lunch about 1130 your time 1111 30 your time
Like to do it like 45 minutes for lunch, but if you need an hour that we can do that as well
You know if there's not a lot of questions or a lot of interaction we usually end a little early
You know there's some time built in for those interactions
You know so we'll do that so everybody's logged into their desktop you're able to see the files that we're going to work with
When we get to this part, we're going to actually do an install of NAFA on the Windows operating system
If you were installing this in Linux, it would actually be a little bit easier, but there is a ton of documentation
As you can imagine for a government product to the documentation for NAFA is very extensive
Everything I'm teaching today everything I'm going over is in the NAFA docs
You can actually follow along if you go to
NAFA.APACHE.ORG you will see
You know tons and tons of documentation, you know, what is NAFA?
You know, what is the core concepts the architecture some of these things that we're going over
You know even even the architecture for instance where you have the OS the host which in our case is
Windows
For those that are technical it's a Jetty web server serving this UI up
Then we have the flow controller
processors
You know flow file repository that we talked about content repository and provenance repository
They're on local storage now when it says local storage that doesn't necessarily mean
It's being stored to a local disk
See local storage be a NAS or some other type of network attached storage system or
You know those types of things
So if you want to kind of follow along
With some of the documentation it's all there if you go to NAFA.APACHE.ORG
You'll see everything
Documentation is very well. We're working off of the NAFA version 1 documentation
Just because the version 2 just came out. We'll touch on some of that
What I like to go off of is the admin guide or the user guide
Those are the two guides that that I work with when we go into some of the processors we can
Will actually go in and talk about some of that
But if you look at the admin guide, it kind of you know for for an open source product
The documentation for NAFA is is amazing
Usually we don't see this type of documentation
But as you can imagine being a government product that was released to the open source world
We had to do a lot of documentation before that was released
The documentation is also built in to NAFA
Even down to the processors when you're developing a processor for those that have developed a processor before
You did have a place where you could include a description as well as other documentation
So, you know, we'll go through that but I want to make sure that
That your desktops are running and if you're in your browser, you can pull up the NAFA user guide and admin guide
And follow along
Okay
You know a flow file
Abstraction that represents a single piece of information or a data object
Within a data flow, you know
so take in this case and I'm using this example is because I just implemented a a
huge prototype for this
You know long messages coming in you may have a single message and when it comes into NAFA
You know treats that as a flow file
So that flow file
Is that message it can be in any format it can be in any kind of protocol or those types of things
You know, so when NAFA receives that
it generates that as a flow file and then you know
Within that flow file it consists of two major components the metadata and the data payload
The metadata is the attributes and we'll go into more of that where we're able to
Take that data flow and make it into an attribute
It has a lot of attributes
So as soon as NAFA touches this data flow it gets assigned an attribute of like
You know a date time group of when it was noticed what the source was
You know those types of things when we're using a processor that goes and grabs data from
An HTTP website for instance it will you know what it will record what URL that
Grab the data from location and those types of things all of that is metadata now
metadata separate from the actual
data file
But you know we're able to to work with that metadata because you may want to route
You know your data file based upon source for instance
And then when the when the data is coming in if you see source X
You may want to send it one way you see source why you may want to send it another way and all of that would
Be in the metadata
You know as it's receiving the data file
We can also take a look at the data file for instance
But you know in a lot of cases we like to use those attributes and we'll go into that when we're building it
I'm very interactive. I like to do a lot of hands-on work and so
We're going to start building some data flows and we go through those will
Will basically repeat what we have here on the slide
So attributes are key value pairs that store metadata about the data includes basic information file name size
timestamp
Any additional metadata added by processor?
You know again, so it will say you're using the yet
HTTP or get FTP which would get a process from an FTP server
It will put
metadata such as the server name
IP address, you know some of those types of things that can capture
Content you know the content of flow files the actual data
Carried by the file so you know depending on the application it can be text it can be binary
any other formats we have used it before to
Detect
Heart murmurs and stuff like that and heartbeat data, so we would actually bring in
audio
recordings of
You know of your heart filter and sort those and use you know
some additional processors to
To extrapolate that data look for heart murmurs those types of things
You know like I said, I've seen almost every type of data go through not five
So yeah, the content is what is processed or transformed by the processors
There is processors to handle attributes, but most of the processors is to work on the content
So it actually works on
that package of data
Life cycle of a flow file, you know full files are created by source processors then just in the 95
They were processed and potentially split merged or transformed as they move through the flow
Full files are finally exported out of not found by a destination processor
You know so as you know the life of a flow file as you can imagine is being ingested into the system
It's going through you know different operations and you know at the end. It's going to its destination
So the final step is to push that flow file out to its final destination
Record that in the data governance and then it drops the the flow file
Water flow files important, you know understanding the structure and lifecycle flow files is crucial because they are the backbone
of the data flows in not pop
So efficiently, you know work one of the things I like to do
Is is talk about some of the efficiency of a data flow?
So efficient management of flow files ensures that data is processed reliably and efficiently
maintaining data integrity and traceability
One of the things at the end of this class I will do is I like to take back any questions that I can't
Well, I'll be able to answer questions immediately
Sometimes on those questions
You know, so if I pause for a second when you're asking questions so I could write it down
I like the you know at the end of the class
I like to send out this presentation as well as the Q&A portion
So any questions that I can run them down I can get them answered and get them incorporated into this presentation
You know, so the class is over on Wednesday
Just so you'll have it for reference and you know some of that training material that that we can leave behind
I
Think I've kind of nailed, you know, some of the key concepts of not fine depth
But just in case, you know processors are there. There's a primary component within not five. We'll talk about that a lot
There's different types of processor, you know
Processors tailored for different tasks
Just so you know and you know as a because I'm still part of that community
I know what's coming up
You know some of the nuances there when you download and I find now you still get a
You know, I think it's like 300 processors out of the box
You know one of the biggest complaints is is you know, I just don't really need all these processors or I need my own
processor and and the download is one and a half gig of
Just for not fine and most of that space is actually the processors because of something
So, you know one thing to keep in mind is as an API continues to release updates
in the updates they are going to
Not put as many processors and you can go to
Some different sources like, you know, maybe no line and others to pull those down
They will still be built and ready to go
And there will be some that you know source code that you will need to compile and build and deploy
But for today for instance, we have all the processors we will need within
Custom processors we've already talked about to if there's
My
Experience right usually what comes out of the box will work about 95% of the time
I do run into cases where we will need a custom processor
You know, I can think of a couple for this past implementation
Did where we needed some specialized connectors for some of the
Tools for instance as or as well as like log systems
Things like gray log and other things out there
So being able to interface with different applications, you know, that's usually when we build a new processor
We will build a new processor
Depending on you know, there's some models that you can run in flight
You know, you can do image classification
image recognition models, you know things like that as the data is coming through
Depending on the output of that model you may you know filter or or change direction or send it to a different data flow
You know, so so there's a lot of capabilities
You know for custom processors in those types of things
Connections are links that route flow files between processors will go into that talk about back pressure
You know some of those things they are they not only transfer data
But also control the data flow management such as prioritization back pressure and load balances, you know
There's there's a few different policies within 95
You know, you do a full method the first in first out
You can do, you know some very advanced
Routing with the rules engine for instance, you know
We'll go into some of the back pressure what it does as well as some of the load balancing
And then to finish this off, you know enhancing data flow with connections
Connections can be figured with specific settings to manage how data moves through the system. You may
You know, you may have a use case where you need data to arrive
To a processor before another packet of data arrives you can set that up, you know, you may
You know, you may have a data flow that you want to take priority on its data
You know, you know processing where you know, you've got other data flows that are kind of a lower level priority
You know, you could set those types of things
You know, there's a lot of a lot of capabilities here a lot of customization that is that's part of not five
Again, you know, that's the power of it
But when we get down to some of the design principles and how to do things
You know, we'll see this as even in this class on some of the tasks that we will have to build a flow and how
They will be different
We're getting close to
To getting done with the presentation you going on break and then we get back from break
We'll work on getting enough up and running
but templates and
You know templates in
The data flow that can be saved and reused
You see this quite a bit, you know
most most most
Organizations have went, you know away from templates and went to the version control as you can imagine
Just because you know, you can integrate this into your CSV process you can
You know templates
That you can't work with templates like you can
You know a flow file backed up into yet or get lab or get hub
You know the not fire registry which will also go over
And those types of things but you know, you can create a template
I like creating templates sometimes because you know
I don't have to worry about the the get lab and the get hub connection those types of things
I can go to the canvas I can build my flow
I can save it as a template and send it to my colleague for instance
My colleague can quickly import a template that flow will be up and running on their canvas and and they can go from there
so
So templates are pretty important
But you know here lately it's more and more about version control. So
Encapsulate a set of processors connection controller services or a specific task or workflow
You know templates simply simplify the deployment of common patterns and promote best practices by allowing users to deploy tested flows quickly
You know tested flows quickly, so if you develop a flow you can save it as a template
Export that as an XML file send that to your your colleague and you should be able to
Quickly, you know get that flow up and running and you go from there. So, you know, that's that's template
Nafi does integrate with not five registry, which we will go over
Which supports versioning of data flows?
Version control is crucial for managing changes to data flows over time
allowing users to track modifications revert to previous versions
And ensure that the employment across different environments are consistent
That's the key ones that we will be working off of I know we have
Let's see we have a few folks that I've written down that
That would be interesting that
Some sys admins and those folks
So the main thing here is we're going to look at we're going to touch on templates. We're gonna probably save a template but version control
Will be our main
Avenue of saving flows and those types of things
And we'll go into using
Nafi registry for version control, you know
Nafi registry allows for the storing and retrieval and managing of the version flows
When we go to the Nafi desktop
And after we get registry up and running you're going to be able to save your flow check them in check them out
and those types of things
And then we will talk about how you can version control those from registry into your own
GitLab environment
I don't know if someone wants to let me know what what environment you look like I can focus on that
But you know we can work with a lot of different versioning control system
Okay, so
Let me see this other than chat
Okay
so
What I like to do is pause here before we go for a quick break
but
You know what challenges?
You anticipate in implementing or migrating
Into your current workflow, you know, I'd like to hear from the group on some of the the challenges you may have
and like I said that helps me in tailoring the conversation as well as
you know
What what we don't what we train on so what you know?
What what are some of your challenges in implementing or migrating to not in your in your current process, right?
And feel free someone just to start talking
We're the unknown
So it's just it's still single user mode this seems like any option we select
Deploying it from a container. It's
Single use remote. So I'm wondering if
It is not but we will we will touch on that but I can I can understand that pain point as well
Okay
Okay
Okay, so
No, and all those are
Good things and I really like the fear of the unknown
You know
Just because once you see how easy it is to
Operate and to start off, you know
I think it's pretty quick to get up and running
Then it becomes pretty deadly because of all the capabilities and the options you may have and it gets a little overwhelming
But those are some things I will definitely touch on the multi tenancy
Necessarily in this class, but what I will do is take that back
And I'm gonna work that in for like tomorrow or Wednesday
To definitely go over some of that and what that would look like. We do have Docker desktop on our
All all of our VMs and so, you know, we we can we can touch on that and see how that works
And then definitely we can hit some security aspects all day long
Okay
And I ask that because you know, I'm trying to get a better understanding of you know, some of the data governance
You know requirements you may have you know, some of the thoughts, you know, there's
You know, there's big data governance packages that are out there
You know, do you have those types of requirements?
You know those types of things because it helps me kind of tailor this to what you can expel
Any anybody want to speak on their data governance practices and how this how not five you know to get you know
Some additional information on a top of that
One of the ideas I thought that I had my head was like, you know getting event logs or whatever
Wi-Fi and I know from like central log server srg. They want to make sure that the data has been modified
So I think like if we went down a road like that data governance could help in that aspect
But for like the test community don't have to be answered by
That's a good point that's a good point
So that chain of custody right, you know that
You can see that data if it was if it was manipulated with their security aspect behind it
That's why telecoms are using this, you know because of some of those capabilities
Digest all that all that, you know fun stuff and associating it to that particular message
but it's not
Easy to do with like our syslog and stuff like that. So this could us, you know
Maybe this is something that we could look into it at some point that could help us
gap
Okay, anybody else with some of their your data governance
Are there any specific processes in your operation that could immediately benefit from the not like abilities and I know that that's kind of broad
But you know where you don't like to hear from the audience
Where do you see not fitting in and how it can fit in and how can it just help you?
You know do that data orchestration with all it wants
But I think man's taking a break
Wow
And that actually touches on some of the
Kick this off where
You know we see scripts just a single Python script running just to do something right and and
You know, it seems kind of small and you're putting this this this big project in front of it
But you know really understanding, you know those data sources getting those data sources into did you say access like Microsoft access?
Yes
Next year I'm telling you use Excel
So, you know
You know keeping the
Record of all of that, you know is definitely needed. I think it will help you with you know, some of your compliance issues
It'll help automate that
It'll you know, there's a lot of rules and a lot of triggers and things like that you can build the end
you know and so
Perfect. Okay
You know what else what other immediate benefits do you do you guys hope to get from now?
For working on more
Long-term process is not really immediate but it's it's our only use case right now, but it's
Or essentially a real-time data streaming pipeline from one of our test sites to do some
Verification on data so running it through submission learning models to identify like bad sensor data
And do some just data verification while it's coming through the pipeline
We can sort of facilitate an automated QA on the data. Oh
Really nice
Okay
Um
I've heard potentially you also is in this maybe related to to the the real-time pipeline
But you know, you're you're trying to get data from a talk or get data to a talk smartly filter those out
You know those types of things, you know some of the questions
Previously was like, you know, how do you how do you get minified to to pull that data in send it to your talk?
You've talked can filter what that what that kind of architecture looks like. So
I'm taking note of that previously, but I think that's still valid
Yes or no, yes
Oh beautiful those where you plan to use minify
What is the is it like an edge device running Linux?
Is it like a Windows laptop?
You know, can you go if you can go into details? What you know, what does that?
But there could be future use cases on some more restricted instrumentation or it's you know, possibly a microcontroller
reference
Well those things, okay
Alright and then
How might you use non-file scalability and flexibility to improve the data handling and processing in future projects?
so, you know and I asked this question because I'm trying to I think it was
Shawn no chance of death
Amanda
And Erin, you know essays
Looking at you know deploying this in a multi-tenancy
scalable fashion and so that's why I'm asking is you know, how might you use how do you plan to use now for for this?
And that'll kind of help me tailor the conversation when we go into some of the scalability some of the flexibility
But I can I can say that we're designing it in this way to account for there's a lot of data and also WPG and we're expecting
a
Lot of people to when they see the platform want to use it
So we're trying to design it upfront to be able to be scalable, but I would say our use case
Doesn't really need that okay
Okay, you have that the scalability issue
Several different
Locations that will be
Creating a lot of data during the day at least for this
Initial project is just going to be sort of you know one site and then it's going to expand out to multiple sites for
For sort of a single mission area and then might move out to two more sites
So scalability initially isn't going to be extremely important, but as it goes on
There's probably going to be quite a few workflows
So one of the one of the use cases that I think we might have in the future is
We have a data lake that's going to be in the cloud, but there's a lot of talks from our two data officer about having
an on-prem data lake
Get that data both places the test data. Mm-hmm
And what what what storage like what database and storage of solution are you looking at for your you know your
Beauty of not if I as it does have a an s3 processor, you know, it has an Azure blob storage processor
You know and those types of things
You said you're using men I oh
Oh
We so there's processor for both of them. So perfect the the minnow doesn't
I don't think it comes out of the box yet, but it is available as a processor on github
You know, so so perfect. No, I like that. I've actually seen that quite a bit
Lately with men I know it's you know folks coming out of the cloud and still you know
kind of keep it local for you know security reasons compliance reasons and and just
You know overall process
Yep, exactly
Exactly
No, and we will I'll make sure the touch on some of those things
You know what as we go through start building full files and and you know those types of stuff
We could potentially even
on the third day
Do a flow where we pick data up and put some in I'll
Hey
That being said
Let's take our first break. I need to get water since I'm talking a lot
I want to make sure to keep my voice throughout the day. Let's take a 15-minute
You know rest my outbreak rest and break get some water we'll meet back here at 1150
and then let me do my time, I think it's 950 your time and
We'll go through
Installing nafai and windows and start working on building our first flow
So we'll see everybody back here in about 15 minutes
And if you need anything just put it in the chat, I'll be running back and forth with getting water and restroom
All right, see you in 15
My desk is
And then we are going to get started on installing not
Draw I don't know if you're back or not, but I really like to
Wait till you hear about our processes
So being I don't know if anybody's here but being a former soldier being within the army itself so many years now I
Completely understand the nuances
While we wait for give a couple more minutes
Usually, you know
Software and but I felt it was pretty critical for us to actually do an install within windows
Just so everyone has that experience
If you're going to be working within not by even the local environment
Who knows you may want to spin up your own instance on your own laptop?
Get it working get your flow built, you know test some things out save it as a template
And then you know export that to your dev environment your test environment, you know those types of things
When we are and I'll go over this, you know in detail, but when we're installing 950 some key things to
Take a look at because there is some specific directories being created
And and there's a reasoning behind that there are some specific directories that you will need to understand
And learn about as well. So that's one of the reasons I like to
to really go in depth and
I'm taking a risk here because I don't have it installed because I'm gonna walk, you know, we're gonna all do it together
I do have Java on everyone's machine
So we'll go through some of the basics. So but we'll give it just another minute and we'll get started being if you're back
Can you just let me know like like
How long you you all think you need for lunch like I said around 45 minutes is
Kind of what what I like to go off of but I can do an hour as well. No problem
Okay, okay, we'll do 45 and
That will give you the capability to eat and then also
You know play around with whatever we've already built and done because you're gonna have
You're gonna have this desktop environment throughout the training
And you you do have the capability to download any information that you have there you can
You know upload the presentation as well so you can have it, you know on the desktop environment
So there's a lot of capabilities, but we will go ahead and get started. Let me exit all this
Okay, so if everyone can
Go ahead and start working off your desktop. I'm sharing my screen
But you know if you can let's go ahead and get logged into the desktop environment
Pull everyone up looks like everyone is good to go                

on 2024-05-06

Visit the Apache Nifi - GROUP 1 course recordings page