Cloud Desktop Teaching Platforms

Apache Nifi - GROUP 1

language: EN

                  WEBVTT
So for those that are on your desktop if you can can you bring up your
Desktop you should see a blank canvas
You know a blank desktop like mine. You'll have Microsoft Edge Docker desktop
Uploads you know those types of things
But one of the things I want you to check is if you can open your file explorer
And when you go to downloads
There should be
Quite a bit of downloads minify minify nine five registry toolkit nine five those types of things
If you do not have that information let me know
That way I can go in and replicate to your desktop the you know the information kodi.se you have it
Aaron nothing
Sean
You look good to it
Amanda you should have it as well. I remember replicating your desktop perfect perfect perfect
Pedro are you able to see the files like Pedro?
Unless if you look in your downloads file explorer then go to downloads you should see
A list of files
No worries I again I'm here on my ranch in Central, Texas, so
Starly you know if he decides I want data today I get it
Yep, I see your screen
Perfect perfect perfect so the the downloads all worked and you can go there
So what we'll do is go through some of the PowerPoint presentation a little bit more
We'll take a break for just take a quick break. I like to do like a 10 to 15 minute break
You know depending on on how many questions we get and those types of things
It's my understanding
Everyone's in Arizona so my lunch time is usually in about an hour
But for Arizona time we will try to go lunch about 1130 your time 1111 30 your time
Like to do it like 45 minutes for lunch, but if you need an hour that we can do that as well
You know if there's not a lot of questions or a lot of interaction we usually end a little early
You know there's some time built in for those interactions
You know so we'll do that so everybody's logged into their desktop you're able to see the files that we're going to work with
When we get to this part, we're going to actually do an install of NAFA on the Windows operating system
If you were installing this in Linux, it would actually be a little bit easier, but there is a ton of documentation
As you can imagine for a government product to the documentation for NAFA is very extensive
Everything I'm teaching today everything I'm going over is in the NAFA docs
You can actually follow along if you go to
NAFA.APACHE.ORG you will see
You know tons and tons of documentation, you know, what is NAFA?
You know, what is the core concepts the architecture some of these things that we're going over
You know even even the architecture for instance where you have the OS the host which in our case is
Windows
For those that are technical it's a Jetty web server serving this UI up
Then we have the flow controller
processors
You know flow file repository that we talked about content repository and provenance repository
They're on local storage now when it says local storage that doesn't necessarily mean
It's being stored to a local disk
See local storage be a NAS or some other type of network attached storage system or
You know those types of things
So if you want to kind of follow along
With some of the documentation it's all there if you go to NAFA.APACHE.ORG
You'll see everything
Documentation is very well. We're working off of the NAFA version 1 documentation
Just because the version 2 just came out. We'll touch on some of that
What I like to go off of is the admin guide or the user guide
Those are the two guides that that I work with when we go into some of the processors we can
Will actually go in and talk about some of that
But if you look at the admin guide, it kind of you know for for an open source product
The documentation for NAFA is is amazing
Usually we don't see this type of documentation
But as you can imagine being a government product that was released to the open source world
We had to do a lot of documentation before that was released
The documentation is also built in to NAFA
Even down to the processors when you're developing a processor for those that have developed a processor before
You did have a place where you could include a description as well as other documentation
So, you know, we'll go through that but I want to make sure that
That your desktops are running and if you're in your browser, you can pull up the NAFA user guide and admin guide
And follow along
Okay
You know a flow file
Abstraction that represents a single piece of information or a data object
Within a data flow, you know
so take in this case and I'm using this example is because I just implemented a a
huge prototype for this
You know long messages coming in you may have a single message and when it comes into NAFA
You know treats that as a flow file
So that flow file
Is that message it can be in any format it can be in any kind of protocol or those types of things
You know, so when NAFA receives that
it generates that as a flow file and then you know
Within that flow file it consists of two major components the metadata and the data payload
The metadata is the attributes and we'll go into more of that where we're able to
Take that data flow and make it into an attribute
It has a lot of attributes
So as soon as NAFA touches this data flow it gets assigned an attribute of like
You know a date time group of when it was noticed what the source was
You know those types of things when we're using a processor that goes and grabs data from
An HTTP website for instance it will you know what it will record what URL that
Grab the data from location and those types of things all of that is metadata now
metadata separate from the actual
data file
But you know we're able to to work with that metadata because you may want to route
You know your data file based upon source for instance
And then when the when the data is coming in if you see source X
You may want to send it one way you see source why you may want to send it another way and all of that would
Be in the metadata
You know as it's receiving the data file
We can also take a look at the data file for instance
But you know in a lot of cases we like to use those attributes and we'll go into that when we're building it
I'm very interactive. I like to do a lot of hands-on work and so
We're going to start building some data flows and we go through those will
Will basically repeat what we have here on the slide
So attributes are key value pairs that store metadata about the data includes basic information file name size
timestamp
Any additional metadata added by processor?
You know again, so it will say you're using the yet
HTTP or get FTP which would get a process from an FTP server
It will put
metadata such as the server name
IP address, you know some of those types of things that can capture
Content you know the content of flow files the actual data
Carried by the file so you know depending on the application it can be text it can be binary
any other formats we have used it before to
Detect
Heart murmurs and stuff like that and heartbeat data, so we would actually bring in
audio
recordings of
You know of your heart filter and sort those and use you know
some additional processors to
To extrapolate that data look for heart murmurs those types of things
You know like I said, I've seen almost every type of data go through not five
So yeah, the content is what is processed or transformed by the processors
There is processors to handle attributes, but most of the processors is to work on the content
So it actually works on
that package of data
Life cycle of a flow file, you know full files are created by source processors then just in the 95
They were processed and potentially split merged or transformed as they move through the flow
Full files are finally exported out of not found by a destination processor
You know so as you know the life of a flow file as you can imagine is being ingested into the system
It's going through you know different operations and you know at the end. It's going to its destination
So the final step is to push that flow file out to its final destination
Record that in the data governance and then it drops the the flow file
Water flow files important, you know understanding the structure and lifecycle flow files is crucial because they are the backbone
of the data flows in not pop
So efficiently, you know work one of the things I like to do
Is is talk about some of the efficiency of a data flow?
So efficient management of flow files ensures that data is processed reliably and efficiently
maintaining data integrity and traceability
One of the things at the end of this class I will do is I like to take back any questions that I can't
Well, I'll be able to answer questions immediately
Sometimes on those questions
You know, so if I pause for a second when you're asking questions so I could write it down
I like the you know at the end of the class
I like to send out this presentation as well as the Q&A portion
So any questions that I can run them down I can get them answered and get them incorporated into this presentation
You know, so the class is over on Wednesday
Just so you'll have it for reference and you know some of that training material that that we can leave behind
I
Think I've kind of nailed, you know, some of the key concepts of not fine depth
But just in case, you know processors are there. There's a primary component within not five. We'll talk about that a lot
There's different types of processor, you know
Processors tailored for different tasks
Just so you know and you know as a because I'm still part of that community
I know what's coming up
You know some of the nuances there when you download and I find now you still get a
You know, I think it's like 300 processors out of the box
You know one of the biggest complaints is is you know, I just don't really need all these processors or I need my own
processor and and the download is one and a half gig of
Just for not fine and most of that space is actually the processors because of something
So, you know one thing to keep in mind is as an API continues to release updates
in the updates they are going to
Not put as many processors and you can go to
Some different sources like, you know, maybe no line and others to pull those down
They will still be built and ready to go
And there will be some that you know source code that you will need to compile and build and deploy
But for today for instance, we have all the processors we will need within
Custom processors we've already talked about to if there's
My
Experience right usually what comes out of the box will work about 95% of the time
I do run into cases where we will need a custom processor
You know, I can think of a couple for this past implementation
Did where we needed some specialized connectors for some of the
Tools for instance as or as well as like log systems
Things like gray log and other things out there
So being able to interface with different applications, you know, that's usually when we build a new processor
We will build a new processor
Depending on you know, there's some models that you can run in flight
You know, you can do image classification
image recognition models, you know things like that as the data is coming through
Depending on the output of that model you may you know filter or or change direction or send it to a different data flow
You know, so so there's a lot of capabilities
You know for custom processors in those types of things
Connections are links that route flow files between processors will go into that talk about back pressure
You know some of those things they are they not only transfer data
But also control the data flow management such as prioritization back pressure and load balances, you know
There's there's a few different policies within 95
You know, you do a full method the first in first out
You can do, you know some very advanced
Routing with the rules engine for instance, you know
We'll go into some of the back pressure what it does as well as some of the load balancing
And then to finish this off, you know enhancing data flow with connections
Connections can be figured with specific settings to manage how data moves through the system. You may
You know, you may have a use case where you need data to arrive
To a processor before another packet of data arrives you can set that up, you know, you may
You know, you may have a data flow that you want to take priority on its data
You know, you know processing where you know, you've got other data flows that are kind of a lower level priority
You know, you could set those types of things
You know, there's a lot of a lot of capabilities here a lot of customization that is that's part of not five
Again, you know, that's the power of it
But when we get down to some of the design principles and how to do things
You know, we'll see this as even in this class on some of the tasks that we will have to build a flow and how
They will be different
We're getting close to
To getting done with the presentation you going on break and then we get back from break
We'll work on getting enough up and running
but templates and
You know templates in
The data flow that can be saved and reused
You see this quite a bit, you know
most most most
Organizations have went, you know away from templates and went to the version control as you can imagine
Just because you know, you can integrate this into your CSV process you can
You know templates
That you can't work with templates like you can
You know a flow file backed up into yet or get lab or get hub
You know the not fire registry which will also go over
And those types of things but you know, you can create a template
I like creating templates sometimes because you know
I don't have to worry about the the get lab and the get hub connection those types of things
I can go to the canvas I can build my flow
I can save it as a template and send it to my colleague for instance
My colleague can quickly import a template that flow will be up and running on their canvas and and they can go from there
so
So templates are pretty important
But you know here lately it's more and more about version control. So
Encapsulate a set of processors connection controller services or a specific task or workflow
You know templates simply simplify the deployment of common patterns and promote best practices by allowing users to deploy tested flows quickly
You know tested flows quickly, so if you develop a flow you can save it as a template
Export that as an XML file send that to your your colleague and you should be able to
Quickly, you know get that flow up and running and you go from there. So, you know, that's that's template
Nafi does integrate with not five registry, which we will go over
Which supports versioning of data flows?
Version control is crucial for managing changes to data flows over time
allowing users to track modifications revert to previous versions
And ensure that the employment across different environments are consistent
That's the key ones that we will be working off of I know we have
Let's see we have a few folks that I've written down that
That would be interesting that
Some sys admins and those folks
So the main thing here is we're going to look at we're going to touch on templates. We're gonna probably save a template but version control
Will be our main
Avenue of saving flows and those types of things
And we'll go into using
Nafi registry for version control, you know
Nafi registry allows for the storing and retrieval and managing of the version flows
When we go to the Nafi desktop
And after we get registry up and running you're going to be able to save your flow check them in check them out
and those types of things
And then we will talk about how you can version control those from registry into your own
GitLab environment
I don't know if someone wants to let me know what what environment you look like I can focus on that
But you know we can work with a lot of different versioning control system
Okay, so
Let me see this other than chat
Okay
so
What I like to do is pause here before we go for a quick break
but
You know what challenges?
You anticipate in implementing or migrating
Into your current workflow, you know, I'd like to hear from the group on some of the the challenges you may have
and like I said that helps me in tailoring the conversation as well as
you know
What what we don't what we train on so what you know?
What what are some of your challenges in implementing or migrating to not in your in your current process, right?
And feel free someone just to start talking
We're the unknown
So it's just it's still single user mode this seems like any option we select
Deploying it from a container. It's
Single use remote. So I'm wondering if
It is not but we will we will touch on that but I can I can understand that pain point as well
Okay
Okay
Okay, so
No, and all those are
Good things and I really like the fear of the unknown
You know
Just because once you see how easy it is to
Operate and to start off, you know
I think it's pretty quick to get up and running
Then it becomes pretty deadly because of all the capabilities and the options you may have and it gets a little overwhelming
But those are some things I will definitely touch on the multi tenancy
Necessarily in this class, but what I will do is take that back
And I'm gonna work that in for like tomorrow or Wednesday
To definitely go over some of that and what that would look like. We do have Docker desktop on our
All all of our VMs and so, you know, we we can we can touch on that and see how that works
And then definitely we can hit some security aspects all day long
Okay
And I ask that because you know, I'm trying to get a better understanding of you know, some of the data governance
You know requirements you may have you know, some of the thoughts, you know, there's
You know, there's big data governance packages that are out there
You know, do you have those types of requirements?
You know those types of things because it helps me kind of tailor this to what you can expel
Any anybody want to speak on their data governance practices and how this how not five you know to get you know
Some additional information on a top of that
One of the ideas I thought that I had my head was like, you know getting event logs or whatever
Wi-Fi and I know from like central log server srg. They want to make sure that the data has been modified
So I think like if we went down a road like that data governance could help in that aspect
But for like the test community don't have to be answered by
That's a good point that's a good point
So that chain of custody right, you know that
You can see that data if it was if it was manipulated with their security aspect behind it
That's why telecoms are using this, you know because of some of those capabilities
Digest all that all that, you know fun stuff and associating it to that particular message
but it's not
Easy to do with like our syslog and stuff like that. So this could us, you know
Maybe this is something that we could look into it at some point that could help us
gap
Okay, anybody else with some of their your data governance
Are there any specific processes in your operation that could immediately benefit from the not like abilities and I know that that's kind of broad
But you know where you don't like to hear from the audience
Where do you see not fitting in and how it can fit in and how can it just help you?
You know do that data orchestration with all it wants
But I think man's taking a break
Wow
And that actually touches on some of the
Kick this off where
You know we see scripts just a single Python script running just to do something right and and
You know, it seems kind of small and you're putting this this this big project in front of it
But you know really understanding, you know those data sources getting those data sources into did you say access like Microsoft access?
Yes
Next year I'm telling you use Excel
So, you know
You know keeping the
Record of all of that, you know is definitely needed. I think it will help you with you know, some of your compliance issues
It'll help automate that
It'll you know, there's a lot of rules and a lot of triggers and things like that you can build the end
you know and so
Perfect. Okay
You know what else what other immediate benefits do you do you guys hope to get from now?
For working on more
Long-term process is not really immediate but it's it's our only use case right now, but it's
Or essentially a real-time data streaming pipeline from one of our test sites to do some
Verification on data so running it through submission learning models to identify like bad sensor data
And do some just data verification while it's coming through the pipeline
We can sort of facilitate an automated QA on the data. Oh
Really nice
Okay
Um
I've heard potentially you also is in this maybe related to to the the real-time pipeline
But you know, you're you're trying to get data from a talk or get data to a talk smartly filter those out
You know those types of things, you know some of the questions
Previously was like, you know, how do you how do you get minified to to pull that data in send it to your talk?
You've talked can filter what that what that kind of architecture looks like. So
I'm taking note of that previously, but I think that's still valid
Yes or no, yes
Oh beautiful those where you plan to use minify
What is the is it like an edge device running Linux?
Is it like a Windows laptop?
You know, can you go if you can go into details? What you know, what does that?
But there could be future use cases on some more restricted instrumentation or it's you know, possibly a microcontroller
reference
Well those things, okay
Alright and then
How might you use non-file scalability and flexibility to improve the data handling and processing in future projects?
so, you know and I asked this question because I'm trying to I think it was
Shawn no chance of death
Amanda
And Erin, you know essays
Looking at you know deploying this in a multi-tenancy
scalable fashion and so that's why I'm asking is you know, how might you use how do you plan to use now for for this?
And that'll kind of help me tailor the conversation when we go into some of the scalability some of the flexibility
But I can I can say that we're designing it in this way to account for there's a lot of data and also WPG and we're expecting
a
Lot of people to when they see the platform want to use it
So we're trying to design it upfront to be able to be scalable, but I would say our use case
Doesn't really need that okay
Okay, you have that the scalability issue
Several different
Locations that will be
Creating a lot of data during the day at least for this
Initial project is just going to be sort of you know one site and then it's going to expand out to multiple sites for
For sort of a single mission area and then might move out to two more sites
So scalability initially isn't going to be extremely important, but as it goes on
There's probably going to be quite a few workflows
So one of the one of the use cases that I think we might have in the future is
We have a data lake that's going to be in the cloud, but there's a lot of talks from our two data officer about having
an on-prem data lake
Get that data both places the test data. Mm-hmm
And what what what storage like what database and storage of solution are you looking at for your you know your
Beauty of not if I as it does have a an s3 processor, you know, it has an Azure blob storage processor
You know and those types of things
You said you're using men I oh
Oh
We so there's processor for both of them. So perfect the the minnow doesn't
I don't think it comes out of the box yet, but it is available as a processor on github
You know, so so perfect. No, I like that. I've actually seen that quite a bit
Lately with men I know it's you know folks coming out of the cloud and still you know
kind of keep it local for you know security reasons compliance reasons and and just
You know overall process
Yep, exactly
Exactly
No, and we will I'll make sure the touch on some of those things
You know what as we go through start building full files and and you know those types of stuff
We could potentially even
on the third day
Do a flow where we pick data up and put some in I'll
Hey
That being said
Let's take our first break. I need to get water since I'm talking a lot
I want to make sure to keep my voice throughout the day. Let's take a 15-minute
You know rest my outbreak rest and break get some water we'll meet back here at 1150
and then let me do my time, I think it's 950 your time and
We'll go through
Installing nafai and windows and start working on building our first flow
So we'll see everybody back here in about 15 minutes
And if you need anything just put it in the chat, I'll be running back and forth with getting water and restroom
All right, see you in 15
My desk is
And then we are going to get started on installing not
Draw I don't know if you're back or not, but I really like to
Wait till you hear about our processes
So being I don't know if anybody's here but being a former soldier being within the army itself so many years now I
Completely understand the nuances
While we wait for give a couple more minutes
Usually, you know
Software and but I felt it was pretty critical for us to actually do an install within windows
Just so everyone has that experience
If you're going to be working within not by even the local environment
Who knows you may want to spin up your own instance on your own laptop?
Get it working get your flow built, you know test some things out save it as a template
And then you know export that to your dev environment your test environment, you know those types of things
When we are and I'll go over this, you know in detail, but when we're installing 950 some key things to
Take a look at because there is some specific directories being created
And and there's a reasoning behind that there are some specific directories that you will need to understand
And learn about as well. So that's one of the reasons I like to
to really go in depth and
I'm taking a risk here because I don't have it installed because I'm gonna walk, you know, we're gonna all do it together
I do have Java on everyone's machine
So we'll go through some of the basics. So but we'll give it just another minute and we'll get started being if you're back
Can you just let me know like like
How long you you all think you need for lunch like I said around 45 minutes is
Kind of what what I like to go off of but I can do an hour as well. No problem
Okay, okay, we'll do 45 and
That will give you the capability to eat and then also
You know play around with whatever we've already built and done because you're gonna have
You're gonna have this desktop environment throughout the training
And you you do have the capability to download any information that you have there you can
You know upload the presentation as well so you can have it, you know on the desktop environment
So there's a lot of capabilities, but we will go ahead and get started. Let me exit all this
Okay, so if everyone can
Go ahead and start working off your desktop. I'm sharing my screen
But you know if you can let's go ahead and get logged into the desktop environment
Pull everyone up looks like everyone is good to go

on 2024-05-06

language: EN

                  WEBVTT
Brooke slash Brooke slash one two seven dot zero dot zero dot one and colon eight
four four three okay thank you Randy you can click advanced and click proceed I
want to bring the Pedro's up so I can catch up on the news right quick awesome
everybody looks good except for the separate rooms yeah we use men low
security so it's got that isolation thing yes no worries thank you for you
and we'll get that just a minute well while we get better off and running so
what you're looking at is the the login screen depending on your identity
provider some of your security settings some of your core settings you know this
may be a single sign-on you may use you know you know it's configured you
configurable to use cat cards and and certificates as well you know there is a
ton of capabilities here you know again I believe you all picked the right
product for what you're trying to do just because of the rich military
history and you know the familiarity this you know this product has with
integrating into DOD components and those types of things well we get
you back
okay while we wait one of the things that that we're going to accomplish
before lunch is you know just getting not fine installed up and running we're
you know I mentioned some of the history of not fine how they put single user
credentials in place for a default so we're going to go in on how to create
your username and password for this single instance but we all know why you
know a lot of folks would install not by in an easy to leave it open to the world
I and it's a very easy Google search to find those instances you know so they
started securing it pretty quickly and it also looks like when I created
everybody's desktop environment it did retain the tab browser tabs that that
open to nafas documentation you know those are great links to keep but you
also keep in mind that if you install this locally you will have that document
documentation as well so in the past I've seen people trying to you know save
as PDF or copy it I don't think there's any need there may be certain sections
you want to highlight and take note of but you know for the most part this
documentation is widely available and you know you shouldn't need to download
one minute because I you know I don't I don't want to necessarily go any further
or skip this step just because this is a critical step you know in this so while
he pulls that up I am and Pedro if you just let me know when you're ready you
know I can give you the address again
if you want to play around while we wait for Pedro you can go into your log
directory and you can actually go into the not hot app and it will let you know
how long it took to load it and those types of things where the web server is
enlisting and you know because it's binding to localhost it's gonna either
go to localhost or 127 some of the services and service controllers that it
generated username generated password we can change this username and password
and I will show you how to do it but you know that is the current username and
password for this system while we wait on him if you can you know go into that
directory pull open the not hot app it should automatically open with the
notepad and then scroll about 85% the way down and we can pull our username
and password so yeah luckily once we log in we should not need to log you
out again until you shut the machine down if you want to take this actually
that's a good point if you want to take your generated username and password and
copy it that is a good point and and just go into your directory and I'm
gonna just create a text file that you know you pee you would never do this on
a full-blown system right you would never save a username password in the
log the reason is there is because the log directory is created after it's you
know installed and you know it sets the permissions on when it's creating that
you clicked on the bat file okay perfect and it's running if you can go back to
your browser can you do you mind closing that edge browser out and putting in a
Google Chrome so if you can just exit that Google Chrome should be available
on your desktop there you go pull up in the Google Chrome and perfect and now
there you go there you go don't click that and then you know go into Ben run
on file right it should it should start rather quickly perfect now let's go
back to your browser and refresh that all of our machines are exactly the same
so it's weird that yours is actually you just close that tab bring open a new tab
and let's just type it out together HTTPS colon front slash front slash once
you didn't change anything in your property so let's do this real quickly
isn't here's another quick fix bring up that command prompt again right quick
control C so we can terminate it with the first time I've ever seen this where
the the desktop something's different when they all get made from the same
image so yeah if you can terminate go back to your file folder yeah exit that
one pull that one up go back to your downloads and then right click on the
9512 500 been folder and say delete nope you want to go back to your downloads
right click and delete that folder no not that one not that one the total
yeah right there yeah yeah delete that yeah we're gonna just have you re
extract it and and and this is what to put you at the the base default you know
that that was a good point that just came up it's recycling 8,000
items over two gigs and so you know as a sysadmin keep that in mind that not high
is nifi is not liked minify is so if you can go to your naf that one there the
band right click and say extract all and perfect say extract and you want to show
the extractive files
but yeah there is a pop-up you can put in as you know I've seen as well so um
yeah the banner I'll show you because I think hopefully I'm the only one who
changed it but if you changed it that's fine too I put unclassified on mine but
you know put my name if I wanted to so if you can't go go into that folder okay
and we know that this is a fresh install because we don't have any of the other
directories that is created upon startup so go to bin and then run nifi say run
it's going to take a minute just because you know it's the first setup so again
if you want to go back to your folder actually that was really quick if you
want to go back to your folder though and go back a directory you can now see
you have logs run work you know some of these other things so try refreshing
your browser to see if it will pull it up hey Maria are you still on call yes
I'm here why would Maria is the expert at these wow so I created all of these
mine I'm gonna continue until we go to lunch and I will put you know you can
follow along on my screen and others as well and then what while we are in lunch
I'm going to reset your machine to exactly like mine and let's see if that
fixes it this is really weird because everybody's was created from the same
yeah I'm gonna yeah during lunch and probably probably quicker than trying
to diagnose it I'll probably just reset it and then he'll he'll look exactly
like mine and like everyone else's so okay that's good to know though and I
well this is a first so I didn't do anything perfect perfect well I mean it
started working so perfect Pedro if you can just go into your log directory in
that folder pull in the nafI app you want to go about 80% down and there's a
yeah if you want to copy just that whole block into a new text document you can
or you just copy and paste it in there and you know you should be good to go as
long as you don't close your browser and and even then I think it will hold for a while
give you just another minute everyone should be you know everyone should be
looking at the top level nafI canvas the only difference between mine and yours
so what you're looking at here is the nafI canvas you know you have all your
components and and and a lot of the services and things like that you know
right here so the first thing we're looking at is that user interface you
know night and we went over some of this but the nafI offers a web-based user
interface to seamlessly experience you know for designing monitoring and
controlling data flows and it allows you to visually manage data flows in real
time you know within this interface we have you know you know your top top left
in your first box is your processors and we went over processors but you know
processors do that that that single task you know a processor is used to pick up
data so if you click processor and bring it down there's 359 processors that we
can search you know you all use Azure so you may have you know a lot of
Azure processors you may need to consume an Azure event you may need to delete an
Azure blob storage you know those types of things so there there's a you know a
lot a lot of capabilities there and and you know you can search there is a a
word cloud that you can have as well you know so if I remove that I want to say
all my AWS processors there's the ones that are tagged AWS now when you're
developing custom processors you know you have the capability to put your
version number your tags your description as well as you know a full
description of exactly what that processor does so you know just kind of
keep that in mind when you're you know building custom processors so right now
this is the processors that we have we have 350 non processors you know we we
will work off of these more building our sample data flows and those types of
things but you know now you know where a processor is if I wanted to add this
config well before we say configure you have options here so when you right
click you can look at configuring the processor we can disable the processor
we can view the data provenance for this specific processor we can replay the
last events for this processor you know as part of that data provenance that
tracks data flow through you know from its origin to its destination we can use
status history we can give you the usage of this processor we can view the
connections downstream then we can do things such as beautification where we
want to center it in our view we can change the color we can do groupings we
can create a template with this processor and we can copy and we can
delete it so there's a lot of options here and we will actually go through
most of these you know through the next three days so with that being said if
you can right click and say configure we will go into you know what a processor
is so I can see we've got most folks again you just go to the processor
right click on the processor and then click configure and give it just
so we've drug our processor icon to the screen we selected our processor you
know I know there's a lot of options but we we got the get file processor and
this processor tells us that it pulls data from our local disk into the
instance so I know the instance is running here on the VM but we could use
this processor for instance to get we could you know get all the zip files in
files downloads right I'm just coming up with whatever but you know it's
getting files from downloads so that you know if I were to take a look at this
processor I would instantly you know understand that this get file is
getting files from the downloads directory on and it's enabled the
processor every time that you create a new processor it creates a new UUID so
this processors ID is you know this UUID and you know it's you don't have to
memorize this or you know mess with this at all but when you start looking
through some of the data provenance in the lineage you know you're gonna see
and then I also know regular logging to let me know if there's any problems
deploying that processor so that is some of the settings of a processor if you'll
go to the next time a tab that is scheduling of a processor you know
there there's a there's a lot of different settings here you know what
properties is if you can click on the properties tab properties is probably
the biggest configuration that you will do on a processor and so if the property
here is is bolted as you can see there's a the path filter is not input directory
is final filter is if you remember you know the processor itself complain that
the input directory was not set and it didn't have any connection to send the
data so the reason it wasn't said is you know from the you know out of the box
the input directory is not filled in and so here's what we would put like you know
in Linux you put the Linux file path and Windows will put the Windows file path
and when we're working on building a flow we'll add some more additional details of this you can
put that in that's where that path would live so you know you know that that's
that you know that's where it's going to go get the files right and and pick those up so you know
just keep that in mind some of the other things is is you can filter the files right so we can
filter on the files as it picks up you can see a regex pattern for those that are familiar with
so you know kind of keep that in mind file filter path filter is not required but again if you want
to you want to set a regex pattern to filter the path have at it you know that capability is there
the batch size how many how many files you want to process in time you know the default is 10 you
know the default is set up to you know best you know here's the best configuration for a single
we'll go into expression language and sensitive property basically a sensitive property is is that
a username is that a password is that a hash you know is it a sensitive property that needs to be
protected this is a you know one of the i wish you know i wish this this was different and i have made
the recommendations numerous times to the apache foundation for nafa but you know keep the source
file uh here it's false so whatever file we use to pick up it's it's going to pick it up and not
keep the source so it's going to delete the source file i like to change that to true and when we go
through and build a flow you know we will keep that as true uh just because you know nafa is not
going to pick that file up again unless that file has changed like the its hash has changed um but
um you know there is a state that is kept within nafai it knows if it's picked that file up or not
but you know i like to say true when i'm building a flow from scratch uh just because i don't want
it to accidentally delete all of my files especially if i have it picking files up and sending it
somewhere else until i can test that out um you know you wanted to you know recurse subdirectories
so in the downloads folder for instance we have the nafafo and then in that nafafolder is another
nafafold so you know if we kept recursive directories on we could go through and pick up
everything in the downloads folder not just at that parent door polling interval you know just
like it says you know indicates how long the wait before performing a directory listing
you know zero seconds is is it's polling it constantly and and usually that's what i see but
if you know that you don't get a file into a directory except for like once an hour you may
want to change that just because that way it's not polling it constantly um you know if you want to
ignore hidden files folders the minimal file age maximal file age you notice it's not bolted so it's
not uh necessarily set and you know the minimal file size maximum file size those types things
so you know for this one um i don't you you all don't you don't need to worry about filling it
but you know i'm gonna go ahead and put my file path of nafi in there
because i i just need it so i can show you what the next processor looks like
okay so i have my
my main property that needed to be set i have it set now so now when i go back to my yellow
you know yield sign it it just tells me the only thing left to do is a relationship success so you
know when it comes to the properties of configuring this processor i've answered the call
relationship tab um you know a data flow has to terminate it has to um
either either retry or automatically terminate so what that means is is um you know what do you want
this processor to do after it's handed the file off and you know previous older versions of nafi
we would have to go through and tell every process send every processor to a terminate
processor just to get it to stop luckily they fixed that where it's less cumbersome and you
can just tell it to automatically terminate or retry because it may not be able to push that
flow file to the next processor so uh but we'll go into more of that uh you know here in a second
and then any kind of comments um you know i highly encourage commenting especially in a
production environment it gives everyone that ability you know i added this processor for the
tree
so it gives you know uh gives you a comment but like why and and and you know why is this
processor here who added that processor you know just some general comments you know about this so
you know it's something um you know i would add in in a more production environment if you won't
um i do see a lot of cases where you know there's no comments and it's because you know it's
documented elsewhere but the comment section there is is just to provide the space for additional
documentation all right so that is the git file and we're not going to go like this through every
single processor it's just this would be a turning into a two month training but that itself is just
an overview of your typical processor and some of the some of the settings and some of the properties
of a typical processor so with that being said i'm going to pause and i'm going to go back and see
if anybody has their hand up you know any questions on the you know a processor any questions on the
canvas we're going to go into more details of the canvas but i just want to pause there for a second
any questions perfect well we all understand processors now good deal um if you have any
questions seriously though let me know we'll go through um uh you know a little bit more than this
i got almost 130 1130 uh you know it's time uh we'll go through a little bit more and then we'll
take a lunch break um and so so now we're starting to build our flow right you know we we our first
step is we need to get data and so getting files from the downloads helps us get that data so what
am i going to do with the files in the downloads right there's zip files there you know there are
all kinds of things so you know say that folder was just full of zip files um and so what i would
like to do then is you know what i probably would do is you know i would i would detect the the
so what i'm doing is is i'm getting the file see here's 1130 let's go about 15 more minutes
and then um we'll break for lunch if that works for everyone so anyway so what i'm doing is
getting a file from the downloads i'm picking that up i've made a connection now to identify
the mine type you know my thought process is is i've got all of these i got this file folder i've
got like millions of zip files well thousands of zip files because you know director can't hold
millions but i've got all these zip files that i need to handle um you know so i'm gonna pick
those up i'm gonna identify which type of file they are and i'm going to decompress them and then
from there i may do some sorting and filtering um you know so i've got my connection made
the processor that's getting the file stopped now it's going to the identified mine type
if i look at the um you know the relationship uh it tells me basically that it needs somewhere to go
right that relationship success needs somewhere to go so you know after this usually in this use case
where i am you know worried about zip files and decompressing them um you know i want to unpack
i'm gonna add that one
and for simplicity i'll put it here so for identify mom type it doesn't have any
um you know required properties um you know so it's going to use the file name i can actually
put some additional settings i can change the name some of these things that went over the scheduling
uh the relationships i can automatically terminate um you know uh own success i don't want to do that
but you know if it successfully identifies the mine type kill it done i don't i don't like i
want it to decompress that file type but you know i can auto do these things um i don't retry and
retry is set up uh quite a bit on some bigger complicated data flows just because you know you
may not have the service available yet that you want to you know send that flow file to
if it does something like that you may have some latency issues or something else you want it to
just retry um you know maybe it needs additional information to do something um but you know there
is some automatic relationships built into this um but in this scenario um i'm not going to have
it retry i do not want it to terminate when it identifies the mine type i want it to pass the
mine type uh and the file to my next processor so you know for this one to to to make sure the
requirements are met i just need that relationship success and so relationship success is checked
and added this should turn red perfect and so you know my my goal of picking files up you know you
know identifying what type of of of you know compression was used and send that to unpack the
content so if we look at this one relationship success is invalid we need a success we need a
relationship failure in case it cannot um you know remove the or unpack the content
and we have a relationship original so that original zip file what do we want to do with that
after we decompress it and so you know those are some of the things that we need to think about
so let's look at this the properties of this processor um you know for the relationships we
have the three uh you know the first two uh processors only had success uh you know identify
mine type is going to be automatically successful even if it doesn't uh and identify it um you know
we can tell it to retry uh but for this one we need to either you know terminate the process
on failure or terminate the original or or send the success somewhere um and then you know once
in in the success of this one is those unpacked flow files so if i have a zip file with a thousand
files in it it's going to unpack that and then it's going to put a thousand files on that
processor and so the processor is going to say okay i need somewhere to put these thousand files
i just unzipped and they need to go somewhere because you don't want a lot of files you know
in the queue in the you know in your data flow uh so for this um unpack content for instance
uh i want to look at the properties um the packaging format is actually using the mine
type attribute um which is great so you know so what's going to happen is is it's going to unpack
depending on the type of file it is um and so you know in this in this scenario uh i think most of
these are zip files uh it's going to say hey this is a zip file and because it's a zip file we know
how to decompress it so i'm gonna apply there um and and what i'm gonna do i'm gonna try to actually
run i usually don't try to run this but i want to try to put um you know one of these through so i
don't want to pick all these files up from downloads i'm gonna create a new folder
i'll test and what i want to do is something small
i want to copy menafi it's kind of small to test
and what i want to do here is change my
test okay i want to keep the source file i don't want to delete it
plot
all right so you can see already right we have our get file um what's the complain
button let me make sure that this directory was right it will check your directory structure
to make sure um uh that you have everything correct
okay perfect um so now we we built this flow so far the flow is getting files from the downloads
it's going to identify the mine type and then it's going to unpack the content and and i've got these
chained together i've got a couple of connections in there but i'm still working on this flow
but i really want to kind of test and see you know is this working so one of the nice things is
is you can start the processor and if i were to start the processor it's going to pick up
everything in that folder but you know i just want to test this out so one of the
cool capabilities is is once you get it to a state where it stopped you can just tell it to run once
and so i want to do that i want to tell it to run one time and and see if it picked that one file up
in that directory that test directory we have minify 1.25 bin compress zip file so if you remember
the ui is refreshed every you know few seconds you can click on the desktop and you can say
refresh at any time but you know it's automatic refresh already happened and it picked the file
up right it had one file in there it picked it up it's 243.52 meg i can list the queue even
so the queue of files in that connection here's a unique idea of the file which again is going to
go back to that data governance here's the file size the you know the if it was penalized i can
actually view the content i can download the content if i have permissions and i can look
at the provenance but if you notice you know this this screen was not there before we ran this file
so now that we have the list queue option i can go in and see you know was that the file i was
expecting to be picked up what you know is it this file size i was expecting but not only that is we
talked about attributes and how a flow file coming in is basically the data that you're you're bringing
so i haven't even looked at the file i haven't unzipped it i haven't done anything with it
except for pick it up and i already have a lot of metadata associated with that and so because of
that i can now you know put in other processors that would you know maybe sort or filter based
upon an attribute i don't even have to decompress this yet and and i can already work with the data
you know i could send you know all data that has a access time of x to somewhere else
if if the data comes from one folder you know send it somewhere else you know you will see
instances where you know your get data your data file structure is you know there's there's a lot
of different processes putting data in they may be putting into different directories and sorting
and then your job is to get all of that data so you know depending on the file path you may filter
or sort data on that you may filter or sort data on the content of that file so once we unzip it
we may read the contents and and send it to somewhere else but there is a lot of power
in just looking at the metadata of that file that's automatically created you know when it
you know ingest that data can give you the download the details of it you can actually view
the file but the problem is is not five does not have a viewer for zip
you know if it was a json document for instance you could you know look at that you can look
the hex or format it but it you know remember it is a you know zip file so you know not a lot to view
i can download this file if you know if permissions are set up appropriately and we all have root
access on this instance you know you can download the file you can view it you know while it's
sitting on that connection did it come out of that processor exactly like you think it should
you know and that's where you would use some of these tools was it penalized you know any details
about it all of these details and attributes though are going to be available as metadata and
we'll go more into that cody go ahead i see you have the screen up so i can see when someone
raises their hand uh yeah where where is the get file processor do you feel the specific like
profiles we have up right now yeah so what i did is i told the get file processor to run one time
and so luckily there's only one file in that folder and so it picked that file up
and immediately sent it to the next process so get file is a you know it's a simple function
right you know one simple task and that task is to get a file and send it off and so what what
what happened was is that connection between get file and before it identifies the mine type it
stopped so because this is stopped data is going to queue up on this connection and because of that
i can look at my queue and see all of the data queued up and you know i can look at the attributes
here and the details and i can download it i can view if there would have been a 100 file sitting
in that you know and i had 100 files sitting in the queue i could just you know go through those
i could look at them i could look at the attributes i could download those types of things
did that answer your question yeah okay no worries um and and i think you know now that we have not
five up and running we're beginning to start building our first flow um you know this flow
that we're building here really isn't uh you know a hands-on example flow necessarily uh you know
this is mainly for you know going over what a processor is and some of the connections and
properties i think we're at a good point where you know we could pause and go have lunch it's
almost two o'clock my time so you know what if we are back at 12 35 i think 235 12 35 my
235 my time 12 35 your time hopefully that will be enough time for lunch uh does that sound good
for anybody and and you know it gives us a minute to to ask any questions before you go on to lunch
okay um all right well let's let's go ahead and break for lunch i will see everyone back at 235
my time 12 35 your time i believe uh and we will continue going um you know through this
so let me update the slide and
and i will see everyone in 45 minutes
perfect all right uh have a great lunch and um i i'm going to be eating lunch at my desk
uh i may or may not see a question just put it in chat but uh but i'm around so um but
have a great lunch and i'll see you all in at 12 35 your time
we'll give it a couple more minutes and we'll get started
hopefully everyone had a good lunch i'm gonna go grab uh my drink and i'll be right back
all right everyone should hopefully be back if you're not back let me know
okay
well let's get started um so let's pick up where we left off the goal today is to
get through some of the basic nafap functionality and and how we do things um you know how you
operate the canvas uh those like those types of things if you just left the screen where i was
at when you left for lunch you should be able to come back straight to the canvas you don't
necessarily need to create this flow and copy me as i'm going if you do you know great uh you can
learn as as we go uh but you know like i said the main goal here is just to familiarize everyone with
uh you know the the nafi canvas some of the basic functionality uh you know where things are
located that type of stuff we will continue going through this for the rest of the day
um and then tomorrow when we come in we'll start going into building our own flow um you know i've
got some exercises for us to do uh you know to build our own flows some more advanced topics
and then hopefully after that we get into um you know talks about scalability different things
within nafi to accomplish that you know some of the the bigger deployment type capabilities
those types of things um we'll have a darker compose that we can put together and and deploy
this uh locally on our machine as well um but you know like i said this is just uh basic functionality
if you haven't created a flow if you didn't do what i'm doing that's okay uh you're going to
have a lot of time to build your own flows uh that's going to be most of the hands-on is
is getting an understanding of of those data flows um any questions on what we have went over or any
other questions related to nafi that i can answer like i think i mentioned earlier what i like to do
is take these questions i'll try to give you a quick you know answer um but i like to take
them back i like to include them in the presentation so that way at the end of the class i can send out
the presentation um and it will include those questions because sometimes you want to reference
those and and you know take a look at what the answer was so you know any any questions so far
uh anything i can answer i'm gonna pause and if i don't hear any we'll just keep
okay amanda perfect was there somebody else
awesome awesome all right well no other questions we'll just continue uh again interrupt me if you
have a question i do have the teams up uh so i can on another monitor so i can see if someone's
raising their hand but because i'm looking at the nafi canvas i may miss it uh you know so
you know just keep that in mind all right
excuse me so um i think where we left off was uh we had looked at the connection uh we we
picked the file up using the github file processor we listed the queue um and and
and got hands-on visualization that we did have that file um you know now that
we've been gone a little bit that file has been sitting in queue for 58 minutes um and so you know
it does keep track of those things um you know there's that type of stuff uh the get file
connection is going to identify mine type it's just an example so we can start chaining some of
these processors together so i can show some of the components of that uh and those types things
so with that being said um let's go back to looking at a processor so for this scenario
i just i decided to run it one time if i were to click run again it should notice that the file
that was picked up has already been picked up uh well it actually cleared my state so i guess
because we were gone too long uh but what it did is that pick that file up again um you know it does
maintain a state uh the state is configured you know after so many minutes it will clear that state
so if it notices it picks the file up it should not pick it up again unless you wait you know a
certain amount of time so it's picked the file up uh it's going to the next processor uh and from
and from there it just continuously goes through the data flow uh so on this connection though
you know i've tested out my get file processor i've tested that the connection to the next
processor works uh so what i like to do in these scenarios is i like to just empty the queue now
here's here's where you know the you know you got to make sure you have that configuration
to keep your source file uh you know i'm seeing previously like if it's not kept
and it gets to this connection and i say empty the queue it will delete that file and once that's
deleted right i can't get it back i mean it's probably still in the content repository but
it's not easily accessible uh so but for this scenario i know that i have the original file
i know because we run it twice now um so i'm going to go into the connection and i can
you know configure that connection i can list the queue what we've done i can look at the status
history uh you know i can see how many bytes came in in the last five minutes how many bytes went
out you know we should have zero bytes because i have not turned on that next processor so if i
actually say run once now i could have
bytes out
should be there
i'd say look at that but it should show like the flow file went out and then
i'll go back to that one and see um but you know i ran this next processor one time
so it's going to grab one flow file and adjust it from that connection and process it um
so so it ran that one time it adjusted it it processed it and now it's sent to the uh you
know the unpacked content uh profile so if i look at the attributes though um you know now i should
have attributes of a mine type so now i have you know mom extension mine type it recognized it was
an application zip file um you know so now i have that additional attribute where i could have
a processor set up instead of you know decompressing this zip file i can have a
process that says route all zip files you know read the attribute and and route all zip files to
the uh decompress you know processor and we'll go into more of those when we do in the hands-on
but you know it now i have that capability and that's still i haven't i haven't unzipped that
zip file i haven't really messed with that zip file i'm still working off of the attributes
and so you know that identify mine type uh identified it accurately it it sent it through
and it applied the attributes of applications zip also you know if those um attributes are left
you know basically unchecked they will follow that flow file so it's just going to keep adding to
those attributes so you know the path the file creation time the last access time you know the
owner those types of things the file name those that was added to the by the previous step i'm
going to have access to that attribute you know on the following steps unless i you know there
there's processors to mess with attributes so you know unless i do something with the attribute it'll
continue flowing all right um so so i've run that one time and and now i've got one file in the queue
and it can't go to the unpacked content because the unpacked content is not ready right i haven't
set a relationship success i haven't set a relationship failure or what am i going to
know so it will continue staying in the in the queue if i run this again i should have two files
in the queue there we go so two files in the queue it's going to continue queuing up until i
you know resolve the next step or until i just tell it to terminate and it's done with those files
so so you know that is a connection in nafa for instance so for this instance though i'm going
to delete it i don't want to i'm going to go ahead and empty the queue we'll be using that file and
some of these files again but it's it's deleted you know as a tip nafile will stop a queue if you
get to a 10 000 file limit threshold these things are configurable but that's usually the standard
that everybody works off of i don't see a lot of changes in that configuration so if if we had 10
000 files in this queue it will you know it will queue those up but it will also start you know
stopping the processes behind it and so you know it won't let identify mime type processing new
files because there's nowhere to put the the file once it's complete so it has insight into what
it's done and you know what what files are still needing to be processed so you know we'll we see
it quite a bit where folks will will queue up a whole bunch of files and and they're not getting
clear they quicken up and it'll start backing up if you have a processor that takes a long time
you know that's something to take into consideration because you can back your your whole data flow up
with files just because it's not processing the file quick enough you know that would be my
concern if i was unzipping you know two or three hundred meg zip files and i'm trying to do
10 000 of them you know the system is going to go to a crawl mostly since we've only got this
configured for like two or four gig of ram with that and we i'm going to show you where to check
all these settings but you know just just as a tip and trick you know if your queue is is constantly
getting backed up you may want to take a look at distributing that process or you know speeding
the next processor up or similar just so you can get the you know your queue count lower but you
know the queues are are built in it's there it's there for that data delivery guarantee
the source file could have been a sensor that we're receiving data from right a source file
could be minify on an edge device a raspberry pi picking up data and sending it to this instance
and we could have you know tens of thousands of those pies you know communicating that data so
it's something to take into consideration some of the performance bottlenecks and we'll go through
some of those as we go through these i like to just tell you some of the real world scenarios
in my experience and you know feel free to ask questions if you have anything on that as well
so anyway so we went through the the file connection you know there is a a lot of
information there we can go back to the source so you know this connection was in another
processor group or or you know outside of this process group and we'll go into processor groups
you know we can just by the connection you can go to the source you can go to the destination
there is a lot of customization capabilities within nafap it's quite used quite a bit
you can you know if you you've done powerpoint and you can stack objects in front of each other
behind each other those types of things you also get you know a lot of that where you can bring it
to the front we can create a template you know from this we can actually just hold shift select
everything and and create a template this is it wants to refresh
the latency on the vm sometimes
there we go and then we can right click and say create template and it will create a template
it looks like this connection this connection these two processors got highlighted in the selection
i can create a template and and deploy that over and over and over again
so what i'm going to do is go ahead and create a template i'm going to just give it a name
and say create and it was created successfully perfect and we'll go into using templates but
anyway so you know there's a lot of customization there's a lot of visual customization on the
nafi canvas you can bend these lines you can depending on the order you put your processors
there's a lot of capabilities there we can label these we put some you know some charts
or not charts but like color boxes and things like that behind it and i'll show you some of that
and it's a label but yeah there's there's a lot of customization so it's not only building a flow
but you can start grouping these aligning everything and and you know don't don't don't
get discouraged if you know you have things off like this or you know some of these other things
it will snap back to the grid just just hold your shift button and and highlight it and go into the
line and and there you go um so also on a connection um you know you know the source
bring to the front empty it you can delete the connection now this connection for it to be
deleted you need to have the queue needs to be emptied and the processor needs to be in a stopped
or still needs to be configured state so if this processor was running
i no longer can delete right so i need to stop this processor
before i can delete that connection that connection again is just a success get file success it
actually picked the file up so um so yeah so that's that's the connection some of the other things
is part of the nafi canvas i mentioned on you know we started with processor over here on the far left
on the far right we have the label this is where we can start labeling
you know some of these processors so what we might want to do for instance is apply a label
to this section that says basically this is the um
so as part of this data flow i can quickly look and i have a visual here of a detect and unzip
section right because i applied this label um you know to the canvas that lets me know that this is
the detect and unzip section um if i were to sort and filter you know that may be a different
section those types of things but a label you know the label as part of the nafi canvas is
very simple it's it's just uses that graphical aid to help explain
um you know different processors things like that i'm gonna go ahead and delete that all i have to
do is highlight it and delete it um and so what else i'm gonna go through we're gonna talk about
input and output ports but um i'm here here if you not right this minute but what i do want to
bring down is processor group and so you know i'm brought down the processor group i can enter a
name or select a file to upload so for this for instance i'm going to enter a name of of pick up
zip detect type and unzip
so this is a processor group and so what that and it's in the name so what this allows us to do
is is build our data flow and then take that group of processors that full data flow
and put it within a group so that helps us you know distinguish between other flows
um you know you may have some prioritization on some of the processor groups you may have
you know different processing different scenarios going on but this is a a clean way instead of
you know 15 20 different flows running from the main canvas we can break this up and say okay
you know this flow is for the pick up the zip files detect the type and unzip we may have another
processor group that you know picks up data from an api and doing something totally different
but at the end you know it's merging that data it's looking it up or whatever um but that's the
purpose of a processor group and so for this i'm going to go ahead and select all of my
processors any connections if the latency will allow
let's try that again
i will just manually select them
one thing i notice about these remote desktops is sometimes latency
there we go
and all i'm doing here is just holding shift to select all of these
as i select them and i'm going to drag and drop it straight into that processor group
so now from the main canvas i have my processor group to pick up zips detect type and unzip
from here i can you know bring down another process group i can i can name this one you know
i receive your pick up from an api i can add that processor group and then i can go in and
start building my flow my data flow based upon you know what i need to do to pick up from an api
so processor groups allow us to to build these flows put them together put them into a processor
group to organize them that way when you go to your main nafi canvas you will not you know it's
not you know hundreds of processors all over the screen it helps provide some organization
to this it helps with you know organizing different flows different capabilities those
types of things you know so one of the design principles behind nafi is is you want to kind of
leave your nafi you know the main canvas the the home canvas you want to have that one as clean as
possible just because this single nafi instance could have two or three hundred processors
chained together and if you tried to um to to pull all of that on the same canvas it's going to use
a lot of resources in the browser but also it's going to just get cluttered and hard to find what
you're looking for and so i'd like to use processor groups to to organize data flows
and you know even from there you can actually take the output of a processor group um and and
have an output of that into another processor group and manage that connection like we've been
seeing with processor to processor connections once you have a processor group you just double
click and open it up and and there is your data flow and so you know it does have those a lot of
capabilities you can leave the group you can see how much uh how much data you have in the queue
the end how much has went in how much went out the reading and writing of the bytes in the last five
minutes and and what it will do is it'll take all of those processors at you know you know accumulate
you know how much is in the queue so if you have three or four connections and you got 7000 files
sitting in one queue you got 3000 sitting in another it's going to tell you you have 10 000 files
uh total and so that's all of those processors within a processor group i'm going to go ahead
and and just delete it like i said we're going over some of the main aspects of of nafa not really
building a flow yet so you know i'll delete that earlier i highlighted these and i saved it as a
template you can access templates from the canvas you know right beside your label if i were to drag
a template down to the canvas it's going to ask me which template i want to do i saved that earlier
as a test template and there it goes um so you know that that is the the functionality of a
template that's how it's used uh you can export this template and you can import it
there's a lot of capabilities capabilities there so i'm going to go ahead and and delete this again
and all i did was hold the shift key draw a box around it and i hit i actually use the physical
key of delete uh you can right click on that and tell it to delete as well um i try to use keyboard
shortcuts you know as much as possible to save on time um so you know just for a while um okay so
canvas so you have an input port you have an output port we will go into more details of that
because you know you may have an input port uh going into a processor group and then an output
port coming out of that processor group coming into another processor group so what that does
is allow you to chain those groups together with those connections and and also apply some
intelligence behind it um you know you you can you know name the port
and right now there's there's nothing you know to configure it with own other processors
but you know it is a port so you can use this for receiving data on a specific port for instance
if you if you have that type of of of need you can use it for a processor to processor group
there's a lot of capabilities with input output ports we will go into more of those
as we are building flow files it will become you know more understandable when we get through that
process group we went over uh creating uh you know chaining these processors together creating
a data flow and putting them into a process group a remote process group would come from
a remote na-fi instance um you would configure that to uh you know basically
send data into a remote process group pull it we can pull that process group into your canvas
there's a lot of capabilities there we will also go into this when we get into more of the advanced
data flow building on day two late on day two early day three
all right and so that is some of the main you know functions of the na-fi canvas
where you will do a lot of your operations is right here you know adding a processor adding
a connection building your data flow putting in a processor group labeling it creating a template
from it uh you know uploading downloading templates those types or pushing the template to the canvas
you know that's the main functionality of the buttons you see in your top left now on the
canvas you will usually see a navigate and operate uh screen as well now the navigate screen just
allows us you know if you had hundreds of processors and and you know you had a canvas
you have processors way over here on the right that's off the canvas you have a bunch over here
on the left that's off the canvas the view you can you know adjust your view to to adjust the
canvas and so you know i like to keep it you know centered uh but you know because this canvas is
pretty infinite i could you know just for an example i could push this all the way over here
and then it no longer is part of it you know it's no longer on the canvas but this navigate will
allow me to adjust this canvas you know so i can see that data flow that process group
you know those types of things
okay um you can use it to you know fit uh you know within the current view
you can do an actual if there was a lot of flow files you know you could expand
but the navigate is primarily used just so you can navigate around the canvas you can go look at
some of the different uh you know processor groups or or flows data flows that you've built
then we have the operate uh for the operate i'm gonna go ahead and bring down that template again
so um for this instance i have just that one single processor selected um you know instead
of right clicking and saying start run once disable those types of things i can actually
operate on that processor or processor groups from this this view so i can tell it to start
i can tell it to stop in in you know until it start it's running stop it's there
and it's already picked up uh two files um so so you know i can tell it to start
until it stop i can disable i can enable and disable that processor or just configure that
processor um or the apply the configuration for the canvas i mean um and so you know the
processor group name is the nine five flow those types of things you know just just being able to
to configure the the canvas itself and the visual the view here if i highlight multiple processors
i could you know still start and stop i can delete you know those types of things
into this queue
so that you know these two buttons are primarily used to navigate around the canvas to operate on
the canvas to operate on processors you know things like that um so so the toolbar up here
is to bring processors down to bring templates processor groups those types of things uh and
this is to navigate uh and this uh you know is to operate on a processor or processor group as well
as you know define some of the main canvas capabilities over here in the top right uh we
have the hamburger menu now this is uh where we're going to start diving into some of the
you know the provenance the lineage those types of things so from the hamburger menu
you should be presented with quite a few options and there's a lot of information
within this so we're going to start with summary
so you know basically this is a summary of everything running on the main 9.5 canvas
in this main instance so you know getting files from downloads is a is a is the name of the
processor it's a git file i can list all my processors here i can tell you what process
processor group uh it wouldn't tell me what that processor group is what is the status
in size you know in the last five minutes how many bytes came in you know the reading the writing in
the last five minutes um the output in the that's wanting to work now in the last five minutes
and how many tasks it executed so there's a lot of good information here too when you start building
your data flows and diagnosing any issues or just to get a summary of what's happening
you can also go to the processor you know click to go directly to that processor um you know
there it is and it's highlighted for you so as you you know go through your summary uh you may
notice an issue or you may notice a processor that you want to work with you can go directly to that
you can also look at the status you know just like we were able to do uh from that processor we were
able to right click on it we were able to pull this status well this screen gives us a centralized
view and centralized management of those views you know so instead of having to go to the canvas to
find that processor view the status those types of things we can now you know look at our list of
processors look at the name and pull the status and how many bytes are read you know all the
attributes about that you know that processor's interaction um so you know you can do that
directly from the summary processors page same thing with input ports and output ports
if we had an input port and we will later on as we're building our flow we will have a name it
will tell us the run status it will tell us the output because you know it data needs to come in
and it needs to go out so the output port is going to tell us the same thing but how much data went
into that output port uh you know to to go to its final destination for instance uh it'll give you
some key you know summary characteristics about that same thing with remote process groups uh it's
going to give you the names you know it's going to tell you how much files were sent or received
we will probably have a remote process group especially because we're going to dive into
minify and registry uh tomorrow so you know expect to see you know some of these things being
populated and here's how you would access those
information about the connections we right now only have that one connection
the the get file to the identify mine type you see right here um you can actually even pull
you know details about that any settings about it you know those types of things but you can see
the get file it currently has two files in the queue at 487 meg it is about 48 percent of the
queue size that we like to see that we have configured by default again we can go back into
the non-file properties file we can configure queue sizes we can make those changes but you know
for this this case i think one gig is is about the queue size that's set at and we're using about
48 percent of that uh getting files from download out and then the destination is identify mine type
and you can again go to the destination go to the connection or view the status history of that
same type thing process groups you know if we um you know right now the main process group is the
95 flow on the main canvas and it will let you know how much data has been transferred in and out
the you know how much was received how much was sent how many active threads is currently going
on in the total task duration in the last five minutes and and one of the things to pay attention
to here is the active threads because this will also help you diagnose that you have a processor
that's taking quite a bit of time to process data and in your queue is backing up you'll be able to
see like you you may have only five active threads and you're like well wait a minute it should be
processing a lot faster let me go to my processor let me increase how many threads and how many
concurrent tasks can happen uh you know increasing your your processing speed so that is the summary
it does open into its own page as well you can pop that out into its own page a lot of folks that
that that use this just goes to the summary real quickly um and and you know just give a quick
snapshot and then close it out but you can have it up and running in a new window that way you can
quickly go to that tab see what's running how it's running all the characteristics um you know about
the the nafi canvas itself
the next one we have is counters um and i don't think we're going to have any counters
um so uh if we had counters in our data flow to um you know doing accounts and things like that
uh we this would show up we don't have any of that but we will put some in for our flow
that we build when we build our flow we're going to see a lot of this become populated and you know
we know how to get back to it now and to check those values uh the counters actually is not used
that much in most operations i've seen uh just because you do get a lot of that information from
summary and things like that
bulletin board you know any you know any kind of of we we we can set you know our bulletins to warn
and and things like that you know we can set these messages that happen so you know this gives us a
capability to come in and and you know come in and and view those messages in a centralized place
you know being able to turn also some detailed logging if need be uh those types of things
so yeah there's a there's a lot of a lot of capabilities there um but bulletins i don't see
used quite too much um you know i've never seen it used that much but basically
you know we can set this to debug info warning error that bulletin shows up you know that as
well as you know all the bulletins from all the components can be viewed and filtered on the
bulletin board page so you know if you were to turn on your debug here
and and you usually with debug and info we get the most data from the log you know as a log
message you're going to be able to see that appear on the bulletin board page and you're going to be
able to filter those right now i have it set to warn we haven't gotten any warrants so there
shouldn't be any on any on the bulletin board page if there would have been a one we would have seen
it there and when we're building our flow files later you you you may see um you know some errors
and you know we can change the the bulletin down to debug so you can at least see the message
but that's usually where you see it all right next major section is the data provenance
and and this really gets into the the the lineage and other things so you know i can
this is my chain of custody right i can see that you know at this date time um this problem event
was captured uh here is the flow file uid i received the flow file here was the file size
the component that touched it was the getting files from downloads the component type was a
get file processor uh you know the the name of the file any kind of details attributes
uh for the file uh as well as you know content if it can be viewed so a lot of times we're picking
up log files text documents those types of things so you can actually go in and view what that file
looked like uh but yes uh this is the uh you know the provenance event that happened we cannot
replay it from the provenance event because the source flow file view with id no no value no
longer exists i did delete those you can download it and view it as well uh you know one of these
same things that you're seeing here is a subset of this information is available when you when
you right click on the processor and view that status and data provenance but this page that
this data provenance page is everything that's happening you know from the system
i can also go here to the lineage it's going to compute the lineage and i can replay this lineage
i can see what's next unfortunately i don't have a full built out flow yet but when i do you can
actually go back and replay and see okay i received this data at this time and this is what it looked
like i passed it to another processor and here's what that data looked like going in here's what
looked like going out here's when that processor touched it you know here's what the original
looks like all of this this chain of custody that is needed uh you know for for many reasons for
that data provenance for that data delivery guarantee that's part of nafap you know some of
those security capabilities those types of things you can download the the lineage it's going to
save it as an svg it's an image of it receiving the file the processor that touched it those those
types of things when we build out a flow later we will be able to download a full package you know
full view of that lineage you know from this screen and then you know of course go back to the
event list i do see this one uh the data provenance you know used quite a bit um you know it gets you
know it does have its own window capability so if you want to just continuously look at your data
problems events if i had data flowing through this system you we would see this be updated you know
constantly and you know being able to to check on that
we will we will really talk about controllers um you know later today and tomorrow uh basically
though this is our controller settings um and a controller for instance you know there is a
um a database controller in nafap and so what that does it allows you to build that connection
so say you have to build a a a flow that connects to postgres and puts the data in postgres you can
build a postgres controller with the username password the connection details those types of
things and what that does is it allows the processor to reuse the you know those components
so you may have five or six processors writing postgres and they are getting they're using the
same controller you know to to authenticate to understand which postgres to go to those types
of things so what you know a controller allows you to do is is you know for many things right you can
have a security setting where only certain people can modify controller and then you have another
security setting that says you know you can have a data flow builder or can build a data flow they
can use the you know postgres controller but they can't modify the you know they they wouldn't even
be able to see potentially the username the password uh you know any of the controller
they're building their flow they connect to the postgres sql service um we will dive a lot into
controllers uh it's a very important aspect of nafap but uh but here's where you can actually
go and start seeing these settings you can manage controllers you can even add controllers um you
know there is you know a credential controller so that way you don't have to share usernames
and passwords you can put your credentials in and again you build a data flow and it can use
that controller but i never knew what the username password was for instance you can set up a you
know a caching system a distributed caching system uh there's a h-base controller to control
connections to h-base for instance uh in big data uh copter record sync controller so you know there
is a lot of controller a lot of capabilities there um and even even uh you know syslog readers
and and those types of things so you know that way uh you don't have to give up those credentials
you know everyone can use the same controller you have a sysadmin who's configured it once
and it's a reusable component um we have about 123 controllers in the out of box nafi
some of these are like you know reader lookup record reader record setters and things like that
that you know you can define schemas of how data should look and map so you may map your attributes
to a schema uh you know there's there's services for that there's a controller for that um and
you know you can go with that
so that is the controller uh reporting tasks you can add a new uh reporting task a lot of times you
may have a prometheus reporting task what this allows you to do for instance is create that
prometheus reporting test that reports on all the nafi metrics and so you know everything going on
when the nafi is putting those uh prometheus metrics out you may have a centralized prometheus
server that's receiving all of these um you know you can configure it and this can you know this
this service and that way if others want to use the prometheus they can just reference the service
you know on those connections and we will look at the processors and how they handle that
because you know some processors mandate you have a controller service and then some will allow you
to do like a simple database connection and and you can select between you know you creating your
own database connection or uh you know or referencing a controller service to to handle
your connection problems but yeah there's a um a lot of the reporting tasks uh set up uh a lot of
this is like you know your ganglia your mantra your memory your disk usage this is all about
your performance of the system um so you can you can add those registry clients we will set up
registry just like we did with windows um you know our registry we're going to install it's going to
be sitting at local host and and we can add the name and this is our local registry
and then you know i can go in and configure that
the properties the url of that registry we won't reinstall registry we will go through some of this
but this is how you would put in uh your registry for now and and to do that
virginity control that checking in and checking out data flows and then parameter providers you
know um you know you again you may have some secrets managing going on uh you're going to need
to you know keep those safe and you probably do not want to share a lot of these secrets uh so you
can define them here like a database parameter provider so you can configure this service here
and and again rinse and reuse it um so yeah
so that is the controller settings we will utilize most of these when we're building out our flows
but you know from the hamburger menu you can select those and start adding in those types of things
parameter context um you know you can i know a lot of times you know we can set a parameter
name that parameter in value and then we can reference that in our data flow so you may have
a a parameter like you know night or day and you in the value of that yes or no and you can reference
that using the expression language in your data flow and so you know when you set this parameter
everyone has permission to access you know the service can access those parameters
we see this quite a bit where you know different systems may get different data so you know
instead of you having to type out you know the system and the url or whatever you can just use
a parameter and reference that value in the data flow and that way you know it saves a lot of time
it's a it's a reusable component and you know it just provides ease of use you know to to handle
your different parameters there's some inheritance capabilities as well and some settings we will
define a parameter on our main nine five canvas and go from there uh you know later on in our
in our data flow design but that's where you would add it um you know when you go there
the flow configuration history basically is uh you know what what have i been doing is and you
know deleting deleting flow files um you know deleting processors this is a you know a history
of the interactions basically with napa the date and time that that happened the name so you know i
i get files from downloads right i started it i stopped it and who the user was
so this is like you're auditing things on the system this can actually feed into
uh other systems as well uh but it's good to you know have and and you're you know not only can
you go through and see all the changes you made on admin data flow but if you use a system where
multiple people uh are logged in and working on data flows it allows you to go back and see
you know who did this who changed that those types of things
so um the note status history um you can see like this is our free
heat space uh how much is used uh the processor load average a lot of system metrics just because
you know we we can send a lot of data through this and we can bog the system down you know uh
there's a lot of a lot of things that's going on behind the scenes so you know your status
history really keeps track of that it keeps track of how much data it's got going in and out uh you
know the sum of content repository free space uh flow file repository used and how much is free
you know when we work with those um the flow file repository that's created right it's saving
you know it's it's got a finite available space and so you know uh it's it's using that space
it's writing it and then it will purge you know those types of things you're going to be able to
see you know oh wait a minute you know my data flow is running um and you know i'm a i'm pushing
you know a lot of data through you're able to look at the status history and say you know
i may i may want to you know increase my provenance uh you know storage capabilities i may want to
decrease you know some processing because i only have 20 gigs of hard drive space allocated to this
and you know i know it's going to use it up uh so when you when you are building data flows
designing the system designing this architecture looking at some of the metrics and and performance
of the system you know that node status history is is going to be very important
templates is exactly uh you know as it reads um so uh nafi templates in general uh when we save them
we can upload them we can download them we can delete them uh in this case you can go straight
here and you can download the template templates going to download as an xml document and if you
were to open it up you know it's going to have your connections your destination the source
you know what processors even the position on the canvas where those are um you know where they come
from bundled you know a lot of a lot of data a lot of information here and and you know and it's all
stored as xml uh like i said i have i'll export a template i can email it to myself if i want to
put it on another system for instance uh and then import that in and instead of another system
plugging into my you know uh version e control you know this is this would be a good way to
get my data flow from one system to another one thing to keep in mind with templates is the
processor that is referenced in the template must be installed in the nafi instance or the template
will not um you know put it will it will not lay itself out on the canvas you can upload the template
and and those types of things but until you have the processor you need uh you know to fulfill that
template it will not put it on the campus i'm seeing a lot of folks build a template uh it
reference a custom processor they change that custom processor to another version or newer version
and now it's breaking the template and you know because it does look for that specific processor
that specific version so just make sure if you're working with templates make sure you have the
processors it's referencing in your nafi instance you can also upload and download templates here
you can create a template um you've already created one if i wanted to i can tell it to
create another template and save in my templates and then i can download that
i did download that uh that one template so i should be able to select template
go to my downloads there's the test template upload it it's probably going to complain
that its name already exists but if i were to do this
okay now i have a different name for that template
oh i don't know what the problem is it's looking at an internal name of that template
so if the template has the not only the same like like file name but inside the template itself is
is the template name
so let me see here back to this now it will let me
success uh so just keep that in mind when you're working with templates
you want to choose unique names um it does you know when you export that name it is going to
export and name that xml file whatever the template name was if you need to go in and
re-upload that template you already have that name you can you know go to the hamburger menu
and you can go to your templates and potentially you know delete the template that that you want
to work off of and upload a new one so you can delete it then re-upload and just make sure you
change the name you have to actually if you're trying to use the same template on a system you
have to go into the xml right at the beginning and just tell it give it a new name like i said a lot
of people don't use a lot of people use templates just to work with colleagues and and those types
of things when you're passing around um you know your your data flow but the better way of of
handling data flows in the portability of those is through version control that way you know
software engineer x can develop a data flow save that data flow to write into the the versioning
control and then another developer can access that bucket and pull that exact data flow
onto their main canvas and when we set up registry uh we'll go into that um we're going to
set up and install nafi registry uh but i don't have well i have my own personal github
instance but nafi registry just so you know is backed by versioning control and we'll go into
more details of that um you do have a search uh is a um a free form search so like if there's a
connection or a processor or processor grip it's it's going to list those out um so i do have
getting files from downloads and also get file uh you know connection so uh you know it's going
to tell me here's all the processors that match that description so far here's all the connections
if there was a processor group that matched it it would let me know um you know and this is really
mean for you know the the the nafi instance that has 30 40 processor groups just on the main screen
uh you're diving into those processor groups you've got you know 100 different processors in one 30
40 and another you can get a little bit out of hand and so you know there is a free form search
box that allows you to just search for what you're looking for another good reason to apply you know
a good naming convention to your processors a more human readable description that way when you
search you know it's easy to find so for instance like my git file says git file from downloads
if i had a whole bunch of files deployed i can just start typing git file from downloads
and it's going to find that exact processor i can go to it and i can operate on it all right so
you'll use that search box once you really start filling up the canvas and and you know if labels
can help find something your processor group naming can help find something you can definitely use the
search box okay uh some of the other main components of the uh user interface so all of this is your
all of these are your component toolbar so if you ever hear of a component toolbar in
nafi that is all of these components a processor input for those types of things the status of
nafi this is your status bar we have two files in the queue 487 meg we've got one stop processor
we've got one invalid you know the time the last refresh time that it has if there was any you know
if there was any issues if there was any warnings you know some of the some of the data upload and
those types of things if we had a disabled processor so this this status bar gives you
quick snapshot of what is currently happening in nafi we also have um you know we went over
the navigate this is the operate palette um you know it's all part of the main nafi canvas
um you know this will give you that bird's eye view now you can see on the canvas that we have
our processor um and things like that um down at the bottom uh we have a break from so if i were
take a process group
and i am going to put this into that process
come on
just bear with me as i work through some of these latency issues
there we go perfect so you can see that we put our flow in the processor group
at the bottom is a breadcrumb trail so you know we are now in that testing processing group
uh if we had another process group underneath we could
you know just keep going and so you know at the bottom is a quick easy way to get back out
you know so so that's the the breadcrumbs uh it's actually a lot of the the canvas
you know that you see right now in the hole um and of course you have a log out but you know
don't worry about that um there's a lot of of of policies and and multi-tenant authorization
as part of this uh we we really won't go into some of this but you know there is security policies
that allows users just to view the ui um and then there's uh policies that allow users to access the
controlling services or reporting tasks uh there is a lot of fine-grained detailed security policies
so you know when you work on deploying this you may have your standard data flow developer and
that's what they worked on you may have a sysadmin or a system engineer that is working to put in a
controller services for the data engineer to use those types of things so um if you refer back to
the admin guide and user guide there is a ton of information uh you know there um
okay perfect uh let's see here what else we've got all the inputs the searches is there any questions
on the canvas itself uh you know anything i can answer on what this button does
um you know how do you access something any of any of those types of questions i think what's
gonna really help is actually getting in there playing with it oh absolutely um and that's why
i'm like going over right now and then uh when we start like tomorrow morning we are are going to
and we maybe even this afternoon we are going to start building five flows um you know we
we will go through and and um build a flow i if someone at the end of the day can let me know
how they plan to receive some of the data we can tailor those flows to kind of resemble what you're
working on but you know you're going to learn a lot just by doing it uh the beauty is is don't
be afraid to break nafi you're not going to um you know you're not going to do anything that we
can't refer to especially in this scenario uh you know so when it comes time to building flows
have at it um but okay uh all right let me look through my notes i think i have um so we can you
know with a processor for instance you know some of the other tips and tricks i like to go over
um you know each component is displayed on the canvas it also contains you know the name of the
processor the group the bundle uh description you know those types of things uh when we're looking
at processor we can we can group or or or filter right now you know right now the only group we
have is org.apache.nafi that is you know that's because all of these are supported apache nafi
processors you can go to github and find you know thousands of different processors all the ones
that are located here are supported by the apache foundation so the apache nafi foundation so keep
that in mind um you know as you're as you're working through this but um you know you may
want to create a different source uh for your processors you may want to create a different
version so we have the attributes to csv for instance we can have two or three different
versions of that um and depending on the version you can reference those in your data flow
so you know something else to consider um you can actually go on to a
and if there was a different version for this um processor we can right click and under data provenance
we can you know we can change that version so there's a there's a lot of processing capabilities
a lot of processor you know capabilities you may upgrade um you know your processor to
accommodate a different data type coming in um you don't necessarily need that upgrade on another
data flow and so in real time you can upgrade install that processor upgrade your data flow
as it's happening so if this was running i could you know i could tell it to upgrade the processor
and it would upgrade that processor and then continue running so you know there is a a lot
of capabilities there when it comes to the software engineering side of the house for not five
the main goal is to continuously you know get these processors running you know let them do
their thing don't stop a data flow um you know we can for instance you know fork off data
you know for this one we're getting files from downloads we are sending it to muntight i could
easily bring down another processor and i could do a put
put file
let me empty this queue so what i'm what i'm doing is you know in while this data
this data could be flowing through the system i am branching off and now every time it runs is now
sending it to the identify mine type but it's also i can send it to a put file uh and that is
exactly the same file so i can from one processor i can branch off into many processors to send that
there is um one of the ways i've recently used it is is you know data coming from kafka
alerts coming out of an aion system would be sent to kafka they would need to be sent to a dashboard
for users to interact with but also that same data needed to be sent to a database for
for longer term storage so we could crunch over those results with some additional aion models
so you know you're executing models on in flight as the data is coming in you're you're doing the
you know executing models on on on ui on the ui for instance uh as well as some of the longer
term you know model model execution so you know keep that in mind as well when you're building
your flow you may have a need to pick the the single source up and you're sending it multiple
different directions all right we went through the settings we went through scheduling we have
blown through some training um execution run schedule i'm looking at my notes we have went
through a lot real quickly um okay look pause there we went through the parameter context we've
through uh most of this um is there any questions with the main nine five canvas some of the
components the toll bars how you access this how you work with this any of those types of questions
okay thank you okay um well we have an hour and seven minutes left um i say we take our final break
here in a minute uh and then we come back and wrap it up we are going to go into you know
some expression language i know that we don't have but you know uh half the team is is developers
and and and stuff so you know do your best when we get to that we're going to go over you know
expression language used within nafi to to do things we're going to set up some flows things
like that but let's take a our last final break uh let's do return at i want to give 12 minute break
so let's return at um four four oh five uh 11 minute break so four oh five two oh five my time
and then we will go through here we'll wrap up some stuff and and get ready for tomorrow
tomorrow's gonna be very heavy and extensive so you know just keep that in mind
so we'll take you know a few minute break we'll be back in 11 minutes and we'll wrap it up
i know we're still in break but i see some of you working um you know so the the last task of the
day is going to be putting that git file and identify my type into uh practice some of you
have already had a head start it looks like uh as well as label brett you're you're excelling it
looks like so when everybody gets back from break uh we can go ahead and build our first data flow
and and try to get these files uh extracted
all right we got one more minute then we will start with our hands-on exercise
um if you guys were looking at the screen i did have the apache website pulled apache
nafa website pulled up i know there's a log for j question and the you know give a detailed
what happened was this is not actually uh swapped to log back uh the successor to log for j the the
person who um you know was originally developing on log for j spun that off and um so we shouldn't
have a log for j issue but with that being said i'm going to double check uh and give the answer
and write it up into the presentation but also you know just so you have it there is a security
section uh and if i um you know i do know that government employees uh report to this as well
like i said there's a there's a lot of people who use this um and and here is the latest published
vulnerabilities that is found as well as the mitigation so for instance you know 95.7 through
1.23 um had the jolt transform json processor you know has a dom based vulnerability cross
site scripting vulnerability that's been updated uh the version that we are using
some of these versions came out because of vulnerabilities so it's always good to check
you know when you're deploying this um i know some organizations like to use the latest greatest
some of the organizations like to use a version behind but if you look uh you know in in downloading
these you'll see that um you know this m2 release will have some release notes here some uh you know
release notes for 1.25 they will include things like you know some of the vulnerabilities that's
disclosed and and what 1.25 is is um you know what what is building off of so
you know when it comes to some of these security things um you know it's good to always check it
is an open source product so you you do have that capability to uh you know go and do some of these
things i think version one three one two three two uh you know it corrected a few bugs sometimes
it's a minor release for a bug fix sometimes it's a security focused release one two three for
instance because i know one two two had some some issues um so you know just keep that in mind as
you're going through here i will um you know enter the log for j in detail uh but when i was looking
at it it looks like that source has been changed to to log back uh which mitigates that vulnerability
and hopefully mitigates the the the finding of file names all right so everyone should be back
and so for the final task of the day let's go through that data flow that i built um you know
so we want to get file uh and and i will i will walk through it in just a minute but i want to see
how far people can get without um you know seeing my screen um and so if you have any questions let
me know i did on purpose leave some stuff off just so we can see how far we can get but if you can
let's uh let's pick up a file from our file system let's use any of the zip files that you have
uh within the downloads directory you can create a new directory you can you know copy a zip file
over into that but your git file should include that path so let's do that let's get a file
and then let's identify the mine type and uh you know be able to create that data flow and
look at the attributes see if it accurately discovered the mine type and so forth so spend
a few minutes on building your your flow you know some of you this might be your first time building
a flow let me know if you have any questions i have everyone's screen pulled up uh so if you have any
you know issues let me know when you're in your folder a little tip is when you're in your folder
you can um let me pull my back up here in file explorer you can click at the top in the
like address bar of file explorer and you know for this one for instance
i was using this test you can just click on the address bar and copy
i'll start typing the name to narrow it down
figure
come on so you know as you're building this out uh use some shortcuts if you can uh but you know
you can directly copy and paste the windows file path if we were working in linux it would be the
same thing you just do a pwd to get your linux directory you're in i copy and paste that directory
in and it will pick it up all right kody is another high achiever look at this
flow's looking good any questions kody i don't i don't think so thanks so while you work through
this i brought kody's up it was a pretty extensive uh data flow already one of the things you'll
notice with his is that red connection um that's because he has over a gig of data queued up
so it will change colors as you approach uh you know the threshold has been set
great example by now you should have your get file uh that's going to pick your files up
you know if you have decided to use a directory that has different types of files you may have
went in and put a filter in place again the file filtering the regex is a it's based off of java
so you know just remember java regex is close to some of the others but it can be a little different
dir is what you're looking for ben or there you go i can tell ben as a linux person
yeah me too yeah you gotta remember windows um sometimes cares about uh capitalization or not
but linux like it doesn't really if you won't ben just right click and hit refresh and um on the
only canvas itself and it will because it's set up for a 30 second timer you know sometimes you
don't instantly see it uh but yeah click refresh and see if it retrieved any files oh you're even
put a route on after you nice tyler i think you completed it um you know uh did it work as you
expected uh you picked your files up you identified the mine type you unpacked it looks like and now
you're about to put it up it seems like it worked correctly oh perfect perfect is this your first
time building a flow or or you already have experience that i've already done oh okay perfect
we'll try to find a couple of you that have not built a flow before looking good pedro
so some of the things i'm looking for in this flow is you know did you rename your processor a lot
of folks will skip that step uh they forget about it um i see you have looked for zips so uh you that
tells me you renamed the processor something that we can understand um you can see that the
connection is starting to turn red because you are quickly approaching your one gig
queue threshold um but that will clear once you start your other process and remember when you're
building your flow um the run once uh is is is really nice it was introduced here recently in
some later versions of nafi so that way you don't have to turn everything on just to test it you can
just run it one time what was the difference between run once and go start no great question so
if you click run once it's going to push one flow file through instead of five so i've got
pedro screen pulled up here for instance um you know he clicked start on the look for zips and
what that did was take all of the files in that connection in that queue and push it through the
processor and so if you click run once it's only going to push one file through and then you can
inspect you know the output to see if that's what you're looking for and if that's what you're
looking for then you can just hit start and let it run thank you yep no worries and so for this
instance uh the unpacked content i would just say run once to see if it actually unpacked i would
put a stop on your put file right quickly before you click that put a stop on your put file
yeah for the output of the unpacked content so the reason you want to put a stop on that is you want
to have all your processors in your progress or in your processor group the whole flow file you
want to have them stopped because you want to test it along the way and so if you were to leave that
one running and you hit start it would the data would go right through that processor and you
wouldn't have the capability to make changes um so it'll just save you a little bit of time in the
long run so there again you want to click uh you know unpack content run once and then refresh your
screen um when you you can refresh from the canvas just click on the canvas somewhere on the canvas
and right click and say refresh now you can actually go to that test queue you see i have 106
you can actually right click list queue and did everything unpack as you were expecting perfect
perfect um yeah no it looks good to me and tomorrow we're going to work with some text data
because we're going to do some advanced like etl type capabilities but today i just want to be able
to get a file picked up unpacked and put somewhere um and so you know you can start that there you go
or you can run it once and see if all of it writes but if you right click on the canvas and say
refresh that data should be all the way through yeah it's extremely quick so perfect and then
if you want you can now um you know start that whole process so start your unpack content
yep start them all um i see you use the operate part which is just fine but you can actually just
click on the processor itself and right click and say start stop all that fun stuff uh yeah there
you go but leave it running um and so i know you're on your canvas hit right click and hit
refresh very first one perfect all right so you've you've encountered an error um and you remember
when we were going over processors um you know i didn't have an error to show but you have an error
on your put file you can tell that by the top right is a red box so if you can just highlight
over that red box it should tell us what our error is this is a great example there you go
um ping live full file oh so the the name already exists so it is um it's just letting you know so
you can actually stop that right click on it and go to properties configure and what we want to do
is we want to auto terminate those yep go to relations and you know failure have it automatically
terminate perfect um so it's going to continue reporting that um files that cannot be written
for some reason are auto terminated so go ahead and hit apply and refresh on your screen to see
if they if it goes through oh your put file stop go ahead and start your put file again
so that way we can clear this queue out okay there we go refresh again we should be looking good
a little bit better on clearing your queue uh no we keep picking up that same file can you hit stop
on your files from download folder and that way we can let the rest of the data flow run
all of these are great examples of some of the power here where you can manipulate data flows
in flight and things like that go back to your canvas and hit refresh be clear in your queue
on your put file let's stop that one and go to the property figure perfect and go to properties
and the conflict resolution strategy let's set that from fail to replace
perfect now apply start that back and you can go ahead and start your put file again all right
let's see if you're clear your queue now you go uh you want to go ahead and stop your git file
because it's just continuously picking up that same file so the git file just stop it and see
if this will clear your queue just refresh now yep okay it refreshes every like 30 seconds in
the in the thing but we can change that it's still writing data though so you can see that the put
file is now put 6.22 gig out the door let's make this a little easier um you know only uh 6.6 gig
you know this is um one of those examples where when the you this is a perfect example the queue
starts filling up because that unpacked content can't run fast enough so there's a couple of
things we can do we can tell the unpacked content to um we can stop it and tell it to run more tasks
so what it would do is unpack you know two files at a time instead of one so if you want to unpack
content and stop it there you go and then go to the configuration and go to settings uh no scheduling
i'm sorry concurrent task let's do two apply that all right now start it should run a little bit
quicker we've got the first queue already cleared second queue is unpacking it takes a minute this
is um you know this first example is is why i like doing this is you can quickly back the system up
with all of the unzipping if you push you know those are some pretty good size zip files
and so it needs time to unzip those and then push every one of those single files out um
of course that's kind of nice
doing face off with the flame type also the same as how this is so
uh who's god
all right we got we got uh pecker will go in let me pull amanda up amanda
oh wow um
amanda can you refresh that tab
okay
you definitely blew this up
no um no great job so
i actually really mean that like peter and i were just going through his flow and he was on the
verge of blowing things up and you know it's there to kind of make the point i was i was saying
where you know we get too much data in this queue we got a processor doing unpacking of large files
those types of things it's really going to start bogging the system down uh you know we're sending
only one task to do this and so as you can imagine we're bringing in gigs of data unzipping it one
task at a time it's still kind of blowing through it but but yeah so amanda to to do a quick fix
for you can you bring up your command prompt you have at the bottom of the command there you go
let's just do a control c and let's terminate it perfect and let's see if um you'll have to
say why to terminate the batch job but if it looks like yeah no no no um i gave everyone eight gigs
of ram and a few cpus so just fy okay go back to your folder amanda i'm going back into your
nafi folder downloads there you go go into your nafi bin folder uh it's the nafi dash there you go
perfect go into that one go into your bin folder and run nafi
you really blew it up um supposedly i can do this
um if you can uh amanda let's uh run dafi again
say run
it just goes away doesn't it yeah um well task manager should not it shouldn't be running in
task manager oh when you read it it doesn't show you the process no it will but it's not running
right now um let's do this could not find something that's really weird did you pick up oh did you
when you were telling to pick up did you pick up the nafi folder as well i believe so yeah so you
probably picked up the nafi folder and it removed itself
so so let's do this so exit out of task manager
um go back to your uh folder and again you know some of this is by design because
you know um go to your downloads you know uh go ahead and delete
yeah you picked it up and unzipped it you also i think you did not keep uh the source file
and the properties
so perfect uh so amanda if you can go ahead and delete those two directories
well you picked up the zip file so we actually need to download it again now
so you you picked up the original zip file you also deleted yourself um and and but i'm but i'm
glad you did because you know this is what i was saying earlier where you know we've got this can
this can go like a worm it can go through and wipe everything out if you miss a setting
because it wants to pick everything up it will not and you probably did not have keep original
and so it just went through and it picked its own self up it processed itself and then crashed
because it couldn't find itself
no that's perfect so
you had probably the log file open amanda
all right try again perfect so let's uh go up to your top left tab there you go
go to download there you go scroll down
perfect and you want to download the 95 standard binary not the source
the 95 standard binary that one yes and then um click the http um that uh well you can do
that one as well right go ahead and click to download
yes click that link
sometimes sometimes these vms like to take a little time okay there you go click that one
it gets that on one page is like oh you should try connecting to these
perfect oh amanda it's downloading um it should be pretty quick
you can go now and right click and say extract the files view it run your nifi and get your
username and password and if if one of your colleagues is nice they could create a template
all right uh who else do we have erin looks pretty good i'm gonna see
oh nice you're getting a file identifying unpack oh you're sending the original to one direction
the success of the unpacking to another direction and the failure to another perfect perfect perfect
perfect
all right let's see if sean that's my rabbit
okay um any issues sean
yeah yeah it can it can get a little tricky so yeah make sure you keep that source file
um again you know this this is made this first exercise i expect to break things i expect to
fill it up that's the reason we're using some of these zip files because i know it will blow the
system up and this will give me training opportunities so perfect perfect perfect
get a little less than to break it in break that tomorrow we'll do some more advanced stuff but
you know as a as a practice yeah um oh nice nice brett you got labels applied
uh you renamed your processors um that looks really nice you did you want to rename your
connection okay
all right looking good though tyler let's see how you're doing it's all green were you able to run
tyler were you able to run your flow and did it pick up a zip file and unzip it did your flow delete
yourself all right we'll come back to tyler uh manda you got yours going you're installing it
you know where to find your username password um did something happen with teams because
everything went down yeah i think everybody got kicked out of the teams call yeah i just got
kicked out too yeah i lost everybody it seems that's a microsoft issue
okay uh so can you guys hear me i can i can this is brett okay thanks okay tyler um
i called on you right before we all went down are you still there
okay perfect um how did how did it go it worked correctly okay any any issues um you know you know
picking the data unzipping it and putting it somewhere uh no it could work good okay
wait a minute you already have some of the the the regex in uh in those times you've used this before
yeah perfect perfect perfect uh we'll get into the regular expression language uh tomorrow
uh but you know you already have a good head start you know you can take that path and you
could put it as a environment variable on the main canvas and then reference path forever perfect
perfect okay amanda you're still getting up to speed and reloading
Pedro did you get did your queue clear on yours pedro make it back into the call
yeah we we got back on i just forgot to unmute oh yeah okay perfect perfect um
any issues with your flow no i was trying to find out how to delete files
how do you delete files yeah i was playing around with it oh okay uh you're just playing around with
you know picking them up deleting them um your flow looks good i see a yellow behind that
i think it was i think you may even have a list but i don't think that's part of your main flow right
okay perfect perfect perfect um elisa how did you uh how did you do
what did i did see you have a file picked up uh in success um were you able to run it through
the identified mine type yeah it wasn't running through but it was the same with uh
oh okay um yeah if you won't you can actually you know you can stop picking up files so leave
your first processor stopped and then you know start your second third one fourth one to see
if it will go through and clear the files um the example we are using is going to throw errors
uh and like i said that's kind of on design as well randy oh i like this so randy put another
processor in place to put the file so he will have a copy of it uh you know in another directory
looks like um randy how did it go how did your flow did it work yeah it worked but what i
started all the processors it was generating a backlog so i stopped everything and then
it the run wants to get it to go all the way through okay perfect perfect yeah yes and again
you know this first example is kind of designed to break um but yeah if you try to unzip all of
those files it's going to back up the queue your queue's you know about a gig and so you know if
you do just four or five of those you're going to be over that um perfect perfect
okay ben it looks like you're getting the file uh you're getting built up
i like that you have around on attribute um
yeah you're not putting in the correct expression um so video properties
oh you're familiar with uh the expression language as well so so perfect uh let's see
um i'd have to i use gemini for that oh nice nice um let me see here
let me i'm looking to see what's wrong your route man you um dollars
and you can tell you've been messing with it with your history um so mine type contains is
that a colon in between mine type uh i think so yeah that is going you i think it's supposed
to be a period oh no no expression found because i'm done and that's how i do things
that's what you get for for trying to be fancy um
exactly
is that a camel bracket or okay um so let's do that so replace the in front of mine
um let's put a camel bracket there oh curly brace yeah uh curly brace and there you go
and whoa you see how actually turned blue
yes so that's correct um but um application zip that should be
um that should be this should be correct but it's not
and type contains your curly braces are where they need to be your prongs are where they need to be
uh say okay and uh see what that yells at you about yes unexpected okay line one complete
you can bring that back up and you only have one line right so you don't like the prince
uh should like the parentheses
i'm looking to what's value i'm pulling up the the nafi expression language
guidebook just for fyi the where i'm looking up you know where if he's formatting this correctly
we have not went into this you know some folks like to be overachievers that's a great thing
but there is a an expression language guide that will guide you you know through you know a lot of
these so i'm gonna try equals on that file name i think you're working with an attribute so it's
going to be only two i wonder if you could do file name contains uh but that would look at the file
name not the attribute right and i did look at the
so as uh as uh uh as ben tries to to figure that one out um if you if you can't figure it out ben
just send me what you have just like paste it in chat if you can and and i'll take a look at it on
yeah like uh but i think you're close you're very very close
yeah and and you know like we haven't went into this yet but you know just for consideration a
lot of the regex uh online regex builders uh they do not like to work with nafas uh regex
um you know so keep that in mind you know there's some regex builders out there that you can
uh you know help build your regex patterns and you can copy and paste but you know
nafi doesn't like to work well with those okay while we wait on ben we have a couple minutes
left um tomorrow we are going to continue building a flow file if you can uh you know
log out of your machine and and um because i'm going to push some additional data to the machine
i'm going to push some zip files with csv's uh we're going to deal with you know extracting
columns from a from a csv or excel file we're going to work on manipulating the data
we're going to work on picking those up and how we put those into
potentially like a data store or of some sort uh those types of things
because i was having to update and i'm excited that because i like it
if i don't if you don't uh log out uh it's okay because i can stop your machine um later so like
in the next couple of hours i will stop your machine and push the updates anyways
you know just for fy okay all right let's see what else do we have all right is there any questions
on what we went over with today besides the regex pattern i think everyone uh took the exercise
and nailed it um some of you went above and beyond great job uh brad i really liked how you
put the the labels in place i think it was cody i think you named started naming everything i
noticed uh you know you you caught that during the training um was it you or somebody else
no it was somebody else
well one of you uh or a couple of you went for and named the processors and and those types of
things to be a more human readable um you know so you feel free to play around with your flow
uh like i said i'm in a couple of hours i'm going to push some updates um i'm going to include some
csv files we're going to pick up things like that we may even reach out and hit a web service
and pull some data in um there is some faker type data that i like to to generate and pull in like
there's some web services that will you hit an api it's going to return 100 different names
they're fake names fake birthdays and all these other things um and it really kind of tomorrow
is going to be going through the data so the data flow we created today is is really easy it's
picking a file up identifying the mine type unpacking it putting it out we didn't really
need to work with the data that we picked up uh we're just working with the uh attributes to
identify the mine and send it over so tomorrow you know we're going to do a lot more in-depth uh
looking at the files sorting uh doing some etl type steps uh and things like that
we may even build our own customized weather weather service to send us data about weather
based upon location that's another great example i have if there's any in particular technologies
that you all would like to to um you know take advantage of let me know i would be happy to do
uh you know a hands-on flow where we're working with a database or an api or
you know you name it and so we're going to work with the build this flow after we get through
a couple of flows uh building those we're going to export uh one of those out and we're going to
start working with minify and getting it running in minify and having minify run that flow uh we
will use minify locally as well so you know we won't have a an edge device to work off of for
instance but uh we can do everything within the same machine and then we'll also get registry up
and running and you'll be able to check your your flow in so uh without with that being said uh you
know any final questions got a couple minutes left so any final questions uh anything i can help
with yeah i got it oh yeah what was uh yeah i think the reason why i didn't like to contain
but uh i switched over to just set it equals uh-huh oh and with the colon and it just started to work
oh perfect perfect perfect um yeah the the regex with nafi is is i have been complaining about
forever but yeah so i got to work awesome well i think everyone got their flow to work uh everything
looks good to me uh you know with that being said feel free to continue playing around in this
environment uh and building your flow uh i'm going to shut all of this down in in a couple of hours
upload some csvs and zip files and such for us to work off of uh and then tomorrow we are going to
go into registry uh some additional flows and we're going to start minify so if there's no
additional questions uh have a great evening uh thanks for bearing with us on all the technical
issues i wasn't expecting teams to kick us out as well that's kind of weird um you know but but if
you need anything just let me know and and i will see everyone here eight o'clock your time in the
morning 10 a.m central time um and then if you have my contact information so if you have any
questions just feel free to let me know sounds good thank you much all right see you guys thank you

on 2024-05-06

language: EN

                  WEBVTT
We'll be providing content.
So, yes.
Okay.
So, what I'll do is I'll create a status review of the meetings throughout the schedule.
Kind of put those out to the team.
We can make those as long as possible to make sure that as the content we develop, if any
questions can get back to Mike with those questions.
But I assume we don't have those questions anymore because nobody put them together in
question.
Well, I would get those questions asked rather quickly.
And then it sounded like Smiley, I think his name, had some other information potentially.
So, I noticed Mike just sent out some calendar invites.
So, maybe we can actually get some additional information, but definitely get those questions
out ASAP and get them answered.
I think you're giving us two weeks.
I think it's two weeks on this.
So, and half that time is Red Team and others.
Oh, absolutely.
If I, my way of traditionally approaching these that works, it sounds any different than we don't have documents, right?
That's not the case.
We have a ton of documentation on this.
That's why I was saying, Logan, I'm looking for the email now to send to you, you know, to, because I found some of the documentation just a couple weeks ago.
Perfect, perfect.
And I've got some more.
So, those could be helpful.
But the sooner you can get ahead of it and, you know, and get something out fast and say, okay, this is what I got.
When do we want to make it go?
Well, how do you want me to help?
So, okay, well, I feel like this, you know, I do feel like we're running to talk to you about some, you know, and let it get released.
Some, some of that type of stuff that we normally don't really, I mean, we talk about, but we don't just because of the way the proposals asking for that agility layer, for instance.
You know, I don't feel like we don't have a lot of that type of content.
And so, you know, you need to come up with some of that.
So you don't think that.
Okay.
Whatever.
Okay.
I mean, whatever, Rhonda, like, okay, so I can, like, I don't like that response.
Because, like, I'm just asking you a question.
I think at some point, Josh, needs to lean on my engineers and not have to answer every single question.
It's a high level question.
But, but, okay.
I don't, I get it.
Okay.
Let me see.
Okay.
I think it's suffice, right?
I don't understand what was wrong with my question.
I don't understand.
I just don't understand it.
Josh, you're absolutely right.
Again, right?
I'm never trying to put you down or put you on the spot.
I'm always just trying to help you.
Right?
And you as my manager, that's one of my jobs, is to make sure I'm helping you.
And that was all that was.
But, all right, I need you to get a cigarette.
Real quickly, Dylan reached out to me because he noticed I was in the document.
And I didn't.
And so then he started complaining about it.
And then again, I'm hearing him helping you because he doesn't want to put all your fluff that he calls it in the document.
And I'm like, well, Dylan, I don't have, for instance, a lot of these things set up.
So we're going to need to know the prerequisites and things like that.
So that's all I wanted to talk to you about with Dylan is, is he like, I didn't, I didn't reach out to him.
He reached out to me because he can see I'm in his document.
Right?
Again, like hearing him trying to defend and help you and deliver your message to the team that, hey, look, this is why she's asking for this.
This is why she's wanting this.
And here is why I'm telling you we're going to need to do this.
But again, I need you to go get a smoke or coffee.
If you need anything else, just send me a message.
Good morning.
For those that are already on the call, give another couple of minutes and get started. Good morning.
It's me, Ben, Aaron, and this room here. Awesome. Awesome.
Hello.
Good morning.
Good morning.
My.
Really?
Let me look.
Same one.
Yeah.
So this is the same one.
You can look in here.
I think I see this link here.
I'm shocked.
It's really weird.
It's asking for a password because mine never, never does.
But I have the teacher one.
I don't know if that's any different.
Yes.
I read participants when they were just at the half a set of passwords.
Okay.
Log in credentials.
Yes.
That's why I always encourage to make sure to write them down.
We'll get Aaron a couple of minutes.
And then we will get started.
You know, this morning we are going to be doing a little bit of a chat.
We're going to be doing a little bit of a chat.
We're going to be doing a little bit of a chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.
We're going to be doing a little bit of chat.

on 2024-05-06

language: EN

                  WEBVTT
Okay. Well, according to my screen, everyone was able to log into their environment.
Anyone having any issues?
Perfect. Perfect. Perfect.
All right. So let's, if no issues, let's just get started. Let's dive right in.
So if everyone could bring up their NAFA, and what we are going to do is, you know, practice converting ACSV over to JSON.
We're going to use some controller services. I have the flow. I have the flow built.
You know, trying to give it just a minute and then, you know, refresh your screen, refresh your browser.
I'll give everyone just a minute to get their browser up and running and get ready to go.
Okay, so you should be able to bring your folder up. You should be able to go into the NAFA folder. So it's NAFA 1250 bin.
And then you go in and there's a logs folder. And then there's a NAFA dash app.
On that login. There you go. Perfect. There's your password. Perfect. Looks like you're getting logged in, Alyssa.
Okay. So real quickly, if you can, just hold down, create a new processor group.
So again, if you need help on that, sharing my screen. This is our next flow we will build.
But let me go ahead and bring down a processor group. And we are going to name this Git.

on 2024-05-06

language: EN

WEBVTT
Looking good Cody, looking good.
Does anyone have any issues so far or are we good to go?
Pedro, were you able to copy your data flow into a new group?
Perfect.
And then go into the example main and you should see some sample data.
And inventory folders and inventory.csv.
My etherpad link isn't giving me the drop off.
I have like a default message.
It's Cody.
Hey Cody.
I've asked you that multiple times.
Alright, let's see.
I should have your voice memorized now.
What is?
That is strange.
Yeah, that's very strange because I'm looking at it and...
You just chat me in the VM.
Yeah, I can do that.
Dropbox link.
Can I just send it directly to you in chat in Teams chat?
VM would be better if you can because I have my teams on my government one and I'm using the VM on my personal computer.
Oh, okay.
Let me put it in Teams, but I might be able to put it directly into this.
And let me see if I can take control.
Alright, interactive.
Copy link.
There you go.
And I'll download it for you too.
Thank you.
No, it worked.
I was very surprised it worked.
So because I have to go back and then tell it I want to interact with your machine and then log in to your machine.
Okay, so everyone should have that zip file.
I'm Cody.
I left your machine.
If you can, extract it and, you know, it might be easier to put on your desktop, but put it in a location that you know about.
Because we're going to need to use our get file processor.
So hopefully everyone has that.
Let me know if you run into any issues.
We can stop and get it squared away.
But with that being said, you should be able to go into that zip file.
There's a sample data folder and then there's an inventory folder and inventory CSV.
So if you can on your NIFI canvas, let's do a new process group.
Because we already have a data flow that we worked on yesterday.
We want to bring down and do a new process group onto your canvas.
And so that way we can start building onto this data flow.
And if you look at my screen.
Right, I have a process group called CSV to JSON demo data flow.
You can name it however you want.
But you know, definitely something you can, you know, something you can work off of.
So our first step is we need to get the file from the directory.
So what you want to start with is most likely a get file.
There's a couple of different ways you could do that.
You could list the directory, filter on it, and then turn around and do a get file.
But I find just leading off with a get file as the easiest method.
And the file that we were looking for is inventory.csv.
So you can design this how you want.
But you know, for me, for instance, I put, you know, I put it in my,
it was in my uploads directory because the, this desktop environment that we're working in
has an uploads directory on the desktop that we can put files into.
And then the file filter, I put inventory.csv instead of picking up everything.
All right.
And then once we have that, we need to set the schema model metadata.
And so, you know, this attributes metadata so that we can later understand which schema to use
to process the data.
So, you know, we want to create an update attribute just because once we get that CSV file,
we want to be able to tag that metadata with a schema name that it will use to read and to write
the data.
And I know I have mine pulled up, but I would love to see others in your own thought process
putting this in and, you know, being able to bring that up.
So we get the set schema metadata.
You can just add if you didn't know on the update attribute, I think a couple of you were working
on that yesterday with an update attribute, you can go to the properties and just add a property.
And, you know, you want to do the property name.
So we could actually do schema.type, for instance.
And it needs a value.
And I'll just put a row.
I'm not going to use that attribute, but, you know, that's how you would add attributes,
you know, to a flow.
So we are getting the file.
We are ingesting that.
We are not even looking at the CSV yet.
We are just looking at the metadata.
So what we want to do is add an attribute to that that says the schema.name goes to inventory.
And so once we have that, we then need to convert CSV to JSON.
So the processor I like to use for that is the convert record processor.
And in your properties, you're going to have a record reader and a record writer.
This is a controller service that we're going to set up.
So the record reader is a CSV reader.
So you should be able to choose, you know, an Avro reader or CSV reader.
And then on your record writer, we want to say the JSON record set writer.
And let me know if you have any hiccups there.
And then once you have that CSV reader, record reader set up, you can actually go to the service.
And so, you know, if the service isn't there, let me know.
But once you put that in and hit apply, you should take it.
But if it doesn't, we can just create a new one.
And you can see where I have a CSV reader that I went in and created.
Let me look at how this is the tricky part of this flow.
And so what we want to do is set up the first record or the first controller service is that CSV record reader.
And then we want to set up a JSON record set writer.
And we're going to also set up a couple of schema controller services.
So when you use the convert record processor, it should.
Okay, yeah, perfect.
So you should have CSV reader on the record writer.
You should have JSON record set writer on the record writer.
You want to change, you know, just as a tip, include zero record flow files.
You want to change that to false.
And then once you can save and say, okay, for that, you can go back into that configuration.
And then you can actually go straight to the controller services.
And you should have controller services listed.
And you want to add a CSV reader, a JSON record set writer, an Avro reader, and an Avro schema registry.
And you're pulling your screen up because it looks like you're in the middle of creating.
I'm trying to figure out what goes in that properties.
Yeah, if you can, you want to cancel.
Okay, so we are get CSV file.
You named it.
You got it.
But what we want to do then is do an update attribute because we want to tell NaPhi which schema to use.
So I hit cancel and bring down a new processor and the processor is an update attribute processor.
So you can actually just start typing it in the update attribute.
Perfect.
And then you want to drag your success down to that one.
Awesome.
Bad.
Perfect.
And then go into your, you can delete that.
You can delete the other processor.
That's just dangling.
So what we want to do is you want to go into your update attribute configuration.
Okay, perfect.
And on that, you want to set, you want to do a new value.
So that positive you see over there, you want to click on that.
The plus sign on the top right in your processor configuration.
Okay, and then you want to do property name is schema.name.
Say okay.
And you want to put inventory.
Perfect.
And say okay.
Your cash value lookup should be 100.
Do not store state.
And the other two fields are not required.
So perfect.
Say apply.
Awesome.
So what we're doing here, and this is a great example, is we're bringing that zip file,
that CSV file in.
And what we're, we're not necessarily reading the CSV file.
And so when we bring that file in, you're going to be able to look at the attributes
and see that you've now assigned a schema name to that file.
Now that you have the update attribute, we need to get a convert record processor
because we're going to work on converting this to CSV to JSON.
You want to just type convert record.
Perfect.
You want to, there you go, drag that to there.
Success.
So an update attribute and get file and some of these are only going to have success as
the next termination, right?
Because it's just going to apply that attribute to no matter what file comes in.
So there really is no failure.
Okay.
So now that you've got, you know, your convert record, we are reading in a CSV file
and we are going to put a JSON file.
So if you can on your record reader, instead of no value, click that and click down.
There you go.
Reference, create new service.
And you want to use a Avro reader.
Perfect.
Great.
On the record reader, select your dropdown.
Again, no, cancel on that reader, not the writer.
There you go.
Reader.
Select your dropdown.
No.
Why is that not coming up?
I create new service.
Oh, we want to do, do your dropdown instead of an Avro reader.
You want to do a CSV reader.
Perfect.
And say create.
Perfect.
Perfect.
Perfect.
So we are reading in a CSV for a record writer.
We are outputting JSON.
Right.
So go ahead and click new service.
Say, yep, do your dropdown and create new service.
Perfect.
And we're doing JSON.
Yep.
And say create.
You got it.
And on the include zero record flow files, just say false.
Okay.
And say okay.
What does that record zero flow files mean?
What does the zero record flow files mean?
Yeah.
Okay.
So if include zero record flow files is usually like there's like no data in the flow file.
How like that?
Question mark.
Hover over that question mark.
Right there.
When converting an incoming flow file, the conversion results in no data.
So if there's zero data, you know, it knows how to handle that.
But if you hover over the question mark again, there you go.
So if it comes in and if the results is no data, you know, where do you want to do with that?
You want to send it, you know, elsewhere?
You know, you want to if it comes back as zero byte, you know, if it's a zero byte flow file,
for instance, you don't want it to continue down the path.
Right.
You want to probably send that to a different relationship.
There's some reason it didn't get converted.
But yeah, leave it as false.
We're not we're not getting too advanced today.
Okay.
Say apply.
Awesome.
Now go back into your convert record and on your CSV reader, there is a little arrow to the right.
At the right side of it.
Yep.
Click that.
And so what this is going to do is take you to your controller services.
And so once you're there, we want to look at our CSV reader.
You can go ahead and click on the gear icon.
That is how we control the properties of a controller service.
And you want to go to properties.
Okay.
The schema access strategy, we want to use the schema name property.
So click that.
In first schema.
And use the schema name property.
Remember we set that schema name in the update attribute.
Okay.
Perfect.
So say okay.
Schema registry.
All right.
So on the schema registry, drop down.
I create new service and we want to use an Avril schema registry.
Say create.
And click create.
Awesome.
Awesome.
Okay.
So the schema name property is we already said it, but you want to double check here
to make sure that that schema name matches the update attribute that you applied.
And so schema.name is what we put in the update attribute, I believe, right?
Yep.
Schema.name.
That's the same property we used in the update attribute.
Pedro?
Yeah.
Okay.
Awesome.
Awesome.
Awesome.
So just say okay there and scroll down.
Let's see if there's any other settings.
We're going to use the Apache common CSV.
There's no really custom format value separator as comma slash in as a record.
Yeah.
So I think we're good there.
Say apply.
Okay.
So now it's still in an invalid state.
We specified an Avro schema registry.
So if you can go into the gear icon again.
And you see where it says Avro schema registry and it has an arrow.
Okay.
Perfect.
So you want to go to that's the schema registry it's using.
So if you click the gear icon and you want to valid field names true.
And then here is where we are going to apply our schema.
When it reads the CSV file.
So instead of writing a brand new schema let me paste it into that.
Are you able to go to the etherpad?
Pedro?
Ordering over.
The etherpad.
Actually here let me I'll put it in the teams chat as well to help everyone else.
You don't have to write your own.
Oh perfect.
Perfect.
So I'll put it in chat.
So you want to copy that.
So what we're doing is creating a schema.
And so that way it will read that CSV in.
The controller service is looking at this schema.
And that's what it's going to apply to that CSV file to write it as JSON.
So if you can you know in your controller service you're going to want to hit plus.
On the to add a property.
No worries this is a very difficult hands-on.
So you know I completely understand how to go back and forth and all the copying.
Correct what the schema would look like writing it to JSON.
Perfect.
Pedro I didn't write your background.
Can you real quickly you know what is it you do?
Oh then this is right up your alley.
Right.
So no no no I have everyone's background because you know we have some assistant admins.
We have some developers.
But yeah so you want to create.
Did you get that little JSON block I sent you?
I can paste it in.
So I don't mind pasting it all.
So if you can though I want you to go back to your controller services.
And hit the plus and the property name is inventory.
And say OK.
And I'm going to paste.
I can't let me see if it won't let me.
Just put like the number one and I will come back in and erase it.
And just say OK.
OK I'm going to come in your instance where I can modify it.
Does the property name have to match the schema to the other test?
It does.
It does.
Good question.
OK that didn't work.
Perfect.
There we go.
I'm going to exit back out of here Pedro and I'm going back to view only.
Because I don't want to mess with what you've got going here.
OK so the property inventory you know needs to match.
We've got it.
And if you look at the value.
So Pedro if you can click on that inventory look at the value.
And you can see we have put our schema into this controller service.
When it reads that CSV it's going to pull all that in and create a JSON document.
You know based on utilizing this schema based upon the CSV data.
So you can go ahead and say OK and say apply.
I got a question.
Yes.
Do we have to manually create that schema when we're niggling for the file?
That is a great question.
So you will have to manually create your schema.
Now the beauty is is not five steps Avro schema.
And so if you already have an Avro schema you're already you know ahead of the curve.
But yes you will have to create a schema when you're wanting to do things like like what
we're doing or reading in one format converting it to another format.
Just because you know there's capabilities out there for not to auto understand.
But you know what I'm saying here is is not I would not be able to understand
how you want that data to be written back out.
We have the CSV reader and we're using the Apache Commons library to parse that CSV.
But without that schema we wouldn't know where to map you know the columns back to the JSON fields
that we need.
So yeah unfortunately you're going to have to do that.
OK.
And then if you can let's go to that Avro reader on the first row.
And go to the gear and perfect.
It should be automatically set up where it uses embedded Avro schema with a cache size of 1000.
There's capabilities here we can use external Avro schemas we can you know there's all kinds
of capabilities here within NAPA you know depending on your setup.
But here we just want to use the internal Avro schema we just created.
So go ahead and say OK.
OK. So now that we have that let's see we see OK we're back.
I do see an invalid on your CSV reader hover over the invalid the little yield sign is disabled.
Right. So the only issue that I'm seeing so far is is is the Avro reader the Avro schema registry
is disabled.
So if you can let's go start with the Avro reader and let's enable that service use the
no cancel and use a little lightning bolt enable.
And you you can say service only or service and use the drop down box there is service
and components and referencing component right there.
Yeah. So this option gives you that capability if you just want to enable the service you know
you know that's one thing but you can actually enable the service and any processor any other
services and referencing components will be enabled as well.
So go ahead and say enable at the bottom right.
It's enabling this controller enabling references.
Say close.
All right. So we want to go to our Avro schema registry and we want to enable that one.
You know same thing. So you can see how it's referencing the CSV reader.
So you say enable say OK.
And what that's talking about is the actual processor it can enable but it did enable the
referencing controller service and your Avro schema registry.
Go ahead and say close.
We'll take a look at why the processor is not in a second.
It probably is because you need to enable your JSON record set writer.
So the you know same thing is you want to say OK.
Say close.
Awesome. So now we have went through we've created our controller services.
We have created that Avro reader the Avro schema registry CSV reader in the JSON
record set writer. So they're all enabled.
The state is enabled.
So go ahead and hit X.
And let's go into our processor again.
And look at the yield sign on the convert record.
Oh we need to finish our flow right.
We can't turn this on because the convert record has nowhere to go.
So you want to then you know after we have set
we have now picked this up.
We have converted it to JSON.
We now want to update the file name attribute so we can write that back to disk.
Because right now it's creating a whole new flow file right.
It's creating a whole new document.
So you go ahead and click and bring down a update attribute again.
Perfect.
And so on the update attribute let's go ahead and configure it.
We're going to give it a file name.
So go ahead and hit plus so we can give it a new property.
And the property name is file name.
OK. Say OK.
OK.
And you know we're using not by you know the expression language.
So we want to do dollar open curly bracket.
And it automatically puts the close in there for you.
And then you just want to put file name.
Awesome.
And then you want to at the closing curly bracket put dot JSON.
Dot JSON.
There you go.
Say apply.
OK.
And so we want to apply there.
So for here we do have you know the capability for failure unlike the update attributes in the
git file.
So on success we want to update that attributes.
Go ahead and drag down your update to the update attribute.
Perfect.
And you want that to be success.
And so the convert record also needs to know what to happen to failure.
And what I like to do and you know this is up to you all how you like to do these things.
But I like to log everything.
So on a failure let's log that error message.
So you want to bring down a new processor.
Oh I like how you're going to put the failure back on itself.
So you want to do a log message.
Perfect.
OK.
And what I like to do is yeah just take and you can see here you can see I put my log
messages out to the right of the flow.
And if you're working straight down then you know I know that that is my success path.
And then I'll put my log message out to the right of my flow.
Whatever makes sense.
Because you can reuse this log message processor over and over and we're going to do that here
as well.
But for failure you want to send a log message.
So say add.
Awesome.
Awesome.
On log message you know we now need to configure it right.
So go into your log message and go to relationships.
And we want to auto terminate on success or retry or you know on success.
So auto terminate.
There you go.
Hit apply.
So what's happening is that message now will be sent to the logs.
And so if you were telling tell is a Linux command.
So if you were telling it a log and an error came through you would be
able to see that error message come through.
It's also going to you can actually pull a history as well to see what that area is.
So our convert record now is went from yield to stop.
You can leave the log message out there.
That's fine.
Now we need to continue our flow.
So we have an update attribute for that update attribute.
We you know we went through and added a JSON file name.
Okay.
And so now we want to do a put file.
So now we have you know a new JSON document.
We have it named.
We have our schema set.
We have all these things.
So yep update attribute for success.
And we want to put the file.
So on this you know if you want you can write that file right back to the CSV directory.
And I say that because you know for mine I'm only picking up the CSVs.
So any JSON that is written back to that directory you know will will it will not read that back in
when it comes time to write it again.
So go ahead and figure out where you want to put the output of that.
Do remember to and this is for everyone you know do remember to keep the source file on your
git CSV if possible just because you you know your flow may not be correct.
It runs through you want to run it through again.
Does that look great.
It does it does.
Go ahead and say apply.
And then let's see here.
Let's do a you know from your put file let's drag another arrow to your log message.
Perfect.
And we want to do a failure on that one.
So so you have I think auto terminate enabled for the put file.
On a failure but in case you know the hard disk fills up the if this was a network share
it's not available you know there's a ton of different reasons why we might not be able
to write that file you want to log that message.
And so go ahead and say add.
Perfect.
So what we've done is is we've created this this flow and any errors we have we are pushing
it to the log message.
And so if we were to add other aspects of this and there was an error you know it's the same
we can reuse that same processor over and over and over again.
I'm seeing you know folks set up a whole processor group on how to handle errors.
With advanced filtering and things like that.
And then you know you may have 10 15 other data flows utilizing that you know that error
error flow.
And so you know it's just something to keep in mind as you're developing and designing
your data flows.
Okay let's see if we can run this.
I don't see any errors right yet.
So let's see if we can Pedro if you can just run through this one time.
Yes so let's make sure we configure it first and make sure that we're keeping our source
file because now if I like to default to false.
Okay.
You got true.
Perfect.
All right.
Say cancel.
Just to double check.
Say run once refresh.
Awesome.
Awesome.
Let's take a look at your queue to make sure that that is the file that you're expecting.
So here is you know here we are testing out our flow and making sure you know it looks
good.
So go ahead and say list you and you remember how to view it.
No.
No.
Go record.
Perfect.
You content.
So yesterday when I did a new content I could not do it because it was a zip file and there
was no viewer built into not for zip files.
Luckily it's a CSV.
So now if I understand CSV and it has a viewer built in but the file name is inventory dot CSV
and there's 11 lines with 10 lines with a header and a bad day to roll just to throw us off
too.
So go ahead and exit out of that view.
So you just close that tab.
Awesome.
And if you want to go ahead and look at the attribute so scroll away all the way to your
left to the little.
There you go.
And attributes.
And so if you remember from yesterday this is what we were picking up data and we were
looking at the attributes.
If you scroll down do you see anything in there about schemas.
No.
Perfect.
Because we have not sent it to that update attribute yet.
So go ahead and say OK and exit out of that.
Perfect.
Sorry guys my network dropped.
I was out for like 10 minutes.
Oh wow.
We're you know I'm using Pedro as an example walking through this flow but you know we're
still building the flow so if you have any questions just let me know Brett.
OK.
Thank you.
OK.
So we have now got the file.
Let's do an update attribute.
So just run that once.
Just click run once.
There you go.
And it should show up in success.
Awesome.
So let's now take a look at that you and let's look at the attributes and see what has
changed.
Let's go down.
Perfect.
So you remember in the update attribute we added a new property that was schema.name
and with the value of inventory.
Right.
So now now you know now it should be coming together where we created that controller
service.
We told the controller service we're going to use the schema.name property as what as
because you may have hundreds of different schemes.
And so we're going to use the schema.name property and the name is inventory and that
was the inventory name we gave it in our controller service.
So this looks good so far.
So we'll say OK and exit out of that.
Perfect.
Now let's do here's the heavy lifting is the convert record.
So let's run once to see if it actually works.
Oh we have a failure.
What is our failure?
Cannot parse incoming data error while getting next record for string.
Wanted to.
Oh that bad bad bro that we have in there.
Well it should parse each row and create a JSON document out of each row and then ignore
the thing.
Let me take a look at this right quick because mine's up and running.
Let's go into your convert record and let's look at the properties.
And you have CSV reader and you have JSON record set writer.
You have false.
Click on your CSV reader.
No it can't click on the arrow to take you to the service.
OK that looks good.
Let's view the configuration to make sure we have everything figured properly.
There you go.
We should have used the schema name property.
The schema registry is the Avro schema registry.
We do have schema name.
OK that looks good.
Let's look at our schema registry.
So click on the arrow for that.
Let's go back up.
Scroll up and little arrow there.
Perfect.
And then OK so let's look at the schema registry.
We have inventory.
And put the schema in there.
So click on that.
We make sure that looks good.
OK.
OK that looks good.
So OK.
So OK again the Avro reader.
Let's look click on the gear icon for that one.
Use embedded.
You have that correct.
And you have a thousand.
OK now say OK.
And then on the JSON record set writer.
Let's go.
So you know what that configuration is.
One second I'm pulling mine up to make sure it matches.
Oh we did not configure this one.
So on the schema write strategy.
We want to change that to set schema dot name attribute.
I'll let you write.
So say yeah yeah.
Close that.
And you see up top it says disable and configure like that.
So what it's doing is disabling the services and the processors associated with it.
All right.
So the schema rank strategy is we want to set schema dot name attribute.
I thought we did this.
Say OK.
Awesome.
And then the schema access strategy.
We want to use use schema name property again.
Awesome.
Say OK.
And then schema name is already applied in the schema registry.
So we want to click schema registry and I bet you know this one.
It's the Avro schema registry where you use.
Perfect.
Say OK.
And then pretty print JSON.
Let's let's just make it pretty.
So say true.
And then it should be never suppress array.
And OK.
Say apply.
And now we want to enable that.
We want to enable every.
Awesome.
So that is actually a very quick test.
If you have your processors completely configured it's a really good test to see if everything is
configured appropriately because it will enable that processor.
If you know rejects enabling the processor it's usually because you're missing you know
some of the termination or you got a misconfiguration.
So say close.
And then exit out of that.
All right.
Let's clear our actually let's stop the convert record because it started it when you enabled
that service.
And let's see if we can run this one more time.
So run once at the very beginning.
At the very top.
Yes sir.
And I know I'm working with Pedro here but you know when it pause while he does the run once
does anyone have any questions because this is a very difficult flow.
Oh and that's why we're doing it.
Perfect.
And hopefully we didn't lose Brett.
No worries.
Like I said this is a this is one more advanced data flows that we want to do.
And so this starts getting you into controller services and those types of things.
So we have a lot of time allocated for this.
Okay.
Possibly.
I'll take a look at yours in just a second.
Pedro you want to continue doing a run once and let's see if we can get that to success.
And then I will pull Brett.
Okay.
Let's see if our convert record will actually work this time.
Run once.
It did not.
What is our error?
Failed to process.
Failed to process.
Why are you um.
One second I'm looking at mine because we are working off of the exact same.
Yeah bad data row same header.
So.
One second Pedro let me I'm looking at mine which worked.
Oh oh oh I bet I know what we did not do.
Can we go into the convert record?
And then the CSV reader you want to go to that service click the gear icon.
No no you cancel that.
There you go use the arrow to go to the CSV reader and use the little gear icon.
Scroll down.
I'll duplicate header names true.
Oh treat first line as header.
So that's true but you remember we have to disable and configure.
Awesome.
Okay there we go apply.
It couldn't parse because it doesn't understand what that header was.
So go ahead and enable it and hopefully convert record will turn perfect.
All right let's do it run once all the way through and see if it works this time.
Get refreshed.
Ah success great job.
Okay so if you can let's go ahead and go through and clean this process flow up.
Let's get the naming in.
Let's get some labels you know those types of things.
Here is an example of mine.
So let me pull this.
Here's an example of what my flow looks like.
So if you can go ahead and update yours.
I need to delete this because I was using it for example.
And if you have any other questions Pedro just let me know.
Brad let's go back to you.
I would I would did you test yours out Brett?
Yeah it doesn't look like it's this first one doesn't have any in or out.
Okay.
I'm guessing it's just not picking up the file.
Perfect let's show configuration.
If you don't mind let's stop all your processors right for right now.
Just because you know we need to stop them and configure them anyway.
Beautiful.
Beautiful.
And a great way to use the operate you know on the canvas.
So take if you can go to that input directory and go to that value.
You know go back to your not file canvas and under that value.
You have to open that copy that location.
So you just highlight it and say copy.
I just use control C.
Okay that works.
All right then hit cancel.
But don't hit yeah there you go.
And then you go back to your file browser.
There you go.
And then in your address bar paste that.
I didn't get it.
Do it again control A control C.
I'll hit control C 10 times.
Because it doesn't get it the first time.
Yeah the the desktop environment sometimes will.
There you go hit enter.
Yeah that was the here I'll go up a directory so it's obvious.
Yeah.
Oh it's it's there the CSV files there.
Okay you say cancel.
Oh your your file filter.
Let's just stick with inventory.csv for now.
The regex pattern may be off.
So you can just change it change that.
Inventory.csv.
Run once and hit refresh.
Just right click on your canvas and hit that works.
Well I witnessed you set that path correctly.
Right click on that get inventory file.
Oh it went it went.
I see it.
Success.
Okay.
Okay perfect.
And then if you want just do a run once and let's see if it goes always through.
You have to do each individual one.
Wouldn't it be cool if you just hit run once on the whole process group.
Okay I'll submit it to them and see if they can get that into the next version.
Like I said I know all those guys so like
I submit feedback all the time.
All right so here's where we usually see our our highest values is on this convert
record.
So go ahead and run that once.
Oh whoa whoa whoa.
If it fails I would recommend here is is logging your failures.
Just because you know you see that red box in the top right.
So what it did is it auto terminated the failure and and that's it.
So it won't go into any logs or anything.
So what I like to do is do a log message.
You got it.
And this is what I was.
This part happened when I dropped I didn't hear this part.
Oh so this is what I was saying where I like to have a log message just on my data flow.
And any chance I have where a failure can terminate I always drag my failures to the
log message just so I can also see the file.
You know you may have a file that comes in and is not recognized and it goes to failure.
Then you can actually view the queue and look and see why like maybe it picked up a
file that was inventory CSV that had bad values right.
If you do an auto terminate it takes away some of your options.
That's why I always recommend a log message.
Okay so you can go ahead and auto terminate that log message.
You so you want to go ahead and configure it and then relationships auto terminate.
Perfect.
Say apply.
Awesome.
Awesome.
And then take your log message is fine.
You can just have it set out to the side.
You should be good there.
No you you don't want to drag it to anything.
You can drag it to itself and it will if you you know if you have a failure it will
you can drag it to itself and it will reprocess that data.
We see that you know when you have a processor that could take a little time sometimes.
And so you know if you have something that you know is taking a lot of resources like
unzipping a large file but you know that you're getting like you know a bunch of small files
behind it you every once in a while will see a large one you can put a failure back onto
yourself so you can just reprocess it.
We'll take that flow file put it back into the queue and try to reprocess it.
Let's go to our convert CSV to JSON convert record.
And let's configure that.
Awesome.
Go to properties.
Okay so we are reading in a CSV reader so let's go to that controller service.
You want to hit that arrow on the right.
Awesome.
And let's hit the gear and let's see what your configuration looks like.
Go to properties scroll all the way down and treat.
Okay you do have the treat the first line as the header.
Let's go all the way up to the top and let's see what that looks like.
In first schema I don't think it's correct.
So we want to use schema name property.
So you remember we set the schema name in that attribute.
And so we're telling now that we set that attribute we're telling not to use the schema name
and we're going to specify the name is what schema to use.
So you got that.
And then schema registry we want to set you know there's no value there.
We want to set that as the Avro schema registry.
Awesome.
Say okay.
And then I click apply.
Awesome.
And then let's go to your Avro schema registry we just set up and click your icon.
And you've got the model and you got that.
That's beautiful.
All right.
Say okay.
I think maybe the only issue was is you just didn't reference the schema.
So let's look at your Avro reader.
It should be just use embedded Avro schema.
Yep.
Perfect.
Perfect.
Perfect.
Perfect.
Okay.
And real quickly let's look at JSON record set writer.
So you should have set schema dot name attribute.
No value.
Use schema name property.
And what is the schema name property?
It's you know we set that in that update attribute.
So there you go.
And you've got the Avro schema registry.
Say okay.
I think yours is going to work now.
So just try to do a run once and you know let's see how far we get.
And while he's working on that let's see how's everyone else doing.
Cody is cleaning his up.
Tyler is cleaning his up.
Looking good.
Avro too.
Okay.
So you were able to get the file.
Let's run it once to set the update attribute.
Oh okay.
Did it.
Okay.
Run once.
Hit refresh.
Right click and hit refresh on your canvas.
There you go.
Success.
So and then you're updating the attribute and you are then putting the file back to disk.
So if you can let's go ahead and clean this up.
Do your names and labels and those types of things.
Make sure the file is being written out.
I would add another failure on put file just because you want to make sure you log that
message if it does have an issue putting the file.
As I mentioned earlier when you're putting a file that could be
many things and you know this could fill up or something like that.
If you have some sort of logging mechanism set up where you're
pushing all this to Prometheus and others you know Grafana you can take a look at this.
Okay.
Brett is squared away.
So it got output as a CSV.
Oh I thought we changed that.
Yes.
That might be another one you missed.
So do you have.
Yeah.
Yeah.
Do you have an update attribute after your convert record.
Yeah.
Now let me.
But this is this is I think this is when it cut out.
Okay.
No worries.
We got you.
I'm pulling yours back up.
Okay.
So we have the update attribute when it's coming out you added a new attribute called
founding and then the value should be founding like you have dot J.
Perfect.
Okay.
And say apply.
Okay.
And apply.
So is there a different attribute for like file extension.
Because I didn't set that and it was it automatically just gave it CSV.
Yeah.
Because it's using the file name attribute that was originally with it.
And so it probably wrote that because it needs a file name to write and so it wrote it as a.
If you look at you changed it.
If you want to run once but turn off that update attribute where we put on the file name.
We'll show you where it happens.
Going to stop that update attribute and if you can just run it one more time through.
Oh you have an error on your put file.
Yeah.
That's it's the same file so go ahead and you've got that stop.
Perfect.
Empty that queue.
Yes sir.
And let's run it once all the way up to the update attribute.
So let's let it convert and then after it converts let's not let's look at it the attributes.
Okay.
And let it convert.
It's already went to set name.
Well your other file has that was in the queue but it's okay.
So you should have two in the queue after set name.
Okay.
Success.
So if you can right before there let's look at the attributes of that file.
Perfect.
Do you remember where it's at.
So you see the file name.
No you're fine.
So you see the file name.
So it's going to write that out until we update that attribute.
And so what we want to do is update it to be inventory dot JSON.
So it's going to use so it's going to use that attribute to write the data back out
unless you update it.
And so if you say okay.
And then exit out of that.
And let's look at your update attribute properties.
And you see we kept the file name and we added dot JSON right.
Okay.
And then let that run real quickly.
That's going to that's going to be inventory dot CSV dot JSON.
Correct.
Yeah we didn't strip the.
We didn't strip the.
Really make sure.
So you should now have found an inventory CSV JSON.
Now we could do.
You can update that attribute with some rejects and change the file name all together.
So if you want you can actually go back into your update attribute.
And and you know apply some rejects if you wanted to to change that file name to whatever
you want you know just just you know as right.
So right now it's using the file name and just.
For the sake of this demo could I just hard code it to inventory dot JSON.
You could or did you have the rejects already that I can use.
I do not have it handy but but yeah just send it to inventory dot JSON.
Since that works for you.
Oh you already have a file in your queue so if you want to go ahead and run once.
Before you unless you just did that I don't know if you did.
Oh I did I did just do it.
Yeah I know where it's.
What's our error.
Because it's going to complain about the file name already exists.
But but that actually is a good question.
Let's see.
Like to do is I'll actually update the file name and give it like a UID and.
And a new file extension but I can actually let me see here.
You can do like a set file name as like file name equal dollar.
Curly bracket new file name right.
And while you work on that you can set a file name.
In the in the in the rejects.
So you can actually.
So you can put in your update attribute.
Hang on let me let me double check my notes here.
But.
Because if you do file name it's going to pull in that dot CSV.
We need to get rid of that CSV or you could do.
I'm looking at the rejects patterns on the website.
I mean if it's like typical rejects it would be something like begin of line.
Yeah I mean.
Something and then.
Any amount of characters and then.
Like.
Something like not literal.
Well you can also do something like a date.
So you can do like.
Yes you know you can do like now.
And.
Then parentheses and then close your curly brackets dot JSON.
You know you know when the sign if you do like now.
Hey yo.
And then what that's going to do is just give you a date right.
I understand.
Yeah a time stamp dot JSON.
No that's that's a good question run that once and see if it works.
It looked okay but we'll see.
How do I how do I get rid of these errors here.
The so the little red box in the top right.
It goes away in five minutes.
So I it's annoying as well where you know it's there you can be testing and running through
these things quickly and you know the error still exists.
But I think you're off to a good start.
I'll look at some patterns.
One more quick question.
Yeah.
Can I clear out all the queues at once.
Go back see if you can go back.
Use your breadcrumb and go back to the not five low mean.
There you go.
And then right click on your perfect and say stop right click again.
If the all queues.
Oh okay cool.
There you go.
Perfect.
Okay so any like I know we're still working on this but is there any additional questions.
I got a quick question.
Yes sir.
So say does error out.
What would happen to the log.
Does it become a file or does it just log somewhere.
It does.
And that's the reason that I recommend you use the log message.
And so what that is going to do is push that to the not file log.
For instance you pulled that log up to get your username and password.
And so it will log that message to the not file logs.
Yes sir.
So if you go into the log directory and you go to the not file app if it creates a log here.
It will log it to the app to the to log not five you know at log.
No that's a that's a great question.
And again it's some of the design patterns you know you may not want to put you know a log.
I log every failure.
I even use the log message for success sometimes.
Because I want to see some certain aspects of that flow file.
But no that's a that's a great question.
Any other questions.
And Brett if you can you know.
Perfect perfect.
And then.
What.
I hard coded my update attribute to inventory.json.
But it seems like it's still wanting to output as a CSV.
It's outputting the name as inventory.csv or the contents as a CSV.
Okay so let me take a look.
I'm sorry I didn't have it pulled up who was speaking who was that.
Cody.
Thank God.
My Cody I should know your name by now.
Say cancel.
All right so you are my driving.
Oh I do have it on the interactive so hang on one second.
There you go.
Perfect.
All right.
So you have your convert to JSON.
What step is failing.
Oh you're getting failures like crazy.
That was failing because it's the.
There's only the same file name.
Oh oh oh no worries.
Okay and then update file name success put and then failure where.
And so the file that's being written.
Can we look at.
Yeah it's this guy right here.
Can you open it up.
Let's make sure it's a JSON document.
Awesome it is.
Go ahead and close that out.
Go ahead and close that and then your update file name.
Let's look at that stop it and let's look at that property.
Yeah so you're so you're just putting in like the updated but we actually need to
put it in the correct format for not to read it.
So what you want to do is.
Yeah change that and you want to use dollar.
Because dollar curly bracket file name has as part of the red text and so
night understands you know that so only close your curly bracket JSON.
All right that was.
Yeah I had that.
I had that originally I was still getting a CSV but I'll.
It should be inventory CSV JSON.
I'm going to come up with a red text pattern to change that.
Just because I think Brett also has the same question.
But can you run that to see and then we will take a quick.
Instead of.
On your place Jason okay there you just run once okay it's still running a CSV.
All right.

on 2024-05-06

language: EN

WEBVTT
Okay, perfect. So if you can stop all your processors, I just showed Brett a little trick
on how to quickly stop those all. If you want to go right back to your main canvas, use
your breadcrumb trail at the bottom. There you go. Right click on that processor group.
Alright, success. Let's take a look at that update file name. Let's take a look at that
queue right before this. The queue can go all the way to your left and let's look at
the attributes. I can already see file name there. Okay. File name is inventory.csv. Perfect.
Say okay. And exit out of that. And run the update file name one time and refresh.
We're not going to put the file yet. We're just going to run it once. Yep. And hit refresh.
Success. Let's look at that queue. List the queue. And why is there a... Okay. Hit that.
File name. Say okay. Exit out of that. And you're using an update attribute, right? Say
configure. Oh, you capitalized file name. It's a lower case. So delete that. Delete that
altogether. And then add a new one. File name in lower case because that's the property
Give another minute for folks to return. But, you know, great job to everyone getting this
flow working. That was the reason I chose this flow to go through this morning. It really
kind of shows off some of the controller service capabilities. It's more than your standard
Nafa flow of just picking a file up, unzipping it and putting it somewhere else. We really
started getting into some of those controller services, those types of things, as well as
schemas, record readers, record writers. And you'll see a lot of this in a lot of your
data flows will utilize these services. The beauty of that controller service was now we
can reuse those components. We can reuse those controller services if need be. We can
reuse that schema. So if we had, you know, CSV coming in from somewhere else, we can reuse
a lot of the same data flow. So that makes it a lot easier for us. There's even more
advanced capabilities under the hood when you start diving into this. The ability to have

on 2024-05-06

language: EN

                  WEBVTT
color to your processors like I did.
Good job, I didn't explain that.
So, you know, good job finding that out.
You are now, you know, bending your arrows,
you're beautifying this.
I think it looks great.
I really like the color scheme.
You can go through and add labels like I did.
If you wanted to, you could group these into a label.
But overall, this looks great.
If you have any questions, let me know.
Amanda, Amanda also the same, you know,
added some colors to this.
The only thing I do, maybe Amanda is just like left or right.
You have obtained a CSV down to update schema.
You may just want to put these processors in order
from left to right, and then, you know,
go down when, you know, to your log messages.
So, you know, you keep your failures going left to right,
and then you keep your success left to right,
and then use, you know, any downward flow
as your log message.
But it looks like you got it, you understand.
So, you know, that's just any tips I would give.
And finally, Alyssa, perfect, perfect.
You got naming, if you want, you can name your connections.
You can apply some labels, you know,
change color of your processors, those types of things.
But yours looks good as well.
Hopefully, you didn't have any issues.
If you do, let me know.
But no, great job, and let's, you know, let's follow up.
Okay, so any final questions about your data flow?
And again, you know, we used this to pick up a CSV file.
We created a controller service,
and then we used that service to do a CSV reader
and have the data read in.
We created an Abro schema that we just copied and pasted in,
but, you know, if you were creating a schema,
you could copy and paste your own schema in.
And then we, you know, we used that schema
to run the CSV files through to create a JSON document,
push that back to the file system
where we renamed it to a .json,
even though it still had the .csv.
During lunch, I'm gonna work on mine
to come up with a regex pattern
to really show that name off, you know,
while I'm sitting here eating.
And I'll paste in chat, you know, how to do that,
you know, for those that wanna update their file name.
So any final questions before we move on
to not file registry?
Go ahead.
So on the get file,
I noticed there's like a little symbol on it,
it's like a shield, half red, half white shield.
Did you cover what that meant yet?
Let me see here.
Right here.
Is that the one you're talking about?
Yeah, so there's that little shield.
I'm just curious what that means.
Let's see where I set that.
I mean, it's on mine too, so I...
So there's the processor name, shield is...
Let me look at why mine is doing that.
The red and white shield,
the red and white shield, okay, I see this,
you're talking about right here,
this little icon here, right?
Yes, okay, sorry.
So that red and white shield, you know,
in the UI indicates a restricted component processor,
meaning these processors can be used
to run unsanitized code or get data on a host.
It's just a quick visualization that there's...
It's a risky.
Risky, bingo, that's the word I'm looking for.
So if you noticed, it's on the get file and the put file.
So the risk level is a little higher there.
I figured that's what it was.
So if we were gonna run like,
I'm assuming there's like some sort of Python code,
would it also have that symbol on it?
It could, so processors that are able to run code,
unchecked, you know, may have that as well.
There's...
So most likely, I think it does, let me...
Actually, let's just look at it and see.
We can drag the processor down to run Python code.
Come on, latency.
Yeah, you see the shield is right beside the script,
validate record, invoke scripted processor,
execute script, you know, those types of things.
So yeah, it is...
That's...
Yeah, it's a...
I knew it was like some sort of security thing.
Yeah, it's a more risky thing.
You can see, you know, where...
A lot of these is where it interacts
with something outside of NaPy, right?
Because if it's in NaPy, if we're creating attributes,
if we're dealing with data, you know, those types of things,
you know, it keeps it within NaPy,
or if there's like a secure connection,
you know, those types of things,
then we won't have that shield.
But yeah, that's a great...
I've actually never been asked that question
why there's a shield.
But yeah, it's risky, you know,
and it's a restricted component processor
that can run unsanitized code, you know,
or change or get data on the whole system,
you know, those types of things.
Great question though.
Okay, any other questions, you know, about our flows?
I had a question.
Yes.
So I know you mentioned that we have to create
our own schema for the files.
Yep.
Well, what if, for example, in our office,
we have a bunch of CSV files.
Well, they don't follow like a very straightforward,
like a common or separated by values format.
How can we come up with like a custom schema?
So what I like to do on those cases
is I like to read the CSV in,
and you know, you can actually use,
you can actually do like a lookup on that CSV as well.
You can do a regex pattern, you know,
so in the past, what I've done is used like a regex pattern
to extract the data out of the CSV files,
and I will take that data, extract it,
and push it as an attribute.
So if you look, you know, for JSON, for instance,
you know, I can do, I could read that JSON,
evaluate JSON path, for instance,
and I can read that JSON that we, you know, wrote.
I could change this to a flow file attribute,
and now I can read that JSON in and take all the fields,
and I can go through and start defining the fields,
adding properties, you know,
regex patterns or other things to pull the data
out of a JSON and create an attribute with it.
And then once it's an attribute, I can, you know,
match, I can do a lookup if I needed to,
I can match it to some other data if I needed to,
and then I could, you know, extract that data,
even from a CSV, pull that data in,
have it as an attribute, and then turn around
and write that as proper JSON or proper CSV.
So there's a couple of different ways to handle,
you know, a mix of data, but yeah, it's,
you're looking, if it's a lot of different formats
and a lot of different things, you are now looking at,
you know, doing regex patterns to extract that data,
or using the CSV reader in that controller service.
So you may want, you know, if you remember in the CSV,
that's a, this is a really good question,
so let's go into that.
In the CSV reader, we had like, you know,
what is the record separator, the value separator,
those types of things, right?
So you could actually, depending on the content,
you could read the contents of a CSV file,
determine the, you know, the separator
or something else that distinguish it as different,
and send it to its own CSV reader,
where you have some of these,
like the quote character may be different.
Instead of a CSV, it may be a tab, a TSV.
So, you know, you may have a tab in here.
So you may have a few different CSV reader
controller services, and you will read the contents
of that message or that piece of data,
and filter and route, you know, to the correct CSV reader.
Does that help answer your question?
If you have any, like, better example of the data itself,
I can help you quickly, just on a flow even.
So, go ahead.
Yeah, yeah, that definitely helps a lot.
I mean, I can show you what I have,
but I don't think it's pretty.
No, no.
But another quick thing, is there any way to,
since I'm a programmer, so I think,
is there any ways we can embed any code
in like a data set, or no?
So, do you like Python?
I mean, I know everyone works with Python.
I haven't worked in a long time,
but I mean, I'm trying to figure it out.
You could,
do an execute script.
You can do an execute script,
where you send that data and execute a script on it.
You know, they're starting to get away
with some of the scripting language that's used,
just because of security risks.
But, you know, you could do Clojure, for instance,
or Groovy, in this processor, we have some other,
and you can still do Python.
It's just, you know, they're letting you know,
they're getting rid of that.
We also have,
are they getting rid of Python?
No, they're not, no, no, no, no, not at all.
Actually, they're getting rid of it through that processor.
The newest version of NaPy, NaPy 2.0,
you can actually create Python processors.
Because, you remember, yesterday,
we went over what a processor is,
and, you know, basically, it's a Java jar,
and so, you know, instead of creating a Java jar,
Java nar, for this instance,
with your own custom Java logic,
you can actually create a Python processor
to accept that incoming data,
and then you may have, like, in Python,
a script to parse that and push that out as an attribute.
For instance, you do have that capability.
You also have,
you can invoke scripted processor,
you can scripted filter record.
There's a few different ways
where you can actually then use code,
you know, to do these types of things.
Even a scripted partition record, those types of things.
So, you can run your code, your custom code,
you know, in this, you can invoke scripts,
you can, you know, do those types of things if you'd like.
Awesome.
And if you get a free second, just chat with me,
chat, like, a couple of CSV examples you're talking about,
and, like, later tonight, I can whip up a quick,
I'm not gonna do the flow front to back,
but I can show you an example of how I would do it
using best design principles.
Yeah, I mean, I can give you,
we mostly deal with the financial data,
and the way that it comes from,
it comes from a website
that we have to manually download the files.
So, it comes with a lot of extra tags and stuff.
Yeah, yeah.
We have to strip up a lot of that.
If you noticed in the downloads folder,
one of my other hands-on is actually pulling from a website.
How do you download that?
You manually download it, you said,
is it like a zip file every day, or what?
It's a combination.
It's a long process, but it comes in different CSV files,
or some of them have to be converted into cell files.
Oh, okay.
I was thinking if it's like an API you hit,
and it will give you the file, you could.
Certainly, we've been asking for that for many years.
Okay.
Yeah, so it sounds like the source of the data
isn't as automated as you would hope,
but if they do get to that point,
just remember we have HTTP processors as well.
So, you can actually get, you can handle HTTP requests,
you can invoke HTTP, you can actually listen or post.
There's a lot of capabilities there.
So, just for FYI, if your source gets to that capability
where you can automatically pull,
you may set up a data flow to pull once a day,
and then you can take their output, filter, sort,
parse it out, put it back together,
and then you even have the capability to export as Excel.
How about, I'm sorry, I don't mean to take over.
Any SQL, MSSQL, we work with the SQL a lot.
Yeah, so, mostly if you work with like MSSQL, for instance,
and I know you have Access and stuff like that,
but you have tons of put SQL query database table,
query database table record, query record,
you can list the database tables,
you have tons of SQL capabilities.
Does that help, Pedro?
Yes, sir, I appreciate it.
Okay, perfect, perfect.
All right, any other final questions before we move on?
Awesome, awesome, awesome.
So, next thing we're going to do
is we've now built a couple of data flows.
We do not wanna take a chance on turning a flow on
and basically ingesting our own non-FI instance
and breaking it.
Maybe this crashes or something else like that.
So, let's start talking registry and non-FI registration.
So, if you can, you wanna go back into your folder,
you wanna go back to your downloads,
and we are going to install non-FI registry.
So, to do that, you want to go to the non-FI registry bin
dot zip, which is right here.
Go ahead and extract that, but don't run it yet,
if you don't mind.
So, extract it into the folder, the non-FI registry,
just like you were doing with non-FI.
Give you just a minute on that.
Okay, so you should have that extracted.
We wanna go into the non-FI registry folder.
I have already executed mine, but again,
it's the same principle with non-FI.
You're gonna have some already set folders,
and then when we start this, it will create the rest
of the folders that you need.
But the main folder that we are concerned about
with registry is the comp folder.
And so, if you can, go into your comp folder
and then look at non-FI registry.
So, registry's pretty easy compared to non-FI registry.
Non-FI, some of the properties are,
it's a lot less properties, thankfully,
a little bit easier to configure,
but we'll kind of go through some of the more advanced
configuration options.
Your web properties, you should be able to see that.
We are basically listening on 0.0.0.0.
So, it's gonna bind to all the ports,
or bind to all the hosts,
and then also the port is 18.0.8.0.
You can change that here.
I would just leave it the same,
because mine's set up for that,
and we're going to work off of that.
How many threads the JETI server should have
for some of these more advanced, like sysadmin type work?
You may wanna mess with those.
There's a lot of different tuning
and performance considerations to take in
if you're building this out in a more scalable,
excuse me, production way.
Security properties, here's where you're gonna go in.
You're gonna start setting your key store, your trust store.
You can use the authorizers.xml
that you may have created with non-FI.
And so, you can reference those.
We will, for this scenario,
we do not have security enabled for the registry.
It's pretty much wide open,
but non-FI is just a single sign-on as well.
So, there's a lot there you can configure.
And then the providers,
this is where you will go in.
So, with registry, registry is backed
by your versioning control.
I mean, what do you all use for versioning control?
You get lab, get hub, Azure has a Git service, I think.
Are you, anyone able to tell me, or is it?
Azure dev, okay.
Okay, so there is a, let me take note of that.
You can use Azure DevOps, and I'll take note of that.
I haven't configured non-FI registry to use that,
but I'm taking a note because I wanna make sure
I send you the right documentation for that.
So, in your real-world environment.
It does.
Okay, yeah, yeah, so perfect, perfect, perfect.
Yeah, no, so under the hood, it uses Git,
but we need to specify some things in that for Azure.
So, and then you can use GitLab and GitHub as well.
The beauty of using Azure,
and this isn't really well-known yet.
So, Microsoft is incorporating more and more
into its Azure stack.
And I don't have 100% confirmation,
but Microsoft has been reaching out to some of us
because they plan to make it as a service, as part of Azure.
And so, you will potentially in the future
be able to configure this a little bit easier
just because it's gonna be officially supported
as the Azure stack, if that makes sense.
So anyway, so yeah, in the provider's file.
In case it's relevant to your notes,
we have an on-prem Azure DevOps, not the cloud version.
There are differences.
I don't know if you're, since you were writing that down,
I just figured I'd like to-
No, thank you.
No, and what I like to do is just get as much information
as I can because after I get into this class,
I wanna make sure that I send all these notes out
and as well as some more advanced capabilities
and give you like a little handbook to work off of.
And if I know the environment you're working off of,
I can tailor that so when you get the email,
we can go from there.
So it's Azure DevOps on-prem.
So you'll actually go into your providers,
into your conf directory to do that.
If you look, you will see a providers.
This one, for instance, everything is commented out,
but if we, you would just use Git, access user,
the password, the repository to clone,
and those types of things.
Like I said, I do not have a GitHub set up for this,
but what I will do is go through
and give you some directions on setting your GitHub up
for this or GitLab or Azure DevOps.
And it'll help you when you're configuring this
in the future.
We are worried about right now is non-file to registry,
but like I said, I'll give you instructions on registry
to your Git version control.
But you will define that in your providers,
just like the properties say.
Any database properties under the hood,
registry, non-file, they'd like to use the H2,
that's constantly being updated.
Any extensions directory as well for AWS,
there's some special configuration there.
Identity mapping, there's some additional security,
Kerberos properties, those types of things.
So when you start putting this more in production,
you're gonna look at your conf directory first,
start getting that filled out and go to your providers
and get that checked.
What I will do is I will try to get you a good example
of your providers so you can use your Azure DevOps.
But this is where you will define it.
So with that being said, we have-
Can I ask, what do we get out of connecting into DevOps?
Because I get the value of a registry,
if we just don't connect it to DevOps,
what would be the difference?
So the way this process works is non-file communicates
with registry to store and version all of their data flows.
But it's not backed, registry is not backed by Git
or a true version control system.
And so you would plug your registry into GitHub,
for instance.
And so when those flows are committed,
registry will take those, create the history file
and then push as a Git push to your Git repo.
So that way you may decide that as part of your CI-CD process,
you will take a flow and push that out,
as well as the, say you had an Ansible,
Playbook set up to deploy non-file
and you need to feed it the flow that it's going to use.
And you can pull that from your Git repo,
like GitHub, for instance.
Does that-
Okay, so you could do that without connecting
the registry to a Git?
Yes, yeah.
Because we are going to save the versions into registry
and not registry to GitHub.
So that final step would be in your environment,
you would have non-file to registry
and then registry to your Azure DevOps.
Think of registry as your translation layer
to get your data flows into a versioning control,
as well as a UI to manage your data.
Okay, because I know I was speaking with James,
he had, from what I've seen,
he had been the one who has been messing with non-file,
he's not in this training, unfortunately.
He developed a flow on a sub-production instance,
even though we don't have a real production instance.
And then he exported it to an XML or something
and then had to convert it
and then import it to another instance.
So this would be, instead of that, tedious process.
Yeah, so, yeah.
And I'll show you, when we set up registry,
you know, the beauty is registry
will segregate everything into buckets.
And so you can go in with permissions, for instance,
say that, you know, user X only has access to these buckets,
user Y only has access to these buckets.
But, you know, if you have X and Y
working on the same thing,
they can actually, you know, commit to registry.
And then user Y can check that out
and pull the latest version
and continue working on the flow as well.
And then it's just, you know, the committee is back and forth.
And then registry is going to take that flow
and push it to a versioning control system like GitHub
that will keep track of it.
You know, you can branch and all those things.
And then you can use that flow as part of a CI-CD process
to push that flow out.
You know, here's the version that you need to push out
for dev or prod or whatever.
And you can have that as part of your CI-CD process as well.
Oh, thank you.
Yep, yep, no, and great questions.
And this is the perfect time to ask it.
Okay, so we now have all of our files extracted.
We've looked at our comp directory.
There is a database directory, you know,
it's an H2 that's keeping track of things.
You know, there is docs.
Every NaFi component, except for like some of the Minify stuff
that we will go into later, you know, has docs built in
just in case you're on a network
that doesn't have access to the internet.
Lib directory, you know, just like NaFi
has that lib directory.
There is an extensions directory
if you want to build all two registry.
You know, just like NaFi, you can build a processor
and put it into the extensions and hot load it.
There are some extensions you can build for registry.
I don't know if I've ever seen one built, to be honest,
but the capabilities there.
And of course, your logs directory,
because that way you can see what's going on
and those types of things.
So for this exercise, let's go into the bin directory.
And we want to run NaFi registry.
And so that should bring up a new window
that, you know, is our NaFi registry.
Let me check your screen.
And then once it starts, if you can, open your browser.
Do you remember we set our, you know,
we didn't change our port, but it was 18.080.
So if you can, when it's up and running,
bring up a new tab and you want to log in
and go to 127.0.0.1 colon 18.080.
Sorry, Joshua, I'm stuck for a little bit.
I'm late on the last step.
Oh, no worries.
So in your bin folder, in NaFi registry,
you've extracted NaFi registry,
you've extracted that zip file.
In the bin folder is the run NaFi registry bat file.
And so you want to run that
and it's going to bring up a new command line window
that's running.
So you should have two, one for NaFi,
one for NaFi registry.
And then once registry starts, you know,
you gotta give it a minute.
You should be able to log in to the,
to be able to log in to the, to the browser.
Hey, Josh, it's Ben.
Mine blew up.
Let me look at yours.
Ben, Ben, Ben, Ben, Ben, where you at?
Ben, you blew up.
It's all right.
Oh, I'm kind of like, hold on.
I know a few people are going,
that one's at 45, can you say it?
Yeah, the one, two, five should be,
oh, you're re-downloading.
So make sure you're downloading registry.
So scroll back up and click on registry.
And you can use one, two, six binaries.
That's a stupid piece of crap.
I know, I know.
If I could change it, you know, change this,
trust me I would.
All right, I got it.
I'm good.
Okay, so Ben's downloading the registry now.
It should also be in your folder.
If you don't have it,
you can go to the NotFile website and download it.
The reason, you know,
I would tell this is the more technically advanced class.
You know, if it wasn't the technically advanced class,
I may have already have this installed
and running for everyone.
But, you know, so because we're the advanced class,
we're gonna go through some of the technical aspects
of this, setting it up and things like that.
So yeah, if you don't already have it downloaded,
you can download it yourself.
Again, all of this is free and open source.
What was the URL for that new NotFile?
Yeah, so if you wanna download it yourself,
it should be available on your,
and Pedro, I'll pull your screen up.
And go to notfile.apache.org.
I'm gonna click download and you click on registry.
Scroll back up.
There's three tabs, NotFile, Minify, Registry,
well, four tabs and FBS is the Flow Design System.
Click on Registry,
I go up, up top, there you go, perfect.
Scroll down a little bit
and you see the one, two, five registry binaries,
not the source, unless you wanna build it, there you go.
And it should have been available, I think you're good.
You click the HTTP site, if it doesn't respond in a minute,
click the backup site, but it should download those files.
And what I was saying is the NotFile registry
should be already in your downloads folder.
When I created the VMs, I created it with it all
as a zip file, so you don't have to download it,
but you may have picked all your zips up and destroyed it.
Yeah, I think we all messed up the zips on the first one.
There we go.
I had to redownload.
No worries, no worries.
Like I said, we use that zip flow file
as an example of what bad can happen as well.
And so it makes for great party.
So yeah, just download it.
Once you've got that zip file, extract it and go into,
we're not gonna change anything in the properties,
but then go into the bin file and click run registry.
And I'll give everyone a couple.
I did the connection review, so.
Right, so I think I have that.
So then what happens up there, run registry?
So Pedro, it's running.
I think so, yeah.
Let me look at your screen.
Yep, it's running.
So now in your address bar, go to,
just type in 127.0.0.1
and you wanna go to colon 18 080 and hit enter.
Oh, in front of that, put nafi-registry.
Do a front slash nafi-registry.
Nafi-registry.
Oh.
Yep.
So it should look like that.
You don't need to use HTTPS or anything else.
Okay, perfect, Pedro, you're in.
In like clean.
Okay, anyone else having an issue?
Yeah, I am, Sean.
Hey Sean, let's take a look.
Mine's not.
No worries, let's see what we got.
Got it running.
Uh-huh.
I think I'm doing something wrong here.
Yeah, I think you have HTTPS.
It's just HTTP because, yeah, perfect.
So take your HTTPS, take a S out, there you go.
Hit enter.
And you're good.
Okay.
So nafi out of the box comes with a self-signed certificate
with a username and password.
And the reason being is we had tons of people downloading it,
putting it in AWS, leaving it wide open to the world.
And I can create a data flow to download
whatever kind of malicious activity I want
and go from there.
So registry, because registry's not actually
like executing code or anything,
they decided to just leave that open.
But you do have the capability to configure security
in the conf files.
You wanna put, modify your authorizers,
those types of things.
But you're where you need to be.
So how's everybody else doing?

on 2024-05-06

language: EN

                  WEBVTT
You look for new versions of that.
Make sure your versioning matches.
So if you're using 95125, you most likely would need to use registry 125 as well.
I know a lot of these are backwards compatible.
I've seen some older registries that are still running and 95 was upgraded itself,
but the callbacks are still there and they were never deprecated.
So just try to stay with the same version if you can.
But it is a sub-project and its job is a complementary application that provides a central location
for storage and management of shared resources across one or more NIFI instances or MENIFI,
which we'll talk about tomorrow.
The first implementation of registry supports version flows.
So we need to tell our NIFI where our NIFI registry is located.

on 2024-05-06

language: EN

                  WEBVTT
view all's not by instance I don't know if the IP address will resolve but I'm
curious to see but anyway so you should now have you know two flows within your
within your registry everyone does okay perfect if you do not let me know so
let's go back to 95 and let's go into that CSV to JSON data flow and what I
want to do is I am going to make some changes to this I want to add you know a
label
I'm just adding a label just making some changes to my data flow
why okay so the change I made is basically I just added a label to this
and if you go back you know you can use your break from trail to to go back or
just right-click and say leave your group but when you go back you know that
green checkbox is now and so what that lets you know what's that letting you
know is that you have changes that need to be committed and so you can right
again same version and you want to commit local changes and so but this
version comment I want to put I added a label and then once I make that comment
it should be a checkbox and I should be good to go
hopefully if anyone I'm not looking at everyone's screen right now so if anyone's
having the issue feel free to speak up if you're not this far along okay so now
that I've committed my changes and it's now into the registry if I had a GitHub
or GitLab or Azure DevOps platform set up you know get repo there you know the
registry would automatically write these data flows to to that versioning control
stuff but minify you need to develop your flows in notify and save them before
minify could run them because minify does not have a UI to develop its flows
and that is the you know from tomorrow morning to lunch tomorrow we're
working minify you know I've set aside half a day just for that because I know
it's a it's a very there's a lot of things going on once we start introducing
minify but you know that could be a reason you want to save your flow you
built it in notify you save it you save it to you it gets backed up to your Git repo you may have a
CI CD process that will push notify to your you know minify to your raspberry
pie edge device it will push the correct flow to that as well because you do not
have registry installed on that pie you do not have notify installed you do not
have registry you know connected to that notify that's not installed so you know
minify is is probably one of the key users of that versioning control system
and you know in talking with you know folks minify is definitely you know
mess with it locally and make a lot of changes and then once you get to where
you're at then start committing again I can't think of any real business use
cases to stop versioning control except for like testing and training and those
types of things you can also revert back so what I did is I I'm going to change
and the red arrow lets me know that there is a newer version of this flow
out there so I was reverting my changes because I want to get rid of that label
and maybe I wanted to you know get rid of that label and I wanted to add
another label by the way you can control C control V on labels maybe I wanted two
labels now right so what I would do is I created my new label I have changes
that needed to be made and now I can you know I can revert my local changes I can
show the local changes or I could just commit a newer version and now we are
back you know to a green checkbox
okay so with that being said now that is how we are committing our changes
that's how we are working with our data flow you know those types of things so
what I want to do is I am going to stop my my processor group I am going to
delete that processor group well I would have to go in and delete the
controller service that's associated with it what you know some of these other
things but just stop your processor group and now we have a version of that
data flow so now we can actually go and say drag and drop down a new processor
and now we have the option to import from registry so previously we did not
have this we could only create a name for our processor group or upload like we
did previously for the data flow that we were working with you know that JSON
element that we can upload but now you can import from registry and you can go
so now I have you know two different flows you you may have more but I have
two different flows saved in my registry
refresh awesome so in that same bucket I now have you know a new name data flow
also have the CSV to JSON that has three different versions you know I can
refresh these I can you know delete the bucket all together I can export a
version I can import even a new version or I can just delete the flow all together
and then you know you can go back to your settings here and you know delete
your buckets and things like that as well you know so now registry has a you
know a ton of capabilities in managing your data flows and you know if you
were to run your you know import zip and you know it blows up it is you know it
ingests your your parent not five directory and you know it no longer
works we can now delete that whole instance unzip it start nine five and
this registry real quickly and now we have a version of the the flows you know
so there's a lot of capabilities here again we're not going to go into
configuring github and get lab in Azure and those types of things but what I
will do is send out some some tips and tricks and instructions on how how I've
done this and and some reference material because you know you may need
to reference a specific class or property in the configuration file so I'm
gonna pause there I think that just about covers you know the registry the
user guide and the system administrator guide on his own one those will those
links I will send out as well you know it goes into a little bit more detail
about you know managing users and things like that the if you remember when I was
talking about some of the not five security is very fine detailed it's it's
you know deep and fine grained controls you can you know set your users to only
allow them to view a data flow you can start a data flow they can import just
they can they can and when if you were if somebody is high speed enough I don't
know if this will work in this VM but you you should be able to access my
this may be out of scope at this training I'm just curious how other people do this
but if the flows are being backed up in source control we're gonna be running
these as containers do we do we need to back up any of the anything other than
the flow because I know there's like the data is queued and if there's errors and
stuff do people typically back up stuff with just the flow when they're running
my fine production they do not so that's that's the the beauty of this and the
any other questions any you know even tips or tricks you know I I like to you
know do these classes where it's very casual and and it's a very it's a very
much conversation you know please pick my brain you know like I said I've been
through numerous huge non-fi deployments I'm one of the committers to this so you
and then you can actually look at Apache you know Apache links their
github repository you can see if I was changed an hour ago for instance but if
you look here you will have all of the source code or all of the processors one
of the things that I mean I would be happy you let me know but is we can set
so there's a whole developer guide on how to you know build a processor from
scratch and things like that it has a lot of information you can go directly to
get a download once you've got it downloaded there's some things you can do
and change some things if you want to build it yourself building a custom
distribution you can you know uses maven because Java's under the hood and things
like that there is a MVM clean install like include all of the NARS and all the
bundles you may want to only build like you know certain bundles and things like
that there's actually another hidden option and it's the rules engine so
there is a rules engine flag they just don't list it here that you can also
include but yeah the developers documentation and building it you know
is really well documented and you know that leads me to another question so I've
seen this this quite a bit and I don't really go over it because you know to me
you know it's all in the install guide and things like that but when you are
installing this in and like in a production type of environment you want
to make sure that you have you know all the hard set you know the amount of
files that you can have open on the system and those types of things not
five is is really heavy on having a ton of files open and so you know they will
give you in the instructions and I'm looking for it right now they will give
you no no I think I think I understand thanks okay and like I said I'll have
more there I can I can I can I can type up some more tips and tricks on
developing processors but good question anybody else have any questions before
we take a break and for me to eat early dinner for you all get some lunch
okay so let's let's take a lunch break it is 202 let's just say it's two o'clock
let's try to be back here you know let's take a 42 minute lunch 43 minute lunch
let's try to be back here by 12 45 you all your time to 45 central time and you
know we'll go into another another day to flow and build that out and you know
get some get some additional hands-on and then you know call it a day so have
a great lunch I will see you all in 43 minutes if you need anything I'll most
and hopefully everyone has made it back even even given 43 minutes for lunch I
was I had barely enough time to finish but we've got a good next scenario okay
so I have one minute after 246 my time so let's get started our final activity
for today is a scenario I put some some tips tricks and some pointers in the
scenario but this is I don't I don't usually like to give a test in any of
the classes what I'd like to see is just some hands-on experience and you know
applying some of the things that we went over so for this scenario so for this
scenario you will be if you have the dropbox link still or the the etherpad
pulled up you can actually download the scenario information let me know if you
cannot but for this scenario so for this scenario you are a local government
agency responsible for monitoring environmental conditions across various
locations in the region your task is to aggregate transform and analyze weather
data collected from multiple local weather stations to provide daily
summaries and alerts so if you can this scenario is is in the dropbox link it's
also in the etherpad right here so I can actually I can go on to everyone's
desktop and download it if you need help but if you can try to grab this link if
it's working nope oh it's logging me in copy link address okay so if you can go
back to the etherpad I will also send this link in teams chat there should be
a zip folder that is not five scenario it has the scenario as well as three data
files we are pulling from three different weather stations it gives a
report every hour there is a data description in the scenario to describe
the fields as well as you know like I mentioned some tips and tricks for that
you take a look at that link that I put in you should be able to download the
scenario okay should be pretty easy and then during that scenario you just want
to be able to go to your downloads I am going to copy it to my desktop you know
go ahead and extract the zip file and so again for this scenario I've included
the PDF of the scenario here we go so your objective is use NAFA to automate
the collection transformation and reporting of weather data from local
good deal so I'm here for questions feel free to to ask any questions you may have
you know any tips or tricks or anything that that you'd like to see there should be
information within that PDF to help you out the data structure is is pretty self-explanatory but
if you need a breakdown of that you know I can I can I can send over something as well
and while everyone works on that I am going to go mute but I'm going to kind of pop into everyone's
desktops see how things are going and just provide any commentary as needed
wish you were sitting on my face right now
so we have about two hours left for today this may take a little bit of time
so what we'll do is is spend the next hour on this flow development again ask any questions along
the way we'll take a quick bio break our final break of the the day and you know come back
answering questions and review your flows again you know there's many ways to skin a cat with NAFA
so I would I'm really looking at you know some of the thought process behind you know developing
your data flow and just that story that you have I have a question so for step three
you do not so if you do not have a script or or you know you may not know scripting
feel free to use another processor to to extract that to deal with the you know that type of data
look through the processor you know the list of processors you have you may want to be able to do
like you know extract that as a csv reader and a json reader and then put those back together
for those that that knows how to execute like a script you know it's an option you know so
so do the best you can for this one you can you can use custom scripting like there's also
a jolt transform script python I don't think we have python installed here so
you may not be able to use a python script you might have to use a processor to just extract
the data and then once you have it extracted and saved as an attribute you can then use those
attributes to build your final output and if you get hung up on that step or some of the other
steps you know just just ask me and we can we can find a processor to do the function that you're
looking to do most likely well this scenario can definitely be accomplished you know utilizing
all the processors that notify comes with out of the box not all the processes but you wouldn't
need custom processors also remember the documentation has a list of processors and
their functions you know so if you are trying to find a processor to to simplify what you're
trying to do or to do that function feel free to reference you know the documentation
a list of the processors and what it can do is there
I assume some of you will probably use the extract text processor as well so you know that's there
as well as you know some of those things
processor okay and seems like it'd be a lot easier if we just did the
convert from csvj something we did in the last lab because i'm getting capture groups with regex
so is this kind of like however we want to approach this yes it's however you want to
approach it so like i said yeah you can learn from the the previous example and and that's the
reason we have both json and the csv so if you want to ingest csv converted to json and then look
at the json values together i'm fine with that as well yeah it's this again you know this is just
to test the you know the the skills that we've learned over the last day and a half and and put
those into practice do the best you can um you you know there's many ways to build a flow there's many
ways to do this as you as you know as this scenario proves that you know you may use
i bet when we pull this up we're going to have folks use one set of processors others use another
set of processors uh you know and it's all going to accomplish the same goal of getting that alert
up but yes use any processor you won't use previous processors you've learned about
those types of things there's a reason i didn't make it all csv is so you just can't reuse the
the last flow but but yeah you know reuse what you need to if you had python set up properly
and some of those things you could easily do this in a script if you knew python um you know so
so yeah you don't use the processors uh that's available do the best you can and and use i put
some trick tips in there but if you want to use another processor habit so uh just to see if i
was on the right track with this i was using an extract text and i created some properties based
off of capture groups okay i just called it columns so i have like uh columns dot one through
okay is that is that on the right track without extract text it is so you know one of the things
that would would put you on the right track is extracting these values as attributes
because once you have them all as an attribute you can you can manipulate you know all day long
and you can pass those attributes to different processors you know those types of things but
you are definitely on the right path okay cool thank you yeah no worries good questions
hey josh i got a question
yes sir
i have a feeling this is you went over this uh in the part that i missed this morning but um i'm
trying to use the convert record processor and i'm getting a that it was validated against a
different do it and the one i'm using i'm not sure what the issue is oh let's look at let's
let me pull yours up right quick and and while i'm off mute uh we still need to take our last break
after lunch um but what i'm thinking is is after i answer this question i want to just go to the
restroom and come right back so you know just take a break in place and continue working on your
data flow um okay record reader validated against this controller is disabled right
okay so yeah i'm not sure what that means yeah i missed all yeah no you did so uh go into the
go into that processor and you see you have the csv reader and the csv are the json record set
to the right is that arrow there you go and yes you have those disabled so you will need to
enable uh with the lightning um both the service you can just do the service for now until you
enable both of those and and i think you might have missed this during the uh morning session
as well as uh if you're trying to do like an avro schema or something you will need to put
those services in there um if you want you could if you have that zip file um well now see it's
working now but if you're using a schema uh if you have that zip file that i sent uh that i had
everyone download earlier it actually has the work in flow okay so it has that working flow in it
you know go back and go to the oh yeah that's kind of what i've been okay yeah see there you go so
it has that working flow the reason that the csv to json convert record is disabled there is only
because you know you imported it in uh but nobody ever enabled the controller service so if you go
into that one for instance and go to the csv reader and hit that uh yeah you see that you need to
enable those and then your previous flow will work okay and that should help take away the
the warning yep you're good so the other two are disabled as well you got to go back to your
yep the easiest way is just go into the processor go over to your controller services
and the bottom two are disabled as well so you want to enable them and there we go perfect
so that original data flow should work for you now um okay and you can tell actually the
convert csv to json just tell it to run once right click and go to run once and see if it gets
success i don't think i did any of the steps no no you already have a file in the queue
oh yeah okay and then uh yeah hit refresh on your canvas oh you do have success perfect so you took
that original csv and you made it a json file cool okay and you can use you know i i don't i
i don't want to see like an exact copy of what we did this morning but you can use this morning's
activity as reference yeah well i missed the whole morning oh no worries no worries okay good uh any
other questions no i think i'm good for now then okay i am actually just going to run to the
restroom i'll be i will be right back uh if i miss anybody you know while i'm away um you know just
leave me a message in the chat but i'll be back in a couple minutes i'm back if uh anyone has any
questions um let's spend about uh you know 20 more minutes on this um hopefully uh make some
progress and then uh start going over some of the data flows if you get done um you know with your
data flow let me know and we can start reviewing it looks like some of you are getting very close
to completing this task um if if you get hung up or if it you know if you're just taking a while
don't feel bad uh we have plenty of time you can you can work on this later you know after the
class if you want or tomorrow morning you can finish it on your own time later i can send you
the scenario i will be sending the scenario so you know you can practice it on your own time as well
we'll give it another 15 minutes and then uh let's start going through some of these data flows
okay
everyone's very close so um we may just finish up working on this today and then um
i can go through the the the flow later and see how you do and then we can talk about it in the
morning as well so we have a few minutes left um kind of touch on some of these and see if there's
first one i have pulled up is cody uh how are things going anything i can do um yeah i tried
to change i tried to use the reuse the one we did that previous flow and i was trying to change
some of the schema but it was giving me uh it was giving me an error on some of the formatting for
that okay so you're i can run it and show you um so you're getting your weather data uh are you
just getting the csv files yeah i'm just grabbing just the csv okay perfect perfect oh awesome
and then uh you were setting the schema name yes do i need to yes just to weather
okay perfect um the last one we did was uh inventory so no that works out great so and then
here this is where it errors out okay let's uh let's stop it yeah i get an error
temperature yeah here
um okay um i did i noticed something already is uh yeah weather capitalized over on your other one
and here it's lower case okay um also you um just wrap why you can expand on that box if you go
down just a little bit below the down arrow um on the box itself to your right
right above okay right there all right yep and there you go that would have been helpful um
okay so type record name weather uh field this one i had is a
an integer but it was right there and i changed the string and it passed so i'm not sure um no
worries i mean you can have them all it's string if you want to um okay so i'm looking at the station id
date the hour temperature humidity wind speed and precipitation okay
say okay there um that actually looks good okay well i did get an error on the
caps mismatch um say apply there and uh hit x on just close that window out
and let's look at your set schema the step before that's where i saw the capital
you got stop and configure there you apply perfect um okay you need to enable your services again
and then the other ones are invalid so let's look at why they are invalid
they may be invalid because of that service was there we go
all is enabled still throwing an error um
error while getting mixed for the temperature error um let's go back into your convert csv to json
um let's go to your csv reader pull that controller service and listen to the settings for that
scroll down uh you want to treat your first line as header it's true because it doesn't know how to
yeah uh so because you know it's trying to treat it as uh you know some of the data
so you want to set your first line and treat first lines here apply exit out of that
well enable it the next what's our latest error
error while getting next record for input stream 14.8
not sure where that input stream would be uh the input that is the uh temperature
um that is so it pulled in the station id the date and hour and then the temperature
uh it was having a problem with that just do a string maybe yeah change it to a string um
i mean we can we can yeah just try to stream not to get too over complicated here uh perfect
put them all to stream make it a little easier so when you go back to your canvas that um
processor is going to be started because you started both the processor and the
uh controller services no it's fine though uh but you know go ahead and stop there you know
that's this and then i haven't i didn't get to move i just laid everything out i haven't really
built out any of the configurations for anything else okay no worries um like i said you know i
i suspect some people will get done with with this some people may not uh but i do you know
i'm glad you kind of laid it laid it out and so what we'll do first thing in the morning is kind
of go through what your thought process is on some of the ones that you didn't get finished
um and you know we can talk through those real quickly um i'm not again i'm not really looking
for the the flow to be completed this is this is uh you know interactive just let me know what
you're thinking and how you would do this and you know just to make sure you know we have a
good grasp on on the you know the components that we figured out but besides that you know it's
looking good so far um you know what if i was doing this real quickly i would have used you know some
of the previous flow as well i would have you know saved that as json i would have brought all
three json documents in i would have done an evaluate json path and extracted all the data
from the json and and then you know move to the next processor to combine the data or make the
calculations uh so my flow would have probably been like six or seven processors uh if i was
doing it a simple way but um i think you've got a grasp on this feel free to finish up
and then like i said we'll go through it in the morning

on 2024-05-06

language: EN

                  WEBVTT
and it will pull all that data up as an attribute
and you're good to go.
So if I was designing this flow route
and I was trying to do it extremely quickly,
I would have a git file grabbing the two CSVs.
I would kind of reuse the previous step
where I converted that to JSON,
but then take all three of those JSONs,
send it to an evaluate JSON processor,
extract the values,
and then once I have them all as an attribute,
like it's easy to manipulate.
But I think you're on the right path.
So yeah, if you see that evaluate JSON path,
and if you send a JSON, a little tip on this one,
go to the properties of evaluate JSON path,
and the destination,
you want it to be flow file attribute, not content,
and then say okay.
And then you can just hit plus
and start adding in the path to extract the data.
So for instance, let me see if I can.
So if you, for instance, you can use,
depending on the JSON, right, you can,
see if you can pull that in.
Okay, so if, can you open up one of your JSON documents
right here and stop beside?
Awesome, awesome, awesome.
So for station ID, I would name this one like station ID.
So go back to your, configure your processor.
You can do station ID.
I would do all lower case, and then say okay.
And then what you're looking for here is the,
the path in the JSON.
So you wanna put dollar, dot,
and then look at your JSON station ID.
Like grab this whole thing?
Yeah, just copy, no, just copy that.
And paste it in.
And so what it's doing is,
and you may have to reformat your JSON
for the rest of this, but what it's going to do,
what this would do is it would extract station ID
and create a key pair value as an attribute.
So if you say okay here and you say apply,
when that JSON goes through that processor,
it's going to extract the data from station ID.
And so, and then save that as an attribute.
And so what you're able to do is,
keep adding these property and values,
and you would be able to extract all of those attributes
as key pairs.
And then you can manipulate those,
do whatever you need,
because they would now be an attribute.
And then use the attributes to JSON to save the file,
as a JSON document with the temperature alert or whatever,
that's in this scenario.
So, yeah, and you just leave the path.
So basically grab all of the values from my JSON file
and then store them as something, right?
These are like variables, kinda?
Correct, store them as an attribute.
Yep, and then we'll kinda compile all those into one,
and then you can do whatever you want with that.
Yeah, exactly, exactly.
Okay.
So the key here, you're looking at the evaluate JSON path,
those types of processors,
you're looking at attributes to JSON
to write the final JSON document,
but you're getting very, very close.
Awesome, yeah.
I mean, I would have to definitely play with it
a little bit more, but I think I see where you're getting at.
Okay, yeah, we're a couple minutes over,
but just play with it and let me know how it goes.
But those are definitely some of the things
that you might wanna take a look at.
And then we'll talk about it in the morning.

on 2024-05-06

language: EN

                  WEBVTT
All right, we'll just go around the room.
Sean, how are you doing?
Mostly OK.
I was kind of doing some catch up from this morning still too.
But I got it.
Converting all the CSDs into JSON and putting them
into a separate directory.
And I was just trying to figure out how to merge them right now.
Oh, very nice.
You can pick them back up.
Instead of writing the JSON, you might
want to just send it to the next processor.
But if you want to write it to disk
and then turn around and pick those back up,
and then look at what Pedro and I were just going over,
the evaluate JSON processor, you can then.
Yeah, it's just fine to have that.
OK.
OK.
So I think you're very close.
You've got the CSVs and the JSON all aligned.
It's now just the final calculations.
So you're very close.
OK.
Good deal.
Any questions I can answer?
No, I don't think so.
OK.
All right.
How about Aaron?
Aaron, how are you doing?
I'm going.
Are you in the room with everyone else?
Yeah.
I am very surprised that the room is not complete.
I figured you guys would be just sharing information
constantly.
So we are.
OK, good.
But as far as we can get is making sure they're all
in JSON format as far as we got.
Well, some of us are messing with merging right now.
OK.
OK.
So again, you may want to look at the evaluate JSON processor.
Now that you have everything in JSON, you can write it to disk.
If that's easier, pick it back up and look at that
and look at taking those JSON values
and pushing them up as an attribute.
But it's funny, as I was thinking,
I was like, there's three, four, five people in the same room.
It would be so easy just to copy off of each other and be done.
Yeah, we're trying.
No, there's no.
Nobody's going to get dinged for helping your buddy, right?
OK, so who else is in the room?
We have Brett in the room, I think.
OK, let's go to Alyssa.
Alyssa, how are things with you?
OK.
Don't worry.
If you want to beautify the JSON, do remember,
if you bring down processor and you start typing JSON,
there is numerous JSON processors.
Some of those are used to beautify JSON.
I think it's scroll down just a little bit.
There's the flatten JSON that helps.
Evaluate JSON is one that you will probably use.
Scroll on down.
Keep going.
There's a jolt transform JSON.
So I'm surprised no one's kind of looked at that processor yet
because you can actually write a jolt transformation
and probably accomplish this task in four or five processors.
So definitely take a look at the jolt transform JSON.
Take a look at the evaluating JSON path.
I think you're getting close, right?
So yeah.
I was thinking somebody in that room
is going to write a jolt transformer
and just have all this working within two or three processors.
But no.
But I think you're getting close.
Any questions I can answer right now,
I think you're on a good path.
But I just want to check in and see.
Yeah, I think we all just need to play around with it a bit more.
We didn't find all the processors.
There's over 300 processors.
And so make sure when you go into your processors,
look for JSON processors.
Look for CSV processors.
Look for the items that you're trying to do.
And you'll start narrowing down your list.
Even as someone like myself that is extremely experienced
at building data flows and non-file processors and those,
when I have a problem, I'll go to my processors.
I'll do a keyword like JSON.
Let me see what's out there.
Let me see what's available.
What are some of the capabilities?
I keep the non-file documentation up.
And that way, it has the documentation as well.
One tidbit of information on the evaluate JSON path.
You're only able to extract one field if you're trying to do a flow file content.
So make sure you have flow file attribute so you can extract all the fields
from the JSON document.
And this is for whoever is going to use that processor.
But I think you're getting close.
I would definitely look at the evaluate JSON path.
And they don't transform.
Depending on your skill level and your skill set,
you might be able to solve some of this pretty quickly.
Any questions that I can answer for you, Lissa?
No, I would just need more time with her.
Okay.
Like I said, take all the time.
Like training stops at five, but I don't stop.
Like, I mean, I keep going, whatever it is to help support you all.
So just keep going.
No worries.
No worries.
So for you, you want to get the weather data.
If you, again, if you want to solve this very pretty quickly,
reuse the CSV processors that we did for the previous exercise.
And then you can actually write that JSON back to disk if you want.
And then have a new processor that grabs all your JSON.
So it would include that station, that third station ID.
And then you can bring all your JSON in together.
You can evaluate the JSON path, extract the data,
and then you can manipulate it using attributes.
So just play around with it.
If you don't get complete, no big deal, we can work on it.
If you need to reference the previous exercise, you can upload that.
It's in that zip file.
I just didn't advertise it at first because I wanted to walk us through the flow itself.
But you're able to bring down a process group, import that flow,
and it should work for you.
So if you need to reference the previous flow, and even if you didn't get it finished,
you can import it and it's complete.
And that should give you some helpful tips on processing that CSV that you get.
Any questions, Amanda, I can answer right now?
Oh, no.
I'm good.
OK.
Well, try your best.
And like I said, any questions, just feel free to reach out.
And is there anybody else in the room with you?
I am, Ben.
I hope you like to scroll.
Who was that, Cody?
Ben.
Ben.
Oh, yeah.
I thought you said Ben should be in there.
Do I like to scroll, oh, Lord.
What's SpiderWeb?
Oh, my God.
Did you read Charlotte's Web before this?
Well, welcome to our brain race.
It's literally, OK, what's the next thing I have to do?
All right, let me do that.
I have no clue where to start.
Is this like a maze?
Top left.
Top left.
So you're getting the source file.
Oh, you identified the mom type.
That's it.
Oh, well, I don't see it.
Yeah, no.
It's been a while.
All right, so if it is a JSON document,
you're sending it to write JSON to work.
If it is a CSV, you're updating the attribute,
you're converting the record, and you're
putting the file to JSON just like we were doing
on the first, the previous exercise.
And then you're pushing everything to a funnel and logging.
OK, so what's on the right?
OK, so it gets work files.
Oh, OK, and then you merge JSON data, rename the file,
and you put the file again.
And that's as far as I got.
OK, perfect.
Well, the nice thing is I think you're
very close on normalizing all of your data.
OK, perfect.
So look at picking that back up.
Might as well put a few more Git files on there.
You don't have enough.
Go ahead and Git file.
And take a look at the evaluate JSON or the flatten JSON
processor, the evaluate JSON processor, the Jolt
transformation processor.
I bet if you were to Google, I want
to Jolt transform a JSON document in Nafa,
you will see tons of great examples.
OK, so I think you're very, very close.
Do you have any questions I can answer right now?
And if not, we will go over this in the morning.
No, I'm good.
OK, thanks, Ben.
That is ace by the web.
Tyler, how are things going with you?
Oh, wow.
I'm pretty good.
Wow.
Just in just the files, I check the MIME type.
I update the schema attribute, route on attributes,
and it's already in JSON.
It just outputs it.
Otherwise, it will keep working the same speeds to JSON.
Wow, that's nice.
Nice and clean.
Then it queries the records to do the aggregations
and pushes out the aggregations here.
But I haven't branched that part out yet.
And then this query record does the warning for when speed.
I haven't added any other warnings yet.
And then I have it splitting the records
and then extracting out the date as an attribute.
And then it merges all of them from the same date
to aggregate them for the reporting.
And then I haven't fletched that out yet.
Oh, good Lord.
You're almost done.
You nailed it.
And I was very surprised to see query record, split record.
I really like that approach because that approach
is a data engineering thought process where you can then
take this and build some of the reusable components
and process data like this in the future.
I think you're very, very close to being complete.
Any questions that you may have that I can answer?
Any major roadblocks in your way that I can answer?
But overall, great job.
I think you're very, very, very close.
So great job.
If you need anything, let me know.
But wow, nicely done.
Thank you.
All right.
Brett.
I think Brett's done.
No.
Can you hear me?
I can.
So I started one way, then I switched,
and I started using the evaluate JSON.
OK.
So I have for the native JSON file in there,
I'm able to grab the file and get all the attributes
for all the elements in there.
So then now I'm just, because I split,
I had two different paths for CSV JSON.
CSV path was the old one that we did.
So now I'm trying to, after I converted to CSV here,
I'm trying to go to the success to do the same thing.
I haven't tried that yet.
OK.
It may require a different evaluate JSON processor,
just to, depending on if the JSON is exactly the same or not.
But you're extremely close, and good job
using the evaluate JSON processor,
and pulling all of that data out as an attribute.
You have it all as an attribute now.
So there's a lot of capabilities you have now with that data.
Any questions I can answer?
This is, you and Tyler are just really rocking it.
Any questions I can answer?
No.
I think it's making sense.
OK.
Perfect.
And I think you're very, very close.
I've been struggling with the canvas and the latency
that we have, though.
You and me both.
If it was my way, we would have a different environment.
But I am as well.
I just got a warning that I'm 300 milliseconds.
And I'm like, come on now.
But no, I get it.
I get it.
OK.
Well, if you have any questions, let me know.
Great job.
We'll go over this in the morning,
but I think you're extremely close.
All right.
I think we went over Alessa and Amanda.
Randy.
How are things going, Randy?
Well, that doesn't sound good.
Are you there, Randy?
All right.
Well, it sounds like Randy's allergies are really
getting to him.
It is that time of season.
So we'll come back.
Randy, if you have any questions or anything,
let me know.
But I'll check in on your flow when I can see it
and see how far you've gotten.
OK.
Let's see.
Who did I miss?
Erin, I think I got you.
But did I miss you?
You looked at Erin's earlier.
That's what I thought.
It's confusing me with somebody in the room.
And then so did I miss anyone?
And if I did, let me know.
If not, does anyone have any questions
I can answer right now?
OK.
So it's about 30 minutes after or 22 minutes
after I am going to call it.
It's almost 530 my time.
We will start off tomorrow morning
going through these flows.
We have enough time in the morning
to spend 30 minutes to an hour going through these
or updating them or finishing them up.
So do your best you can.
If you do not get complete, what I'm looking for
is just an explanation, a story on how
you're planning to design your data flow
and some of the things there.
Little tips.
I would try to get it organized and get
some of the labels and the naming and that type of stuff
that we went over on the canvas as well.
Beautify the flow.
Try not to read Charlotte's Web tonight then.
And have a great evening.
If you have any additional questions, speak now.
If not, I'm going to go ahead and leave the meeting.
But I will also check in on everyone's desktop
later tonight because I have Minify and NaPhi processor
development to go over tomorrow.
And I need to make sure you all have the files for that.
Awesome.
Well, thank you all very, very much.
Continue working as you want.
And we will pick up tomorrow morning.
I will see you all then.
Have a good evening.

on 2024-05-06

language: EN

                  WEBVTT

on 2024-05-06

language: EN

                  WEBVTT

on 2024-05-06

language: EN

                  WEBVTT
I
Didn't get far no worries
You can't talk me through
Where you got what you're thinking? You know how are you going to accomplish this?
Yeah, so I just started off with converting the CSVs to JSON and
Regardless, it was a failure success since you would get a failure. It was already a JSON. I had them go
through another conversion where it just made the JSON files pretty and then basically
merged those into one and
Then the part where was execute script was where I just I don't really have experience doing Python
So I wasn't sure how to go about that
Okay, okay. No worries
Yeah, and you know there like I said, there's there's a couple of different ways, you know, you can handle this
You could have you know use the execute script SQL
I'd be careful
Picking up files and moving files that you know through a processor that may not be needed
Just because you know, you know to make the
To get the best performance out of the system. You want that processor to
You know work on data it knows about or do that single
Task that it wants to do so if you send the files that you know, you know may fail
Send those a different direction have another process handle things like that
But overall, you know great job in and again, you know
We have a mixed audience of different skill sets and those types things so I didn't expect
Everyone to get finished. I didn't actually expect anyone really to get as far as some some folks did
But you know being able to explain this explain how you're building out your flow
It's critical. Um, you know, just keep in mind also, you know name your processor something a little bit more
Readable when you can add some labels those types of things
You know you if if you were to set this up and have a very extensive data flow
And just help visually to to do that thing, you know to recognize where those are
And we'll go into some of the other visual aspects of Nafa because you know when we get to some of the security settings
But no great job
Thanks for walking me through it and any questions that you know, you have so far on any of the Nafa
components in the canvas or any of that stuff
Hi, where is next
This is really nice
Thank you morning Tyler you want to kind of walk me through your flow what you're thinking
How far you got, you know any questions?
Yeah, just ingest the data
I was just trying to work with this query record
So this records during the agitation I'm having some issues with the date column right now
And then this three record is
Pulling out right now, I just have a wind speed warning so that
I
Didn't really fledge out these two paths, but these would be going to merge for that
I
Guess this path would be going to an aggregation report and then this path out
I'm taking and splitting all of those records and then bending them by by their day. Nice. Nice
so
you know
Another instance where we're using a query record
And you're using evaluate JSON
You know, there's tons of different processors. I really like your layout
You know, it's nice clean
You can kind of follow the path in the life of a packet of data. So, you know overall, you know great job
Any questions I can answer?
Any issues or anything else?
Yeah, yeah, they can be you know
We've seen that earlier as well yesterday they can be a little little finicky but
Yeah, do the best if you have any other questions if you have any other questions, let me know
I can
if you want to
Break as well to see if there's a way we can quickly fix that date
Good morning, I
Can I can
Okay, so we can talk much yesterday
Converting the CSV to JSON a two different
Get files whether it was CSV or JSON pick it up
But then started into the flow at a different point
My question is is there a processor that
Allows you to want external application like if I already have a Java application a jar file sitting on the server
Can I launch that from here pass it to input file and have that do the work?
But you can um, and so say you have like a jar file that that has all the logic built in
You know pass the flow file to that and and you know get the results
So you can execute a process and that could be like a like a shell script
You know, we use this for executing the like Linux commands, you know those types of things
The arguments
Question is is when you take a piece of data outside of NAFA
And any kind of processing that happens there you're going to lose, you know that data governance part
so we're going to see in the lineage that you sent it to the process and
We will see in the lineage when we get the data back
But any processing or anything that happens outside of that, you know, we will not be able to
You know capture that you know big problem on suspense
Now with that being said, right, you know, sometimes we have these external applications that you know do
You know things like this very well and we call them right we use it
There there's ways in you know, if you don't want to rewrite your whole well
It's already a Java a bet converting it to a processor would be pretty easy
But you know there there's ways as well to
You know keep the NAFA ecosystem alive in your separate process, you know being able to
You know save those attributes, you know
And then there's callbacks to not by you
Used to say, you know, here's the attributes associated with this flow file you those types of things
But yeah, you do have a processor to to execute a process and and if that's you know
Execute Java dash jar and and your parameters, you know have that
Yep, and like I said, you know, I would love to be able to go through each individual processor
That's a different way of handling this you can execute you know a script even
And you know do some of that stuff you but probably in your case you would just use an execute process execute that Java
You know, just make sure that
That external application gets shipped with them with NAFA just because you know, you deploy this and cluster it
You'll need to make sure you have access to that binary
You know across all the machines but great question and actually, you know
That's why we go through some of these is because nobody's even asked that question yet
Any other questions or issues let me know let me have to take a look
Ain't great hearing from you today. We're not being
Good morning
Looking a little better little less spiderwebby
Okay, you just kind of walk me through your flow how far you got and what issues you have
That's the merge of the
files but also backing up of the
Original
The other stuff like making an attribute all those things my brain doesn't work in that space. Yeah, it's a task alien to me
No, no worries, but you did something that I would highly recommend and you know when you were going through
Designing your data flow started laying these things out
One of the things I noticed is you would put your original file back or you would
You know take a copy of the file and save it somewhere else those types of things
You know when you're building your data flow and in some of those precautions add in some of that
You know safety mechanism to to ensure that you know, if you are writing a file to a database for instance
You know, you're writing the values you may write it to a file first just so you can see
You know does this look exactly like I wanted to go in and you know, you know
The beauty of a processor is is you can branch off, you know hundreds of success, right?
and so if it's a
Success you can send a same success to another file
And so when you're building your flows building in some of those safety mechanisms
I feel like really helps and then when you're done, you've got it tested you're ready to start shipping this out
You know go in and look and get rid of the redundancy, you know
Go in and you know get rid of some of those safety mechanisms
So, you know that way the flow can perform as best as possible
but
Well, and I think it's funny how it's being used so that kind of tells me who's used this before and
You know, who's experimenting things like that. So no great job on putting that in
I like how the only other thing I would do is just you know, you know, again back to the labeling
beautification make it easier to read but usually that's at the end and when you're ready to start shipping that data flow out
You know you start doing those types of things. So
No, this looks great. I get what you're trying to get at and
Understand where you're going your flow if you you don't have the skill set to write code for instance, you know, that's that's fine
So long as you you you get close
Thank you, let's look at Amanda
Amanda is not here today. Oh
Oh, yeah, she did not want to do
Okay
Okay
Too much further I was just basically looking at the processes
Processers and what they did. I mean I did make a little improvement like
All my JSON now looks the same because why I added a flattened
Can you open up your
Okay, no worries
You you did the one thing though that you know
I was I mentioned is is you change the destination from flow file content to flow file attribute
So that's what you needed to do
the reason being is you can this processor and
Some of these things even being a committer, you know some of these things, you know
Confuse me why we're doing it this way, but you know, if you had a flow file content
It's only gonna let you extract, you know one element out of that JSON document
But if you do a flow file attribute you can go through the the whole JSON tree and start extracting
You know every every value out of there and then having that as an attribute
So I'm glad you changed that
You know, I think I think you know
If you would have had chance to go further your you were really close because you know
Once you start looking at the evaluate JSON and you can do some of the same things with CSVs
You know, I think you're merging and other things later would have been a lot easier
so
Any questions concerns
That can help me with media
Okay, is there anyone else in the room
Look at Brett
Okay, Brett how far looks nice looks real nice
So I switched I didn't get much further than I did yesterday we were this but I
Switched like halfway through the way I was doing it. I was trying to use I think it was split JSON. Mm-hmm
I
didn't I
Wasn't getting I was able to get break things up into different files because I thought that was the way to do it
But then I switched to this value
So
Then I was able to get
The attributes
Extracted before yeah extracted into the thing. Oh, I think I showed this yesterday
Oh, yeah, so like our humidity precipitation, mm-hmm the station and all that stuff
and then
The plan was to just feed this
Converted CSV converted into JSON into that same thing. You mentioned yesterday that I might have to do a separate evaluate
And I think I did because the second one I got
Didn't parse correctly. Mm-hmm
So I'll probably have to do just a separate evaluate JSON for that CSV to get that to work
Absolutely, or you may want to
Just you know how you're parsing your JSON parse your CSV and have is as an attribute and then you know
with both of those as an attribute
You can put processors on down the line that would you know, right?
a single JSON document and
All of it would be the same
So there's a couple different ways. I like the path of your own
I always like using, you know, you know record readers and record writers just because
They're reusable components. They
You know, you can add some logic and schema and some intelligence behind it
But I think you would have got, you know pretty close
If you had the time
Issues or concerns or any questions you had about the overall scenario or flow?
Perfect perfect perfect
All right Pedro, let's look at your okay
How we doing
So my purpose to like I put filter on CSV files so I can make it into JSON
Well, I think I got that working
with the schema
Yeah, yeah, I got those guys going oh nice nice and then
Then I guess after that I was my friends I was okay
Let's just go in and do the JSONs and then merge them and then do whatever you have to do in there
So that's kind of where I left off yesterday. Okay. Yeah, if you were able to have time
You you have 10,000 files in your queue
And you notice, you know with those 10,000 files in the queue is it's basically halted, right?
the reason it's halted is that
Error log message is stopped if you were to start that it would clear the queue for the extract text
And then the process JSON files can send its queue to the extract text
You know, so, you know, I'm glad it's there just so we can point it out and it's a learning ability
But yeah, you know that would have helped clear to you. I think you're on a good path
Just keep in mind, you know, there's there's
You know, you want to reduce the amount of processors you use in the data flow
So if possible, right you can pick all the files up do some filtering and sorting
Soon as possible and then you know start sending it off to its own process
It's on its own flow and then you could merge those at the end as well
You know, so it's something just just tips and tricks to keep in mind
But it looks great. I like the labeling
And those things that you you've got accomplished though
Alright walk me through your data flow
Since we talked about it yesterday, okay
But I definitely learned a few lessons that I would
Yeah, it's picking up the CSVs
Updating schema burning JSON and then writing it again. And then this one's just I was just picking up the already
Which that is one part I would do differently if I was starting over from scratch, okay
And then I was just messing with this merge content one a little bit while you're going through with the other people this morning
Oh nice, but it's we're just pick the files that were written back up
better merger them into a
single merged JSON and then
I'm always gonna do the
sequel statistics on it. Mm-hmm. I
Like how you know, I like how we you know
Some folks were just extracting it using an Avril schema for the CSV those types things
And then you know those different approaches such as using sequel
So that was really nice. I all of the you know, you have a merge content and a merge record
All of the standard processors have documentation
Long is pretty helpful. You can you see it bolded? It's a required field if it's not
You know, they're all but yeah merge a group of flow files together
on a user-defined strategy
I think you would have got all of those JSONs merged you could have you know
extracted a few things from them and
Set up some alert, you know extract the wind speed or something and and you would have been finished. So great job
any questions
But I think like you mentioned earlier it's good practice for this
That's all safety steps in there
It is it is and and a lot of people like to just well
I don't want to add those processors because I have to go back and delete them or I may leave them in
You know as this gets deployed, you know those types of things
So, you know, it's always good to have those safety steps in place
You know even even to this day
You know create a flow and I'll get ahead of myself and I'm like, oh no
I forgot to turn on keep source fall and I'm you know
Because it's missing from the source back to the you know back to a folder
You know, so I don't believe it. So, you know, I even I get into those situations sometimes so, you know
No, it looks like you got really far
You know
One of the things that that I didn't see enough I've seen in the past is you know, this is across the border
You may create a processor group that
Handles the picking up and filtering of
files
You may have another process group within that, you know that that parent level process group that
you know handles the
You know your ETL steps for your CSV and then you may have another process group that
Each of those functions can run independently of each other
So, you know, keep in mind if you have I know you're you're accessing a website
It's cumbersome. It's not automated you're downloading it
But if you were getting a feed of data just written to a disk for instance with you know, different data types formats those types things
you don't want an error or
Something else, you know potentially blocking the whole flow. So, you know, keep in mind you can bust this up
into you know
Smaller functions so that way, you know, you may you may be processing seeing JSON and maybe processing CSV
CSV could act up but you know, Jason will continue to process
So just keep that in mind when you're designing your data flows. You can bust this up put it into
You know different processor groups link those together with your input and output ports
And you go from there. So
But great job
Everyone, you know, you got a lot further everyone got further than I was expecting
I knew
It would it would throw a few curveballs because you know, we were having to do some ETL steps and then now learning mechanism
I knew would kind of trip folks up
You know, just keep in mind that you can always go back to the documentation
you can you know the
The description of the documentation and not five, you know should include all of this as well
But you know everything's on the website
and then you know, there's a ton a processor for
You know some of the some of these things
and then speaking of
documentation I found
You know, I had mentioned that
Azure was supporting nafI more and more and
So last night I was looking
this I was going over what I was going to you know show today and
ran across a
New the new Microsoft Azure
So, you know
Microsoft is starting to really lean into nafI they you know, I can't confirm nor deny but it will become a
potential service within
and so, you know, they do have a lot of
Documentation on this I stole you know this graphic for the slides. I'm up to present
But there's a lot of stuff that
Microsoft's even realists, you know putting out so I'm gonna include a link to this and I'm gonna include other links
You know just so you can take this back and and have that documentation
You know, one of the biggest things I try to you know, let the class know is I'm gonna give you as much information
As I can this is a quick three-day training. We're not on a server
We're not you know in a multi-tenancy environment those types of things
So, you know, we've got to do the best we can with what tools we have
But I definitely want to get these links out to everyone but yeah, you know in case you didn't know there is now, you know some
additional information
Specifically on the edge. All right. So that being said any other questions about
the nafI
Candace
Registry those types of things before we go into scalability multi-tenancy
And you know those types of topics. I'll take some

on 2024-05-06

language: EN

WEBVTT
But the major differences in, you know, Minify Java and Minify C++ is basically Minify Java
is NaPhi engine stripped of the UI and a couple of other components.
But what that gives you is what that really does is it gives you the capability to run
you know, just because of its lower overhead, it is quicker, it processes data better, you
know, it's just all around a better product.
But unfortunately, you know, they just added some Kafka support, I think.
Yeah, it's in Kafka.
Kafka is used quite a bit in my circles.
And, you know, it's pretty critical here.
We're not going to, I don't want you to run it yet.
I don't want you to kick anything off just because, you know, Minify can, let's kind of go
through the directory structure.
When you're installing and starting Minify, you know, you have several options.
So let's look in the bin directory.
In the bin directory, we do have applications like Minify and bat files and those types of things.
Windows support is becoming more and more popular with Minify.
I've seen Minify used on government systems to act as the agent to do log aggregation,
WMI collection, you know, some of the cyber security aspects.
You know, so, you know, if you see it in the wild, you know, you will potentially, you know,
see it there.
So in this, you know, you have a bat file to delete the service.
You can install a service.
I think I heard mentioned either originally or in the last couple of days, you're looking at
potential like a Raspberry Pi, you know, that type of device.
Yeah, I know there's a seventh group, for instance, that's out of Tampa.
They're using Java Minify on Windows laptops that are deployed a little outside the talk and,
you know, rerouting data back to the talk to a main Minify instance and those types of things.
So, you know, depending on your use case, you may want to, you know, use the Windows version.
I'd recommend, you know, if you can, the Linux version just because it's easier to deal with.
It's widely supported.
You can use the C++ if you don't need custom processors and those types of things.
So within Minify, some of the directory structure, you know, we've got our bin.
Inside bin, you know, we have, we can do a dump of Minify to see all the processes and flows running,
that type of stuff.
We can do a delete.
You can delete the service after you installed it.
So if you have a, you know, if you have a service installed and you want to remove it, you can,
you know, do that.
We are, you know, we're not going to install it as a service, you know, for some of this hands-on,
but more of, you know, just looking at Minify and getting it up and running, those types of things.
So install service.
You can define your environment variables, those types of things.
We can actually look at this.
One second.
We'll get a better editor than Notepad.
Okay, so anyway, so with Minify, you know, you can edit the bat files.
You can manipulate those if you're working on a Windows environment.
You can go in and set your environments.
You can get a status, you know, just like you had with,
we touched on it only for a few minutes, but if you had the actual NaFi.bat file that ran the main NaFi,
it would, you know, it would, you can change username.
You can change a few things there.
Minify also has a toolkit, you know, so just like working with NaFi, you can use a toolkit to work with Minify as well.
Okay, perfect.
And if you look in the actual bat files, you can actually, you know, change where the pins are running,
where your log directory is, those types of things.
So for you sysadmins out there, you know, you can set your home environment, you know, your Java home.
You can change all these and really get it customized to your location or your environment.
The thing with Minify again, there's no UI, there's no point and click, there's no ease of use when it comes to Minify.
So, you know, if you want to install this and get it up and running,
please expect to hand jam a bunch of things in and, you know, deal with the hardest part with Minify is actually getting it connected to a secure NaFi.
So, you know, getting the key store, trust or those types of things set up.
But if you're an admin, you know, definitely take a look at in running this on Windows, definitely take a look at the bat files for this.
The conf directory is really key with Minify, you know, just like, you know, we could change, you know, NaFi and do all kinds of things.
You know, one of the things that we can do with Minify is it has a configuration.
And so within this configuration, you know, it's you have your flow controller name, any kind of comments, core properties,
you know, file repository and how long it should be kept, you know, some of the swap characteristics.
So as you go through and start fine tuning some of these Minify flows, you know, take that into consideration.
You know, it's going to maintain a local repository, but, you know, even the default, you know, is 10 meg, for instance, and the max flow files is 100.
You know, so you may, depending on how much data you are ingesting into NaFi, you may have a need to, you know, you may have a need to to expand upon that.
If you do, you know, just take into consideration your content repository, your your provenance repository and those types of things.
Your security, you know, some of your security properties will go in here, you know, your key store, your key store type, those those types of things.
You know, again, if you go through and configure even an Invoke HTTP, you're going to need a key store and a trust store.
A lot of times I see people just use the built-in key store and trust store that Java ships with.
So, you know, I don't if you change the path, the password, which a lot of people forget about, you know, you should be decently secure to do that.
But, you know, just kind of keep that in mind.
Processor, processor groups.
Here's where you're going to feed those unique IDs, you know, for those any kind of funnels and those types of things.
There's a lot of configuration.
One of the nice things, though, that they have released is the C2 aspect of Minifop.
And so, let me see if I can pull up an architecture drawing of that because I think it's pretty relevant here.
All right.
So, you know, what I'm saying is Minify now has a C2 aspect that you can download.
So, it's command and control.
You know, Minify runs a single data flow very well.
And, you know, you may have a need for Minify to communicate back to NaPy, you know, be able to get its new data flows and also, you know, get new configuration, those types of things.
You know, so if you have a C2 instance set up, that C2 instance is going to communicate back to NaPy.
It can, you know, pull from the registry and that's the reason that we went over registry.
There's a lot of important aspects of that when you're talking Minify and those services.
So, it can actually pull that flow, push it out, update those flows and manage it.
Now, there's not a good UI to do this and so you're looking at, you know, some scripting.
You're looking at, you know, potentially developing your own custom application and those types of things.
So, just keep that in mind when you're using Minify.
You may grow a little frustrated at the, you know, the capabilities are there, but the configuration and easy use is not.
Well, the nice thing, though, is, you know, once you get it configured and up and running, you can make it as part of that, that CI CD process.
And if you're using Ansible or similar, you can deploy, you know, Minify with ease.
So, you know, when it comes to the architecture, one of the things that we usually see is Minify, you know, I've seen two or three hundred
Minify instances reporting into one node, you know, reporting into a C2, for instance, and the C2 talking back to Nafa.
You do not have to have C2. You can actually have directly from Minify to Nafa, depending on, you know, how would you use C2 if you are, you know, changing data flows frequently, if you're trying to, you know, partition or assign a node.
Yeah, no worries, no worries.
Okay, perfect.
And then also, when we create a flow and put it into Nafa or Minify, we are going to export that out, put it into Nafa, and I'll walk you through that.
The configuration for a one deploy is done in the bootstrap.conf, which you can also
define here in the bootstrap configuration file.
You know, you know, the bootstrap.conf and primarily revolve around the config change in gestures.
The configuration of bootstrap is done using the Nafa monitor notifying gestures key, which you will see
a lot in this section.
So for your sys admin, you can start seeing, you know, the file change notifier configuration I talked about, the REST change notifier configuration, HTTP, you know, and those types of things.
This is where you would configure that.
You can, since this is the Java version, you know, you can configure the JVM memory settings.
It is very low out of the box, only 256 meg of memory, but you've got to also take into consideration that, you know, this is usually used on an edge device, you know, those types of things.
You know, the C++ version is more efficient.
If you're using a Raspberry Pi and Raspbian OS, the support for Minify++ is actually really, really good.
Minify actually supports, you know, some of the Raspberry Pi capabilities.
Let me pull that for you just so you can, because I know it was mentioned.
Let's see if it's on this page.
Yeah.
So if you're using the C++, there is built-in some Raspberry Pi capabilities, you know, OpenCV is there, a USB camera.
There's, which one is it?
Sensors is the one I'm looking for.
Okay, there are sensors.
So if you enable the sensors on the C++ version of Minify, it can read from the, you know, Raspberry Pi sensors.
There's a lot of these that are built, the C++ versions of the processor, you know, is built for Raspberry Pi.
It'll run on others, but it's, you know, specifically targeted after that.
You know, you can use GPS.
If you have a GPS module on your Raspberry Pi, you know, you have a GPS hat.
It can pull that information, you know, those types of things.
Capture, you know, PCAP, you know, things from, there's a, you know, I know of a lot of folks using this to capture PCAP data,
you know, even on the edge to filter and sort and analyze that network traffic.
So, you know, just keep that in mind, you know, when you're choosing your version that it's there, it's available in those types of things.
Again, we're looking at the Java version.
So you can, you have remote debugging.
There is some, you know, for cluster mode to work properly, you know, some of the things that you need to do.
You know, if you have some older, you know, system settings and things like that.
This is, it's set to headless mode by default, you know, in those types of things.
I was talking about some of the C2 capabilities.
If you have a C2 server, you can go in and configure it.
Again, if you are using a lot of edge devices where you're running Minify, I would highly recommend a C2 instance.
Just because once you get your Minify running with the security, you know, you can feed that C2 the data flow that you want to run.
And it will automatically download that data flow, run it, and you know, and it'll just look for a new one periodically.
So you can set the heartbeat and all the information about C2 here.
Again, I'm thinking about one of my senior engineers I was working with just a couple weeks ago.
He was having a problem with C2 because there's not a lot of support for it online.
You know, so a lot of the settings and configuration that you look at is, you know, trial and error.
You know, you're just going to have to play around with it.
But, you know, it does work and you can get it up and running.
Everything that we use to communicate to the C2, to Minify, to wherever, it can be secured.
You know, it's an SSL connection if need be and things like that.
Now, when it comes to some of the fine grained security details of Minify, you're not going to see the same types of policies that NaFi has.
You know, just because it is a, you know, basically an agent of NaFi, its primary purpose is to either, you know, get file, collect data, read a sensor.
You know, you name it, do some quick operations.
You may want to execute a, you know, a model, an AIML type model and those types of things.
I do know, like, seventh group uses it to capture images and run, you know, image classification models on the edge.
And instead of sending up a, you know, two, three, four, JPEG, they can just send the analytics from the, you know, the output of the analytics.
You know, so that's something to keep in mind as well.
Alright, so that's bootstrap. That's the big.
Extensions is just like the extensions directory on NaFi.
I actually did not go into that, but, you know, we talked about it, I think, on the first day.
Your extensions is just, you know, just like you would in NaFi.
This is where you're going to load, you know, your custom processor or something else like that.
You don't want to restart your Minify agent.
You just want to get it loaded and deployed.
So, you know, use the extensions directory when you can.
And of course, there's the lib directory.
If you see it, you know, we still have NARS, you know, listed here, just like we would in NaFi.
Not as many, you know, you would see within the NaFi, but, you know, you can, you know, depending on how you want to handle this, you can copy a NARS from...
I want to go in and show you.
So you're able to copy the, like, one of the NARS from NaFi.
Let's see, let's...
There's already Kafka, there's Solar.
Like, we'll take the Solar one, for instance.
You know, if I had a Minify flow that was, you know, picking data up, sending that to Solar.
Solar is a search industry based off of Lucene, another Apache level project.
I would include the Solar NAR as part of my, you know, deployment.
You know, so keep that in mind, you know, when you have your CSC process set up, if you do have custom processors, and it sounds like you may or you will,
you know, you can have those built, and then when you build and deploy your Minify, you can use that processor on your Minify, you know, install.
If you are, if you have that processor part of your flow file, and, you know, we have to design our flow file, you know, the data flow itself, we have to design those within NaFi.
And so once you, you know, get your data flow designed in NaFi, and you save, and we're about to do that right now.
And once you save that flow, if that processor does not exist in Minify, it will reject that flow and not run.
So, you know, just keep that in mind, you know, if you develop custom processors, or if you're using processors that are not part of the Minify distro.
Again, you know, we know, you know, the C++ version, you have a very detailed list with Minify.
And then also the, let me show you this, again, it's a very extensive list.
So if you, you know, these processors are not packaged with Minify, but it's able to use the following processors out of the box.
You know, so execute SQL, we were using that, we were using execute process, we, you know, we are using, you know, some of these other processors in our data files, data flows that we built.
So just keep that in mind.
This list is available, but it's not readily available.
And so what I'm going to do is put that as part of a presentation I send out afterwards, you know, just as a tip and trick that, you know, kind of reference this list based upon, you know, what your data flow may be.
I think, you know, we would use a put file and a get file, you know, in our data flow, usually you would, like, you know, in a Minify instance, this is pretty new where they've, you know, kind of separated this out.
There's a lot of, you know, there's a lot of thought in the future of Minify where you can just use NaPy and just deploy it as NaPy because it's actually the same engine underneath.
So do expect some changes to this coming up soon.
And that's because, you know, I just had the insider knowledge that these things are coming.
So, but I'll send this list out.
I'll send out, you know, a few things about it, but keep it in mind as well.
Okay. So Minify is able to use the standard SSL contact service out of the box as well.
So keep that in mind.
If you want to create a data flow with a processor not shipped with Minify, you know, the way you do that is set up your data flow, copy the NAR into the lib directory and then restart your Minify instance.
And so, you know, just if you're at this, you know, you would have to, I think some of these NARS are packaged and bundled together.
So if we look at NaPy, there is a NaPy.
I'll have to go back and look and see where it's at.
But there's a NaPy bundle, NAR, and that is a bundle of some of these standard processors.
And so, you know, with that being said, you may, you know, there is a JD bundle, but that's not it.
So you may have to include the bundle of processors, even though you may not use the others.
One of the ways around that is, is again, go back to the source code, get the specific processor.
We looked at the Git file source code and go from there.
You know, once you get the Git file, you can compile it.
You can use it in your main NaPy instance and then turn around and ship that with Minify.
As a matter of fact, as long as it's the same, you know, processor, what I would do is build a Git file processor that's specifically used for NaPy
Within the Minify directory, you have the, you know, the actual Minify application.
So we are going to receive this from a remote site to site connection.
This is what we get into with Minify.
We'll output to a port and, you know, we are going to receive it.
So name your port from Minify.
And then when you are, I'm just naming mine, go into the create connection settings.
And, you know, now you have a connection on your site to site, you know, from Minify.
So what would happen, and I doubt we get time to get all of this fully up and running.
But what's going to happen is Minify needs a place to send, you know, its data, it needs a connection.
So, you know, what we are doing is we are going to listen, you know, this is our input port.
We're going to listen for all those site to site connections from Minify.
And when it comes in, we're going to log the attribute.
You know, this is, you know, we don't have to log the attribute.
We can bring whatever data is Minify sending us and, you know, send it to another data flow, for instance.
Send it to another processor group, for instance.
There's a lot of capabilities here on what you're able to do, what you're able to configure, and those types of things.
So, you know, create your input port.
Do a log attribute. If you want, you know, you can just do another log message.
Add that to your canvas and add that success.
Okay. And I'm going to auto terminate my log message.
So that way it clears.
Okay. So everyone should have something similar. Perfect.
Good, Pedro.
Brett, you might have missed it. You want to have an input port dragged down and you want to do remote site connections site to site.
There you go.
Perfect.
Give you just another minute to create this. But again, this flow is to receive data from Minify.
So this flow is not going to go out.
You know, this isn't a flow that Minify is going to run.
This is our flow to catch whatever Minify sends out through its site to site capabilities and, you know, receive that file.
I have a log attribute because, you know, it's going to, you know, Minify is going to send me attributes and data.
So, you know, just have the log attribute, have a message, and it should be a valid flow.
Okay. Good deal.
Some of you are getting fancy very quickly.
Okay. So now that we, you know, that's our receiver, right?
You know, that's receiving the Minify. That's what the data coming in.
Now it's time for us to build on what we plan to deploy to Minify.
So to do that, I want to go back and what I'm going to do is create a new processor group.
And I'm going to call this my Minify deployed flow.
You can name it wherever you want to name it.
Generate flow file. There we go. Generate flow file is used quite a bit.
And Pedro almost mentioned this earlier on your question about the web services.
You know, generate flow file sometimes is used to kick off a flow.
So, you know, if you have an Invoke HTTP, generate flow file is really nice where, you know, you can set a timer up,
some other rules in the generate flow file, generate a message.
It kicks off the Invoke HTTP.
And that's what kind of drives the rest of that flow.
But for Minify, let's just use the generate flow file processor.
You know, we're going to configure it. We are going to basically generate a zero byte file.
I'm going to set it to text.
Oh, my mind is tight.
And it's going to send, hello, world.
Apply. Now, again, remember, we are, this is the flow that will be deployed to Minify.
So, you know, this flow is not to, you know, don't get it confused with making it valid within Nafa.
It needs to be valid for the Minify install.
So I know that can get a little confusing, but, you know, just keep that in mind when you're developing your flow that, you know,
depending on file locations and those types of things, you may use a git file or something similar.
Drag your success over and go from there.
The, let's see here.
No, I apologize. Actually, delete that connection.
We want to actually add a remote process group.
And it's so that's way when Minify sends it to the Nafa, sending it to that remote process group.
You can, you want to put in your instance that you're running right now.
So when setting up Nafa, you know, we'll talk about this, you know, when setting up these connections, you can, you know,
if you set this, you know, transport protocol to raw, you're going to need to work off of your own port.
And if you look right here,
I get this question quite a bit.
If you look at your Nafa
configuration.
In the Nafa configuration, you have, I think it's port 10,000 is the
Web port.
The remote input port, remote input host.
So, you know, when you are configuring your flow, you can choose raw or HTTP.
If you use HTTP, which I is the one I recommend, HTTP is going to communicate with your Nafa over the URL that you're logged into.
So it's going to create that secure connection back to your Nafa instance.
And that's how it's going to listen and not Minify is going to push its data.
Now, a lot of people that are processing lots of data on the edge, you may not want all of that data going to the main Nafa HTTP port.
The reason being is that JETI server running underneath it, serving up the web page and the API and everything else can get overloaded.
And so it could cause your browser to become a little unresponsive.
And so you might not be, you know, you might turn a flow on, say, 100 edge nodes.
You turn that flow on and, you know, once you turn it on, it's going to start bogging down your system.
It's going to, you know, you can get into your UI, those types of things.
I've seen that happen. The best way to mitigate that is just, you know, managing your resources.
You know, but if you want to send it to a raw, you know, Nafa port, that is okay as well.
The beauty of that is, you know, you can configure that port.
You can, you know, do some log balancing.
You put Nginx in front of this and, you know, do some really advanced load balancing and those types of things.
I recommend the HTTP port though. It's just a little bit easier to deal with.
You know, it operates on the same port in IP that you have in Nafa running.
So if you can imagine you don't have to open another, you know, another hole in your firewall to allow access, you know,
so keep that in mind. I always recommend HTTP.
You can put some proxy information in, those types of things, but, you know, that's usually what I like to do.
And then take your Generate Flow file, you know, connect to your port.
Does not have any, you know, any inputs.
Let me see what's going on with my other one.
Did you all receive the same error that it didn't have an input port?
Yes, I got it.
Okay. Thank you.
Oh, I think it's because it's trying to connect.
But again, I haven't set up the trust store, the key store, and all of that, you know, on minify, which is, you know,
a little outside the scope of configuring Nafa.
So anyway, so what you would do is set up your flow file and get that configured.
And you're going to have this go to an output, that remote process group.
Once that remote process group is there, you want to make sure you include your, you know, the URL, the HTTP.
Do remember, you've got a UUID here.
So if you need to, you know, do those types of things, you may have to define that in your CICD process.
You know, some of those types of things.
So, you know, just keep that in mind.
Let's see here.
But actually, it should still connect.
See if I can solve this.
Let's see here.
It did have an input.
We have now two of them, so it should.
Doesn't match any of the channels.
I think we'll connect to due to SSL.
So anyways, you know, the issue I'm having here is the, looks like the SSL, you know,
it's not allowing it to connect to itself because of SSL.
It can't match the names.
But when setting this up, you're going to create your flow file here.
And then once you have your flow file, you know, defined, we are going to export that out and run it in Minifile.
So let me see if I can.
This just worked earlier yesterday when I was testing it.
One second.
Let's see.
What am I going to put in here?
But that's not going to do it because there's no definition of where it needs to go.
I have to take a look and see why the configuration is blocking me when it was just working last night.
But but basically you're going to this is where you're going to build your Minifile flow.
You're going to, you know, this is where you do your operations.
So, you know, right now we're just doing a generate flow file and then we're just generating a hello world and sending it to the remote Minifile group.
And so, you know, you would configure that here.
We will transition.
So anyway, so I'll diagnose this and see what's going on here.
But I think you get the principle behind it.
So if we were on Minifile itself in deploying this to Minifile, we would take and do a generate flow file.
We can chain together other processors, you know, say we were doing an invoke HTTP,
because we are going to grab a local, you know, a local file that's running on the web server.
You can build out your flow as much as you want or as little as you want.
But at the end of that flow file, you're going to send that to a to a remote process group.
And so, you know, that remote process group gets defined here.
You know, you want to make sure you have your, you know, your property set correctly.
Lower latency.
Anyone else's latency getting bad on this?
Oh, there we go.
You know, just just reiterate when you're a computer remote process group, you're going to need to put, you know, your, you know, your non-fi instance that you're going to communicate with.
You're going to need to set up the matching certificate.
So the certificates that you use for NAFA is going to have to be installed for your Minifile install as well.
You know, for that, for this to work properly, it needs to make that, you know, that connection.
Again, that's where you go into the Minifile key store trust store and set those appropriately.
I don't know why I can figure this out like yesterday to double check that it worked and for some reason it didn't.
But we get the idea.
So once you create your Minifile flow, you can leave the group.
And and then what we need to do now is actually get that flow.
So we've created our data flow.
Now we need to export this out.
So we say.
If you right click on that and say download flow definition with external services, you should be presented with a JSON document.
And I'm bringing it up so you can see what it looks like.
OK, again, right. It's messy.
So once you download your flow in Windows external services, this is the flow that you are going to use to to feed your NAFA instance.
That initial, you know, that initial flow.
So if you have previously set up a if you have previously set up a C2 instance, you would actually go into Minifile and I configure Minifile to communicate with the C2 instance and the properties.
And then you can pass the you know, you can actually pass Minifile command that tells it what initial data flow to get because C2 is going to serve that up.
But for this scenario, we're using Minifile to, you know, using NAFA to create our flow like you would do no matter what.
While we didn't save that flow, you know, for this way of running Minifile, we can export that flow as JSON.
Make sure you don't export it as like a CSV or or not CSV, but an XML template or any of those other formats.
It is expecting a JSON document.
Okay, you know, hopefully everyone's getting back from our last break.
We'll spend a few more minutes just going over some tips and tricks and as well as, you know, some custom processor development.
I do have a question. Ben, if you've made it back or anybody from the audience, would you have felt better if you were working on your own laptop versus this virtual environment?
And do you all have that capability to work on your own environment?
Sorry, I hit the top of the off, I have the room for a second. Were you asking if we could run Minifile on our actual laptop?
Yeah, if I were to give this class without the virtual machine desktop like the DAW desktop environment that we're working off of that we, you know, we run into a few technical issues, you know, latency is a pain in the butt.
But so let's get into custom, not a development.
And I realize, you know, even on this call, you know, we have one, two, three, four or five devs.
How many concurrent tasks, right? You know, so as you can imagine, you know, we have one concurrent task per processor.
But, you know, with that being said, you may want to increase that. We are using the bare minimum, you know, that we would need, you know, for this flow.
You know, that is, again, I've seen this numerous times where, you know, those need to be increased.
You know, if you're constantly setting up and tearing down a large number of sockets in a quick amount of time, you know, you'll need to increase those.
And I've seen a lot of flows, you know, do that. But, you know, assist admin never came in and, you know, they just didn't know about the best practices.
So, you know, this always gets touched on in every class I've ever seen. So I wanted to make sure I pointed out.
Great question. I'm really glad because I kind of highlighted that, but I didn't get to use it.
That's a good question though. Because I can show you exactly how you help. I can use the GeoMesa, I bet. Yep. Releases. There's an R.
Let's see here. GeoMesa, Datastore, Service, File System, Kafka, Installer Bundle though. Oh, Implux.
Perfect. Great, great question. Address. So I have that in R, right? And so I've developed my own custom processor.
I have, you know, used Maven to build it and all those fun things.
Done. Perfect. So you built your GNAR, you got it running, you downloaded a GNAR, but be careful what you download, right?
So what I like to do, and I talk about this, is here's the ImpluxDB GNAR.
So what I do is, if you remember, I can stop NaPhi and I can put it in the lib directory and start NaPhi again, and it will show up in my processors.
Best thing to do is extensions. So, you know, even if, go speed racer.
Even if, you know, you have a custom processor, I still recommend putting it into the extensions directory.
I would leave the lib directory to your core supported NaPhi, you know, GNARs and libraries.
And so, you know, you need custom, you know, processors, you know, you can have that go through a CI CD process, build it, and you want to send it to the extensions directory.
The extensions directory was specifically built, you know, to allow for custom processors and also hot deployment.
So if I go back to my NaPhi instance.
I haven't had to stop my NaPhi instance. I haven't had to, you know, do anything.
And now I should have an Implux. There it is.
So now you can see I have an Implux processor. I have an Implux, get Implux database record, put Implux database, you know, those types of things.
So because I just took that GNAR, downloaded it, put it into the extensions directory, it hot deployed that GNAR, and now I have full access to it.
Is that your question, Pedro, or was it something else?
Yep, yep. Very nice. That's what I was wondering.
Perfect. I got one right.
Okay. No, great question. And I mentioned the extensions directory a few times now. So I'm glad you asked how you do it.
So, yeah, you know, you just take the GNAR and put it into the extensions directory, hit refresh on NaPhi, and it should, you know, it might take a minute.
If you look at the logs, you know, usually when I'm sysadminning a cluster, I have putty pulled up in about a thousand logs.
But if you look at the logs, you can see, you know, that GNAR getting deployed.
And sometimes it just takes a minute to, well, I'm getting all kinds of, oh, my warrants is because of that connection.
But anyway, you know, if you're reading your logs, you should be able to see, you know, that that GNAR was deployed.
There we go.
There we go.
GNAR autoloader. So, you know, it autoloaded the GNAR and deployed it. It's good to go.
If, you know, sometimes it's not easily apparent, but you may want to, you know, as a sysadmin, you may want to tell this log.
And when you're deploying into the extensions directory, just kind of watch things.
The log file will have additional data that, you know, NaPhi may not even report an issue with that processor until it's being used.
And so, you know, you want to check your logs when you're loading custom GNARs just to make sure that it deployed correctly.
But great question. And I'm glad we got to do a hands-on, like, real world here it is.
Any other questions?
Absolutely. So, that is not a policy in NaPhi.
And so, you know, as a sysadmin, you need to be able to lock these directories down.
And so, the only way someone should be able to bring in a custom processor is if they have access to the lib directory or the extensions directory.
Now, you know, you can, you know, lock that directory down just to only root.
For instance, depending on your security mechanism, there are some other capabilities there.
But, you know, in NaPhi itself, you know, you notice that we never went to NaPhi when we were loading a custom processor.
We just dropped it into the extensions directory and then NaPhi loaded it.
So, your way of securing that is you need to secure the extensions directory, which is outside the NaPhi UI.
So, great question, though.
All right. Other questions before we wrap this up?
Well, a few housekeeping things.
You will be, you either have been sent or you're getting sent a survey.
I noticed I did say um a few times and I'm trying to not do that.
And so, hopefully that didn't bother anyone. I did get a mark on that before.
But, you know, complete your survey as soon as you can.
My pay is tied to that survey. You know, just the timeliness of it. I get paid either way.
But, you know, the quicker those surveys are completed, the quicker I get paid.
You know, so if I get, you know, five, six, seven of these surveys complete, I'm good to go.
I will, I have, I think everyone's email address that's signed up.
What I will be working on over the next, you know, tonight and tomorrow, I'm going to work on getting all your questions fully answered.
I'm even going to get a tip on how to rename the files because I know we ran into that.
And so, I'm going to start compiling all of that and then I will email a PDF that will, you know, have all of this information as well as some tips, tricks, some of the questions you all asked.
And as well as anything else I can do to answer.
If you have any future questions, let me know.
You know, technically after this training is done, I'm done.
But I never like to treat, you know, people like that.
So if you have a question, especially on NaFa, I really enjoy this because it's one of my quote unquote babies.
You know, feel free to send the question over. I'm also going to send some YouTube links.
There's some really cool design patterns that, you know, that we kind of went over, but we just don't have the time to fully go through every design pattern.
But my friend Mark Payne put these together on YouTube and so I'm going to send them over to you.
And yeah, if you have any other additional questions or anything else, shoot me a note.
If it's something short and quick, I'll be happy to help.
If you've got more long term, you know, support or something like that, you know, feel free to let me know.
Like I said, I contract to this training company through my data engineering firm.
So, you know, I'd be happy to help out wherever I can.
And if there's no additional questions, I give you back a few minutes of your time and have a good rest of the day.
Thanks, everyone. Have a good one.

on 2024-05-06

Visit the Apache Nifi - GROUP 1 course recordings page

15 videos

Apache Nifi - GROUP 1 | Videos

Apache Nifi - GROUP 1