Apache Nifi GROUP 2


language: EN

                  WEBVTT
Perfect. Perfect. Thomas, did you make it back yet? Did he get called out to another
call or something? Sorry, were you looking for me? Tom? Hey, Tom. Yeah, if you can, can
you go ahead and start your desktop, get logged in? I think it should be good to go. Sure.
All right. Looks like everyone is coming up. Peter, just so you know, the uploads folder is what I
uploaded back. I actually need to upload a newer presentation. So the PDF you see there has a couple
of things. Depending on the system you're using and proxies and things like that, the last training
class, you know, a couple of folks got proxies to death. They had to have some fixes applied before
things were working. But once you get logged in, you're going to have your virtual desktop. And
so, you know, the latency sometimes can be an issue. You may point and click at something,
is the website to go to. You know, the documentation is very, very extensive. As you can imagine,
you know, being a government system, it requires lots of documentation. You can download it from
here, those types of things. So if you're at home and want to play around with NAFA, have at it.
But, you know, you can download that. You can download the source. You know, if you want the
source files and to compile NAFA yourself, you can have the source. You know, it is an open source
application. So you're able to download the source files. If you're a software engineer, you're able
to go in and make changes. You know, if you're running scans or vulnerabilities, you know, have
at it. There's a lot of capabilities, you know, and the administrator guide actually kind of goes
into how to build NAFA from source and those types of things. So, but for this class, we are
going to work off of the NAFA standard 126 binary. This is already pre-built. It's ready to go. We
look at the documentation, those types of things. So the main parts of the documentation that we are
going to go over is, you know, part of the admin guide and the user guide. But like I said, this is
extremely well documented. For an open source application, it's actually one of the best
ones, though. I like to just work off of the, you know, the official documentation. But yeah,
you know, I can go right here and as soon as the internet wants to respond, I will have the
documentation on Git file. There we go. You know, so for instance, the Git file processor, you know,
creates flow files from files in the directory. It will ignore files. It doesn't have at least
read permission to. And then, you know, each processor has a property. Some are required and
some are optional. And then we also have some that, you know, that we can add. There's a
relationship to the processor. There's other attributes, things like that, that we can take a
look at. But for this, for this part of it, you know, just remember that the documentation is there.
So everyone has Non-Find downloaded. And I'm going to kind of walk you through. I was debating on
whether to include this as part of the class, but I felt like I think we can all accomplish this
pretty easily. So actually what we're all going to do is kind of install our Non-Find, walk through
what some of that means. If you don't understand it or you have any questions or some additional
details you might need, again, feel free to interrupt me. You know, this time between now
and then.
Oh, there you go.
Give it another second. It takes, like I said, it takes a minute to, you know, extract all of that
zip file. That zip file is a 1.2526 gig file. And, you know, when it extracts, it's going to be,
you know, even bigger. That's probably been the biggest complaint that I know of that the community,
not the Non-Find community, but the user community in general complains about is just the massive
size to download this. But if any of you have played Xbox, you know, some of these games can
be 50-60 game now. So, yeah.
All right. It looks like most everyone's got that extracted. Give it just another
second for Alderius to finish up and Peter to finish it up. All right. Let's go back to
my Peter here. So, it looks like it's finished. Perfect. So, you should open a new folder in
Windows. And the only folder in that folder is a NaPyWa-126.0. So, if you can double click and go
into that, and then you should see a bin folder, a comp folder, docs, extensions, lib. These are not
all of the folders that NaPy creates. This is just the initial downloaded install. When we get NaPy
up and running and started, it's going to create some additional folders called our content
and that all lives locally. Now, depending on your strategy of deploying this and scaling this and
some of the other things, you may want to have some of these content repositories, some of these
other repositories on different network and stack attached storage. I know that for like the content
and flow and provenance, those are usually stored on high speed drives just because there's a lot
of reading and writing back and forth. And then you'll have some of the other repositories that,
they really don't get used as much. They're still needed. So, they may break this up a little bit
and put some of these folders on some high speed drives, some of the other folders on some normal
drives for cost savings and performance gains. But that all depends on your deployment strategy.
Whenever we have time, and we're going to have plenty of time for this, but if you want to go
very technical into details, and I'll be happy to give you my opinion on that, I can get very
technical. I still write software. I still write software for NiFi even. But I kind of like the
training part as well for this. So anyways, so when we extracted, we've got the bin folder,
the comp folder, docs, extensions, and lib. Docs is docs. And I got mentioned earlier, everything
that you can find on the website, you're going to get in the docs folder as well. And NiFi utilizes
that docs folder to provide you information. The bin folder, that's your binary, right? This is
where you would execute the start of NiFi and those types of things. And we'll go into more of that
once we start. But the bin folder for NiFi, it contains both Windows batch files, BAT files,
as well as Linux shell scripts. So if you're running this on Linux, you have a way to start
NiFi. If you're running this on Windows, you have a way to start NiFi. So that's how you would
start NiFi as well as some of those binaries to, if you need to change a username or password or
something like that, you can utilize those. The conf directory, which we are going to go into,
is where all the configuration for NiFi exists. So all your properties and where does NiFi,
what IP address is NiFi running on, what port number is NiFi running on, those types of things.
So there is a lot of configuration. A lot of this is the security. So plugging in that security
infrastructure and those types of things, you would do the configuration here. So what I am
going to do though, and this is totally up to you if you want to, but if you go to nifi.properties,
you should see nifi.properties. I am going to open that and go over some of the key points of
the properties just so for those that are sysadmins and others, you'll have this information. I know
for some of you who may not be that technical, this may be a little overwhelming. Again,
this is just for information. You are more than welcome to follow along, but there are some key
points that I feel like everyone needs to see as part of the property file. So anyway, so this is
your core property section. Again, a lot of this is documented. A lot of this relates back to the
website even. So what is the main flow configuration file? Where is that located? Of
course it is going to be in your comp directory. Where is the JSON file? It is also there.
You have archive enabled. Those types of things. So that is some of your core properties.
Some of the other ones is your authorizers configuration file. This is where as was
mentioned earlier how you are trying to work on the other organizations trying to work on getting
not only installed, up and running, get the multi-tenancy, the multi-users. I think it was
Brett is working on some of that. So Brett would go in here. He would configure these properties.
He would take a look at the authorizers.xml file and start building in some of his
configurations he would need for security and user permissions and identity management and all
that fun stuff. But that is where you would find that. But there is one key property that
you can just tell that has come from the government. And that is the nifi.ui.banner.text.
Now this property lives there for now it is for a couple of different reasons. But this banner,
as you can imagine, you could put unclassified. You could put secret. You could put top secret.
You could put you know, you could do whatever classification header you would need. So what that
does is it provides the government an easy way to put the classification of the system
on a banner. So when you pull up this nifi instance, you immediately see the classification of the
system. Also, because that is such a government property, the way that commercial companies use
others in the government as well is this may be our dev instance of nifi. This may be our test
instance. It may be prod. And so I know a lot of companies that use this banner as a description.
So you can quickly go to the UI and you will immediately see I am working on the test system
and so for me, I am actually going to put something in and say this is a test system.
I can put in whatever. Okay. And you know, again, you don't necessarily need to do this.
If you are following along, feel free to put in whatever you would like. This is your own personal
nifi instance. And you can go from there. Or you can just leave it blank. So some of the other
properties that you would need to potentially look at if you are like a sys admin and stuff like that
these will be immediately available as soon as we start nifi. And then, you know, like I said,
there is an extension directory. It is empty. This is where if we had a special
processor and you had a CI CD process set up where, you know, a developer could create a
processor and checks it in and builds it and tests it for vulnerabilities and goes through that
whole CI CD and DevSecOps policies and things like that that you have set up, you know,
ultimately it will spit out a norm file and that norm file could be automatically installed into
the extensions directory and you would have immediate access to that processor. Not only would
you have immediate access, but if the permissions and the policy was there, everyone would have
access to that same processor. So, you know, as part of some of the usability points I was
making earlier where you are able to reuse these components. So if I build a connector to say,
you know, let's go, it is already built, but let's talk SQL Server. If I build a connector
for SQL Server and test it out, it has went through the processes that, you know, you may
have set up and things like that, it gets deployed to that extensions directory. Well, now everyone
can use that connector. So as a different organization, I don't have to go and build a
new connector. I can just reuse one that was already built, but I may be connecting to a
different instance. I may be connecting to a, you know, different user names, passwords,
authentication methods in SQL Server. You know, it may be the same SQL Server is just pulling
from a different database or a different table. You know, those types of things. So, you know,
that extensions directory, you know, is pretty important here. Then that is how we hot load
processors. So that means we do not need to stop data from flowing. We do not need to turn data
flows off. We don't need to restart the application. It can run. Data can continuously flow through
the system. And now I have a newer capability so I can connect to new data sources. So that's the
purpose of the extensions and lib. Again, all of that is referenced, you know, into the 9.5.0
properties. You can change it. I've seen, you know, some folks change it to a different lib directory
depending on their policies, things like that. But, you know, as a sys admin, this is the section
to do that. And then, of course, you know, you need a state. You need to have state. If you're
working in a clustered environment, you would need a zookeeper, which is another software
application that is open source. If you've been around any kind of distributed system,
clustered system, you've heard of zookeeper. Zookeeper is widely used, you know, across the
board with government and commercial alike. So, you know, here's where you would manage some of
that. State management. We have a database directory. So that's our database repository.
Again, we have multiple repositories here. And where you store those is configurable. So you may,
depending on if it's a repository that needs a lot of reads and writes, you may store it on a
slower drive instead of having to choose one or the other. Because this is so highly configurable,
it's there so you can do those types of things, you know, reducing your cloud costs, your server
resources, your own prim resource. I know you guys do a lot of stuff on prim. So, you know,
it may help reduce some of those resources. And that's the database settings. And there's a
flow file repository, the content repository we talked about. And, you know, one of the things
I like to point out here is that content repository is keeping basically your flow file. So if you
told it to ingest a CSV, that content repository is keeping a copy of that CSV, you know, for the
time being. That's because if NaPhi were to crash and shut down, and when you restarted it,
that processor that was processing that flow file is going to go back to the content repository
and say, give me back that file. I need to finish processing it. And before a processor that's,
you know, say we've got three or four processors chained together, you know, we get filed and we
send it to the next step and the next step. Well, that next step, if something crashes before it
gets time to go to that next step where it completes, when NaPhi comes back up, it will
reprocess that content, that file based upon whatever processor was working on it because
of that content repository. And, you know, just so you have a little bit of underneath the hood
workings of this, when a flow file or a piece of data, you know, goes to a processor, it will not
release that flow file and that data until the next processor has it. And so what that does is
it guarantees that a copy of that data is on the next processor doing that function
and that that next processor got a thumbs up from the previous processor that it was complete.
So that way, you know, if something crashes and things like that, you don't lose data. Now,
if it is in the middle of processing data and it crashes, it's going to try to reprocess that data.
So, you know, just keep that in mind where you may get some initial results from NaPhi,
but you need some additional processing to happen. So you may get duplication of data because,
you know, it produced 25% of the output. But, you know, before it crashed, when it come back up,
it's going to try to redo that. And so, you know, you may get an additional duplication of data.
Now, with that being said, we do have ways of dealing with that as well. There's actually a
dedupe processor. I don't know if it's in this latest version, but I do know it's there. Because
duplicate data is a pretty big issue in my experience in all the years I've been with
the government. So, yeah, that is our content. Then you have all the provenance events. It has
its own repository. When we start NaPhi, that new folder is going to be created, and that's where
the provenance events will go into. So you can specify how much, you know, Richard, I'm thinking
about you here, where you may have an overarching data governance plan and strategy. And so, you
know, you're not fine to retain the last 14 days. And then, you know, during that 14 days,
you're offloading all of that provenance information into a larger data governance.
Informatica has one. You know, there's a couple of open source versions. You know, there's like
Knox and Tika, or not Tika, but Ranger and Apache Ranger, Apache Knox, and a few of those tools that
kind of work well with NaPhi. So, you know, you may have a corporate-wide or unit-wide governance
policy. So that's where this would get configured. You can configure it to keep to, right now it's
configured to keep all the provenance events for 30 days with a max storage size of 10 gig. So,
you know, so keep that in mind, you know, when you're building it and designing your system,
that if you have a ton of data coming through, you may want to, you know, those events are
being offloaded. And, you know, you have those data provenance events. So you don't need to keep 30
days worth of data. You need to only keep it for a week or a day. You know, I've seen this configured
where it keeps it only for a couple of hours. Because as those events happen, all of the data
governance events is being offloaded to the, you know, the corporate-wide data governance system.
And so, you know, this is highly configurable, you know, for you sysadmins out there as you start
working through getting it installed and things like that. You know, pay attention to some of
these properties because, you know, this one, for instance, it'll take 30 days or 10 gig to fill up.
And so, you know, you may want to adjust those settings. Again, you know, you see a lot of times
applications have settings that you can just like, you know, go to a menu and select the setting and
change it and those types of things. We have some of that in i5, but this is part of some of those
core settings. There is no UI for this. You know, that was, you know, one of the things that we went
over in the last training class was you're going to have to go in and edit these files. You're going
to have to put in, you know, different properties based upon your organization. I wish there was an
easier way to do this. I find that this way is not too bad. I find that setting up security and those
types of things, now that's the more difficult part. And the problem with that is, you know,
if you do run into an issue, you're asking the community, you're asking Google, you're, you know,
or you're emailing me and saying, hey, Josh, how do I do this? You'll get my contact information at
the end of the, you know, at the end of every day, I think it is. But I will be happy to answer any
quick questions after this training. Do remember, like, you know, I'm delivering training and the
is. So I mentioned that NAFA when, you know, up until recently, you could download it, you can
install it and have it up and running in a few minutes, but everybody in the world could access
it if it was on a public IP or something. So what they did is they went through and said, okay,
we are now going to secure every install. We're going to generate a username and password
that is unique to every install. So to find that information, you actually have to go into the
NAFA dash app dot log folder and look for username. And you're going to see in this log folder,
a generated username and a generated password. That is going to be our username and password
to log in. Yours is going to be different. This is a very unique UID that is generated.
And so, you know, yours is, your username and password is going to be different.
I'm going through this right now, but we, as an exercise, you know, I'm going to have you all,
you know, basically do the same. What I like to do, because there's no way I can remember
that much information, is I like to copy it and I will actually put it in a new document.
Because that log file is going to go away, you know, as we process data, it rolls over to a new
log file. You know, there's a lot of information in that file. So I like to just pull out that
username and password, that initial username and password and have it readily available. So what I
did is I just created a new text file. I copied and pasted the username and password. And then
I'm going to save it as text and I'll just throw it in my downloads. And I'll just name it up
in my downloads. Perfect. So now, you know, I've downloaded NAFA, I've extracted NAFA,
double clicked on run NAFA, it went through, it created everything it needed to do
to get up and running. And then, you know, it's up and running now. So it's just waiting on me
to log in. So what I like to do then is I'll bring up my browser. And, you know, I like to go,
if you remember the IP address was 127.0.1, which is localhost. And we were on 8443,
that port. So HTTPS, because it's secure, and colon 8443. Now, you'll learn that you need to do
to show you what happens. Let's just go to this one.
And like I said, the initial running of NAFA can take a few minutes. So if you
are following along and you're trying to do this and you're getting page not found, then, you know,
but it also helps that I put in the right port.
8443. But again, you can put in the correct IP address, the correct port, and it's still not
load. On the last class, I noticed, you know, even three or four minutes before it was fully
up and running, even though NAFA would report that it's running, it still took three or four
minutes to initialize. Again, we're working in a high latency virtual desktop environment.
And so your own environment may be much better or different to allow that to run.
So anyway, so I'm at 127. It's going to come back and tell me my connection is not private. It's a
self-signed certificate, right? All this was set up just to add that username, password, security
layer. So what I like to do is I'll go advanced and I'll go ahead and proceed. And then I didn't
specify slash NAFA, but it caught it. It's automatically going to redirect me. And now
I will be at the login canvas. So it's asking for a username and password.
I have it right here, luckily.
That's why I said, you know, copy and paste it when we get to that part. When we go through this
more hands-on, make sure you copy and paste it into something a little bit easier.
That log is going to go away. So tomorrow when we log in, if you did not copy and paste it somewhere,
you're going to have to find that old log and we're going to have to get it.
In the username, the password, login.
Perfect. We are now back at the application. So this is the NAFA application. It is web-based.
There's a lot of buttons and a lot of things that we're going to go over every one of those.
But again, it's a web-based application. There's some server technologies under the hood that's
running this, JD and some other things. But it's all browser-based, mostly to work with the data
flows. But again, there's no point and click, you know, properties manager. So you've got to,
you know, hand edit that. You know, a lot of applications, you know, you're going to have to
edit the properties. But once you get it up and running, you shouldn't need to go back to the
log directory or in those other properties unless you like have a warning or an error that you need
to look at in the log directory. But if you're running this as a standalone, in your spare time
on your laptop, you know, even at work, you know, you probably don't need to go back and take a look
at those. But make sure you keep that username and password. So we're logged in. I can actually now
start building my data flows. But what I'm going to do is actually go back in my
presentation where we talked about some of the core components of NaPhi. So we talked about
processors, connections, flow files, flow controller, all of these things that we talked about.
And let's take a look at them. Let's look at them. Let's, you know, see more about what they are in
NaPhi. So what I like to do is this is your canvas. This is a blank canvas. So you don't have any
processors running. You don't have, you know, any of the process groups. You don't have any data
flows or anything else. You know, you don't have any of that. So, you know, it's a blank canvas.
So this section up here, you can see the NaPhi logo. You know, I want to point out there's my
banner that this is a test system. So I can put in capital letters, unclassified even, right? Or I
could put dev or test. And that property, when NaPhi has started, it's going to read that property
So anyway, so the, you know, this is the main canvas.
MuI has multiple tools to create and manage, you know, your first data flow. So what this is,
is the component toolbar. So if you see, you know, you should see processor, you see input port,
output port, process group. If you just hover over them, remote process group, funnels,
templates, and labels. So, you know, the last group, we actually, I did not mention filter or
funnel on purpose. And the last group was able to actually work it into their, you know, their data
flow. As it was pretty understandable, they just referenced the document. But anyways,
this is your component. Now, right below your components bar is the status bar. So, you know,
how many bytes are going in and out of the system, right? How many, how many processes are started?
How many are stopped? How many are disabled? You know, how many have a warning? You know,
all of these things. Now, the canvas itself only updates automatically every five minutes.
But at any time, when I, when you'll, you'll, you'll hear me say this a few times during the,
when we're building a hands-on data flow, is to go ahead and refresh, you know, your canvas. So
when I say refresh your canvas, and doesn't mean, you know, go up here and refresh from the browser.
That's actually just anywhere on this canvas without clicking on any component, you can hit
right click and hit refresh, and it will automatically refresh the stats. But anyway,
so that is your status bar. This is our operate palette. You know, we're, and we'll go more into
a lot of times we have, you know, closed systems, you know, we have
one way transfers and things like that. We have, you know, you have systems that don't ever touch
the internet and they're on, you know, their own closed network. So, you know, you may not have
understand, you know, the properties and things like that. You know, here it is without ever
having to go to the internet. And then of course you have a balance, a balance easy. So, so this
is version 126.0. It was built on May 3rd. It was tagged as release candidate one. And, you know,
the branch and everything else, you know, you can actually pull a lot of information. So,
you know, again, if you go to
GitHub, for instance,
one of the, let me search. Yep. And search,
get hub repo where all the source code is located here. And so, you know, this is the main branch,
but, you know, you can go through and see all the different branches, release candidates,
those types of things. Here is the source code to all of that. Not only can you download it from
that link earlier, you can do a get clone if you are familiar with GitHub and Git and others.
Clone this and build it yourself as well. You know, so, you know, just keep that in mind. Again,
it's very open. It's very well supported. There's a lot of documentation for it and things like that.
So, that is the help section. So, that is an overview of the Canvas and all of the components
on the Canvas. And so, before I start diving into, you know, some of the final workings of NAFA,
I want to pause there. Is there any questions I can answer up until this point?
Well, hopefully, I'm teaching so well that it's very clear and understandable. I always worry
about my southern accent, you know, playing a part in this. But again, if you have a question,
feel free to interrupt me. Or you don't, like, I need to translate something or speak proper English.
Feel free to yell at me.
I got one quick question.
Yeah, go ahead, Tom.
I'm understanding that when you run the command, I do see that,
and I've run this on a container report. So, I've seen during the execution of the,
you'll see the password in there. But I'm sorry, I missed a part where, let's just say,
you don't see that here. How do you get there? Is it written in a log? Or you can go and retrieve
No, that's a great question. So, when we start NAFA, it's automatically going to create this
log called NAFA-app.log. And that's where almost, that's where 99% of NAFA activity is writing to
this log. And so, yes, on that first install, you're going to see, you know, generated username,
generated password, and it's only going to be in the logs. The problem with it, do I?
No, I would just say, yep, I see it there.
Okay. You know, the problem with that is, is, you know, we're going to start doing some hands-on
exercises and some work here. And so, that log is going to roll over. And so, it's going to rename
this old log, give it, I think, a date at the end of the log, and it's going to start a fresh one.
And so, if you do not capture your username and password pretty quickly, it's going to be
in another log, and it could be in, you know, a log that was generated days ago, if you didn't,
you know, set everything up, right? Or, you know, you may go in and put a data flow in and run it.
It's generating all these log messages. And now, your username and password is sitting in a five-day
log file. So, that is where you initially get your username and password, but do know that it can go
away, especially if we're doing a lot of operations very quickly. Great question. And then, we are
going to, you know, go through installing and getting it up and running and all that fun stuff.
If you didn't follow along, you know, I like to kind of go ahead and show you what we're doing.
That's where we're going to get our username and password. You can find that log in the logs directory.
And it's 95-app. Let's see if I do it. Yeah, I haven't generated enough data for it to
roll over, but, you know, tomorrow, I bet there's going to be a 95-app.log
521-2024, something like that. Okay, any other questions?
All right. So, let's, what I'm going to do is kind of go through more in-depth of the components
and, you know, go through some of those things. We will then take a break and go to lunch and come
back from lunch and, you know, get everyone else up and running and, you know, get your own version
of NaPhi going so we can start building some data flows. You know, so that being said, you know, on
the only components toolbar, the first thing I have is processors. So, I actually just click that
and hold it and drag it down. And here are all of my processors. So, you know, with this version
of NaPhi, this install, I have 359 processors available. So, you know, I have processors to
handle Amazon, Azure, you know, AWS tags, JSON, CSV, you know, all kinds of things. So, what you're
seeing here is just like a word cloud, you know, for all the processors and those types of things.
So, you also have, you know, a list of all your processors and the, you know, the description.
Just because it was asked the last time, you will see the shield, the little red and white shield
beside the processor. It's specifically called out because you can now create a policy and security
within NaPhi that will allow you to lock down certain processors. You know, so for this one,
this is a reference remote resources processor. So, it falls within that reference remote resources.
And so, because of that, you may set a policy that says, you know, my data engineers cannot,
you know, see these processors in this group because, you know, they're not needed
and, you know, for security reasons, you know, we're just not going to allow that.
Or you may have it where, you know, I have database admins that need access to this group that
contains the database connection details and those types of things to set it up. And but another group
doesn't have access to it, doesn't need it. They can just reference, you know, that processor
from a controlling service, right, a controller service. You know, so that is a reason for that
little shield. But anyway, so all of these are processors, 359 processors. And the one I like to
use is a little tag cloud here. And I want to see all processors with git in the description,
right? And there should be my git file right here. So, you know, that's how you would select the
processor. So, what I like to do, though, is I already know the name of the processor. So,
I see it. It's highlighted. I say add. Boom. New processor on my canvas. So, this processor
is just the git file processor. It's got a single function. Its function is to pick files up
and retrieve those from the file system. You know, it's not trying to extract things. It's not,
you know, doing any kind of ETL. It's not a model or anything else. This processor is doing one
function and one function only. And it does it very well. And that's the git file. Also, within a
processor, you can see again that little shield that belongs to a group that, you know, you can
imagine you may have a convert text processor, right? You know, from a security aspect, that's
a very low risk, you know, just because you're converting data that you already pulled in and
you're converting it to other formats and sending it out. But, you know, you're not, you know, this
convert text processor, for instance, it doesn't have the connection details. It doesn't, can't get
a file. It can't put a file. It can't connect to a database or anything else like that. So, because
this one can actually get data, you know, there is a security group for it. You may want to, you know,
depending on your security policies, you may want to lock this down where, you know, folks can't do
a git file or a put file. You know, they can build in the logic of the data flows and everything else.
And they may get their data from another processor. And then that way, you know, you run
the risk of, you know, someone doing a git file. We actually had this happen on the last class with
a couple people where during the exercise, we put git file. They specified the same directory as NAFA
to git. They told it to not keep the source file. And so they also told it to ingest everything.
And so what they did is they built a flow that did a self-destruction. And so what it did is,
you know, they run that flow, that file went and grabbed everything in the directory and, you know,
and then crashed because, you know, it just couldn't work because it consumed itself.
And so, you know, there is some security thoughts that go into this, you know, as you're planning
this deployment out. But anyway, so that is our git file processor. You know, you can take a look
at it. It's going to give you some real quick information. How many bytes came in? How many
bytes read and write? How many bytes went out? And how many tasks and the time it took to execute
those tasks? All of this, again, is in the last five minutes. But if you hit refresh on the canvas,
so I click off of that processor and hit refresh, if data was flowing through, that would be updated.
And so, you know, that's how you would get a quick refresh of what's going on with that processor.
Now, every processor, you should be able to click on it. It will do a little black box around it to
highlight it and then right click on it and you have options. So the option that we will use most
is probably configure. So we can actually configure the processor. You know, there is a disable if
you want to disable it. You want to view the data for this specific processor. You can replay the
last event through the processor as well. You can view the status, the usage, the connections.
You can center it in view. You can change the color of that processor. So, you know, we're going to
get into, you know, some of this. But just for FYI, you know, the hands-on exercise, one of the
things I look for is some of these, like, you know, coloring, you know, labels, naming conventions,
you know, some of these types of things that are very nontechnical. But, you know, I look for those
just because of usability, ease of use and those types of things. So anyway, so that's my git file.
I have my configure, disable, provenance. I can group them. I can create a template. I can select
multiple processors and create a template. I can copy it and paste it. I can delete it.
But for this scenario, I want to say configure. So this is how I configure that specific processor.
You know, it has a name, git file. Now, you know, I don't like that git file name because, you know,
it doesn't tell me a whole lot. If I had a data engineer looking at my flow, you know, I want them
to be able to look at my flow, quickly understand what's going on and how this maps together. And
that way they can accomplish the task that they need to do. So what I like to do is I go into my
name, you know, during the configuration and I'll say git file from system.
So there we go. That is an easier, more human readable description of what this file,
this processor is going to do. Also, you know, if there is a penalty or error or something else like
that, it will penalize the flow file. And this is basically the duration is how long you want that
penalized. So right now it's set, everything is default to 30 seconds. After 30 seconds,
it's going to retry and reprocess that flow file. But, you know, 30 second penalty.
The bulletin level, you know, what kind of logging do we want from this processor? You know,
we may, you know, the bulletin level is set to warn. But if you want to log everything,
you may put it at debug. Most times you keep it at warn or error. And so what that means is,
if this processor has a warning or error, it is going to push that to the NiFi dash app log,
you know, in that area. So, you know, if you're building a flow file for your first time,
you may put debug. And that is going to log everything. You usually do not need that much
detail. But, you know, it's there in case you need to set around one, but in about 15, 20 minutes.
Okay. And then you have yield duration, just how long that that's going to yield
before it's scheduled to do it again. You know, so one second is pretty standard. Again,
you may change these settings when you start building your own data flows, more, you know,
real world. But most of the time, these properties all stay the same, except for the name,
you know, the name part of this. Scheduling, there's a couple of scheduling strategies.
There's a timer driven, a cron driven. So you can set this, you know, most, all processors default
to a timer. So it's going to run every, you know, it's running constantly. So you can actually set
a run schedule that says, hey, I want to run this processor every one second. I want to run
this processor every 10 minutes or 10 hours. You know, so what it will do is that scheduling
strategy is going to dictate, you know, the running of this processor. You may have a cron
where it runs, this processor runs only between 10 PM and 11 PM with a run schedule of every one
minute. And so, you know, it's going to run 60 times during that hour. You may have the concurrent
tasks is how many tasks. So this processor is doing a get file from, and it's running one task
to get file. Now, one of the things that I had the class do last time is actually pick up and
get file, pick up that 1.5, 1.2 gig zip file and decompress it that Notify came with. And so we had
a few folks where the file got duplicated or they picked up everything. And so, you know,
what happened was it kind of slowed the system down. It was taking a while to pick things up
and send them off. And so, you know, because it was processing large amounts of data, but if they
wanted to make that quicker, you know, the run schedule is already running full speed. So, but
if they wanted to make that quicker, they could have put the concurrent tasks at five. We gave
five concurrent tasks to execute this. Property. So this is the big one. This is the configuration
run. Hopefully everyone had a great lunch and work and all coming back.                

on 2024-05-20

Visit the Apache Nifi GROUP 2 course recordings page