Apache Nifi - GROUP 1
language: EN
WEBVTT So for those that are on your desktop if you can can you bring up your Desktop you should see a blank canvas You know a blank desktop like mine. You'll have Microsoft Edge Docker desktop Uploads you know those types of things But one of the things I want you to check is if you can open your file explorer And when you go to downloads There should be Quite a bit of downloads minify minify nine five registry toolkit nine five those types of things If you do not have that information let me know That way I can go in and replicate to your desktop the you know the information kodi.se you have it Aaron nothing Sean You look good to it Amanda you should have it as well. I remember replicating your desktop perfect perfect perfect Pedro are you able to see the files like Pedro? Unless if you look in your downloads file explorer then go to downloads you should see A list of files No worries I again I'm here on my ranch in Central, Texas, so Starly you know if he decides I want data today I get it Yep, I see your screen Perfect perfect perfect so the the downloads all worked and you can go there So what we'll do is go through some of the PowerPoint presentation a little bit more We'll take a break for just take a quick break. I like to do like a 10 to 15 minute break You know depending on on how many questions we get and those types of things It's my understanding Everyone's in Arizona so my lunch time is usually in about an hour But for Arizona time we will try to go lunch about 1130 your time 1111 30 your time Like to do it like 45 minutes for lunch, but if you need an hour that we can do that as well You know if there's not a lot of questions or a lot of interaction we usually end a little early You know there's some time built in for those interactions You know so we'll do that so everybody's logged into their desktop you're able to see the files that we're going to work with When we get to this part, we're going to actually do an install of NAFA on the Windows operating system If you were installing this in Linux, it would actually be a little bit easier, but there is a ton of documentation As you can imagine for a government product to the documentation for NAFA is very extensive Everything I'm teaching today everything I'm going over is in the NAFA docs You can actually follow along if you go to NAFA.APACHE.ORG you will see You know tons and tons of documentation, you know, what is NAFA? You know, what is the core concepts the architecture some of these things that we're going over You know even even the architecture for instance where you have the OS the host which in our case is Windows For those that are technical it's a Jetty web server serving this UI up Then we have the flow controller processors You know flow file repository that we talked about content repository and provenance repository They're on local storage now when it says local storage that doesn't necessarily mean It's being stored to a local disk See local storage be a NAS or some other type of network attached storage system or You know those types of things So if you want to kind of follow along With some of the documentation it's all there if you go to NAFA.APACHE.ORG You'll see everything Documentation is very well. We're working off of the NAFA version 1 documentation Just because the version 2 just came out. We'll touch on some of that What I like to go off of is the admin guide or the user guide Those are the two guides that that I work with when we go into some of the processors we can Will actually go in and talk about some of that But if you look at the admin guide, it kind of you know for for an open source product The documentation for NAFA is is amazing Usually we don't see this type of documentation But as you can imagine being a government product that was released to the open source world We had to do a lot of documentation before that was released The documentation is also built in to NAFA Even down to the processors when you're developing a processor for those that have developed a processor before You did have a place where you could include a description as well as other documentation So, you know, we'll go through that but I want to make sure that That your desktops are running and if you're in your browser, you can pull up the NAFA user guide and admin guide And follow along Okay You know a flow file Abstraction that represents a single piece of information or a data object Within a data flow, you know so take in this case and I'm using this example is because I just implemented a a huge prototype for this You know long messages coming in you may have a single message and when it comes into NAFA You know treats that as a flow file So that flow file Is that message it can be in any format it can be in any kind of protocol or those types of things You know, so when NAFA receives that it generates that as a flow file and then you know Within that flow file it consists of two major components the metadata and the data payload The metadata is the attributes and we'll go into more of that where we're able to Take that data flow and make it into an attribute It has a lot of attributes So as soon as NAFA touches this data flow it gets assigned an attribute of like You know a date time group of when it was noticed what the source was You know those types of things when we're using a processor that goes and grabs data from An HTTP website for instance it will you know what it will record what URL that Grab the data from location and those types of things all of that is metadata now metadata separate from the actual data file But you know we're able to to work with that metadata because you may want to route You know your data file based upon source for instance And then when the when the data is coming in if you see source X You may want to send it one way you see source why you may want to send it another way and all of that would Be in the metadata You know as it's receiving the data file We can also take a look at the data file for instance But you know in a lot of cases we like to use those attributes and we'll go into that when we're building it I'm very interactive. I like to do a lot of hands-on work and so We're going to start building some data flows and we go through those will Will basically repeat what we have here on the slide So attributes are key value pairs that store metadata about the data includes basic information file name size timestamp Any additional metadata added by processor? You know again, so it will say you're using the yet HTTP or get FTP which would get a process from an FTP server It will put metadata such as the server name IP address, you know some of those types of things that can capture Content you know the content of flow files the actual data Carried by the file so you know depending on the application it can be text it can be binary any other formats we have used it before to Detect Heart murmurs and stuff like that and heartbeat data, so we would actually bring in audio recordings of You know of your heart filter and sort those and use you know some additional processors to To extrapolate that data look for heart murmurs those types of things You know like I said, I've seen almost every type of data go through not five So yeah, the content is what is processed or transformed by the processors There is processors to handle attributes, but most of the processors is to work on the content So it actually works on that package of data Life cycle of a flow file, you know full files are created by source processors then just in the 95 They were processed and potentially split merged or transformed as they move through the flow Full files are finally exported out of not found by a destination processor You know so as you know the life of a flow file as you can imagine is being ingested into the system It's going through you know different operations and you know at the end. It's going to its destination So the final step is to push that flow file out to its final destination Record that in the data governance and then it drops the the flow file Water flow files important, you know understanding the structure and lifecycle flow files is crucial because they are the backbone of the data flows in not pop So efficiently, you know work one of the things I like to do Is is talk about some of the efficiency of a data flow? So efficient management of flow files ensures that data is processed reliably and efficiently maintaining data integrity and traceability One of the things at the end of this class I will do is I like to take back any questions that I can't Well, I'll be able to answer questions immediately Sometimes on those questions You know, so if I pause for a second when you're asking questions so I could write it down I like the you know at the end of the class I like to send out this presentation as well as the Q&A portion So any questions that I can run them down I can get them answered and get them incorporated into this presentation You know, so the class is over on Wednesday Just so you'll have it for reference and you know some of that training material that that we can leave behind I Think I've kind of nailed, you know, some of the key concepts of not fine depth But just in case, you know processors are there. There's a primary component within not five. We'll talk about that a lot There's different types of processor, you know Processors tailored for different tasks Just so you know and you know as a because I'm still part of that community I know what's coming up You know some of the nuances there when you download and I find now you still get a You know, I think it's like 300 processors out of the box You know one of the biggest complaints is is you know, I just don't really need all these processors or I need my own processor and and the download is one and a half gig of Just for not fine and most of that space is actually the processors because of something So, you know one thing to keep in mind is as an API continues to release updates in the updates they are going to Not put as many processors and you can go to Some different sources like, you know, maybe no line and others to pull those down They will still be built and ready to go And there will be some that you know source code that you will need to compile and build and deploy But for today for instance, we have all the processors we will need within Custom processors we've already talked about to if there's My Experience right usually what comes out of the box will work about 95% of the time I do run into cases where we will need a custom processor You know, I can think of a couple for this past implementation Did where we needed some specialized connectors for some of the Tools for instance as or as well as like log systems Things like gray log and other things out there So being able to interface with different applications, you know, that's usually when we build a new processor We will build a new processor Depending on you know, there's some models that you can run in flight You know, you can do image classification image recognition models, you know things like that as the data is coming through Depending on the output of that model you may you know filter or or change direction or send it to a different data flow You know, so so there's a lot of capabilities You know for custom processors in those types of things Connections are links that route flow files between processors will go into that talk about back pressure You know some of those things they are they not only transfer data But also control the data flow management such as prioritization back pressure and load balances, you know There's there's a few different policies within 95 You know, you do a full method the first in first out You can do, you know some very advanced Routing with the rules engine for instance, you know We'll go into some of the back pressure what it does as well as some of the load balancing And then to finish this off, you know enhancing data flow with connections Connections can be figured with specific settings to manage how data moves through the system. You may You know, you may have a use case where you need data to arrive To a processor before another packet of data arrives you can set that up, you know, you may You know, you may have a data flow that you want to take priority on its data You know, you know processing where you know, you've got other data flows that are kind of a lower level priority You know, you could set those types of things You know, there's a lot of a lot of capabilities here a lot of customization that is that's part of not five Again, you know, that's the power of it But when we get down to some of the design principles and how to do things You know, we'll see this as even in this class on some of the tasks that we will have to build a flow and how They will be different We're getting close to To getting done with the presentation you going on break and then we get back from break We'll work on getting enough up and running but templates and You know templates in The data flow that can be saved and reused You see this quite a bit, you know most most most Organizations have went, you know away from templates and went to the version control as you can imagine Just because you know, you can integrate this into your CSV process you can You know templates That you can't work with templates like you can You know a flow file backed up into yet or get lab or get hub You know the not fire registry which will also go over And those types of things but you know, you can create a template I like creating templates sometimes because you know I don't have to worry about the the get lab and the get hub connection those types of things I can go to the canvas I can build my flow I can save it as a template and send it to my colleague for instance My colleague can quickly import a template that flow will be up and running on their canvas and and they can go from there so So templates are pretty important But you know here lately it's more and more about version control. So Encapsulate a set of processors connection controller services or a specific task or workflow You know templates simply simplify the deployment of common patterns and promote best practices by allowing users to deploy tested flows quickly You know tested flows quickly, so if you develop a flow you can save it as a template Export that as an XML file send that to your your colleague and you should be able to Quickly, you know get that flow up and running and you go from there. So, you know, that's that's template Nafi does integrate with not five registry, which we will go over Which supports versioning of data flows? Version control is crucial for managing changes to data flows over time allowing users to track modifications revert to previous versions And ensure that the employment across different environments are consistent That's the key ones that we will be working off of I know we have Let's see we have a few folks that I've written down that That would be interesting that Some sys admins and those folks So the main thing here is we're going to look at we're going to touch on templates. We're gonna probably save a template but version control Will be our main Avenue of saving flows and those types of things And we'll go into using Nafi registry for version control, you know Nafi registry allows for the storing and retrieval and managing of the version flows When we go to the Nafi desktop And after we get registry up and running you're going to be able to save your flow check them in check them out and those types of things And then we will talk about how you can version control those from registry into your own GitLab environment I don't know if someone wants to let me know what what environment you look like I can focus on that But you know we can work with a lot of different versioning control system Okay, so Let me see this other than chat Okay so What I like to do is pause here before we go for a quick break but You know what challenges? You anticipate in implementing or migrating Into your current workflow, you know, I'd like to hear from the group on some of the the challenges you may have and like I said that helps me in tailoring the conversation as well as you know What what we don't what we train on so what you know? What what are some of your challenges in implementing or migrating to not in your in your current process, right? And feel free someone just to start talking We're the unknown So it's just it's still single user mode this seems like any option we select Deploying it from a container. It's Single use remote. So I'm wondering if It is not but we will we will touch on that but I can I can understand that pain point as well Okay Okay Okay, so No, and all those are Good things and I really like the fear of the unknown You know Just because once you see how easy it is to Operate and to start off, you know I think it's pretty quick to get up and running Then it becomes pretty deadly because of all the capabilities and the options you may have and it gets a little overwhelming But those are some things I will definitely touch on the multi tenancy Necessarily in this class, but what I will do is take that back And I'm gonna work that in for like tomorrow or Wednesday To definitely go over some of that and what that would look like. We do have Docker desktop on our All all of our VMs and so, you know, we we can we can touch on that and see how that works And then definitely we can hit some security aspects all day long Okay And I ask that because you know, I'm trying to get a better understanding of you know, some of the data governance You know requirements you may have you know, some of the thoughts, you know, there's You know, there's big data governance packages that are out there You know, do you have those types of requirements? You know those types of things because it helps me kind of tailor this to what you can expel Any anybody want to speak on their data governance practices and how this how not five you know to get you know Some additional information on a top of that One of the ideas I thought that I had my head was like, you know getting event logs or whatever Wi-Fi and I know from like central log server srg. They want to make sure that the data has been modified So I think like if we went down a road like that data governance could help in that aspect But for like the test community don't have to be answered by That's a good point that's a good point So that chain of custody right, you know that You can see that data if it was if it was manipulated with their security aspect behind it That's why telecoms are using this, you know because of some of those capabilities Digest all that all that, you know fun stuff and associating it to that particular message but it's not Easy to do with like our syslog and stuff like that. So this could us, you know Maybe this is something that we could look into it at some point that could help us gap Okay, anybody else with some of their your data governance Are there any specific processes in your operation that could immediately benefit from the not like abilities and I know that that's kind of broad But you know where you don't like to hear from the audience Where do you see not fitting in and how it can fit in and how can it just help you? You know do that data orchestration with all it wants But I think man's taking a break Wow And that actually touches on some of the Kick this off where You know we see scripts just a single Python script running just to do something right and and You know, it seems kind of small and you're putting this this this big project in front of it But you know really understanding, you know those data sources getting those data sources into did you say access like Microsoft access? Yes Next year I'm telling you use Excel So, you know You know keeping the Record of all of that, you know is definitely needed. I think it will help you with you know, some of your compliance issues It'll help automate that It'll you know, there's a lot of rules and a lot of triggers and things like that you can build the end you know and so Perfect. Okay You know what else what other immediate benefits do you do you guys hope to get from now? For working on more Long-term process is not really immediate but it's it's our only use case right now, but it's Or essentially a real-time data streaming pipeline from one of our test sites to do some Verification on data so running it through submission learning models to identify like bad sensor data And do some just data verification while it's coming through the pipeline We can sort of facilitate an automated QA on the data. Oh Really nice Okay Um I've heard potentially you also is in this maybe related to to the the real-time pipeline But you know, you're you're trying to get data from a talk or get data to a talk smartly filter those out You know those types of things, you know some of the questions Previously was like, you know, how do you how do you get minified to to pull that data in send it to your talk? You've talked can filter what that what that kind of architecture looks like. So I'm taking note of that previously, but I think that's still valid Yes or no, yes Oh beautiful those where you plan to use minify What is the is it like an edge device running Linux? Is it like a Windows laptop? You know, can you go if you can go into details? What you know, what does that? But there could be future use cases on some more restricted instrumentation or it's you know, possibly a microcontroller reference Well those things, okay Alright and then How might you use non-file scalability and flexibility to improve the data handling and processing in future projects? so, you know and I asked this question because I'm trying to I think it was Shawn no chance of death Amanda And Erin, you know essays Looking at you know deploying this in a multi-tenancy scalable fashion and so that's why I'm asking is you know, how might you use how do you plan to use now for for this? And that'll kind of help me tailor the conversation when we go into some of the scalability some of the flexibility But I can I can say that we're designing it in this way to account for there's a lot of data and also WPG and we're expecting a Lot of people to when they see the platform want to use it So we're trying to design it upfront to be able to be scalable, but I would say our use case Doesn't really need that okay Okay, you have that the scalability issue Several different Locations that will be Creating a lot of data during the day at least for this Initial project is just going to be sort of you know one site and then it's going to expand out to multiple sites for For sort of a single mission area and then might move out to two more sites So scalability initially isn't going to be extremely important, but as it goes on There's probably going to be quite a few workflows So one of the one of the use cases that I think we might have in the future is We have a data lake that's going to be in the cloud, but there's a lot of talks from our two data officer about having an on-prem data lake Get that data both places the test data. Mm-hmm And what what what storage like what database and storage of solution are you looking at for your you know your Beauty of not if I as it does have a an s3 processor, you know, it has an Azure blob storage processor You know and those types of things You said you're using men I oh Oh We so there's processor for both of them. So perfect the the minnow doesn't I don't think it comes out of the box yet, but it is available as a processor on github You know, so so perfect. No, I like that. I've actually seen that quite a bit Lately with men I know it's you know folks coming out of the cloud and still you know kind of keep it local for you know security reasons compliance reasons and and just You know overall process Yep, exactly Exactly No, and we will I'll make sure the touch on some of those things You know what as we go through start building full files and and you know those types of stuff We could potentially even on the third day Do a flow where we pick data up and put some in I'll Hey That being said Let's take our first break. I need to get water since I'm talking a lot I want to make sure to keep my voice throughout the day. Let's take a 15-minute You know rest my outbreak rest and break get some water we'll meet back here at 1150 and then let me do my time, I think it's 950 your time and We'll go through Installing nafai and windows and start working on building our first flow So we'll see everybody back here in about 15 minutes And if you need anything just put it in the chat, I'll be running back and forth with getting water and restroom All right, see you in 15 My desk is And then we are going to get started on installing not Draw I don't know if you're back or not, but I really like to Wait till you hear about our processes So being I don't know if anybody's here but being a former soldier being within the army itself so many years now I Completely understand the nuances While we wait for give a couple more minutes Usually, you know Software and but I felt it was pretty critical for us to actually do an install within windows Just so everyone has that experience If you're going to be working within not by even the local environment Who knows you may want to spin up your own instance on your own laptop? Get it working get your flow built, you know test some things out save it as a template And then you know export that to your dev environment your test environment, you know those types of things When we are and I'll go over this, you know in detail, but when we're installing 950 some key things to Take a look at because there is some specific directories being created And and there's a reasoning behind that there are some specific directories that you will need to understand And learn about as well. So that's one of the reasons I like to to really go in depth and I'm taking a risk here because I don't have it installed because I'm gonna walk, you know, we're gonna all do it together I do have Java on everyone's machine So we'll go through some of the basics. So but we'll give it just another minute and we'll get started being if you're back Can you just let me know like like How long you you all think you need for lunch like I said around 45 minutes is Kind of what what I like to go off of but I can do an hour as well. No problem Okay, okay, we'll do 45 and That will give you the capability to eat and then also You know play around with whatever we've already built and done because you're gonna have You're gonna have this desktop environment throughout the training And you you do have the capability to download any information that you have there you can You know upload the presentation as well so you can have it, you know on the desktop environment So there's a lot of capabilities, but we will go ahead and get started. Let me exit all this Okay, so if everyone can Go ahead and start working off your desktop. I'm sharing my screen But you know if you can let's go ahead and get logged into the desktop environment Pull everyone up looks like everyone is good to go
on 2024-05-06
language: EN
WEBVTT Brooke slash Brooke slash one two seven dot zero dot zero dot one and colon eight four four three okay thank you Randy you can click advanced and click proceed I want to bring the Pedro's up so I can catch up on the news right quick awesome everybody looks good except for the separate rooms yeah we use men low security so it's got that isolation thing yes no worries thank you for you and we'll get that just a minute well while we get better off and running so what you're looking at is the the login screen depending on your identity provider some of your security settings some of your core settings you know this may be a single sign-on you may use you know you know it's configured you configurable to use cat cards and and certificates as well you know there is a ton of capabilities here you know again I believe you all picked the right product for what you're trying to do just because of the rich military history and you know the familiarity this you know this product has with integrating into DOD components and those types of things well we get you back okay while we wait one of the things that that we're going to accomplish before lunch is you know just getting not fine installed up and running we're you know I mentioned some of the history of not fine how they put single user credentials in place for a default so we're going to go in on how to create your username and password for this single instance but we all know why you know a lot of folks would install not by in an easy to leave it open to the world I and it's a very easy Google search to find those instances you know so they started securing it pretty quickly and it also looks like when I created everybody's desktop environment it did retain the tab browser tabs that that open to nafas documentation you know those are great links to keep but you also keep in mind that if you install this locally you will have that document documentation as well so in the past I've seen people trying to you know save as PDF or copy it I don't think there's any need there may be certain sections you want to highlight and take note of but you know for the most part this documentation is widely available and you know you shouldn't need to download one minute because I you know I don't I don't want to necessarily go any further or skip this step just because this is a critical step you know in this so while he pulls that up I am and Pedro if you just let me know when you're ready you know I can give you the address again if you want to play around while we wait for Pedro you can go into your log directory and you can actually go into the not hot app and it will let you know how long it took to load it and those types of things where the web server is enlisting and you know because it's binding to localhost it's gonna either go to localhost or 127 some of the services and service controllers that it generated username generated password we can change this username and password and I will show you how to do it but you know that is the current username and password for this system while we wait on him if you can you know go into that directory pull open the not hot app it should automatically open with the notepad and then scroll about 85% the way down and we can pull our username and password so yeah luckily once we log in we should not need to log you out again until you shut the machine down if you want to take this actually that's a good point if you want to take your generated username and password and copy it that is a good point and and just go into your directory and I'm gonna just create a text file that you know you pee you would never do this on a full-blown system right you would never save a username password in the log the reason is there is because the log directory is created after it's you know installed and you know it sets the permissions on when it's creating that you clicked on the bat file okay perfect and it's running if you can go back to your browser can you do you mind closing that edge browser out and putting in a Google Chrome so if you can just exit that Google Chrome should be available on your desktop there you go pull up in the Google Chrome and perfect and now there you go there you go don't click that and then you know go into Ben run on file right it should it should start rather quickly perfect now let's go back to your browser and refresh that all of our machines are exactly the same so it's weird that yours is actually you just close that tab bring open a new tab and let's just type it out together HTTPS colon front slash front slash once you didn't change anything in your property so let's do this real quickly isn't here's another quick fix bring up that command prompt again right quick control C so we can terminate it with the first time I've ever seen this where the the desktop something's different when they all get made from the same image so yeah if you can terminate go back to your file folder yeah exit that one pull that one up go back to your downloads and then right click on the 9512 500 been folder and say delete nope you want to go back to your downloads right click and delete that folder no not that one not that one the total yeah right there yeah yeah delete that yeah we're gonna just have you re extract it and and and this is what to put you at the the base default you know that that was a good point that just came up it's recycling 8,000 items over two gigs and so you know as a sysadmin keep that in mind that not high is nifi is not liked minify is so if you can go to your naf that one there the band right click and say extract all and perfect say extract and you want to show the extractive files but yeah there is a pop-up you can put in as you know I've seen as well so um yeah the banner I'll show you because I think hopefully I'm the only one who changed it but if you changed it that's fine too I put unclassified on mine but you know put my name if I wanted to so if you can't go go into that folder okay and we know that this is a fresh install because we don't have any of the other directories that is created upon startup so go to bin and then run nifi say run it's going to take a minute just because you know it's the first setup so again if you want to go back to your folder actually that was really quick if you want to go back to your folder though and go back a directory you can now see you have logs run work you know some of these other things so try refreshing your browser to see if it will pull it up hey Maria are you still on call yes I'm here why would Maria is the expert at these wow so I created all of these mine I'm gonna continue until we go to lunch and I will put you know you can follow along on my screen and others as well and then what while we are in lunch I'm going to reset your machine to exactly like mine and let's see if that fixes it this is really weird because everybody's was created from the same yeah I'm gonna yeah during lunch and probably probably quicker than trying to diagnose it I'll probably just reset it and then he'll he'll look exactly like mine and like everyone else's so okay that's good to know though and I well this is a first so I didn't do anything perfect perfect well I mean it started working so perfect Pedro if you can just go into your log directory in that folder pull in the nafI app you want to go about 80% down and there's a yeah if you want to copy just that whole block into a new text document you can or you just copy and paste it in there and you know you should be good to go as long as you don't close your browser and and even then I think it will hold for a while give you just another minute everyone should be you know everyone should be looking at the top level nafI canvas the only difference between mine and yours so what you're looking at here is the nafI canvas you know you have all your components and and and a lot of the services and things like that you know right here so the first thing we're looking at is that user interface you know night and we went over some of this but the nafI offers a web-based user interface to seamlessly experience you know for designing monitoring and controlling data flows and it allows you to visually manage data flows in real time you know within this interface we have you know you know your top top left in your first box is your processors and we went over processors but you know processors do that that that single task you know a processor is used to pick up data so if you click processor and bring it down there's 359 processors that we can search you know you all use Azure so you may have you know a lot of Azure processors you may need to consume an Azure event you may need to delete an Azure blob storage you know those types of things so there there's a you know a lot a lot of capabilities there and and you know you can search there is a a word cloud that you can have as well you know so if I remove that I want to say all my AWS processors there's the ones that are tagged AWS now when you're developing custom processors you know you have the capability to put your version number your tags your description as well as you know a full description of exactly what that processor does so you know just kind of keep that in mind when you're you know building custom processors so right now this is the processors that we have we have 350 non processors you know we we will work off of these more building our sample data flows and those types of things but you know now you know where a processor is if I wanted to add this config well before we say configure you have options here so when you right click you can look at configuring the processor we can disable the processor we can view the data provenance for this specific processor we can replay the last events for this processor you know as part of that data provenance that tracks data flow through you know from its origin to its destination we can use status history we can give you the usage of this processor we can view the connections downstream then we can do things such as beautification where we want to center it in our view we can change the color we can do groupings we can create a template with this processor and we can copy and we can delete it so there's a lot of options here and we will actually go through most of these you know through the next three days so with that being said if you can right click and say configure we will go into you know what a processor is so I can see we've got most folks again you just go to the processor right click on the processor and then click configure and give it just so we've drug our processor icon to the screen we selected our processor you know I know there's a lot of options but we we got the get file processor and this processor tells us that it pulls data from our local disk into the instance so I know the instance is running here on the VM but we could use this processor for instance to get we could you know get all the zip files in files downloads right I'm just coming up with whatever but you know it's getting files from downloads so that you know if I were to take a look at this processor I would instantly you know understand that this get file is getting files from the downloads directory on and it's enabled the processor every time that you create a new processor it creates a new UUID so this processors ID is you know this UUID and you know it's you don't have to memorize this or you know mess with this at all but when you start looking through some of the data provenance in the lineage you know you're gonna see and then I also know regular logging to let me know if there's any problems deploying that processor so that is some of the settings of a processor if you'll go to the next time a tab that is scheduling of a processor you know there there's a there's a lot of different settings here you know what properties is if you can click on the properties tab properties is probably the biggest configuration that you will do on a processor and so if the property here is is bolted as you can see there's a the path filter is not input directory is final filter is if you remember you know the processor itself complain that the input directory was not set and it didn't have any connection to send the data so the reason it wasn't said is you know from the you know out of the box the input directory is not filled in and so here's what we would put like you know in Linux you put the Linux file path and Windows will put the Windows file path and when we're working on building a flow we'll add some more additional details of this you can put that in that's where that path would live so you know you know that that's that you know that's where it's going to go get the files right and and pick those up so you know just keep that in mind some of the other things is is you can filter the files right so we can filter on the files as it picks up you can see a regex pattern for those that are familiar with so you know kind of keep that in mind file filter path filter is not required but again if you want to you want to set a regex pattern to filter the path have at it you know that capability is there the batch size how many how many files you want to process in time you know the default is 10 you know the default is set up to you know best you know here's the best configuration for a single we'll go into expression language and sensitive property basically a sensitive property is is that a username is that a password is that a hash you know is it a sensitive property that needs to be protected this is a you know one of the i wish you know i wish this this was different and i have made the recommendations numerous times to the apache foundation for nafa but you know keep the source file uh here it's false so whatever file we use to pick up it's it's going to pick it up and not keep the source so it's going to delete the source file i like to change that to true and when we go through and build a flow you know we will keep that as true uh just because you know nafa is not going to pick that file up again unless that file has changed like the its hash has changed um but um you know there is a state that is kept within nafai it knows if it's picked that file up or not but you know i like to say true when i'm building a flow from scratch uh just because i don't want it to accidentally delete all of my files especially if i have it picking files up and sending it somewhere else until i can test that out um you know you wanted to you know recurse subdirectories so in the downloads folder for instance we have the nafafo and then in that nafafolder is another nafafold so you know if we kept recursive directories on we could go through and pick up everything in the downloads folder not just at that parent door polling interval you know just like it says you know indicates how long the wait before performing a directory listing you know zero seconds is is it's polling it constantly and and usually that's what i see but if you know that you don't get a file into a directory except for like once an hour you may want to change that just because that way it's not polling it constantly um you know if you want to ignore hidden files folders the minimal file age maximal file age you notice it's not bolted so it's not uh necessarily set and you know the minimal file size maximum file size those types things so you know for this one um i don't you you all don't you don't need to worry about filling it but you know i'm gonna go ahead and put my file path of nafi in there because i i just need it so i can show you what the next processor looks like okay so i have my my main property that needed to be set i have it set now so now when i go back to my yellow you know yield sign it it just tells me the only thing left to do is a relationship success so you know when it comes to the properties of configuring this processor i've answered the call relationship tab um you know a data flow has to terminate it has to um either either retry or automatically terminate so what that means is is um you know what do you want this processor to do after it's handed the file off and you know previous older versions of nafi we would have to go through and tell every process send every processor to a terminate processor just to get it to stop luckily they fixed that where it's less cumbersome and you can just tell it to automatically terminate or retry because it may not be able to push that flow file to the next processor so uh but we'll go into more of that uh you know here in a second and then any kind of comments um you know i highly encourage commenting especially in a production environment it gives everyone that ability you know i added this processor for the tree so it gives you know uh gives you a comment but like why and and and you know why is this processor here who added that processor you know just some general comments you know about this so you know it's something um you know i would add in in a more production environment if you won't um i do see a lot of cases where you know there's no comments and it's because you know it's documented elsewhere but the comment section there is is just to provide the space for additional documentation all right so that is the git file and we're not going to go like this through every single processor it's just this would be a turning into a two month training but that itself is just an overview of your typical processor and some of the some of the settings and some of the properties of a typical processor so with that being said i'm going to pause and i'm going to go back and see if anybody has their hand up you know any questions on the you know a processor any questions on the canvas we're going to go into more details of the canvas but i just want to pause there for a second any questions perfect well we all understand processors now good deal um if you have any questions seriously though let me know we'll go through um uh you know a little bit more than this i got almost 130 1130 uh you know it's time uh we'll go through a little bit more and then we'll take a lunch break um and so so now we're starting to build our flow right you know we we our first step is we need to get data and so getting files from the downloads helps us get that data so what am i going to do with the files in the downloads right there's zip files there you know there are all kinds of things so you know say that folder was just full of zip files um and so what i would like to do then is you know what i probably would do is you know i would i would detect the the so what i'm doing is is i'm getting the file see here's 1130 let's go about 15 more minutes and then um we'll break for lunch if that works for everyone so anyway so what i'm doing is getting a file from the downloads i'm picking that up i've made a connection now to identify the mine type you know my thought process is is i've got all of these i got this file folder i've got like millions of zip files well thousands of zip files because you know director can't hold millions but i've got all these zip files that i need to handle um you know so i'm gonna pick those up i'm gonna identify which type of file they are and i'm going to decompress them and then from there i may do some sorting and filtering um you know so i've got my connection made the processor that's getting the file stopped now it's going to the identified mine type if i look at the um you know the relationship uh it tells me basically that it needs somewhere to go right that relationship success needs somewhere to go so you know after this usually in this use case where i am you know worried about zip files and decompressing them um you know i want to unpack i'm gonna add that one and for simplicity i'll put it here so for identify mom type it doesn't have any um you know required properties um you know so it's going to use the file name i can actually put some additional settings i can change the name some of these things that went over the scheduling uh the relationships i can automatically terminate um you know uh own success i don't want to do that but you know if it successfully identifies the mine type kill it done i don't i don't like i want it to decompress that file type but you know i can auto do these things um i don't retry and retry is set up uh quite a bit on some bigger complicated data flows just because you know you may not have the service available yet that you want to you know send that flow file to if it does something like that you may have some latency issues or something else you want it to just retry um you know maybe it needs additional information to do something um but you know there is some automatic relationships built into this um but in this scenario um i'm not going to have it retry i do not want it to terminate when it identifies the mine type i want it to pass the mine type uh and the file to my next processor so you know for this one to to to make sure the requirements are met i just need that relationship success and so relationship success is checked and added this should turn red perfect and so you know my my goal of picking files up you know you know identifying what type of of of you know compression was used and send that to unpack the content so if we look at this one relationship success is invalid we need a success we need a relationship failure in case it cannot um you know remove the or unpack the content and we have a relationship original so that original zip file what do we want to do with that after we decompress it and so you know those are some of the things that we need to think about so let's look at this the properties of this processor um you know for the relationships we have the three uh you know the first two uh processors only had success uh you know identify mine type is going to be automatically successful even if it doesn't uh and identify it um you know we can tell it to retry uh but for this one we need to either you know terminate the process on failure or terminate the original or or send the success somewhere um and then you know once in in the success of this one is those unpacked flow files so if i have a zip file with a thousand files in it it's going to unpack that and then it's going to put a thousand files on that processor and so the processor is going to say okay i need somewhere to put these thousand files i just unzipped and they need to go somewhere because you don't want a lot of files you know in the queue in the you know in your data flow uh so for this um unpack content for instance uh i want to look at the properties um the packaging format is actually using the mine type attribute um which is great so you know so what's going to happen is is it's going to unpack depending on the type of file it is um and so you know in this in this scenario uh i think most of these are zip files uh it's going to say hey this is a zip file and because it's a zip file we know how to decompress it so i'm gonna apply there um and and what i'm gonna do i'm gonna try to actually run i usually don't try to run this but i want to try to put um you know one of these through so i don't want to pick all these files up from downloads i'm gonna create a new folder i'll test and what i want to do is something small i want to copy menafi it's kind of small to test and what i want to do here is change my test okay i want to keep the source file i don't want to delete it plot all right so you can see already right we have our get file um what's the complain button let me make sure that this directory was right it will check your directory structure to make sure um uh that you have everything correct okay perfect um so now we we built this flow so far the flow is getting files from the downloads it's going to identify the mine type and then it's going to unpack the content and and i've got these chained together i've got a couple of connections in there but i'm still working on this flow but i really want to kind of test and see you know is this working so one of the nice things is is you can start the processor and if i were to start the processor it's going to pick up everything in that folder but you know i just want to test this out so one of the cool capabilities is is once you get it to a state where it stopped you can just tell it to run once and so i want to do that i want to tell it to run one time and and see if it picked that one file up in that directory that test directory we have minify 1.25 bin compress zip file so if you remember the ui is refreshed every you know few seconds you can click on the desktop and you can say refresh at any time but you know it's automatic refresh already happened and it picked the file up right it had one file in there it picked it up it's 243.52 meg i can list the queue even so the queue of files in that connection here's a unique idea of the file which again is going to go back to that data governance here's the file size the you know the if it was penalized i can actually view the content i can download the content if i have permissions and i can look at the provenance but if you notice you know this this screen was not there before we ran this file so now that we have the list queue option i can go in and see you know was that the file i was expecting to be picked up what you know is it this file size i was expecting but not only that is we talked about attributes and how a flow file coming in is basically the data that you're you're bringing so i haven't even looked at the file i haven't unzipped it i haven't done anything with it except for pick it up and i already have a lot of metadata associated with that and so because of that i can now you know put in other processors that would you know maybe sort or filter based upon an attribute i don't even have to decompress this yet and and i can already work with the data you know i could send you know all data that has a access time of x to somewhere else if if the data comes from one folder you know send it somewhere else you know you will see instances where you know your get data your data file structure is you know there's there's a lot of different processes putting data in they may be putting into different directories and sorting and then your job is to get all of that data so you know depending on the file path you may filter or sort data on that you may filter or sort data on the content of that file so once we unzip it we may read the contents and and send it to somewhere else but there is a lot of power in just looking at the metadata of that file that's automatically created you know when it you know ingest that data can give you the download the details of it you can actually view the file but the problem is is not five does not have a viewer for zip you know if it was a json document for instance you could you know look at that you can look the hex or format it but it you know remember it is a you know zip file so you know not a lot to view i can download this file if you know if permissions are set up appropriately and we all have root access on this instance you know you can download the file you can view it you know while it's sitting on that connection did it come out of that processor exactly like you think it should you know and that's where you would use some of these tools was it penalized you know any details about it all of these details and attributes though are going to be available as metadata and we'll go more into that cody go ahead i see you have the screen up so i can see when someone raises their hand uh yeah where where is the get file processor do you feel the specific like profiles we have up right now yeah so what i did is i told the get file processor to run one time and so luckily there's only one file in that folder and so it picked that file up and immediately sent it to the next process so get file is a you know it's a simple function right you know one simple task and that task is to get a file and send it off and so what what what happened was is that connection between get file and before it identifies the mine type it stopped so because this is stopped data is going to queue up on this connection and because of that i can look at my queue and see all of the data queued up and you know i can look at the attributes here and the details and i can download it i can view if there would have been a 100 file sitting in that you know and i had 100 files sitting in the queue i could just you know go through those i could look at them i could look at the attributes i could download those types of things did that answer your question yeah okay no worries um and and i think you know now that we have not five up and running we're beginning to start building our first flow um you know this flow that we're building here really isn't uh you know a hands-on example flow necessarily uh you know this is mainly for you know going over what a processor is and some of the connections and properties i think we're at a good point where you know we could pause and go have lunch it's almost two o'clock my time so you know what if we are back at 12 35 i think 235 12 35 my 235 my time 12 35 your time hopefully that will be enough time for lunch uh does that sound good for anybody and and you know it gives us a minute to to ask any questions before you go on to lunch okay um all right well let's let's go ahead and break for lunch i will see everyone back at 235 my time 12 35 your time i believe uh and we will continue going um you know through this so let me update the slide and and i will see everyone in 45 minutes perfect all right uh have a great lunch and um i i'm going to be eating lunch at my desk uh i may or may not see a question just put it in chat but uh but i'm around so um but have a great lunch and i'll see you all in at 12 35 your time we'll give it a couple more minutes and we'll get started hopefully everyone had a good lunch i'm gonna go grab uh my drink and i'll be right back all right everyone should hopefully be back if you're not back let me know okay well let's get started um so let's pick up where we left off the goal today is to get through some of the basic nafap functionality and and how we do things um you know how you operate the canvas uh those like those types of things if you just left the screen where i was at when you left for lunch you should be able to come back straight to the canvas you don't necessarily need to create this flow and copy me as i'm going if you do you know great uh you can learn as as we go uh but you know like i said the main goal here is just to familiarize everyone with uh you know the the nafi canvas some of the basic functionality uh you know where things are located that type of stuff we will continue going through this for the rest of the day um and then tomorrow when we come in we'll start going into building our own flow um you know i've got some exercises for us to do uh you know to build our own flows some more advanced topics and then hopefully after that we get into um you know talks about scalability different things within nafi to accomplish that you know some of the the bigger deployment type capabilities those types of things um we'll have a darker compose that we can put together and and deploy this uh locally on our machine as well um but you know like i said this is just uh basic functionality if you haven't created a flow if you didn't do what i'm doing that's okay uh you're going to have a lot of time to build your own flows uh that's going to be most of the hands-on is is getting an understanding of of those data flows um any questions on what we have went over or any other questions related to nafi that i can answer like i think i mentioned earlier what i like to do is take these questions i'll try to give you a quick you know answer um but i like to take them back i like to include them in the presentation so that way at the end of the class i can send out the presentation um and it will include those questions because sometimes you want to reference those and and you know take a look at what the answer was so you know any any questions so far uh anything i can answer i'm gonna pause and if i don't hear any we'll just keep okay amanda perfect was there somebody else awesome awesome all right well no other questions we'll just continue uh again interrupt me if you have a question i do have the teams up uh so i can on another monitor so i can see if someone's raising their hand but because i'm looking at the nafi canvas i may miss it uh you know so you know just keep that in mind all right excuse me so um i think where we left off was uh we had looked at the connection uh we we picked the file up using the github file processor we listed the queue um and and and got hands-on visualization that we did have that file um you know now that we've been gone a little bit that file has been sitting in queue for 58 minutes um and so you know it does keep track of those things um you know there's that type of stuff uh the get file connection is going to identify mine type it's just an example so we can start chaining some of these processors together so i can show some of the components of that uh and those types things so with that being said um let's go back to looking at a processor so for this scenario i just i decided to run it one time if i were to click run again it should notice that the file that was picked up has already been picked up uh well it actually cleared my state so i guess because we were gone too long uh but what it did is that pick that file up again um you know it does maintain a state uh the state is configured you know after so many minutes it will clear that state so if it notices it picks the file up it should not pick it up again unless you wait you know a certain amount of time so it's picked the file up uh it's going to the next processor uh and from and from there it just continuously goes through the data flow uh so on this connection though you know i've tested out my get file processor i've tested that the connection to the next processor works uh so what i like to do in these scenarios is i like to just empty the queue now here's here's where you know the you know you got to make sure you have that configuration to keep your source file uh you know i'm seeing previously like if it's not kept and it gets to this connection and i say empty the queue it will delete that file and once that's deleted right i can't get it back i mean it's probably still in the content repository but it's not easily accessible uh so but for this scenario i know that i have the original file i know because we run it twice now um so i'm going to go into the connection and i can you know configure that connection i can list the queue what we've done i can look at the status history uh you know i can see how many bytes came in in the last five minutes how many bytes went out you know we should have zero bytes because i have not turned on that next processor so if i actually say run once now i could have bytes out should be there i'd say look at that but it should show like the flow file went out and then i'll go back to that one and see um but you know i ran this next processor one time so it's going to grab one flow file and adjust it from that connection and process it um so so it ran that one time it adjusted it it processed it and now it's sent to the uh you know the unpacked content uh profile so if i look at the attributes though um you know now i should have attributes of a mine type so now i have you know mom extension mine type it recognized it was an application zip file um you know so now i have that additional attribute where i could have a processor set up instead of you know decompressing this zip file i can have a process that says route all zip files you know read the attribute and and route all zip files to the uh decompress you know processor and we'll go into more of those when we do in the hands-on but you know it now i have that capability and that's still i haven't i haven't unzipped that zip file i haven't really messed with that zip file i'm still working off of the attributes and so you know that identify mine type uh identified it accurately it it sent it through and it applied the attributes of applications zip also you know if those um attributes are left you know basically unchecked they will follow that flow file so it's just going to keep adding to those attributes so you know the path the file creation time the last access time you know the owner those types of things the file name those that was added to the by the previous step i'm going to have access to that attribute you know on the following steps unless i you know there there's processors to mess with attributes so you know unless i do something with the attribute it'll continue flowing all right um so so i've run that one time and and now i've got one file in the queue and it can't go to the unpacked content because the unpacked content is not ready right i haven't set a relationship success i haven't set a relationship failure or what am i going to know so it will continue staying in the in the queue if i run this again i should have two files in the queue there we go so two files in the queue it's going to continue queuing up until i you know resolve the next step or until i just tell it to terminate and it's done with those files so so you know that is a connection in nafa for instance so for this instance though i'm going to delete it i don't want to i'm going to go ahead and empty the queue we'll be using that file and some of these files again but it's it's deleted you know as a tip nafile will stop a queue if you get to a 10 000 file limit threshold these things are configurable but that's usually the standard that everybody works off of i don't see a lot of changes in that configuration so if if we had 10 000 files in this queue it will you know it will queue those up but it will also start you know stopping the processes behind it and so you know it won't let identify mime type processing new files because there's nowhere to put the the file once it's complete so it has insight into what it's done and you know what what files are still needing to be processed so you know we'll we see it quite a bit where folks will will queue up a whole bunch of files and and they're not getting clear they quicken up and it'll start backing up if you have a processor that takes a long time you know that's something to take into consideration because you can back your your whole data flow up with files just because it's not processing the file quick enough you know that would be my concern if i was unzipping you know two or three hundred meg zip files and i'm trying to do 10 000 of them you know the system is going to go to a crawl mostly since we've only got this configured for like two or four gig of ram with that and we i'm going to show you where to check all these settings but you know just just as a tip and trick you know if your queue is is constantly getting backed up you may want to take a look at distributing that process or you know speeding the next processor up or similar just so you can get the you know your queue count lower but you know the queues are are built in it's there it's there for that data delivery guarantee the source file could have been a sensor that we're receiving data from right a source file could be minify on an edge device a raspberry pi picking up data and sending it to this instance and we could have you know tens of thousands of those pies you know communicating that data so it's something to take into consideration some of the performance bottlenecks and we'll go through some of those as we go through these i like to just tell you some of the real world scenarios in my experience and you know feel free to ask questions if you have anything on that as well so anyway so we went through the the file connection you know there is a a lot of information there we can go back to the source so you know this connection was in another processor group or or you know outside of this process group and we'll go into processor groups you know we can just by the connection you can go to the source you can go to the destination there is a lot of customization capabilities within nafap it's quite used quite a bit you can you know if you you've done powerpoint and you can stack objects in front of each other behind each other those types of things you also get you know a lot of that where you can bring it to the front we can create a template you know from this we can actually just hold shift select everything and and create a template this is it wants to refresh the latency on the vm sometimes there we go and then we can right click and say create template and it will create a template it looks like this connection this connection these two processors got highlighted in the selection i can create a template and and deploy that over and over and over again so what i'm going to do is go ahead and create a template i'm going to just give it a name and say create and it was created successfully perfect and we'll go into using templates but anyway so you know there's a lot of customization there's a lot of visual customization on the nafi canvas you can bend these lines you can depending on the order you put your processors there's a lot of capabilities there we can label these we put some you know some charts or not charts but like color boxes and things like that behind it and i'll show you some of that and it's a label but yeah there's there's a lot of customization so it's not only building a flow but you can start grouping these aligning everything and and you know don't don't don't get discouraged if you know you have things off like this or you know some of these other things it will snap back to the grid just just hold your shift button and and highlight it and go into the line and and there you go um so also on a connection um you know you know the source bring to the front empty it you can delete the connection now this connection for it to be deleted you need to have the queue needs to be emptied and the processor needs to be in a stopped or still needs to be configured state so if this processor was running i no longer can delete right so i need to stop this processor before i can delete that connection that connection again is just a success get file success it actually picked the file up so um so yeah so that's that's the connection some of the other things is part of the nafi canvas i mentioned on you know we started with processor over here on the far left on the far right we have the label this is where we can start labeling you know some of these processors so what we might want to do for instance is apply a label to this section that says basically this is the um so as part of this data flow i can quickly look and i have a visual here of a detect and unzip section right because i applied this label um you know to the canvas that lets me know that this is the detect and unzip section um if i were to sort and filter you know that may be a different section those types of things but a label you know the label as part of the nafi canvas is very simple it's it's just uses that graphical aid to help explain um you know different processors things like that i'm gonna go ahead and delete that all i have to do is highlight it and delete it um and so what else i'm gonna go through we're gonna talk about input and output ports but um i'm here here if you not right this minute but what i do want to bring down is processor group and so you know i'm brought down the processor group i can enter a name or select a file to upload so for this for instance i'm going to enter a name of of pick up zip detect type and unzip so this is a processor group and so what that and it's in the name so what this allows us to do is is build our data flow and then take that group of processors that full data flow and put it within a group so that helps us you know distinguish between other flows um you know you may have some prioritization on some of the processor groups you may have you know different processing different scenarios going on but this is a a clean way instead of you know 15 20 different flows running from the main canvas we can break this up and say okay you know this flow is for the pick up the zip files detect the type and unzip we may have another processor group that you know picks up data from an api and doing something totally different but at the end you know it's merging that data it's looking it up or whatever um but that's the purpose of a processor group and so for this i'm going to go ahead and select all of my processors any connections if the latency will allow let's try that again i will just manually select them one thing i notice about these remote desktops is sometimes latency there we go and all i'm doing here is just holding shift to select all of these as i select them and i'm going to drag and drop it straight into that processor group so now from the main canvas i have my processor group to pick up zips detect type and unzip from here i can you know bring down another process group i can i can name this one you know i receive your pick up from an api i can add that processor group and then i can go in and start building my flow my data flow based upon you know what i need to do to pick up from an api so processor groups allow us to to build these flows put them together put them into a processor group to organize them that way when you go to your main nafi canvas you will not you know it's not you know hundreds of processors all over the screen it helps provide some organization to this it helps with you know organizing different flows different capabilities those types of things you know so one of the design principles behind nafi is is you want to kind of leave your nafi you know the main canvas the the home canvas you want to have that one as clean as possible just because this single nafi instance could have two or three hundred processors chained together and if you tried to um to to pull all of that on the same canvas it's going to use a lot of resources in the browser but also it's going to just get cluttered and hard to find what you're looking for and so i'd like to use processor groups to to organize data flows and you know even from there you can actually take the output of a processor group um and and have an output of that into another processor group and manage that connection like we've been seeing with processor to processor connections once you have a processor group you just double click and open it up and and there is your data flow and so you know it does have those a lot of capabilities you can leave the group you can see how much uh how much data you have in the queue the end how much has went in how much went out the reading and writing of the bytes in the last five minutes and and what it will do is it'll take all of those processors at you know you know accumulate you know how much is in the queue so if you have three or four connections and you got 7000 files sitting in one queue you got 3000 sitting in another it's going to tell you you have 10 000 files uh total and so that's all of those processors within a processor group i'm going to go ahead and and just delete it like i said we're going over some of the main aspects of of nafa not really building a flow yet so you know i'll delete that earlier i highlighted these and i saved it as a template you can access templates from the canvas you know right beside your label if i were to drag a template down to the canvas it's going to ask me which template i want to do i saved that earlier as a test template and there it goes um so you know that that is the the functionality of a template that's how it's used uh you can export this template and you can import it there's a lot of capabilities capabilities there so i'm going to go ahead and and delete this again and all i did was hold the shift key draw a box around it and i hit i actually use the physical key of delete uh you can right click on that and tell it to delete as well um i try to use keyboard shortcuts you know as much as possible to save on time um so you know just for a while um okay so canvas so you have an input port you have an output port we will go into more details of that because you know you may have an input port uh going into a processor group and then an output port coming out of that processor group coming into another processor group so what that does is allow you to chain those groups together with those connections and and also apply some intelligence behind it um you know you you can you know name the port and right now there's there's nothing you know to configure it with own other processors but you know it is a port so you can use this for receiving data on a specific port for instance if you if you have that type of of of need you can use it for a processor to processor group there's a lot of capabilities with input output ports we will go into more of those as we are building flow files it will become you know more understandable when we get through that process group we went over uh creating uh you know chaining these processors together creating a data flow and putting them into a process group a remote process group would come from a remote na-fi instance um you would configure that to uh you know basically send data into a remote process group pull it we can pull that process group into your canvas there's a lot of capabilities there we will also go into this when we get into more of the advanced data flow building on day two late on day two early day three all right and so that is some of the main you know functions of the na-fi canvas where you will do a lot of your operations is right here you know adding a processor adding a connection building your data flow putting in a processor group labeling it creating a template from it uh you know uploading downloading templates those types or pushing the template to the canvas you know that's the main functionality of the buttons you see in your top left now on the canvas you will usually see a navigate and operate uh screen as well now the navigate screen just allows us you know if you had hundreds of processors and and you know you had a canvas you have processors way over here on the right that's off the canvas you have a bunch over here on the left that's off the canvas the view you can you know adjust your view to to adjust the canvas and so you know i like to keep it you know centered uh but you know because this canvas is pretty infinite i could you know just for an example i could push this all the way over here and then it no longer is part of it you know it's no longer on the canvas but this navigate will allow me to adjust this canvas you know so i can see that data flow that process group you know those types of things okay um you can use it to you know fit uh you know within the current view you can do an actual if there was a lot of flow files you know you could expand but the navigate is primarily used just so you can navigate around the canvas you can go look at some of the different uh you know processor groups or or flows data flows that you've built then we have the operate uh for the operate i'm gonna go ahead and bring down that template again so um for this instance i have just that one single processor selected um you know instead of right clicking and saying start run once disable those types of things i can actually operate on that processor or processor groups from this this view so i can tell it to start i can tell it to stop in in you know until it start it's running stop it's there and it's already picked up uh two files um so so you know i can tell it to start until it stop i can disable i can enable and disable that processor or just configure that processor um or the apply the configuration for the canvas i mean um and so you know the processor group name is the nine five flow those types of things you know just just being able to to configure the the canvas itself and the visual the view here if i highlight multiple processors i could you know still start and stop i can delete you know those types of things into this queue so that you know these two buttons are primarily used to navigate around the canvas to operate on the canvas to operate on processors you know things like that um so so the toolbar up here is to bring processors down to bring templates processor groups those types of things uh and this is to navigate uh and this uh you know is to operate on a processor or processor group as well as you know define some of the main canvas capabilities over here in the top right uh we have the hamburger menu now this is uh where we're going to start diving into some of the you know the provenance the lineage those types of things so from the hamburger menu you should be presented with quite a few options and there's a lot of information within this so we're going to start with summary so you know basically this is a summary of everything running on the main 9.5 canvas in this main instance so you know getting files from downloads is a is a is the name of the processor it's a git file i can list all my processors here i can tell you what process processor group uh it wouldn't tell me what that processor group is what is the status in size you know in the last five minutes how many bytes came in you know the reading the writing in the last five minutes um the output in the that's wanting to work now in the last five minutes and how many tasks it executed so there's a lot of good information here too when you start building your data flows and diagnosing any issues or just to get a summary of what's happening you can also go to the processor you know click to go directly to that processor um you know there it is and it's highlighted for you so as you you know go through your summary uh you may notice an issue or you may notice a processor that you want to work with you can go directly to that you can also look at the status you know just like we were able to do uh from that processor we were able to right click on it we were able to pull this status well this screen gives us a centralized view and centralized management of those views you know so instead of having to go to the canvas to find that processor view the status those types of things we can now you know look at our list of processors look at the name and pull the status and how many bytes are read you know all the attributes about that you know that processor's interaction um so you know you can do that directly from the summary processors page same thing with input ports and output ports if we had an input port and we will later on as we're building our flow we will have a name it will tell us the run status it will tell us the output because you know it data needs to come in and it needs to go out so the output port is going to tell us the same thing but how much data went into that output port uh you know to to go to its final destination for instance uh it'll give you some key you know summary characteristics about that same thing with remote process groups uh it's going to give you the names you know it's going to tell you how much files were sent or received we will probably have a remote process group especially because we're going to dive into minify and registry uh tomorrow so you know expect to see you know some of these things being populated and here's how you would access those information about the connections we right now only have that one connection the the get file to the identify mine type you see right here um you can actually even pull you know details about that any settings about it you know those types of things but you can see the get file it currently has two files in the queue at 487 meg it is about 48 percent of the queue size that we like to see that we have configured by default again we can go back into the non-file properties file we can configure queue sizes we can make those changes but you know for this this case i think one gig is is about the queue size that's set at and we're using about 48 percent of that uh getting files from download out and then the destination is identify mine type and you can again go to the destination go to the connection or view the status history of that same type thing process groups you know if we um you know right now the main process group is the 95 flow on the main canvas and it will let you know how much data has been transferred in and out the you know how much was received how much was sent how many active threads is currently going on in the total task duration in the last five minutes and and one of the things to pay attention to here is the active threads because this will also help you diagnose that you have a processor that's taking quite a bit of time to process data and in your queue is backing up you'll be able to see like you you may have only five active threads and you're like well wait a minute it should be processing a lot faster let me go to my processor let me increase how many threads and how many concurrent tasks can happen uh you know increasing your your processing speed so that is the summary it does open into its own page as well you can pop that out into its own page a lot of folks that that that use this just goes to the summary real quickly um and and you know just give a quick snapshot and then close it out but you can have it up and running in a new window that way you can quickly go to that tab see what's running how it's running all the characteristics um you know about the the nafi canvas itself the next one we have is counters um and i don't think we're going to have any counters um so uh if we had counters in our data flow to um you know doing accounts and things like that uh we this would show up we don't have any of that but we will put some in for our flow that we build when we build our flow we're going to see a lot of this become populated and you know we know how to get back to it now and to check those values uh the counters actually is not used that much in most operations i've seen uh just because you do get a lot of that information from summary and things like that bulletin board you know any you know any kind of of we we we can set you know our bulletins to warn and and things like that you know we can set these messages that happen so you know this gives us a capability to come in and and you know come in and and view those messages in a centralized place you know being able to turn also some detailed logging if need be uh those types of things so yeah there's a there's a lot of a lot of capabilities there um but bulletins i don't see used quite too much um you know i've never seen it used that much but basically you know we can set this to debug info warning error that bulletin shows up you know that as well as you know all the bulletins from all the components can be viewed and filtered on the bulletin board page so you know if you were to turn on your debug here and and you usually with debug and info we get the most data from the log you know as a log message you're going to be able to see that appear on the bulletin board page and you're going to be able to filter those right now i have it set to warn we haven't gotten any warrants so there shouldn't be any on any on the bulletin board page if there would have been a one we would have seen it there and when we're building our flow files later you you you may see um you know some errors and you know we can change the the bulletin down to debug so you can at least see the message but that's usually where you see it all right next major section is the data provenance and and this really gets into the the the lineage and other things so you know i can this is my chain of custody right i can see that you know at this date time um this problem event was captured uh here is the flow file uid i received the flow file here was the file size the component that touched it was the getting files from downloads the component type was a get file processor uh you know the the name of the file any kind of details attributes uh for the file uh as well as you know content if it can be viewed so a lot of times we're picking up log files text documents those types of things so you can actually go in and view what that file looked like uh but yes uh this is the uh you know the provenance event that happened we cannot replay it from the provenance event because the source flow file view with id no no value no longer exists i did delete those you can download it and view it as well uh you know one of these same things that you're seeing here is a subset of this information is available when you when you right click on the processor and view that status and data provenance but this page that this data provenance page is everything that's happening you know from the system i can also go here to the lineage it's going to compute the lineage and i can replay this lineage i can see what's next unfortunately i don't have a full built out flow yet but when i do you can actually go back and replay and see okay i received this data at this time and this is what it looked like i passed it to another processor and here's what that data looked like going in here's what looked like going out here's when that processor touched it you know here's what the original looks like all of this this chain of custody that is needed uh you know for for many reasons for that data provenance for that data delivery guarantee that's part of nafap you know some of those security capabilities those types of things you can download the the lineage it's going to save it as an svg it's an image of it receiving the file the processor that touched it those those types of things when we build out a flow later we will be able to download a full package you know full view of that lineage you know from this screen and then you know of course go back to the event list i do see this one uh the data provenance you know used quite a bit um you know it gets you know it does have its own window capability so if you want to just continuously look at your data problems events if i had data flowing through this system you we would see this be updated you know constantly and you know being able to to check on that we will we will really talk about controllers um you know later today and tomorrow uh basically though this is our controller settings um and a controller for instance you know there is a um a database controller in nafap and so what that does it allows you to build that connection so say you have to build a a a flow that connects to postgres and puts the data in postgres you can build a postgres controller with the username password the connection details those types of things and what that does is it allows the processor to reuse the you know those components so you may have five or six processors writing postgres and they are getting they're using the same controller you know to to authenticate to understand which postgres to go to those types of things so what you know a controller allows you to do is is you know for many things right you can have a security setting where only certain people can modify controller and then you have another security setting that says you know you can have a data flow builder or can build a data flow they can use the you know postgres controller but they can't modify the you know they they wouldn't even be able to see potentially the username the password uh you know any of the controller they're building their flow they connect to the postgres sql service um we will dive a lot into controllers uh it's a very important aspect of nafap but uh but here's where you can actually go and start seeing these settings you can manage controllers you can even add controllers um you know there is you know a credential controller so that way you don't have to share usernames and passwords you can put your credentials in and again you build a data flow and it can use that controller but i never knew what the username password was for instance you can set up a you know a caching system a distributed caching system uh there's a h-base controller to control connections to h-base for instance uh in big data uh copter record sync controller so you know there is a lot of controller a lot of capabilities there um and even even uh you know syslog readers and and those types of things so you know that way uh you don't have to give up those credentials you know everyone can use the same controller you have a sysadmin who's configured it once and it's a reusable component um we have about 123 controllers in the out of box nafi some of these are like you know reader lookup record reader record setters and things like that that you know you can define schemas of how data should look and map so you may map your attributes to a schema uh you know there's there's services for that there's a controller for that um and you know you can go with that so that is the controller uh reporting tasks you can add a new uh reporting task a lot of times you may have a prometheus reporting task what this allows you to do for instance is create that prometheus reporting test that reports on all the nafi metrics and so you know everything going on when the nafi is putting those uh prometheus metrics out you may have a centralized prometheus server that's receiving all of these um you know you can configure it and this can you know this this service and that way if others want to use the prometheus they can just reference the service you know on those connections and we will look at the processors and how they handle that because you know some processors mandate you have a controller service and then some will allow you to do like a simple database connection and and you can select between you know you creating your own database connection or uh you know or referencing a controller service to to handle your connection problems but yeah there's a um a lot of the reporting tasks uh set up uh a lot of this is like you know your ganglia your mantra your memory your disk usage this is all about your performance of the system um so you can you can add those registry clients we will set up registry just like we did with windows um you know our registry we're going to install it's going to be sitting at local host and and we can add the name and this is our local registry and then you know i can go in and configure that the properties the url of that registry we won't reinstall registry we will go through some of this but this is how you would put in uh your registry for now and and to do that virginity control that checking in and checking out data flows and then parameter providers you know um you know you again you may have some secrets managing going on uh you're going to need to you know keep those safe and you probably do not want to share a lot of these secrets uh so you can define them here like a database parameter provider so you can configure this service here and and again rinse and reuse it um so yeah so that is the controller settings we will utilize most of these when we're building out our flows but you know from the hamburger menu you can select those and start adding in those types of things parameter context um you know you can i know a lot of times you know we can set a parameter name that parameter in value and then we can reference that in our data flow so you may have a a parameter like you know night or day and you in the value of that yes or no and you can reference that using the expression language in your data flow and so you know when you set this parameter everyone has permission to access you know the service can access those parameters we see this quite a bit where you know different systems may get different data so you know instead of you having to type out you know the system and the url or whatever you can just use a parameter and reference that value in the data flow and that way you know it saves a lot of time it's a it's a reusable component and you know it just provides ease of use you know to to handle your different parameters there's some inheritance capabilities as well and some settings we will define a parameter on our main nine five canvas and go from there uh you know later on in our in our data flow design but that's where you would add it um you know when you go there the flow configuration history basically is uh you know what what have i been doing is and you know deleting deleting flow files um you know deleting processors this is a you know a history of the interactions basically with napa the date and time that that happened the name so you know i i get files from downloads right i started it i stopped it and who the user was so this is like you're auditing things on the system this can actually feed into uh other systems as well uh but it's good to you know have and and you're you know not only can you go through and see all the changes you made on admin data flow but if you use a system where multiple people uh are logged in and working on data flows it allows you to go back and see you know who did this who changed that those types of things so um the note status history um you can see like this is our free heat space uh how much is used uh the processor load average a lot of system metrics just because you know we we can send a lot of data through this and we can bog the system down you know uh there's a lot of a lot of things that's going on behind the scenes so you know your status history really keeps track of that it keeps track of how much data it's got going in and out uh you know the sum of content repository free space uh flow file repository used and how much is free you know when we work with those um the flow file repository that's created right it's saving you know it's it's got a finite available space and so you know uh it's it's using that space it's writing it and then it will purge you know those types of things you're going to be able to see you know oh wait a minute you know my data flow is running um and you know i'm a i'm pushing you know a lot of data through you're able to look at the status history and say you know i may i may want to you know increase my provenance uh you know storage capabilities i may want to decrease you know some processing because i only have 20 gigs of hard drive space allocated to this and you know i know it's going to use it up uh so when you when you are building data flows designing the system designing this architecture looking at some of the metrics and and performance of the system you know that node status history is is going to be very important templates is exactly uh you know as it reads um so uh nafi templates in general uh when we save them we can upload them we can download them we can delete them uh in this case you can go straight here and you can download the template templates going to download as an xml document and if you were to open it up you know it's going to have your connections your destination the source you know what processors even the position on the canvas where those are um you know where they come from bundled you know a lot of a lot of data a lot of information here and and you know and it's all stored as xml uh like i said i have i'll export a template i can email it to myself if i want to put it on another system for instance uh and then import that in and instead of another system plugging into my you know uh version e control you know this is this would be a good way to get my data flow from one system to another one thing to keep in mind with templates is the processor that is referenced in the template must be installed in the nafi instance or the template will not um you know put it will it will not lay itself out on the canvas you can upload the template and and those types of things but until you have the processor you need uh you know to fulfill that template it will not put it on the campus i'm seeing a lot of folks build a template uh it reference a custom processor they change that custom processor to another version or newer version and now it's breaking the template and you know because it does look for that specific processor that specific version so just make sure if you're working with templates make sure you have the processors it's referencing in your nafi instance you can also upload and download templates here you can create a template um you've already created one if i wanted to i can tell it to create another template and save in my templates and then i can download that i did download that uh that one template so i should be able to select template go to my downloads there's the test template upload it it's probably going to complain that its name already exists but if i were to do this okay now i have a different name for that template oh i don't know what the problem is it's looking at an internal name of that template so if the template has the not only the same like like file name but inside the template itself is is the template name so let me see here back to this now it will let me success uh so just keep that in mind when you're working with templates you want to choose unique names um it does you know when you export that name it is going to export and name that xml file whatever the template name was if you need to go in and re-upload that template you already have that name you can you know go to the hamburger menu and you can go to your templates and potentially you know delete the template that that you want to work off of and upload a new one so you can delete it then re-upload and just make sure you change the name you have to actually if you're trying to use the same template on a system you have to go into the xml right at the beginning and just tell it give it a new name like i said a lot of people don't use a lot of people use templates just to work with colleagues and and those types of things when you're passing around um you know your your data flow but the better way of of handling data flows in the portability of those is through version control that way you know software engineer x can develop a data flow save that data flow to write into the the versioning control and then another developer can access that bucket and pull that exact data flow onto their main canvas and when we set up registry uh we'll go into that um we're going to set up and install nafi registry uh but i don't have well i have my own personal github instance but nafi registry just so you know is backed by versioning control and we'll go into more details of that um you do have a search uh is a um a free form search so like if there's a connection or a processor or processor grip it's it's going to list those out um so i do have getting files from downloads and also get file uh you know connection so uh you know it's going to tell me here's all the processors that match that description so far here's all the connections if there was a processor group that matched it it would let me know um you know and this is really mean for you know the the the nafi instance that has 30 40 processor groups just on the main screen uh you're diving into those processor groups you've got you know 100 different processors in one 30 40 and another you can get a little bit out of hand and so you know there is a free form search box that allows you to just search for what you're looking for another good reason to apply you know a good naming convention to your processors a more human readable description that way when you search you know it's easy to find so for instance like my git file says git file from downloads if i had a whole bunch of files deployed i can just start typing git file from downloads and it's going to find that exact processor i can go to it and i can operate on it all right so you'll use that search box once you really start filling up the canvas and and you know if labels can help find something your processor group naming can help find something you can definitely use the search box okay uh some of the other main components of the uh user interface so all of this is your all of these are your component toolbar so if you ever hear of a component toolbar in nafi that is all of these components a processor input for those types of things the status of nafi this is your status bar we have two files in the queue 487 meg we've got one stop processor we've got one invalid you know the time the last refresh time that it has if there was any you know if there was any issues if there was any warnings you know some of the some of the data upload and those types of things if we had a disabled processor so this this status bar gives you quick snapshot of what is currently happening in nafi we also have um you know we went over the navigate this is the operate palette um you know it's all part of the main nafi canvas um you know this will give you that bird's eye view now you can see on the canvas that we have our processor um and things like that um down at the bottom uh we have a break from so if i were take a process group and i am going to put this into that process come on just bear with me as i work through some of these latency issues there we go perfect so you can see that we put our flow in the processor group at the bottom is a breadcrumb trail so you know we are now in that testing processing group uh if we had another process group underneath we could you know just keep going and so you know at the bottom is a quick easy way to get back out you know so so that's the the breadcrumbs uh it's actually a lot of the the canvas you know that you see right now in the hole um and of course you have a log out but you know don't worry about that um there's a lot of of of policies and and multi-tenant authorization as part of this uh we we really won't go into some of this but you know there is security policies that allows users just to view the ui um and then there's uh policies that allow users to access the controlling services or reporting tasks uh there is a lot of fine-grained detailed security policies so you know when you work on deploying this you may have your standard data flow developer and that's what they worked on you may have a sysadmin or a system engineer that is working to put in a controller services for the data engineer to use those types of things so um if you refer back to the admin guide and user guide there is a ton of information uh you know there um okay perfect uh let's see here what else we've got all the inputs the searches is there any questions on the canvas itself uh you know anything i can answer on what this button does um you know how do you access something any of any of those types of questions i think what's gonna really help is actually getting in there playing with it oh absolutely um and that's why i'm like going over right now and then uh when we start like tomorrow morning we are are going to and we maybe even this afternoon we are going to start building five flows um you know we we will go through and and um build a flow i if someone at the end of the day can let me know how they plan to receive some of the data we can tailor those flows to kind of resemble what you're working on but you know you're going to learn a lot just by doing it uh the beauty is is don't be afraid to break nafi you're not going to um you know you're not going to do anything that we can't refer to especially in this scenario uh you know so when it comes time to building flows have at it um but okay uh all right let me look through my notes i think i have um so we can you know with a processor for instance you know some of the other tips and tricks i like to go over um you know each component is displayed on the canvas it also contains you know the name of the processor the group the bundle uh description you know those types of things uh when we're looking at processor we can we can group or or or filter right now you know right now the only group we have is org.apache.nafi that is you know that's because all of these are supported apache nafi processors you can go to github and find you know thousands of different processors all the ones that are located here are supported by the apache foundation so the apache nafi foundation so keep that in mind um you know as you're as you're working through this but um you know you may want to create a different source uh for your processors you may want to create a different version so we have the attributes to csv for instance we can have two or three different versions of that um and depending on the version you can reference those in your data flow so you know something else to consider um you can actually go on to a and if there was a different version for this um processor we can right click and under data provenance we can you know we can change that version so there's a there's a lot of processing capabilities a lot of processor you know capabilities you may upgrade um you know your processor to accommodate a different data type coming in um you don't necessarily need that upgrade on another data flow and so in real time you can upgrade install that processor upgrade your data flow as it's happening so if this was running i could you know i could tell it to upgrade the processor and it would upgrade that processor and then continue running so you know there is a a lot of capabilities there when it comes to the software engineering side of the house for not five the main goal is to continuously you know get these processors running you know let them do their thing don't stop a data flow um you know we can for instance you know fork off data you know for this one we're getting files from downloads we are sending it to muntight i could easily bring down another processor and i could do a put put file let me empty this queue so what i'm what i'm doing is you know in while this data this data could be flowing through the system i am branching off and now every time it runs is now sending it to the identify mine type but it's also i can send it to a put file uh and that is exactly the same file so i can from one processor i can branch off into many processors to send that there is um one of the ways i've recently used it is is you know data coming from kafka alerts coming out of an aion system would be sent to kafka they would need to be sent to a dashboard for users to interact with but also that same data needed to be sent to a database for for longer term storage so we could crunch over those results with some additional aion models so you know you're executing models on in flight as the data is coming in you're you're doing the you know executing models on on on ui on the ui for instance uh as well as some of the longer term you know model model execution so you know keep that in mind as well when you're building your flow you may have a need to pick the the single source up and you're sending it multiple different directions all right we went through the settings we went through scheduling we have blown through some training um execution run schedule i'm looking at my notes we have went through a lot real quickly um okay look pause there we went through the parameter context we've through uh most of this um is there any questions with the main nine five canvas some of the components the toll bars how you access this how you work with this any of those types of questions okay thank you okay um well we have an hour and seven minutes left um i say we take our final break here in a minute uh and then we come back and wrap it up we are going to go into you know some expression language i know that we don't have but you know uh half the team is is developers and and and stuff so you know do your best when we get to that we're going to go over you know expression language used within nafi to to do things we're going to set up some flows things like that but let's take a our last final break uh let's do return at i want to give 12 minute break so let's return at um four four oh five uh 11 minute break so four oh five two oh five my time and then we will go through here we'll wrap up some stuff and and get ready for tomorrow tomorrow's gonna be very heavy and extensive so you know just keep that in mind so we'll take you know a few minute break we'll be back in 11 minutes and we'll wrap it up i know we're still in break but i see some of you working um you know so the the last task of the day is going to be putting that git file and identify my type into uh practice some of you have already had a head start it looks like uh as well as label brett you're you're excelling it looks like so when everybody gets back from break uh we can go ahead and build our first data flow and and try to get these files uh extracted all right we got one more minute then we will start with our hands-on exercise um if you guys were looking at the screen i did have the apache website pulled apache nafa website pulled up i know there's a log for j question and the you know give a detailed what happened was this is not actually uh swapped to log back uh the successor to log for j the the person who um you know was originally developing on log for j spun that off and um so we shouldn't have a log for j issue but with that being said i'm going to double check uh and give the answer and write it up into the presentation but also you know just so you have it there is a security section uh and if i um you know i do know that government employees uh report to this as well like i said there's a there's a lot of people who use this um and and here is the latest published vulnerabilities that is found as well as the mitigation so for instance you know 95.7 through 1.23 um had the jolt transform json processor you know has a dom based vulnerability cross site scripting vulnerability that's been updated uh the version that we are using some of these versions came out because of vulnerabilities so it's always good to check you know when you're deploying this um i know some organizations like to use the latest greatest some of the organizations like to use a version behind but if you look uh you know in in downloading these you'll see that um you know this m2 release will have some release notes here some uh you know release notes for 1.25 they will include things like you know some of the vulnerabilities that's disclosed and and what 1.25 is is um you know what what is building off of so you know when it comes to some of these security things um you know it's good to always check it is an open source product so you you do have that capability to uh you know go and do some of these things i think version one three one two three two uh you know it corrected a few bugs sometimes it's a minor release for a bug fix sometimes it's a security focused release one two three for instance because i know one two two had some some issues um so you know just keep that in mind as you're going through here i will um you know enter the log for j in detail uh but when i was looking at it it looks like that source has been changed to to log back uh which mitigates that vulnerability and hopefully mitigates the the the finding of file names all right so everyone should be back and so for the final task of the day let's go through that data flow that i built um you know so we want to get file uh and and i will i will walk through it in just a minute but i want to see how far people can get without um you know seeing my screen um and so if you have any questions let me know i did on purpose leave some stuff off just so we can see how far we can get but if you can let's uh let's pick up a file from our file system let's use any of the zip files that you have uh within the downloads directory you can create a new directory you can you know copy a zip file over into that but your git file should include that path so let's do that let's get a file and then let's identify the mine type and uh you know be able to create that data flow and look at the attributes see if it accurately discovered the mine type and so forth so spend a few minutes on building your your flow you know some of you this might be your first time building a flow let me know if you have any questions i have everyone's screen pulled up uh so if you have any you know issues let me know when you're in your folder a little tip is when you're in your folder you can um let me pull my back up here in file explorer you can click at the top in the like address bar of file explorer and you know for this one for instance i was using this test you can just click on the address bar and copy i'll start typing the name to narrow it down figure come on so you know as you're building this out uh use some shortcuts if you can uh but you know you can directly copy and paste the windows file path if we were working in linux it would be the same thing you just do a pwd to get your linux directory you're in i copy and paste that directory in and it will pick it up all right kody is another high achiever look at this flow's looking good any questions kody i don't i don't think so thanks so while you work through this i brought kody's up it was a pretty extensive uh data flow already one of the things you'll notice with his is that red connection um that's because he has over a gig of data queued up so it will change colors as you approach uh you know the threshold has been set great example by now you should have your get file uh that's going to pick your files up you know if you have decided to use a directory that has different types of files you may have went in and put a filter in place again the file filtering the regex is a it's based off of java so you know just remember java regex is close to some of the others but it can be a little different dir is what you're looking for ben or there you go i can tell ben as a linux person yeah me too yeah you gotta remember windows um sometimes cares about uh capitalization or not but linux like it doesn't really if you won't ben just right click and hit refresh and um on the only canvas itself and it will because it's set up for a 30 second timer you know sometimes you don't instantly see it uh but yeah click refresh and see if it retrieved any files oh you're even put a route on after you nice tyler i think you completed it um you know uh did it work as you expected uh you picked your files up you identified the mine type you unpacked it looks like and now you're about to put it up it seems like it worked correctly oh perfect perfect is this your first time building a flow or or you already have experience that i've already done oh okay perfect we'll try to find a couple of you that have not built a flow before looking good pedro so some of the things i'm looking for in this flow is you know did you rename your processor a lot of folks will skip that step uh they forget about it um i see you have looked for zips so uh you that tells me you renamed the processor something that we can understand um you can see that the connection is starting to turn red because you are quickly approaching your one gig queue threshold um but that will clear once you start your other process and remember when you're building your flow um the run once uh is is is really nice it was introduced here recently in some later versions of nafi so that way you don't have to turn everything on just to test it you can just run it one time what was the difference between run once and go start no great question so if you click run once it's going to push one flow file through instead of five so i've got pedro screen pulled up here for instance um you know he clicked start on the look for zips and what that did was take all of the files in that connection in that queue and push it through the processor and so if you click run once it's only going to push one file through and then you can inspect you know the output to see if that's what you're looking for and if that's what you're looking for then you can just hit start and let it run thank you yep no worries and so for this instance uh the unpacked content i would just say run once to see if it actually unpacked i would put a stop on your put file right quickly before you click that put a stop on your put file yeah for the output of the unpacked content so the reason you want to put a stop on that is you want to have all your processors in your progress or in your processor group the whole flow file you want to have them stopped because you want to test it along the way and so if you were to leave that one running and you hit start it would the data would go right through that processor and you wouldn't have the capability to make changes um so it'll just save you a little bit of time in the long run so there again you want to click uh you know unpack content run once and then refresh your screen um when you you can refresh from the canvas just click on the canvas somewhere on the canvas and right click and say refresh now you can actually go to that test queue you see i have 106 you can actually right click list queue and did everything unpack as you were expecting perfect perfect um yeah no it looks good to me and tomorrow we're going to work with some text data because we're going to do some advanced like etl type capabilities but today i just want to be able to get a file picked up unpacked and put somewhere um and so you know you can start that there you go or you can run it once and see if all of it writes but if you right click on the canvas and say refresh that data should be all the way through yeah it's extremely quick so perfect and then if you want you can now um you know start that whole process so start your unpack content yep start them all um i see you use the operate part which is just fine but you can actually just click on the processor itself and right click and say start stop all that fun stuff uh yeah there you go but leave it running um and so i know you're on your canvas hit right click and hit refresh very first one perfect all right so you've you've encountered an error um and you remember when we were going over processors um you know i didn't have an error to show but you have an error on your put file you can tell that by the top right is a red box so if you can just highlight over that red box it should tell us what our error is this is a great example there you go um ping live full file oh so the the name already exists so it is um it's just letting you know so you can actually stop that right click on it and go to properties configure and what we want to do is we want to auto terminate those yep go to relations and you know failure have it automatically terminate perfect um so it's going to continue reporting that um files that cannot be written for some reason are auto terminated so go ahead and hit apply and refresh on your screen to see if they if it goes through oh your put file stop go ahead and start your put file again so that way we can clear this queue out okay there we go refresh again we should be looking good a little bit better on clearing your queue uh no we keep picking up that same file can you hit stop on your files from download folder and that way we can let the rest of the data flow run all of these are great examples of some of the power here where you can manipulate data flows in flight and things like that go back to your canvas and hit refresh be clear in your queue on your put file let's stop that one and go to the property figure perfect and go to properties and the conflict resolution strategy let's set that from fail to replace perfect now apply start that back and you can go ahead and start your put file again all right let's see if you're clear your queue now you go uh you want to go ahead and stop your git file because it's just continuously picking up that same file so the git file just stop it and see if this will clear your queue just refresh now yep okay it refreshes every like 30 seconds in the in the thing but we can change that it's still writing data though so you can see that the put file is now put 6.22 gig out the door let's make this a little easier um you know only uh 6.6 gig you know this is um one of those examples where when the you this is a perfect example the queue starts filling up because that unpacked content can't run fast enough so there's a couple of things we can do we can tell the unpacked content to um we can stop it and tell it to run more tasks so what it would do is unpack you know two files at a time instead of one so if you want to unpack content and stop it there you go and then go to the configuration and go to settings uh no scheduling i'm sorry concurrent task let's do two apply that all right now start it should run a little bit quicker we've got the first queue already cleared second queue is unpacking it takes a minute this is um you know this first example is is why i like doing this is you can quickly back the system up with all of the unzipping if you push you know those are some pretty good size zip files and so it needs time to unzip those and then push every one of those single files out um of course that's kind of nice doing face off with the flame type also the same as how this is so uh who's god all right we got we got uh pecker will go in let me pull amanda up amanda oh wow um amanda can you refresh that tab okay you definitely blew this up no um no great job so i actually really mean that like peter and i were just going through his flow and he was on the verge of blowing things up and you know it's there to kind of make the point i was i was saying where you know we get too much data in this queue we got a processor doing unpacking of large files those types of things it's really going to start bogging the system down uh you know we're sending only one task to do this and so as you can imagine we're bringing in gigs of data unzipping it one task at a time it's still kind of blowing through it but but yeah so amanda to to do a quick fix for you can you bring up your command prompt you have at the bottom of the command there you go let's just do a control c and let's terminate it perfect and let's see if um you'll have to say why to terminate the batch job but if it looks like yeah no no no um i gave everyone eight gigs of ram and a few cpus so just fy okay go back to your folder amanda i'm going back into your nafi folder downloads there you go go into your nafi bin folder uh it's the nafi dash there you go perfect go into that one go into your bin folder and run nafi you really blew it up um supposedly i can do this um if you can uh amanda let's uh run dafi again say run it just goes away doesn't it yeah um well task manager should not it shouldn't be running in task manager oh when you read it it doesn't show you the process no it will but it's not running right now um let's do this could not find something that's really weird did you pick up oh did you when you were telling to pick up did you pick up the nafi folder as well i believe so yeah so you probably picked up the nafi folder and it removed itself so so let's do this so exit out of task manager um go back to your uh folder and again you know some of this is by design because you know um go to your downloads you know uh go ahead and delete yeah you picked it up and unzipped it you also i think you did not keep uh the source file and the properties so perfect uh so amanda if you can go ahead and delete those two directories well you picked up the zip file so we actually need to download it again now so you you picked up the original zip file you also deleted yourself um and and but i'm but i'm glad you did because you know this is what i was saying earlier where you know we've got this can this can go like a worm it can go through and wipe everything out if you miss a setting because it wants to pick everything up it will not and you probably did not have keep original and so it just went through and it picked its own self up it processed itself and then crashed because it couldn't find itself no that's perfect so you had probably the log file open amanda all right try again perfect so let's uh go up to your top left tab there you go go to download there you go scroll down perfect and you want to download the 95 standard binary not the source the 95 standard binary that one yes and then um click the http um that uh well you can do that one as well right go ahead and click to download yes click that link sometimes sometimes these vms like to take a little time okay there you go click that one it gets that on one page is like oh you should try connecting to these perfect oh amanda it's downloading um it should be pretty quick you can go now and right click and say extract the files view it run your nifi and get your username and password and if if one of your colleagues is nice they could create a template all right uh who else do we have erin looks pretty good i'm gonna see oh nice you're getting a file identifying unpack oh you're sending the original to one direction the success of the unpacking to another direction and the failure to another perfect perfect perfect perfect all right let's see if sean that's my rabbit okay um any issues sean yeah yeah it can it can get a little tricky so yeah make sure you keep that source file um again you know this this is made this first exercise i expect to break things i expect to fill it up that's the reason we're using some of these zip files because i know it will blow the system up and this will give me training opportunities so perfect perfect perfect get a little less than to break it in break that tomorrow we'll do some more advanced stuff but you know as a as a practice yeah um oh nice nice brett you got labels applied uh you renamed your processors um that looks really nice you did you want to rename your connection okay all right looking good though tyler let's see how you're doing it's all green were you able to run tyler were you able to run your flow and did it pick up a zip file and unzip it did your flow delete yourself all right we'll come back to tyler uh manda you got yours going you're installing it you know where to find your username password um did something happen with teams because everything went down yeah i think everybody got kicked out of the teams call yeah i just got kicked out too yeah i lost everybody it seems that's a microsoft issue okay uh so can you guys hear me i can i can this is brett okay thanks okay tyler um i called on you right before we all went down are you still there okay perfect um how did how did it go it worked correctly okay any any issues um you know you know picking the data unzipping it and putting it somewhere uh no it could work good okay wait a minute you already have some of the the the regex in uh in those times you've used this before yeah perfect perfect perfect uh we'll get into the regular expression language uh tomorrow uh but you know you already have a good head start you know you can take that path and you could put it as a environment variable on the main canvas and then reference path forever perfect perfect okay amanda you're still getting up to speed and reloading Pedro did you get did your queue clear on yours pedro make it back into the call yeah we we got back on i just forgot to unmute oh yeah okay perfect perfect um any issues with your flow no i was trying to find out how to delete files how do you delete files yeah i was playing around with it oh okay uh you're just playing around with you know picking them up deleting them um your flow looks good i see a yellow behind that i think it was i think you may even have a list but i don't think that's part of your main flow right okay perfect perfect perfect um elisa how did you uh how did you do what did i did see you have a file picked up uh in success um were you able to run it through the identified mine type yeah it wasn't running through but it was the same with uh oh okay um yeah if you won't you can actually you know you can stop picking up files so leave your first processor stopped and then you know start your second third one fourth one to see if it will go through and clear the files um the example we are using is going to throw errors uh and like i said that's kind of on design as well randy oh i like this so randy put another processor in place to put the file so he will have a copy of it uh you know in another directory looks like um randy how did it go how did your flow did it work yeah it worked but what i started all the processors it was generating a backlog so i stopped everything and then it the run wants to get it to go all the way through okay perfect perfect yeah yes and again you know this first example is kind of designed to break um but yeah if you try to unzip all of those files it's going to back up the queue your queue's you know about a gig and so you know if you do just four or five of those you're going to be over that um perfect perfect okay ben it looks like you're getting the file uh you're getting built up i like that you have around on attribute um yeah you're not putting in the correct expression um so video properties oh you're familiar with uh the expression language as well so so perfect uh let's see um i'd have to i use gemini for that oh nice nice um let me see here let me i'm looking to see what's wrong your route man you um dollars and you can tell you've been messing with it with your history um so mine type contains is that a colon in between mine type uh i think so yeah that is going you i think it's supposed to be a period oh no no expression found because i'm done and that's how i do things that's what you get for for trying to be fancy um exactly is that a camel bracket or okay um so let's do that so replace the in front of mine um let's put a camel bracket there oh curly brace yeah uh curly brace and there you go and whoa you see how actually turned blue yes so that's correct um but um application zip that should be um that should be this should be correct but it's not and type contains your curly braces are where they need to be your prongs are where they need to be uh say okay and uh see what that yells at you about yes unexpected okay line one complete you can bring that back up and you only have one line right so you don't like the prince uh should like the parentheses i'm looking to what's value i'm pulling up the the nafi expression language guidebook just for fyi the where i'm looking up you know where if he's formatting this correctly we have not went into this you know some folks like to be overachievers that's a great thing but there is a an expression language guide that will guide you you know through you know a lot of these so i'm gonna try equals on that file name i think you're working with an attribute so it's going to be only two i wonder if you could do file name contains uh but that would look at the file name not the attribute right and i did look at the so as uh as uh uh as ben tries to to figure that one out um if you if you can't figure it out ben just send me what you have just like paste it in chat if you can and and i'll take a look at it on yeah like uh but i think you're close you're very very close yeah and and you know like we haven't went into this yet but you know just for consideration a lot of the regex uh online regex builders uh they do not like to work with nafas uh regex um you know so keep that in mind you know there's some regex builders out there that you can uh you know help build your regex patterns and you can copy and paste but you know nafi doesn't like to work well with those okay while we wait on ben we have a couple minutes left um tomorrow we are going to continue building a flow file if you can uh you know log out of your machine and and um because i'm going to push some additional data to the machine i'm going to push some zip files with csv's uh we're going to deal with you know extracting columns from a from a csv or excel file we're going to work on manipulating the data we're going to work on picking those up and how we put those into potentially like a data store or of some sort uh those types of things because i was having to update and i'm excited that because i like it if i don't if you don't uh log out uh it's okay because i can stop your machine um later so like in the next couple of hours i will stop your machine and push the updates anyways you know just for fy okay all right let's see what else do we have all right is there any questions on what we went over with today besides the regex pattern i think everyone uh took the exercise and nailed it um some of you went above and beyond great job uh brad i really liked how you put the the labels in place i think it was cody i think you named started naming everything i noticed uh you know you you caught that during the training um was it you or somebody else no it was somebody else well one of you uh or a couple of you went for and named the processors and and those types of things to be a more human readable um you know so you feel free to play around with your flow uh like i said i'm in a couple of hours i'm going to push some updates um i'm going to include some csv files we're going to pick up things like that we may even reach out and hit a web service and pull some data in um there is some faker type data that i like to to generate and pull in like there's some web services that will you hit an api it's going to return 100 different names they're fake names fake birthdays and all these other things um and it really kind of tomorrow is going to be going through the data so the data flow we created today is is really easy it's picking a file up identifying the mine type unpacking it putting it out we didn't really need to work with the data that we picked up uh we're just working with the uh attributes to identify the mine and send it over so tomorrow you know we're going to do a lot more in-depth uh looking at the files sorting uh doing some etl type steps uh and things like that we may even build our own customized weather weather service to send us data about weather based upon location that's another great example i have if there's any in particular technologies that you all would like to to um you know take advantage of let me know i would be happy to do uh you know a hands-on flow where we're working with a database or an api or you know you name it and so we're going to work with the build this flow after we get through a couple of flows uh building those we're going to export uh one of those out and we're going to start working with minify and getting it running in minify and having minify run that flow uh we will use minify locally as well so you know we won't have a an edge device to work off of for instance but uh we can do everything within the same machine and then we'll also get registry up and running and you'll be able to check your your flow in so uh without with that being said uh you know any final questions got a couple minutes left so any final questions uh anything i can help with yeah i got it oh yeah what was uh yeah i think the reason why i didn't like to contain but uh i switched over to just set it equals uh-huh oh and with the colon and it just started to work oh perfect perfect perfect um yeah the the regex with nafi is is i have been complaining about forever but yeah so i got to work awesome well i think everyone got their flow to work uh everything looks good to me uh you know with that being said feel free to continue playing around in this environment uh and building your flow uh i'm going to shut all of this down in in a couple of hours upload some csvs and zip files and such for us to work off of uh and then tomorrow we are going to go into registry uh some additional flows and we're going to start minify so if there's no additional questions uh have a great evening uh thanks for bearing with us on all the technical issues i wasn't expecting teams to kick us out as well that's kind of weird um you know but but if you need anything just let me know and and i will see everyone here eight o'clock your time in the morning 10 a.m central time um and then if you have my contact information so if you have any questions just feel free to let me know sounds good thank you much all right see you guys thank you
on 2024-05-06
language: EN
WEBVTT We'll be providing content. So, yes. Okay. So, what I'll do is I'll create a status review of the meetings throughout the schedule. Kind of put those out to the team. We can make those as long as possible to make sure that as the content we develop, if any questions can get back to Mike with those questions. But I assume we don't have those questions anymore because nobody put them together in question. Well, I would get those questions asked rather quickly. And then it sounded like Smiley, I think his name, had some other information potentially. So, I noticed Mike just sent out some calendar invites. So, maybe we can actually get some additional information, but definitely get those questions out ASAP and get them answered. I think you're giving us two weeks. I think it's two weeks on this. So, and half that time is Red Team and others. Oh, absolutely. If I, my way of traditionally approaching these that works, it sounds any different than we don't have documents, right? That's not the case. We have a ton of documentation on this. That's why I was saying, Logan, I'm looking for the email now to send to you, you know, to, because I found some of the documentation just a couple weeks ago. Perfect, perfect. And I've got some more. So, those could be helpful. But the sooner you can get ahead of it and, you know, and get something out fast and say, okay, this is what I got. When do we want to make it go? Well, how do you want me to help? So, okay, well, I feel like this, you know, I do feel like we're running to talk to you about some, you know, and let it get released. Some, some of that type of stuff that we normally don't really, I mean, we talk about, but we don't just because of the way the proposals asking for that agility layer, for instance. You know, I don't feel like we don't have a lot of that type of content. And so, you know, you need to come up with some of that. So you don't think that. Okay. Whatever. Okay. I mean, whatever, Rhonda, like, okay, so I can, like, I don't like that response. Because, like, I'm just asking you a question. I think at some point, Josh, needs to lean on my engineers and not have to answer every single question. It's a high level question. But, but, okay. I don't, I get it. Okay. Let me see. Okay. I think it's suffice, right? I don't understand what was wrong with my question. I don't understand. I just don't understand it. Josh, you're absolutely right. Again, right? I'm never trying to put you down or put you on the spot. I'm always just trying to help you. Right? And you as my manager, that's one of my jobs, is to make sure I'm helping you. And that was all that was. But, all right, I need you to get a cigarette. Real quickly, Dylan reached out to me because he noticed I was in the document. And I didn't. And so then he started complaining about it. And then again, I'm hearing him helping you because he doesn't want to put all your fluff that he calls it in the document. And I'm like, well, Dylan, I don't have, for instance, a lot of these things set up. So we're going to need to know the prerequisites and things like that. So that's all I wanted to talk to you about with Dylan is, is he like, I didn't, I didn't reach out to him. He reached out to me because he can see I'm in his document. Right? Again, like hearing him trying to defend and help you and deliver your message to the team that, hey, look, this is why she's asking for this. This is why she's wanting this. And here is why I'm telling you we're going to need to do this. But again, I need you to go get a smoke or coffee. If you need anything else, just send me a message. Good morning. For those that are already on the call, give another couple of minutes and get started. Good morning. It's me, Ben, Aaron, and this room here. Awesome. Awesome. Hello. Good morning. Good morning. My. Really? Let me look. Same one. Yeah. So this is the same one. You can look in here. I think I see this link here. I'm shocked. It's really weird. It's asking for a password because mine never, never does. But I have the teacher one. I don't know if that's any different. Yes. I read participants when they were just at the half a set of passwords. Okay. Log in credentials. Yes. That's why I always encourage to make sure to write them down. We'll get Aaron a couple of minutes. And then we will get started. You know, this morning we are going to be doing a little bit of a chat. We're going to be doing a little bit of a chat. We're going to be doing a little bit of a chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat. We're going to be doing a little bit of chat.
on 2024-05-06
language: EN
WEBVTT Okay. Well, according to my screen, everyone was able to log into their environment. Anyone having any issues? Perfect. Perfect. Perfect. All right. So let's, if no issues, let's just get started. Let's dive right in. So if everyone could bring up their NAFA, and what we are going to do is, you know, practice converting ACSV over to JSON. We're going to use some controller services. I have the flow. I have the flow built. You know, trying to give it just a minute and then, you know, refresh your screen, refresh your browser. I'll give everyone just a minute to get their browser up and running and get ready to go. Okay, so you should be able to bring your folder up. You should be able to go into the NAFA folder. So it's NAFA 1250 bin. And then you go in and there's a logs folder. And then there's a NAFA dash app. On that login. There you go. Perfect. There's your password. Perfect. Looks like you're getting logged in, Alyssa. Okay. So real quickly, if you can, just hold down, create a new processor group. So again, if you need help on that, sharing my screen. This is our next flow we will build. But let me go ahead and bring down a processor group. And we are going to name this Git.
on 2024-05-06
language: EN
WEBVTT Looking good Cody, looking good. Does anyone have any issues so far or are we good to go? Pedro, were you able to copy your data flow into a new group? Perfect. And then go into the example main and you should see some sample data. And inventory folders and inventory.csv. My etherpad link isn't giving me the drop off. I have like a default message. It's Cody. Hey Cody. I've asked you that multiple times. Alright, let's see. I should have your voice memorized now. What is? That is strange. Yeah, that's very strange because I'm looking at it and... You just chat me in the VM. Yeah, I can do that. Dropbox link. Can I just send it directly to you in chat in Teams chat? VM would be better if you can because I have my teams on my government one and I'm using the VM on my personal computer. Oh, okay. Let me put it in Teams, but I might be able to put it directly into this. And let me see if I can take control. Alright, interactive. Copy link. There you go. And I'll download it for you too. Thank you. No, it worked. I was very surprised it worked. So because I have to go back and then tell it I want to interact with your machine and then log in to your machine. Okay, so everyone should have that zip file. I'm Cody. I left your machine. If you can, extract it and, you know, it might be easier to put on your desktop, but put it in a location that you know about. Because we're going to need to use our get file processor. So hopefully everyone has that. Let me know if you run into any issues. We can stop and get it squared away. But with that being said, you should be able to go into that zip file. There's a sample data folder and then there's an inventory folder and inventory CSV. So if you can on your NIFI canvas, let's do a new process group. Because we already have a data flow that we worked on yesterday. We want to bring down and do a new process group onto your canvas. And so that way we can start building onto this data flow. And if you look at my screen. Right, I have a process group called CSV to JSON demo data flow. You can name it however you want. But you know, definitely something you can, you know, something you can work off of. So our first step is we need to get the file from the directory. So what you want to start with is most likely a get file. There's a couple of different ways you could do that. You could list the directory, filter on it, and then turn around and do a get file. But I find just leading off with a get file as the easiest method. And the file that we were looking for is inventory.csv. So you can design this how you want. But you know, for me, for instance, I put, you know, I put it in my, it was in my uploads directory because the, this desktop environment that we're working in has an uploads directory on the desktop that we can put files into. And then the file filter, I put inventory.csv instead of picking up everything. All right. And then once we have that, we need to set the schema model metadata. And so, you know, this attributes metadata so that we can later understand which schema to use to process the data. So, you know, we want to create an update attribute just because once we get that CSV file, we want to be able to tag that metadata with a schema name that it will use to read and to write the data. And I know I have mine pulled up, but I would love to see others in your own thought process putting this in and, you know, being able to bring that up. So we get the set schema metadata. You can just add if you didn't know on the update attribute, I think a couple of you were working on that yesterday with an update attribute, you can go to the properties and just add a property. And, you know, you want to do the property name. So we could actually do schema.type, for instance. And it needs a value. And I'll just put a row. I'm not going to use that attribute, but, you know, that's how you would add attributes, you know, to a flow. So we are getting the file. We are ingesting that. We are not even looking at the CSV yet. We are just looking at the metadata. So what we want to do is add an attribute to that that says the schema.name goes to inventory. And so once we have that, we then need to convert CSV to JSON. So the processor I like to use for that is the convert record processor. And in your properties, you're going to have a record reader and a record writer. This is a controller service that we're going to set up. So the record reader is a CSV reader. So you should be able to choose, you know, an Avro reader or CSV reader. And then on your record writer, we want to say the JSON record set writer. And let me know if you have any hiccups there. And then once you have that CSV reader, record reader set up, you can actually go to the service. And so, you know, if the service isn't there, let me know. But once you put that in and hit apply, you should take it. But if it doesn't, we can just create a new one. And you can see where I have a CSV reader that I went in and created. Let me look at how this is the tricky part of this flow. And so what we want to do is set up the first record or the first controller service is that CSV record reader. And then we want to set up a JSON record set writer. And we're going to also set up a couple of schema controller services. So when you use the convert record processor, it should. Okay, yeah, perfect. So you should have CSV reader on the record writer. You should have JSON record set writer on the record writer. You want to change, you know, just as a tip, include zero record flow files. You want to change that to false. And then once you can save and say, okay, for that, you can go back into that configuration. And then you can actually go straight to the controller services. And you should have controller services listed. And you want to add a CSV reader, a JSON record set writer, an Avro reader, and an Avro schema registry. And you're pulling your screen up because it looks like you're in the middle of creating. I'm trying to figure out what goes in that properties. Yeah, if you can, you want to cancel. Okay, so we are get CSV file. You named it. You got it. But what we want to do then is do an update attribute because we want to tell NaPhi which schema to use. So I hit cancel and bring down a new processor and the processor is an update attribute processor. So you can actually just start typing it in the update attribute. Perfect. And then you want to drag your success down to that one. Awesome. Bad. Perfect. And then go into your, you can delete that. You can delete the other processor. That's just dangling. So what we want to do is you want to go into your update attribute configuration. Okay, perfect. And on that, you want to set, you want to do a new value. So that positive you see over there, you want to click on that. The plus sign on the top right in your processor configuration. Okay, and then you want to do property name is schema.name. Say okay. And you want to put inventory. Perfect. And say okay. Your cash value lookup should be 100. Do not store state. And the other two fields are not required. So perfect. Say apply. Awesome. So what we're doing here, and this is a great example, is we're bringing that zip file, that CSV file in. And what we're, we're not necessarily reading the CSV file. And so when we bring that file in, you're going to be able to look at the attributes and see that you've now assigned a schema name to that file. Now that you have the update attribute, we need to get a convert record processor because we're going to work on converting this to CSV to JSON. You want to just type convert record. Perfect. You want to, there you go, drag that to there. Success. So an update attribute and get file and some of these are only going to have success as the next termination, right? Because it's just going to apply that attribute to no matter what file comes in. So there really is no failure. Okay. So now that you've got, you know, your convert record, we are reading in a CSV file and we are going to put a JSON file. So if you can on your record reader, instead of no value, click that and click down. There you go. Reference, create new service. And you want to use a Avro reader. Perfect. Great. On the record reader, select your dropdown. Again, no, cancel on that reader, not the writer. There you go. Reader. Select your dropdown. No. Why is that not coming up? I create new service. Oh, we want to do, do your dropdown instead of an Avro reader. You want to do a CSV reader. Perfect. And say create. Perfect. Perfect. Perfect. So we are reading in a CSV for a record writer. We are outputting JSON. Right. So go ahead and click new service. Say, yep, do your dropdown and create new service. Perfect. And we're doing JSON. Yep. And say create. You got it. And on the include zero record flow files, just say false. Okay. And say okay. What does that record zero flow files mean? What does the zero record flow files mean? Yeah. Okay. So if include zero record flow files is usually like there's like no data in the flow file. How like that? Question mark. Hover over that question mark. Right there. When converting an incoming flow file, the conversion results in no data. So if there's zero data, you know, it knows how to handle that. But if you hover over the question mark again, there you go. So if it comes in and if the results is no data, you know, where do you want to do with that? You want to send it, you know, elsewhere? You know, you want to if it comes back as zero byte, you know, if it's a zero byte flow file, for instance, you don't want it to continue down the path. Right. You want to probably send that to a different relationship. There's some reason it didn't get converted. But yeah, leave it as false. We're not we're not getting too advanced today. Okay. Say apply. Awesome. Now go back into your convert record and on your CSV reader, there is a little arrow to the right. At the right side of it. Yep. Click that. And so what this is going to do is take you to your controller services. And so once you're there, we want to look at our CSV reader. You can go ahead and click on the gear icon. That is how we control the properties of a controller service. And you want to go to properties. Okay. The schema access strategy, we want to use the schema name property. So click that. In first schema. And use the schema name property. Remember we set that schema name in the update attribute. Okay. Perfect. So say okay. Schema registry. All right. So on the schema registry, drop down. I create new service and we want to use an Avril schema registry. Say create. And click create. Awesome. Awesome. Okay. So the schema name property is we already said it, but you want to double check here to make sure that that schema name matches the update attribute that you applied. And so schema.name is what we put in the update attribute, I believe, right? Yep. Schema.name. That's the same property we used in the update attribute. Pedro? Yeah. Okay. Awesome. Awesome. Awesome. So just say okay there and scroll down. Let's see if there's any other settings. We're going to use the Apache common CSV. There's no really custom format value separator as comma slash in as a record. Yeah. So I think we're good there. Say apply. Okay. So now it's still in an invalid state. We specified an Avro schema registry. So if you can go into the gear icon again. And you see where it says Avro schema registry and it has an arrow. Okay. Perfect. So you want to go to that's the schema registry it's using. So if you click the gear icon and you want to valid field names true. And then here is where we are going to apply our schema. When it reads the CSV file. So instead of writing a brand new schema let me paste it into that. Are you able to go to the etherpad? Pedro? Ordering over. The etherpad. Actually here let me I'll put it in the teams chat as well to help everyone else. You don't have to write your own. Oh perfect. Perfect. So I'll put it in chat. So you want to copy that. So what we're doing is creating a schema. And so that way it will read that CSV in. The controller service is looking at this schema. And that's what it's going to apply to that CSV file to write it as JSON. So if you can you know in your controller service you're going to want to hit plus. On the to add a property. No worries this is a very difficult hands-on. So you know I completely understand how to go back and forth and all the copying. Correct what the schema would look like writing it to JSON. Perfect. Pedro I didn't write your background. Can you real quickly you know what is it you do? Oh then this is right up your alley. Right. So no no no I have everyone's background because you know we have some assistant admins. We have some developers. But yeah so you want to create. Did you get that little JSON block I sent you? I can paste it in. So I don't mind pasting it all. So if you can though I want you to go back to your controller services. And hit the plus and the property name is inventory. And say OK. And I'm going to paste. I can't let me see if it won't let me. Just put like the number one and I will come back in and erase it. And just say OK. OK I'm going to come in your instance where I can modify it. Does the property name have to match the schema to the other test? It does. It does. Good question. OK that didn't work. Perfect. There we go. I'm going to exit back out of here Pedro and I'm going back to view only. Because I don't want to mess with what you've got going here. OK so the property inventory you know needs to match. We've got it. And if you look at the value. So Pedro if you can click on that inventory look at the value. And you can see we have put our schema into this controller service. When it reads that CSV it's going to pull all that in and create a JSON document. You know based on utilizing this schema based upon the CSV data. So you can go ahead and say OK and say apply. I got a question. Yes. Do we have to manually create that schema when we're niggling for the file? That is a great question. So you will have to manually create your schema. Now the beauty is is not five steps Avro schema. And so if you already have an Avro schema you're already you know ahead of the curve. But yes you will have to create a schema when you're wanting to do things like like what we're doing or reading in one format converting it to another format. Just because you know there's capabilities out there for not to auto understand. But you know what I'm saying here is is not I would not be able to understand how you want that data to be written back out. We have the CSV reader and we're using the Apache Commons library to parse that CSV. But without that schema we wouldn't know where to map you know the columns back to the JSON fields that we need. So yeah unfortunately you're going to have to do that. OK. And then if you can let's go to that Avro reader on the first row. And go to the gear and perfect. It should be automatically set up where it uses embedded Avro schema with a cache size of 1000. There's capabilities here we can use external Avro schemas we can you know there's all kinds of capabilities here within NAPA you know depending on your setup. But here we just want to use the internal Avro schema we just created. So go ahead and say OK. OK. So now that we have that let's see we see OK we're back. I do see an invalid on your CSV reader hover over the invalid the little yield sign is disabled. Right. So the only issue that I'm seeing so far is is is the Avro reader the Avro schema registry is disabled. So if you can let's go start with the Avro reader and let's enable that service use the no cancel and use a little lightning bolt enable. And you you can say service only or service and use the drop down box there is service and components and referencing component right there. Yeah. So this option gives you that capability if you just want to enable the service you know you know that's one thing but you can actually enable the service and any processor any other services and referencing components will be enabled as well. So go ahead and say enable at the bottom right. It's enabling this controller enabling references. Say close. All right. So we want to go to our Avro schema registry and we want to enable that one. You know same thing. So you can see how it's referencing the CSV reader. So you say enable say OK. And what that's talking about is the actual processor it can enable but it did enable the referencing controller service and your Avro schema registry. Go ahead and say close. We'll take a look at why the processor is not in a second. It probably is because you need to enable your JSON record set writer. So the you know same thing is you want to say OK. Say close. Awesome. So now we have went through we've created our controller services. We have created that Avro reader the Avro schema registry CSV reader in the JSON record set writer. So they're all enabled. The state is enabled. So go ahead and hit X. And let's go into our processor again. And look at the yield sign on the convert record. Oh we need to finish our flow right. We can't turn this on because the convert record has nowhere to go. So you want to then you know after we have set we have now picked this up. We have converted it to JSON. We now want to update the file name attribute so we can write that back to disk. Because right now it's creating a whole new flow file right. It's creating a whole new document. So you go ahead and click and bring down a update attribute again. Perfect. And so on the update attribute let's go ahead and configure it. We're going to give it a file name. So go ahead and hit plus so we can give it a new property. And the property name is file name. OK. Say OK. OK. And you know we're using not by you know the expression language. So we want to do dollar open curly bracket. And it automatically puts the close in there for you. And then you just want to put file name. Awesome. And then you want to at the closing curly bracket put dot JSON. Dot JSON. There you go. Say apply. OK. And so we want to apply there. So for here we do have you know the capability for failure unlike the update attributes in the git file. So on success we want to update that attributes. Go ahead and drag down your update to the update attribute. Perfect. And you want that to be success. And so the convert record also needs to know what to happen to failure. And what I like to do and you know this is up to you all how you like to do these things. But I like to log everything. So on a failure let's log that error message. So you want to bring down a new processor. Oh I like how you're going to put the failure back on itself. So you want to do a log message. Perfect. OK. And what I like to do is yeah just take and you can see here you can see I put my log messages out to the right of the flow. And if you're working straight down then you know I know that that is my success path. And then I'll put my log message out to the right of my flow. Whatever makes sense. Because you can reuse this log message processor over and over and we're going to do that here as well. But for failure you want to send a log message. So say add. Awesome. Awesome. On log message you know we now need to configure it right. So go into your log message and go to relationships. And we want to auto terminate on success or retry or you know on success. So auto terminate. There you go. Hit apply. So what's happening is that message now will be sent to the logs. And so if you were telling tell is a Linux command. So if you were telling it a log and an error came through you would be able to see that error message come through. It's also going to you can actually pull a history as well to see what that area is. So our convert record now is went from yield to stop. You can leave the log message out there. That's fine. Now we need to continue our flow. So we have an update attribute for that update attribute. We you know we went through and added a JSON file name. Okay. And so now we want to do a put file. So now we have you know a new JSON document. We have it named. We have our schema set. We have all these things. So yep update attribute for success. And we want to put the file. So on this you know if you want you can write that file right back to the CSV directory. And I say that because you know for mine I'm only picking up the CSVs. So any JSON that is written back to that directory you know will will it will not read that back in when it comes time to write it again. So go ahead and figure out where you want to put the output of that. Do remember to and this is for everyone you know do remember to keep the source file on your git CSV if possible just because you you know your flow may not be correct. It runs through you want to run it through again. Does that look great. It does it does. Go ahead and say apply. And then let's see here. Let's do a you know from your put file let's drag another arrow to your log message. Perfect. And we want to do a failure on that one. So so you have I think auto terminate enabled for the put file. On a failure but in case you know the hard disk fills up the if this was a network share it's not available you know there's a ton of different reasons why we might not be able to write that file you want to log that message. And so go ahead and say add. Perfect. So what we've done is is we've created this this flow and any errors we have we are pushing it to the log message. And so if we were to add other aspects of this and there was an error you know it's the same we can reuse that same processor over and over and over again. I'm seeing you know folks set up a whole processor group on how to handle errors. With advanced filtering and things like that. And then you know you may have 10 15 other data flows utilizing that you know that error error flow. And so you know it's just something to keep in mind as you're developing and designing your data flows. Okay let's see if we can run this. I don't see any errors right yet. So let's see if we can Pedro if you can just run through this one time. Yes so let's make sure we configure it first and make sure that we're keeping our source file because now if I like to default to false. Okay. You got true. Perfect. All right. Say cancel. Just to double check. Say run once refresh. Awesome. Awesome. Let's take a look at your queue to make sure that that is the file that you're expecting. So here is you know here we are testing out our flow and making sure you know it looks good. So go ahead and say list you and you remember how to view it. No. No. Go record. Perfect. You content. So yesterday when I did a new content I could not do it because it was a zip file and there was no viewer built into not for zip files. Luckily it's a CSV. So now if I understand CSV and it has a viewer built in but the file name is inventory dot CSV and there's 11 lines with 10 lines with a header and a bad day to roll just to throw us off too. So go ahead and exit out of that view. So you just close that tab. Awesome. And if you want to go ahead and look at the attribute so scroll away all the way to your left to the little. There you go. And attributes. And so if you remember from yesterday this is what we were picking up data and we were looking at the attributes. If you scroll down do you see anything in there about schemas. No. Perfect. Because we have not sent it to that update attribute yet. So go ahead and say OK and exit out of that. Perfect. Sorry guys my network dropped. I was out for like 10 minutes. Oh wow. We're you know I'm using Pedro as an example walking through this flow but you know we're still building the flow so if you have any questions just let me know Brett. OK. Thank you. OK. So we have now got the file. Let's do an update attribute. So just run that once. Just click run once. There you go. And it should show up in success. Awesome. So let's now take a look at that you and let's look at the attributes and see what has changed. Let's go down. Perfect. So you remember in the update attribute we added a new property that was schema.name and with the value of inventory. Right. So now now you know now it should be coming together where we created that controller service. We told the controller service we're going to use the schema.name property as what as because you may have hundreds of different schemes. And so we're going to use the schema.name property and the name is inventory and that was the inventory name we gave it in our controller service. So this looks good so far. So we'll say OK and exit out of that. Perfect. Now let's do here's the heavy lifting is the convert record. So let's run once to see if it actually works. Oh we have a failure. What is our failure? Cannot parse incoming data error while getting next record for string. Wanted to. Oh that bad bad bro that we have in there. Well it should parse each row and create a JSON document out of each row and then ignore the thing. Let me take a look at this right quick because mine's up and running. Let's go into your convert record and let's look at the properties. And you have CSV reader and you have JSON record set writer. You have false. Click on your CSV reader. No it can't click on the arrow to take you to the service. OK that looks good. Let's view the configuration to make sure we have everything figured properly. There you go. We should have used the schema name property. The schema registry is the Avro schema registry. We do have schema name. OK that looks good. Let's look at our schema registry. So click on the arrow for that. Let's go back up. Scroll up and little arrow there. Perfect. And then OK so let's look at the schema registry. We have inventory. And put the schema in there. So click on that. We make sure that looks good. OK. OK that looks good. So OK. So OK again the Avro reader. Let's look click on the gear icon for that one. Use embedded. You have that correct. And you have a thousand. OK now say OK. And then on the JSON record set writer. Let's go. So you know what that configuration is. One second I'm pulling mine up to make sure it matches. Oh we did not configure this one. So on the schema write strategy. We want to change that to set schema dot name attribute. I'll let you write. So say yeah yeah. Close that. And you see up top it says disable and configure like that. So what it's doing is disabling the services and the processors associated with it. All right. So the schema rank strategy is we want to set schema dot name attribute. I thought we did this. Say OK. Awesome. And then the schema access strategy. We want to use use schema name property again. Awesome. Say OK. And then schema name is already applied in the schema registry. So we want to click schema registry and I bet you know this one. It's the Avro schema registry where you use. Perfect. Say OK. And then pretty print JSON. Let's let's just make it pretty. So say true. And then it should be never suppress array. And OK. Say apply. And now we want to enable that. We want to enable every. Awesome. So that is actually a very quick test. If you have your processors completely configured it's a really good test to see if everything is configured appropriately because it will enable that processor. If you know rejects enabling the processor it's usually because you're missing you know some of the termination or you got a misconfiguration. So say close. And then exit out of that. All right. Let's clear our actually let's stop the convert record because it started it when you enabled that service. And let's see if we can run this one more time. So run once at the very beginning. At the very top. Yes sir. And I know I'm working with Pedro here but you know when it pause while he does the run once does anyone have any questions because this is a very difficult flow. Oh and that's why we're doing it. Perfect. And hopefully we didn't lose Brett. No worries. Like I said this is a this is one more advanced data flows that we want to do. And so this starts getting you into controller services and those types of things. So we have a lot of time allocated for this. Okay. Possibly. I'll take a look at yours in just a second. Pedro you want to continue doing a run once and let's see if we can get that to success. And then I will pull Brett. Okay. Let's see if our convert record will actually work this time. Run once. It did not. What is our error? Failed to process. Failed to process. Why are you um. One second I'm looking at mine because we are working off of the exact same. Yeah bad data row same header. So. One second Pedro let me I'm looking at mine which worked. Oh oh oh I bet I know what we did not do. Can we go into the convert record? And then the CSV reader you want to go to that service click the gear icon. No no you cancel that. There you go use the arrow to go to the CSV reader and use the little gear icon. Scroll down. I'll duplicate header names true. Oh treat first line as header. So that's true but you remember we have to disable and configure. Awesome. Okay there we go apply. It couldn't parse because it doesn't understand what that header was. So go ahead and enable it and hopefully convert record will turn perfect. All right let's do it run once all the way through and see if it works this time. Get refreshed. Ah success great job. Okay so if you can let's go ahead and go through and clean this process flow up. Let's get the naming in. Let's get some labels you know those types of things. Here is an example of mine. So let me pull this. Here's an example of what my flow looks like. So if you can go ahead and update yours. I need to delete this because I was using it for example. And if you have any other questions Pedro just let me know. Brad let's go back to you. I would I would did you test yours out Brett? Yeah it doesn't look like it's this first one doesn't have any in or out. Okay. I'm guessing it's just not picking up the file. Perfect let's show configuration. If you don't mind let's stop all your processors right for right now. Just because you know we need to stop them and configure them anyway. Beautiful. Beautiful. And a great way to use the operate you know on the canvas. So take if you can go to that input directory and go to that value. You know go back to your not file canvas and under that value. You have to open that copy that location. So you just highlight it and say copy. I just use control C. Okay that works. All right then hit cancel. But don't hit yeah there you go. And then you go back to your file browser. There you go. And then in your address bar paste that. I didn't get it. Do it again control A control C. I'll hit control C 10 times. Because it doesn't get it the first time. Yeah the the desktop environment sometimes will. There you go hit enter. Yeah that was the here I'll go up a directory so it's obvious. Yeah. Oh it's it's there the CSV files there. Okay you say cancel. Oh your your file filter. Let's just stick with inventory.csv for now. The regex pattern may be off. So you can just change it change that. Inventory.csv. Run once and hit refresh. Just right click on your canvas and hit that works. Well I witnessed you set that path correctly. Right click on that get inventory file. Oh it went it went. I see it. Success. Okay. Okay perfect. And then if you want just do a run once and let's see if it goes always through. You have to do each individual one. Wouldn't it be cool if you just hit run once on the whole process group. Okay I'll submit it to them and see if they can get that into the next version. Like I said I know all those guys so like I submit feedback all the time. All right so here's where we usually see our our highest values is on this convert record. So go ahead and run that once. Oh whoa whoa whoa. If it fails I would recommend here is is logging your failures. Just because you know you see that red box in the top right. So what it did is it auto terminated the failure and and that's it. So it won't go into any logs or anything. So what I like to do is do a log message. You got it. And this is what I was. This part happened when I dropped I didn't hear this part. Oh so this is what I was saying where I like to have a log message just on my data flow. And any chance I have where a failure can terminate I always drag my failures to the log message just so I can also see the file. You know you may have a file that comes in and is not recognized and it goes to failure. Then you can actually view the queue and look and see why like maybe it picked up a file that was inventory CSV that had bad values right. If you do an auto terminate it takes away some of your options. That's why I always recommend a log message. Okay so you can go ahead and auto terminate that log message. You so you want to go ahead and configure it and then relationships auto terminate. Perfect. Say apply. Awesome. Awesome. And then take your log message is fine. You can just have it set out to the side. You should be good there. No you you don't want to drag it to anything. You can drag it to itself and it will if you you know if you have a failure it will you can drag it to itself and it will reprocess that data. We see that you know when you have a processor that could take a little time sometimes. And so you know if you have something that you know is taking a lot of resources like unzipping a large file but you know that you're getting like you know a bunch of small files behind it you every once in a while will see a large one you can put a failure back onto yourself so you can just reprocess it. We'll take that flow file put it back into the queue and try to reprocess it. Let's go to our convert CSV to JSON convert record. And let's configure that. Awesome. Go to properties. Okay so we are reading in a CSV reader so let's go to that controller service. You want to hit that arrow on the right. Awesome. And let's hit the gear and let's see what your configuration looks like. Go to properties scroll all the way down and treat. Okay you do have the treat the first line as the header. Let's go all the way up to the top and let's see what that looks like. In first schema I don't think it's correct. So we want to use schema name property. So you remember we set the schema name in that attribute. And so we're telling now that we set that attribute we're telling not to use the schema name and we're going to specify the name is what schema to use. So you got that. And then schema registry we want to set you know there's no value there. We want to set that as the Avro schema registry. Awesome. Say okay. And then I click apply. Awesome. And then let's go to your Avro schema registry we just set up and click your icon. And you've got the model and you got that. That's beautiful. All right. Say okay. I think maybe the only issue was is you just didn't reference the schema. So let's look at your Avro reader. It should be just use embedded Avro schema. Yep. Perfect. Perfect. Perfect. Perfect. Okay. And real quickly let's look at JSON record set writer. So you should have set schema dot name attribute. No value. Use schema name property. And what is the schema name property? It's you know we set that in that update attribute. So there you go. And you've got the Avro schema registry. Say okay. I think yours is going to work now. So just try to do a run once and you know let's see how far we get. And while he's working on that let's see how's everyone else doing. Cody is cleaning his up. Tyler is cleaning his up. Looking good. Avro too. Okay. So you were able to get the file. Let's run it once to set the update attribute. Oh okay. Did it. Okay. Run once. Hit refresh. Right click and hit refresh on your canvas. There you go. Success. So and then you're updating the attribute and you are then putting the file back to disk. So if you can let's go ahead and clean this up. Do your names and labels and those types of things. Make sure the file is being written out. I would add another failure on put file just because you want to make sure you log that message if it does have an issue putting the file. As I mentioned earlier when you're putting a file that could be many things and you know this could fill up or something like that. If you have some sort of logging mechanism set up where you're pushing all this to Prometheus and others you know Grafana you can take a look at this. Okay. Brett is squared away. So it got output as a CSV. Oh I thought we changed that. Yes. That might be another one you missed. So do you have. Yeah. Yeah. Do you have an update attribute after your convert record. Yeah. Now let me. But this is this is I think this is when it cut out. Okay. No worries. We got you. I'm pulling yours back up. Okay. So we have the update attribute when it's coming out you added a new attribute called founding and then the value should be founding like you have dot J. Perfect. Okay. And say apply. Okay. And apply. So is there a different attribute for like file extension. Because I didn't set that and it was it automatically just gave it CSV. Yeah. Because it's using the file name attribute that was originally with it. And so it probably wrote that because it needs a file name to write and so it wrote it as a. If you look at you changed it. If you want to run once but turn off that update attribute where we put on the file name. We'll show you where it happens. Going to stop that update attribute and if you can just run it one more time through. Oh you have an error on your put file. Yeah. That's it's the same file so go ahead and you've got that stop. Perfect. Empty that queue. Yes sir. And let's run it once all the way up to the update attribute. So let's let it convert and then after it converts let's not let's look at it the attributes. Okay. And let it convert. It's already went to set name. Well your other file has that was in the queue but it's okay. So you should have two in the queue after set name. Okay. Success. So if you can right before there let's look at the attributes of that file. Perfect. Do you remember where it's at. So you see the file name. No you're fine. So you see the file name. So it's going to write that out until we update that attribute. And so what we want to do is update it to be inventory dot JSON. So it's going to use so it's going to use that attribute to write the data back out unless you update it. And so if you say okay. And then exit out of that. And let's look at your update attribute properties. And you see we kept the file name and we added dot JSON right. Okay. And then let that run real quickly. That's going to that's going to be inventory dot CSV dot JSON. Correct. Yeah we didn't strip the. We didn't strip the. Really make sure. So you should now have found an inventory CSV JSON. Now we could do. You can update that attribute with some rejects and change the file name all together. So if you want you can actually go back into your update attribute. And and you know apply some rejects if you wanted to to change that file name to whatever you want you know just just you know as right. So right now it's using the file name and just. For the sake of this demo could I just hard code it to inventory dot JSON. You could or did you have the rejects already that I can use. I do not have it handy but but yeah just send it to inventory dot JSON. Since that works for you. Oh you already have a file in your queue so if you want to go ahead and run once. Before you unless you just did that I don't know if you did. Oh I did I did just do it. Yeah I know where it's. What's our error. Because it's going to complain about the file name already exists. But but that actually is a good question. Let's see. Like to do is I'll actually update the file name and give it like a UID and. And a new file extension but I can actually let me see here. You can do like a set file name as like file name equal dollar. Curly bracket new file name right. And while you work on that you can set a file name. In the in the in the rejects. So you can actually. So you can put in your update attribute. Hang on let me let me double check my notes here. But. Because if you do file name it's going to pull in that dot CSV. We need to get rid of that CSV or you could do. I'm looking at the rejects patterns on the website. I mean if it's like typical rejects it would be something like begin of line. Yeah I mean. Something and then. Any amount of characters and then. Like. Something like not literal. Well you can also do something like a date. So you can do like. Yes you know you can do like now. And. Then parentheses and then close your curly brackets dot JSON. You know you know when the sign if you do like now. Hey yo. And then what that's going to do is just give you a date right. I understand. Yeah a time stamp dot JSON. No that's that's a good question run that once and see if it works. It looked okay but we'll see. How do I how do I get rid of these errors here. The so the little red box in the top right. It goes away in five minutes. So I it's annoying as well where you know it's there you can be testing and running through these things quickly and you know the error still exists. But I think you're off to a good start. I'll look at some patterns. One more quick question. Yeah. Can I clear out all the queues at once. Go back see if you can go back. Use your breadcrumb and go back to the not five low mean. There you go. And then right click on your perfect and say stop right click again. If the all queues. Oh okay cool. There you go. Perfect. Okay so any like I know we're still working on this but is there any additional questions. I got a quick question. Yes sir. So say does error out. What would happen to the log. Does it become a file or does it just log somewhere. It does. And that's the reason that I recommend you use the log message. And so what that is going to do is push that to the not file log. For instance you pulled that log up to get your username and password. And so it will log that message to the not file logs. Yes sir. So if you go into the log directory and you go to the not file app if it creates a log here. It will log it to the app to the to log not five you know at log. No that's a that's a great question. And again it's some of the design patterns you know you may not want to put you know a log. I log every failure. I even use the log message for success sometimes. Because I want to see some certain aspects of that flow file. But no that's a that's a great question. Any other questions. And Brett if you can you know. Perfect perfect. And then. What. I hard coded my update attribute to inventory.json. But it seems like it's still wanting to output as a CSV. It's outputting the name as inventory.csv or the contents as a CSV. Okay so let me take a look. I'm sorry I didn't have it pulled up who was speaking who was that. Cody. Thank God. My Cody I should know your name by now. Say cancel. All right so you are my driving. Oh I do have it on the interactive so hang on one second. There you go. Perfect. All right. So you have your convert to JSON. What step is failing. Oh you're getting failures like crazy. That was failing because it's the. There's only the same file name. Oh oh oh no worries. Okay and then update file name success put and then failure where. And so the file that's being written. Can we look at. Yeah it's this guy right here. Can you open it up. Let's make sure it's a JSON document. Awesome it is. Go ahead and close that out. Go ahead and close that and then your update file name. Let's look at that stop it and let's look at that property. Yeah so you're so you're just putting in like the updated but we actually need to put it in the correct format for not to read it. So what you want to do is. Yeah change that and you want to use dollar. Because dollar curly bracket file name has as part of the red text and so night understands you know that so only close your curly bracket JSON. All right that was. Yeah I had that. I had that originally I was still getting a CSV but I'll. It should be inventory CSV JSON. I'm going to come up with a red text pattern to change that. Just because I think Brett also has the same question. But can you run that to see and then we will take a quick. Instead of. On your place Jason okay there you just run once okay it's still running a CSV. All right.
on 2024-05-06
language: EN
WEBVTT Okay, perfect. So if you can stop all your processors, I just showed Brett a little trick on how to quickly stop those all. If you want to go right back to your main canvas, use your breadcrumb trail at the bottom. There you go. Right click on that processor group. Alright, success. Let's take a look at that update file name. Let's take a look at that queue right before this. The queue can go all the way to your left and let's look at the attributes. I can already see file name there. Okay. File name is inventory.csv. Perfect. Say okay. And exit out of that. And run the update file name one time and refresh. We're not going to put the file yet. We're just going to run it once. Yep. And hit refresh. Success. Let's look at that queue. List the queue. And why is there a... Okay. Hit that. File name. Say okay. Exit out of that. And you're using an update attribute, right? Say configure. Oh, you capitalized file name. It's a lower case. So delete that. Delete that altogether. And then add a new one. File name in lower case because that's the property Give another minute for folks to return. But, you know, great job to everyone getting this flow working. That was the reason I chose this flow to go through this morning. It really kind of shows off some of the controller service capabilities. It's more than your standard Nafa flow of just picking a file up, unzipping it and putting it somewhere else. We really started getting into some of those controller services, those types of things, as well as schemas, record readers, record writers. And you'll see a lot of this in a lot of your data flows will utilize these services. The beauty of that controller service was now we can reuse those components. We can reuse those controller services if need be. We can reuse that schema. So if we had, you know, CSV coming in from somewhere else, we can reuse a lot of the same data flow. So that makes it a lot easier for us. There's even more advanced capabilities under the hood when you start diving into this. The ability to have
on 2024-05-06
language: EN
WEBVTT color to your processors like I did. Good job, I didn't explain that. So, you know, good job finding that out. You are now, you know, bending your arrows, you're beautifying this. I think it looks great. I really like the color scheme. You can go through and add labels like I did. If you wanted to, you could group these into a label. But overall, this looks great. If you have any questions, let me know. Amanda, Amanda also the same, you know, added some colors to this. The only thing I do, maybe Amanda is just like left or right. You have obtained a CSV down to update schema. You may just want to put these processors in order from left to right, and then, you know, go down when, you know, to your log messages. So, you know, you keep your failures going left to right, and then you keep your success left to right, and then use, you know, any downward flow as your log message. But it looks like you got it, you understand. So, you know, that's just any tips I would give. And finally, Alyssa, perfect, perfect. You got naming, if you want, you can name your connections. You can apply some labels, you know, change color of your processors, those types of things. But yours looks good as well. Hopefully, you didn't have any issues. If you do, let me know. But no, great job, and let's, you know, let's follow up. Okay, so any final questions about your data flow? And again, you know, we used this to pick up a CSV file. We created a controller service, and then we used that service to do a CSV reader and have the data read in. We created an Abro schema that we just copied and pasted in, but, you know, if you were creating a schema, you could copy and paste your own schema in. And then we, you know, we used that schema to run the CSV files through to create a JSON document, push that back to the file system where we renamed it to a .json, even though it still had the .csv. During lunch, I'm gonna work on mine to come up with a regex pattern to really show that name off, you know, while I'm sitting here eating. And I'll paste in chat, you know, how to do that, you know, for those that wanna update their file name. So any final questions before we move on to not file registry? Go ahead. So on the get file, I noticed there's like a little symbol on it, it's like a shield, half red, half white shield. Did you cover what that meant yet? Let me see here. Right here. Is that the one you're talking about? Yeah, so there's that little shield. I'm just curious what that means. Let's see where I set that. I mean, it's on mine too, so I... So there's the processor name, shield is... Let me look at why mine is doing that. The red and white shield, the red and white shield, okay, I see this, you're talking about right here, this little icon here, right? Yes, okay, sorry. So that red and white shield, you know, in the UI indicates a restricted component processor, meaning these processors can be used to run unsanitized code or get data on a host. It's just a quick visualization that there's... It's a risky. Risky, bingo, that's the word I'm looking for. So if you noticed, it's on the get file and the put file. So the risk level is a little higher there. I figured that's what it was. So if we were gonna run like, I'm assuming there's like some sort of Python code, would it also have that symbol on it? It could, so processors that are able to run code, unchecked, you know, may have that as well. There's... So most likely, I think it does, let me... Actually, let's just look at it and see. We can drag the processor down to run Python code. Come on, latency. Yeah, you see the shield is right beside the script, validate record, invoke scripted processor, execute script, you know, those types of things. So yeah, it is... That's... Yeah, it's a... I knew it was like some sort of security thing. Yeah, it's a more risky thing. You can see, you know, where... A lot of these is where it interacts with something outside of NaPy, right? Because if it's in NaPy, if we're creating attributes, if we're dealing with data, you know, those types of things, you know, it keeps it within NaPy, or if there's like a secure connection, you know, those types of things, then we won't have that shield. But yeah, that's a great... I've actually never been asked that question why there's a shield. But yeah, it's risky, you know, and it's a restricted component processor that can run unsanitized code, you know, or change or get data on the whole system, you know, those types of things. Great question though. Okay, any other questions, you know, about our flows? I had a question. Yes. So I know you mentioned that we have to create our own schema for the files. Yep. Well, what if, for example, in our office, we have a bunch of CSV files. Well, they don't follow like a very straightforward, like a common or separated by values format. How can we come up with like a custom schema? So what I like to do on those cases is I like to read the CSV in, and you know, you can actually use, you can actually do like a lookup on that CSV as well. You can do a regex pattern, you know, so in the past, what I've done is used like a regex pattern to extract the data out of the CSV files, and I will take that data, extract it, and push it as an attribute. So if you look, you know, for JSON, for instance, you know, I can do, I could read that JSON, evaluate JSON path, for instance, and I can read that JSON that we, you know, wrote. I could change this to a flow file attribute, and now I can read that JSON in and take all the fields, and I can go through and start defining the fields, adding properties, you know, regex patterns or other things to pull the data out of a JSON and create an attribute with it. And then once it's an attribute, I can, you know, match, I can do a lookup if I needed to, I can match it to some other data if I needed to, and then I could, you know, extract that data, even from a CSV, pull that data in, have it as an attribute, and then turn around and write that as proper JSON or proper CSV. So there's a couple of different ways to handle, you know, a mix of data, but yeah, it's, you're looking, if it's a lot of different formats and a lot of different things, you are now looking at, you know, doing regex patterns to extract that data, or using the CSV reader in that controller service. So you may want, you know, if you remember in the CSV, that's a, this is a really good question, so let's go into that. In the CSV reader, we had like, you know, what is the record separator, the value separator, those types of things, right? So you could actually, depending on the content, you could read the contents of a CSV file, determine the, you know, the separator or something else that distinguish it as different, and send it to its own CSV reader, where you have some of these, like the quote character may be different. Instead of a CSV, it may be a tab, a TSV. So, you know, you may have a tab in here. So you may have a few different CSV reader controller services, and you will read the contents of that message or that piece of data, and filter and route, you know, to the correct CSV reader. Does that help answer your question? If you have any, like, better example of the data itself, I can help you quickly, just on a flow even. So, go ahead. Yeah, yeah, that definitely helps a lot. I mean, I can show you what I have, but I don't think it's pretty. No, no. But another quick thing, is there any way to, since I'm a programmer, so I think, is there any ways we can embed any code in like a data set, or no? So, do you like Python? I mean, I know everyone works with Python. I haven't worked in a long time, but I mean, I'm trying to figure it out. You could, do an execute script. You can do an execute script, where you send that data and execute a script on it. You know, they're starting to get away with some of the scripting language that's used, just because of security risks. But, you know, you could do Clojure, for instance, or Groovy, in this processor, we have some other, and you can still do Python. It's just, you know, they're letting you know, they're getting rid of that. We also have, are they getting rid of Python? No, they're not, no, no, no, no, not at all. Actually, they're getting rid of it through that processor. The newest version of NaPy, NaPy 2.0, you can actually create Python processors. Because, you remember, yesterday, we went over what a processor is, and, you know, basically, it's a Java jar, and so, you know, instead of creating a Java jar, Java nar, for this instance, with your own custom Java logic, you can actually create a Python processor to accept that incoming data, and then you may have, like, in Python, a script to parse that and push that out as an attribute. For instance, you do have that capability. You also have, you can invoke scripted processor, you can scripted filter record. There's a few different ways where you can actually then use code, you know, to do these types of things. Even a scripted partition record, those types of things. So, you can run your code, your custom code, you know, in this, you can invoke scripts, you can, you know, do those types of things if you'd like. Awesome. And if you get a free second, just chat with me, chat, like, a couple of CSV examples you're talking about, and, like, later tonight, I can whip up a quick, I'm not gonna do the flow front to back, but I can show you an example of how I would do it using best design principles. Yeah, I mean, I can give you, we mostly deal with the financial data, and the way that it comes from, it comes from a website that we have to manually download the files. So, it comes with a lot of extra tags and stuff. Yeah, yeah. We have to strip up a lot of that. If you noticed in the downloads folder, one of my other hands-on is actually pulling from a website. How do you download that? You manually download it, you said, is it like a zip file every day, or what? It's a combination. It's a long process, but it comes in different CSV files, or some of them have to be converted into cell files. Oh, okay. I was thinking if it's like an API you hit, and it will give you the file, you could. Certainly, we've been asking for that for many years. Okay. Yeah, so it sounds like the source of the data isn't as automated as you would hope, but if they do get to that point, just remember we have HTTP processors as well. So, you can actually get, you can handle HTTP requests, you can invoke HTTP, you can actually listen or post. There's a lot of capabilities there. So, just for FYI, if your source gets to that capability where you can automatically pull, you may set up a data flow to pull once a day, and then you can take their output, filter, sort, parse it out, put it back together, and then you even have the capability to export as Excel. How about, I'm sorry, I don't mean to take over. Any SQL, MSSQL, we work with the SQL a lot. Yeah, so, mostly if you work with like MSSQL, for instance, and I know you have Access and stuff like that, but you have tons of put SQL query database table, query database table record, query record, you can list the database tables, you have tons of SQL capabilities. Does that help, Pedro? Yes, sir, I appreciate it. Okay, perfect, perfect. All right, any other final questions before we move on? Awesome, awesome, awesome. So, next thing we're going to do is we've now built a couple of data flows. We do not wanna take a chance on turning a flow on and basically ingesting our own non-FI instance and breaking it. Maybe this crashes or something else like that. So, let's start talking registry and non-FI registration. So, if you can, you wanna go back into your folder, you wanna go back to your downloads, and we are going to install non-FI registry. So, to do that, you want to go to the non-FI registry bin dot zip, which is right here. Go ahead and extract that, but don't run it yet, if you don't mind. So, extract it into the folder, the non-FI registry, just like you were doing with non-FI. Give you just a minute on that. Okay, so you should have that extracted. We wanna go into the non-FI registry folder. I have already executed mine, but again, it's the same principle with non-FI. You're gonna have some already set folders, and then when we start this, it will create the rest of the folders that you need. But the main folder that we are concerned about with registry is the comp folder. And so, if you can, go into your comp folder and then look at non-FI registry. So, registry's pretty easy compared to non-FI registry. Non-FI, some of the properties are, it's a lot less properties, thankfully, a little bit easier to configure, but we'll kind of go through some of the more advanced configuration options. Your web properties, you should be able to see that. We are basically listening on 0.0.0.0. So, it's gonna bind to all the ports, or bind to all the hosts, and then also the port is 18.0.8.0. You can change that here. I would just leave it the same, because mine's set up for that, and we're going to work off of that. How many threads the JETI server should have for some of these more advanced, like sysadmin type work? You may wanna mess with those. There's a lot of different tuning and performance considerations to take in if you're building this out in a more scalable, excuse me, production way. Security properties, here's where you're gonna go in. You're gonna start setting your key store, your trust store. You can use the authorizers.xml that you may have created with non-FI. And so, you can reference those. We will, for this scenario, we do not have security enabled for the registry. It's pretty much wide open, but non-FI is just a single sign-on as well. So, there's a lot there you can configure. And then the providers, this is where you will go in. So, with registry, registry is backed by your versioning control. I mean, what do you all use for versioning control? You get lab, get hub, Azure has a Git service, I think. Are you, anyone able to tell me, or is it? Azure dev, okay. Okay, so there is a, let me take note of that. You can use Azure DevOps, and I'll take note of that. I haven't configured non-FI registry to use that, but I'm taking a note because I wanna make sure I send you the right documentation for that. So, in your real-world environment. It does. Okay, yeah, yeah, so perfect, perfect, perfect. Yeah, no, so under the hood, it uses Git, but we need to specify some things in that for Azure. So, and then you can use GitLab and GitHub as well. The beauty of using Azure, and this isn't really well-known yet. So, Microsoft is incorporating more and more into its Azure stack. And I don't have 100% confirmation, but Microsoft has been reaching out to some of us because they plan to make it as a service, as part of Azure. And so, you will potentially in the future be able to configure this a little bit easier just because it's gonna be officially supported as the Azure stack, if that makes sense. So anyway, so yeah, in the provider's file. In case it's relevant to your notes, we have an on-prem Azure DevOps, not the cloud version. There are differences. I don't know if you're, since you were writing that down, I just figured I'd like to- No, thank you. No, and what I like to do is just get as much information as I can because after I get into this class, I wanna make sure that I send all these notes out and as well as some more advanced capabilities and give you like a little handbook to work off of. And if I know the environment you're working off of, I can tailor that so when you get the email, we can go from there. So it's Azure DevOps on-prem. So you'll actually go into your providers, into your conf directory to do that. If you look, you will see a providers. This one, for instance, everything is commented out, but if we, you would just use Git, access user, the password, the repository to clone, and those types of things. Like I said, I do not have a GitHub set up for this, but what I will do is go through and give you some directions on setting your GitHub up for this or GitLab or Azure DevOps. And it'll help you when you're configuring this in the future. We are worried about right now is non-file to registry, but like I said, I'll give you instructions on registry to your Git version control. But you will define that in your providers, just like the properties say. Any database properties under the hood, registry, non-file, they'd like to use the H2, that's constantly being updated. Any extensions directory as well for AWS, there's some special configuration there. Identity mapping, there's some additional security, Kerberos properties, those types of things. So when you start putting this more in production, you're gonna look at your conf directory first, start getting that filled out and go to your providers and get that checked. What I will do is I will try to get you a good example of your providers so you can use your Azure DevOps. But this is where you will define it. So with that being said, we have- Can I ask, what do we get out of connecting into DevOps? Because I get the value of a registry, if we just don't connect it to DevOps, what would be the difference? So the way this process works is non-file communicates with registry to store and version all of their data flows. But it's not backed, registry is not backed by Git or a true version control system. And so you would plug your registry into GitHub, for instance. And so when those flows are committed, registry will take those, create the history file and then push as a Git push to your Git repo. So that way you may decide that as part of your CI-CD process, you will take a flow and push that out, as well as the, say you had an Ansible, Playbook set up to deploy non-file and you need to feed it the flow that it's going to use. And you can pull that from your Git repo, like GitHub, for instance. Does that- Okay, so you could do that without connecting the registry to a Git? Yes, yeah. Because we are going to save the versions into registry and not registry to GitHub. So that final step would be in your environment, you would have non-file to registry and then registry to your Azure DevOps. Think of registry as your translation layer to get your data flows into a versioning control, as well as a UI to manage your data. Okay, because I know I was speaking with James, he had, from what I've seen, he had been the one who has been messing with non-file, he's not in this training, unfortunately. He developed a flow on a sub-production instance, even though we don't have a real production instance. And then he exported it to an XML or something and then had to convert it and then import it to another instance. So this would be, instead of that, tedious process. Yeah, so, yeah. And I'll show you, when we set up registry, you know, the beauty is registry will segregate everything into buckets. And so you can go in with permissions, for instance, say that, you know, user X only has access to these buckets, user Y only has access to these buckets. But, you know, if you have X and Y working on the same thing, they can actually, you know, commit to registry. And then user Y can check that out and pull the latest version and continue working on the flow as well. And then it's just, you know, the committee is back and forth. And then registry is going to take that flow and push it to a versioning control system like GitHub that will keep track of it. You know, you can branch and all those things. And then you can use that flow as part of a CI-CD process to push that flow out. You know, here's the version that you need to push out for dev or prod or whatever. And you can have that as part of your CI-CD process as well. Oh, thank you. Yep, yep, no, and great questions. And this is the perfect time to ask it. Okay, so we now have all of our files extracted. We've looked at our comp directory. There is a database directory, you know, it's an H2 that's keeping track of things. You know, there is docs. Every NaFi component, except for like some of the Minify stuff that we will go into later, you know, has docs built in just in case you're on a network that doesn't have access to the internet. Lib directory, you know, just like NaFi has that lib directory. There is an extensions directory if you want to build all two registry. You know, just like NaFi, you can build a processor and put it into the extensions and hot load it. There are some extensions you can build for registry. I don't know if I've ever seen one built, to be honest, but the capabilities there. And of course, your logs directory, because that way you can see what's going on and those types of things. So for this exercise, let's go into the bin directory. And we want to run NaFi registry. And so that should bring up a new window that, you know, is our NaFi registry. Let me check your screen. And then once it starts, if you can, open your browser. Do you remember we set our, you know, we didn't change our port, but it was 18.080. So if you can, when it's up and running, bring up a new tab and you want to log in and go to 127.0.0.1 colon 18.080. Sorry, Joshua, I'm stuck for a little bit. I'm late on the last step. Oh, no worries. So in your bin folder, in NaFi registry, you've extracted NaFi registry, you've extracted that zip file. In the bin folder is the run NaFi registry bat file. And so you want to run that and it's going to bring up a new command line window that's running. So you should have two, one for NaFi, one for NaFi registry. And then once registry starts, you know, you gotta give it a minute. You should be able to log in to the, to be able to log in to the, to the browser. Hey, Josh, it's Ben. Mine blew up. Let me look at yours. Ben, Ben, Ben, Ben, Ben, where you at? Ben, you blew up. It's all right. Oh, I'm kind of like, hold on. I know a few people are going, that one's at 45, can you say it? Yeah, the one, two, five should be, oh, you're re-downloading. So make sure you're downloading registry. So scroll back up and click on registry. And you can use one, two, six binaries. That's a stupid piece of crap. I know, I know. If I could change it, you know, change this, trust me I would. All right, I got it. I'm good. Okay, so Ben's downloading the registry now. It should also be in your folder. If you don't have it, you can go to the NotFile website and download it. The reason, you know, I would tell this is the more technically advanced class. You know, if it wasn't the technically advanced class, I may have already have this installed and running for everyone. But, you know, so because we're the advanced class, we're gonna go through some of the technical aspects of this, setting it up and things like that. So yeah, if you don't already have it downloaded, you can download it yourself. Again, all of this is free and open source. What was the URL for that new NotFile? Yeah, so if you wanna download it yourself, it should be available on your, and Pedro, I'll pull your screen up. And go to notfile.apache.org. I'm gonna click download and you click on registry. Scroll back up. There's three tabs, NotFile, Minify, Registry, well, four tabs and FBS is the Flow Design System. Click on Registry, I go up, up top, there you go, perfect. Scroll down a little bit and you see the one, two, five registry binaries, not the source, unless you wanna build it, there you go. And it should have been available, I think you're good. You click the HTTP site, if it doesn't respond in a minute, click the backup site, but it should download those files. And what I was saying is the NotFile registry should be already in your downloads folder. When I created the VMs, I created it with it all as a zip file, so you don't have to download it, but you may have picked all your zips up and destroyed it. Yeah, I think we all messed up the zips on the first one. There we go. I had to redownload. No worries, no worries. Like I said, we use that zip flow file as an example of what bad can happen as well. And so it makes for great party. So yeah, just download it. Once you've got that zip file, extract it and go into, we're not gonna change anything in the properties, but then go into the bin file and click run registry. And I'll give everyone a couple. I did the connection review, so. Right, so I think I have that. So then what happens up there, run registry? So Pedro, it's running. I think so, yeah. Let me look at your screen. Yep, it's running. So now in your address bar, go to, just type in 127.0.0.1 and you wanna go to colon 18 080 and hit enter. Oh, in front of that, put nafi-registry. Do a front slash nafi-registry. Nafi-registry. Oh. Yep. So it should look like that. You don't need to use HTTPS or anything else. Okay, perfect, Pedro, you're in. In like clean. Okay, anyone else having an issue? Yeah, I am, Sean. Hey Sean, let's take a look. Mine's not. No worries, let's see what we got. Got it running. Uh-huh. I think I'm doing something wrong here. Yeah, I think you have HTTPS. It's just HTTP because, yeah, perfect. So take your HTTPS, take a S out, there you go. Hit enter. And you're good. Okay. So nafi out of the box comes with a self-signed certificate with a username and password. And the reason being is we had tons of people downloading it, putting it in AWS, leaving it wide open to the world. And I can create a data flow to download whatever kind of malicious activity I want and go from there. So registry, because registry's not actually like executing code or anything, they decided to just leave that open. But you do have the capability to configure security in the conf files. You wanna put, modify your authorizers, those types of things. But you're where you need to be. So how's everybody else doing?
on 2024-05-06
language: EN
WEBVTT You look for new versions of that. Make sure your versioning matches. So if you're using 95125, you most likely would need to use registry 125 as well. I know a lot of these are backwards compatible. I've seen some older registries that are still running and 95 was upgraded itself, but the callbacks are still there and they were never deprecated. So just try to stay with the same version if you can. But it is a sub-project and its job is a complementary application that provides a central location for storage and management of shared resources across one or more NIFI instances or MENIFI, which we'll talk about tomorrow. The first implementation of registry supports version flows. So we need to tell our NIFI where our NIFI registry is located.
on 2024-05-06
language: EN
WEBVTT view all's not by instance I don't know if the IP address will resolve but I'm curious to see but anyway so you should now have you know two flows within your within your registry everyone does okay perfect if you do not let me know so let's go back to 95 and let's go into that CSV to JSON data flow and what I want to do is I am going to make some changes to this I want to add you know a label I'm just adding a label just making some changes to my data flow why okay so the change I made is basically I just added a label to this and if you go back you know you can use your break from trail to to go back or just right-click and say leave your group but when you go back you know that green checkbox is now and so what that lets you know what's that letting you know is that you have changes that need to be committed and so you can right again same version and you want to commit local changes and so but this version comment I want to put I added a label and then once I make that comment it should be a checkbox and I should be good to go hopefully if anyone I'm not looking at everyone's screen right now so if anyone's having the issue feel free to speak up if you're not this far along okay so now that I've committed my changes and it's now into the registry if I had a GitHub or GitLab or Azure DevOps platform set up you know get repo there you know the registry would automatically write these data flows to to that versioning control stuff but minify you need to develop your flows in notify and save them before minify could run them because minify does not have a UI to develop its flows and that is the you know from tomorrow morning to lunch tomorrow we're working minify you know I've set aside half a day just for that because I know it's a it's a very there's a lot of things going on once we start introducing minify but you know that could be a reason you want to save your flow you built it in notify you save it you save it to you it gets backed up to your Git repo you may have a CI CD process that will push notify to your you know minify to your raspberry pie edge device it will push the correct flow to that as well because you do not have registry installed on that pie you do not have notify installed you do not have registry you know connected to that notify that's not installed so you know minify is is probably one of the key users of that versioning control system and you know in talking with you know folks minify is definitely you know mess with it locally and make a lot of changes and then once you get to where you're at then start committing again I can't think of any real business use cases to stop versioning control except for like testing and training and those types of things you can also revert back so what I did is I I'm going to change and the red arrow lets me know that there is a newer version of this flow out there so I was reverting my changes because I want to get rid of that label and maybe I wanted to you know get rid of that label and I wanted to add another label by the way you can control C control V on labels maybe I wanted two labels now right so what I would do is I created my new label I have changes that needed to be made and now I can you know I can revert my local changes I can show the local changes or I could just commit a newer version and now we are back you know to a green checkbox okay so with that being said now that is how we are committing our changes that's how we are working with our data flow you know those types of things so what I want to do is I am going to stop my my processor group I am going to delete that processor group well I would have to go in and delete the controller service that's associated with it what you know some of these other things but just stop your processor group and now we have a version of that data flow so now we can actually go and say drag and drop down a new processor and now we have the option to import from registry so previously we did not have this we could only create a name for our processor group or upload like we did previously for the data flow that we were working with you know that JSON element that we can upload but now you can import from registry and you can go so now I have you know two different flows you you may have more but I have two different flows saved in my registry refresh awesome so in that same bucket I now have you know a new name data flow also have the CSV to JSON that has three different versions you know I can refresh these I can you know delete the bucket all together I can export a version I can import even a new version or I can just delete the flow all together and then you know you can go back to your settings here and you know delete your buckets and things like that as well you know so now registry has a you know a ton of capabilities in managing your data flows and you know if you were to run your you know import zip and you know it blows up it is you know it ingests your your parent not five directory and you know it no longer works we can now delete that whole instance unzip it start nine five and this registry real quickly and now we have a version of the the flows you know so there's a lot of capabilities here again we're not going to go into configuring github and get lab in Azure and those types of things but what I will do is send out some some tips and tricks and instructions on how how I've done this and and some reference material because you know you may need to reference a specific class or property in the configuration file so I'm gonna pause there I think that just about covers you know the registry the user guide and the system administrator guide on his own one those will those links I will send out as well you know it goes into a little bit more detail about you know managing users and things like that the if you remember when I was talking about some of the not five security is very fine detailed it's it's you know deep and fine grained controls you can you know set your users to only allow them to view a data flow you can start a data flow they can import just they can they can and when if you were if somebody is high speed enough I don't know if this will work in this VM but you you should be able to access my this may be out of scope at this training I'm just curious how other people do this but if the flows are being backed up in source control we're gonna be running these as containers do we do we need to back up any of the anything other than the flow because I know there's like the data is queued and if there's errors and stuff do people typically back up stuff with just the flow when they're running my fine production they do not so that's that's the the beauty of this and the any other questions any you know even tips or tricks you know I I like to you know do these classes where it's very casual and and it's a very it's a very much conversation you know please pick my brain you know like I said I've been through numerous huge non-fi deployments I'm one of the committers to this so you and then you can actually look at Apache you know Apache links their github repository you can see if I was changed an hour ago for instance but if you look here you will have all of the source code or all of the processors one of the things that I mean I would be happy you let me know but is we can set so there's a whole developer guide on how to you know build a processor from scratch and things like that it has a lot of information you can go directly to get a download once you've got it downloaded there's some things you can do and change some things if you want to build it yourself building a custom distribution you can you know uses maven because Java's under the hood and things like that there is a MVM clean install like include all of the NARS and all the bundles you may want to only build like you know certain bundles and things like that there's actually another hidden option and it's the rules engine so there is a rules engine flag they just don't list it here that you can also include but yeah the developers documentation and building it you know is really well documented and you know that leads me to another question so I've seen this this quite a bit and I don't really go over it because you know to me you know it's all in the install guide and things like that but when you are installing this in and like in a production type of environment you want to make sure that you have you know all the hard set you know the amount of files that you can have open on the system and those types of things not five is is really heavy on having a ton of files open and so you know they will give you in the instructions and I'm looking for it right now they will give you no no I think I think I understand thanks okay and like I said I'll have more there I can I can I can I can type up some more tips and tricks on developing processors but good question anybody else have any questions before we take a break and for me to eat early dinner for you all get some lunch okay so let's let's take a lunch break it is 202 let's just say it's two o'clock let's try to be back here you know let's take a 42 minute lunch 43 minute lunch let's try to be back here by 12 45 you all your time to 45 central time and you know we'll go into another another day to flow and build that out and you know get some get some additional hands-on and then you know call it a day so have a great lunch I will see you all in 43 minutes if you need anything I'll most and hopefully everyone has made it back even even given 43 minutes for lunch I was I had barely enough time to finish but we've got a good next scenario okay so I have one minute after 246 my time so let's get started our final activity for today is a scenario I put some some tips tricks and some pointers in the scenario but this is I don't I don't usually like to give a test in any of the classes what I'd like to see is just some hands-on experience and you know applying some of the things that we went over so for this scenario so for this scenario you will be if you have the dropbox link still or the the etherpad pulled up you can actually download the scenario information let me know if you cannot but for this scenario so for this scenario you are a local government agency responsible for monitoring environmental conditions across various locations in the region your task is to aggregate transform and analyze weather data collected from multiple local weather stations to provide daily summaries and alerts so if you can this scenario is is in the dropbox link it's also in the etherpad right here so I can actually I can go on to everyone's desktop and download it if you need help but if you can try to grab this link if it's working nope oh it's logging me in copy link address okay so if you can go back to the etherpad I will also send this link in teams chat there should be a zip folder that is not five scenario it has the scenario as well as three data files we are pulling from three different weather stations it gives a report every hour there is a data description in the scenario to describe the fields as well as you know like I mentioned some tips and tricks for that you take a look at that link that I put in you should be able to download the scenario okay should be pretty easy and then during that scenario you just want to be able to go to your downloads I am going to copy it to my desktop you know go ahead and extract the zip file and so again for this scenario I've included the PDF of the scenario here we go so your objective is use NAFA to automate the collection transformation and reporting of weather data from local good deal so I'm here for questions feel free to to ask any questions you may have you know any tips or tricks or anything that that you'd like to see there should be information within that PDF to help you out the data structure is is pretty self-explanatory but if you need a breakdown of that you know I can I can I can send over something as well and while everyone works on that I am going to go mute but I'm going to kind of pop into everyone's desktops see how things are going and just provide any commentary as needed wish you were sitting on my face right now so we have about two hours left for today this may take a little bit of time so what we'll do is is spend the next hour on this flow development again ask any questions along the way we'll take a quick bio break our final break of the the day and you know come back answering questions and review your flows again you know there's many ways to skin a cat with NAFA so I would I'm really looking at you know some of the thought process behind you know developing your data flow and just that story that you have I have a question so for step three you do not so if you do not have a script or or you know you may not know scripting feel free to use another processor to to extract that to deal with the you know that type of data look through the processor you know the list of processors you have you may want to be able to do like you know extract that as a csv reader and a json reader and then put those back together for those that that knows how to execute like a script you know it's an option you know so so do the best you can for this one you can you can use custom scripting like there's also a jolt transform script python I don't think we have python installed here so you may not be able to use a python script you might have to use a processor to just extract the data and then once you have it extracted and saved as an attribute you can then use those attributes to build your final output and if you get hung up on that step or some of the other steps you know just just ask me and we can we can find a processor to do the function that you're looking to do most likely well this scenario can definitely be accomplished you know utilizing all the processors that notify comes with out of the box not all the processes but you wouldn't need custom processors also remember the documentation has a list of processors and their functions you know so if you are trying to find a processor to to simplify what you're trying to do or to do that function feel free to reference you know the documentation a list of the processors and what it can do is there I assume some of you will probably use the extract text processor as well so you know that's there as well as you know some of those things processor okay and seems like it'd be a lot easier if we just did the convert from csvj something we did in the last lab because i'm getting capture groups with regex so is this kind of like however we want to approach this yes it's however you want to approach it so like i said yeah you can learn from the the previous example and and that's the reason we have both json and the csv so if you want to ingest csv converted to json and then look at the json values together i'm fine with that as well yeah it's this again you know this is just to test the you know the the skills that we've learned over the last day and a half and and put those into practice do the best you can um you you know there's many ways to build a flow there's many ways to do this as you as you know as this scenario proves that you know you may use i bet when we pull this up we're going to have folks use one set of processors others use another set of processors uh you know and it's all going to accomplish the same goal of getting that alert up but yes use any processor you won't use previous processors you've learned about those types of things there's a reason i didn't make it all csv is so you just can't reuse the the last flow but but yeah you know reuse what you need to if you had python set up properly and some of those things you could easily do this in a script if you knew python um you know so so yeah you don't use the processors uh that's available do the best you can and and use i put some trick tips in there but if you want to use another processor habit so uh just to see if i was on the right track with this i was using an extract text and i created some properties based off of capture groups okay i just called it columns so i have like uh columns dot one through okay is that is that on the right track without extract text it is so you know one of the things that would would put you on the right track is extracting these values as attributes because once you have them all as an attribute you can you can manipulate you know all day long and you can pass those attributes to different processors you know those types of things but you are definitely on the right path okay cool thank you yeah no worries good questions hey josh i got a question yes sir i have a feeling this is you went over this uh in the part that i missed this morning but um i'm trying to use the convert record processor and i'm getting a that it was validated against a different do it and the one i'm using i'm not sure what the issue is oh let's look at let's let me pull yours up right quick and and while i'm off mute uh we still need to take our last break after lunch um but what i'm thinking is is after i answer this question i want to just go to the restroom and come right back so you know just take a break in place and continue working on your data flow um okay record reader validated against this controller is disabled right okay so yeah i'm not sure what that means yeah i missed all yeah no you did so uh go into the go into that processor and you see you have the csv reader and the csv are the json record set to the right is that arrow there you go and yes you have those disabled so you will need to enable uh with the lightning um both the service you can just do the service for now until you enable both of those and and i think you might have missed this during the uh morning session as well as uh if you're trying to do like an avro schema or something you will need to put those services in there um if you want you could if you have that zip file um well now see it's working now but if you're using a schema uh if you have that zip file that i sent uh that i had everyone download earlier it actually has the work in flow okay so it has that working flow in it you know go back and go to the oh yeah that's kind of what i've been okay yeah see there you go so it has that working flow the reason that the csv to json convert record is disabled there is only because you know you imported it in uh but nobody ever enabled the controller service so if you go into that one for instance and go to the csv reader and hit that uh yeah you see that you need to enable those and then your previous flow will work okay and that should help take away the the warning yep you're good so the other two are disabled as well you got to go back to your yep the easiest way is just go into the processor go over to your controller services and the bottom two are disabled as well so you want to enable them and there we go perfect so that original data flow should work for you now um okay and you can tell actually the convert csv to json just tell it to run once right click and go to run once and see if it gets success i don't think i did any of the steps no no you already have a file in the queue oh yeah okay and then uh yeah hit refresh on your canvas oh you do have success perfect so you took that original csv and you made it a json file cool okay and you can use you know i i don't i i don't want to see like an exact copy of what we did this morning but you can use this morning's activity as reference yeah well i missed the whole morning oh no worries no worries okay good uh any other questions no i think i'm good for now then okay i am actually just going to run to the restroom i'll be i will be right back uh if i miss anybody you know while i'm away um you know just leave me a message in the chat but i'll be back in a couple minutes i'm back if uh anyone has any questions um let's spend about uh you know 20 more minutes on this um hopefully uh make some progress and then uh start going over some of the data flows if you get done um you know with your data flow let me know and we can start reviewing it looks like some of you are getting very close to completing this task um if if you get hung up or if it you know if you're just taking a while don't feel bad uh we have plenty of time you can you can work on this later you know after the class if you want or tomorrow morning you can finish it on your own time later i can send you the scenario i will be sending the scenario so you know you can practice it on your own time as well we'll give it another 15 minutes and then uh let's start going through some of these data flows okay everyone's very close so um we may just finish up working on this today and then um i can go through the the the flow later and see how you do and then we can talk about it in the morning as well so we have a few minutes left um kind of touch on some of these and see if there's first one i have pulled up is cody uh how are things going anything i can do um yeah i tried to change i tried to use the reuse the one we did that previous flow and i was trying to change some of the schema but it was giving me uh it was giving me an error on some of the formatting for that okay so you're i can run it and show you um so you're getting your weather data uh are you just getting the csv files yeah i'm just grabbing just the csv okay perfect perfect oh awesome and then uh you were setting the schema name yes do i need to yes just to weather okay perfect um the last one we did was uh inventory so no that works out great so and then here this is where it errors out okay let's uh let's stop it yeah i get an error temperature yeah here um okay um i did i noticed something already is uh yeah weather capitalized over on your other one and here it's lower case okay um also you um just wrap why you can expand on that box if you go down just a little bit below the down arrow um on the box itself to your right right above okay right there all right yep and there you go that would have been helpful um okay so type record name weather uh field this one i had is a an integer but it was right there and i changed the string and it passed so i'm not sure um no worries i mean you can have them all it's string if you want to um okay so i'm looking at the station id date the hour temperature humidity wind speed and precipitation okay say okay there um that actually looks good okay well i did get an error on the caps mismatch um say apply there and uh hit x on just close that window out and let's look at your set schema the step before that's where i saw the capital you got stop and configure there you apply perfect um okay you need to enable your services again and then the other ones are invalid so let's look at why they are invalid they may be invalid because of that service was there we go all is enabled still throwing an error um error while getting mixed for the temperature error um let's go back into your convert csv to json um let's go to your csv reader pull that controller service and listen to the settings for that scroll down uh you want to treat your first line as header it's true because it doesn't know how to yeah uh so because you know it's trying to treat it as uh you know some of the data so you want to set your first line and treat first lines here apply exit out of that well enable it the next what's our latest error error while getting next record for input stream 14.8 not sure where that input stream would be uh the input that is the uh temperature um that is so it pulled in the station id the date and hour and then the temperature uh it was having a problem with that just do a string maybe yeah change it to a string um i mean we can we can yeah just try to stream not to get too over complicated here uh perfect put them all to stream make it a little easier so when you go back to your canvas that um processor is going to be started because you started both the processor and the uh controller services no it's fine though uh but you know go ahead and stop there you know that's this and then i haven't i didn't get to move i just laid everything out i haven't really built out any of the configurations for anything else okay no worries um like i said you know i i suspect some people will get done with with this some people may not uh but i do you know i'm glad you kind of laid it laid it out and so what we'll do first thing in the morning is kind of go through what your thought process is on some of the ones that you didn't get finished um and you know we can talk through those real quickly um i'm not again i'm not really looking for the the flow to be completed this is this is uh you know interactive just let me know what you're thinking and how you would do this and you know just to make sure you know we have a good grasp on on the you know the components that we figured out but besides that you know it's looking good so far um you know what if i was doing this real quickly i would have used you know some of the previous flow as well i would have you know saved that as json i would have brought all three json documents in i would have done an evaluate json path and extracted all the data from the json and and then you know move to the next processor to combine the data or make the calculations uh so my flow would have probably been like six or seven processors uh if i was doing it a simple way but um i think you've got a grasp on this feel free to finish up and then like i said we'll go through it in the morning
on 2024-05-06
language: EN
WEBVTT and it will pull all that data up as an attribute and you're good to go. So if I was designing this flow route and I was trying to do it extremely quickly, I would have a git file grabbing the two CSVs. I would kind of reuse the previous step where I converted that to JSON, but then take all three of those JSONs, send it to an evaluate JSON processor, extract the values, and then once I have them all as an attribute, like it's easy to manipulate. But I think you're on the right path. So yeah, if you see that evaluate JSON path, and if you send a JSON, a little tip on this one, go to the properties of evaluate JSON path, and the destination, you want it to be flow file attribute, not content, and then say okay. And then you can just hit plus and start adding in the path to extract the data. So for instance, let me see if I can. So if you, for instance, you can use, depending on the JSON, right, you can, see if you can pull that in. Okay, so if, can you open up one of your JSON documents right here and stop beside? Awesome, awesome, awesome. So for station ID, I would name this one like station ID. So go back to your, configure your processor. You can do station ID. I would do all lower case, and then say okay. And then what you're looking for here is the, the path in the JSON. So you wanna put dollar, dot, and then look at your JSON station ID. Like grab this whole thing? Yeah, just copy, no, just copy that. And paste it in. And so what it's doing is, and you may have to reformat your JSON for the rest of this, but what it's going to do, what this would do is it would extract station ID and create a key pair value as an attribute. So if you say okay here and you say apply, when that JSON goes through that processor, it's going to extract the data from station ID. And so, and then save that as an attribute. And so what you're able to do is, keep adding these property and values, and you would be able to extract all of those attributes as key pairs. And then you can manipulate those, do whatever you need, because they would now be an attribute. And then use the attributes to JSON to save the file, as a JSON document with the temperature alert or whatever, that's in this scenario. So, yeah, and you just leave the path. So basically grab all of the values from my JSON file and then store them as something, right? These are like variables, kinda? Correct, store them as an attribute. Yep, and then we'll kinda compile all those into one, and then you can do whatever you want with that. Yeah, exactly, exactly. Okay. So the key here, you're looking at the evaluate JSON path, those types of processors, you're looking at attributes to JSON to write the final JSON document, but you're getting very, very close. Awesome, yeah. I mean, I would have to definitely play with it a little bit more, but I think I see where you're getting at. Okay, yeah, we're a couple minutes over, but just play with it and let me know how it goes. But those are definitely some of the things that you might wanna take a look at. And then we'll talk about it in the morning.
on 2024-05-06
language: EN
WEBVTT All right, we'll just go around the room. Sean, how are you doing? Mostly OK. I was kind of doing some catch up from this morning still too. But I got it. Converting all the CSDs into JSON and putting them into a separate directory. And I was just trying to figure out how to merge them right now. Oh, very nice. You can pick them back up. Instead of writing the JSON, you might want to just send it to the next processor. But if you want to write it to disk and then turn around and pick those back up, and then look at what Pedro and I were just going over, the evaluate JSON processor, you can then. Yeah, it's just fine to have that. OK. OK. So I think you're very close. You've got the CSVs and the JSON all aligned. It's now just the final calculations. So you're very close. OK. Good deal. Any questions I can answer? No, I don't think so. OK. All right. How about Aaron? Aaron, how are you doing? I'm going. Are you in the room with everyone else? Yeah. I am very surprised that the room is not complete. I figured you guys would be just sharing information constantly. So we are. OK, good. But as far as we can get is making sure they're all in JSON format as far as we got. Well, some of us are messing with merging right now. OK. OK. So again, you may want to look at the evaluate JSON processor. Now that you have everything in JSON, you can write it to disk. If that's easier, pick it back up and look at that and look at taking those JSON values and pushing them up as an attribute. But it's funny, as I was thinking, I was like, there's three, four, five people in the same room. It would be so easy just to copy off of each other and be done. Yeah, we're trying. No, there's no. Nobody's going to get dinged for helping your buddy, right? OK, so who else is in the room? We have Brett in the room, I think. OK, let's go to Alyssa. Alyssa, how are things with you? OK. Don't worry. If you want to beautify the JSON, do remember, if you bring down processor and you start typing JSON, there is numerous JSON processors. Some of those are used to beautify JSON. I think it's scroll down just a little bit. There's the flatten JSON that helps. Evaluate JSON is one that you will probably use. Scroll on down. Keep going. There's a jolt transform JSON. So I'm surprised no one's kind of looked at that processor yet because you can actually write a jolt transformation and probably accomplish this task in four or five processors. So definitely take a look at the jolt transform JSON. Take a look at the evaluating JSON path. I think you're getting close, right? So yeah. I was thinking somebody in that room is going to write a jolt transformer and just have all this working within two or three processors. But no. But I think you're getting close. Any questions I can answer right now, I think you're on a good path. But I just want to check in and see. Yeah, I think we all just need to play around with it a bit more. We didn't find all the processors. There's over 300 processors. And so make sure when you go into your processors, look for JSON processors. Look for CSV processors. Look for the items that you're trying to do. And you'll start narrowing down your list. Even as someone like myself that is extremely experienced at building data flows and non-file processors and those, when I have a problem, I'll go to my processors. I'll do a keyword like JSON. Let me see what's out there. Let me see what's available. What are some of the capabilities? I keep the non-file documentation up. And that way, it has the documentation as well. One tidbit of information on the evaluate JSON path. You're only able to extract one field if you're trying to do a flow file content. So make sure you have flow file attribute so you can extract all the fields from the JSON document. And this is for whoever is going to use that processor. But I think you're getting close. I would definitely look at the evaluate JSON path. And they don't transform. Depending on your skill level and your skill set, you might be able to solve some of this pretty quickly. Any questions that I can answer for you, Lissa? No, I would just need more time with her. Okay. Like I said, take all the time. Like training stops at five, but I don't stop. Like, I mean, I keep going, whatever it is to help support you all. So just keep going. No worries. No worries. So for you, you want to get the weather data. If you, again, if you want to solve this very pretty quickly, reuse the CSV processors that we did for the previous exercise. And then you can actually write that JSON back to disk if you want. And then have a new processor that grabs all your JSON. So it would include that station, that third station ID. And then you can bring all your JSON in together. You can evaluate the JSON path, extract the data, and then you can manipulate it using attributes. So just play around with it. If you don't get complete, no big deal, we can work on it. If you need to reference the previous exercise, you can upload that. It's in that zip file. I just didn't advertise it at first because I wanted to walk us through the flow itself. But you're able to bring down a process group, import that flow, and it should work for you. So if you need to reference the previous flow, and even if you didn't get it finished, you can import it and it's complete. And that should give you some helpful tips on processing that CSV that you get. Any questions, Amanda, I can answer right now? Oh, no. I'm good. OK. Well, try your best. And like I said, any questions, just feel free to reach out. And is there anybody else in the room with you? I am, Ben. I hope you like to scroll. Who was that, Cody? Ben. Ben. Oh, yeah. I thought you said Ben should be in there. Do I like to scroll, oh, Lord. What's SpiderWeb? Oh, my God. Did you read Charlotte's Web before this? Well, welcome to our brain race. It's literally, OK, what's the next thing I have to do? All right, let me do that. I have no clue where to start. Is this like a maze? Top left. Top left. So you're getting the source file. Oh, you identified the mom type. That's it. Oh, well, I don't see it. Yeah, no. It's been a while. All right, so if it is a JSON document, you're sending it to write JSON to work. If it is a CSV, you're updating the attribute, you're converting the record, and you're putting the file to JSON just like we were doing on the first, the previous exercise. And then you're pushing everything to a funnel and logging. OK, so what's on the right? OK, so it gets work files. Oh, OK, and then you merge JSON data, rename the file, and you put the file again. And that's as far as I got. OK, perfect. Well, the nice thing is I think you're very close on normalizing all of your data. OK, perfect. So look at picking that back up. Might as well put a few more Git files on there. You don't have enough. Go ahead and Git file. And take a look at the evaluate JSON or the flatten JSON processor, the evaluate JSON processor, the Jolt transformation processor. I bet if you were to Google, I want to Jolt transform a JSON document in Nafa, you will see tons of great examples. OK, so I think you're very, very close. Do you have any questions I can answer right now? And if not, we will go over this in the morning. No, I'm good. OK, thanks, Ben. That is ace by the web. Tyler, how are things going with you? Oh, wow. I'm pretty good. Wow. Just in just the files, I check the MIME type. I update the schema attribute, route on attributes, and it's already in JSON. It just outputs it. Otherwise, it will keep working the same speeds to JSON. Wow, that's nice. Nice and clean. Then it queries the records to do the aggregations and pushes out the aggregations here. But I haven't branched that part out yet. And then this query record does the warning for when speed. I haven't added any other warnings yet. And then I have it splitting the records and then extracting out the date as an attribute. And then it merges all of them from the same date to aggregate them for the reporting. And then I haven't fletched that out yet. Oh, good Lord. You're almost done. You nailed it. And I was very surprised to see query record, split record. I really like that approach because that approach is a data engineering thought process where you can then take this and build some of the reusable components and process data like this in the future. I think you're very, very close to being complete. Any questions that you may have that I can answer? Any major roadblocks in your way that I can answer? But overall, great job. I think you're very, very, very close. So great job. If you need anything, let me know. But wow, nicely done. Thank you. All right. Brett. I think Brett's done. No. Can you hear me? I can. So I started one way, then I switched, and I started using the evaluate JSON. OK. So I have for the native JSON file in there, I'm able to grab the file and get all the attributes for all the elements in there. So then now I'm just, because I split, I had two different paths for CSV JSON. CSV path was the old one that we did. So now I'm trying to, after I converted to CSV here, I'm trying to go to the success to do the same thing. I haven't tried that yet. OK. It may require a different evaluate JSON processor, just to, depending on if the JSON is exactly the same or not. But you're extremely close, and good job using the evaluate JSON processor, and pulling all of that data out as an attribute. You have it all as an attribute now. So there's a lot of capabilities you have now with that data. Any questions I can answer? This is, you and Tyler are just really rocking it. Any questions I can answer? No. I think it's making sense. OK. Perfect. And I think you're very, very close. I've been struggling with the canvas and the latency that we have, though. You and me both. If it was my way, we would have a different environment. But I am as well. I just got a warning that I'm 300 milliseconds. And I'm like, come on now. But no, I get it. I get it. OK. Well, if you have any questions, let me know. Great job. We'll go over this in the morning, but I think you're extremely close. All right. I think we went over Alessa and Amanda. Randy. How are things going, Randy? Well, that doesn't sound good. Are you there, Randy? All right. Well, it sounds like Randy's allergies are really getting to him. It is that time of season. So we'll come back. Randy, if you have any questions or anything, let me know. But I'll check in on your flow when I can see it and see how far you've gotten. OK. Let's see. Who did I miss? Erin, I think I got you. But did I miss you? You looked at Erin's earlier. That's what I thought. It's confusing me with somebody in the room. And then so did I miss anyone? And if I did, let me know. If not, does anyone have any questions I can answer right now? OK. So it's about 30 minutes after or 22 minutes after I am going to call it. It's almost 530 my time. We will start off tomorrow morning going through these flows. We have enough time in the morning to spend 30 minutes to an hour going through these or updating them or finishing them up. So do your best you can. If you do not get complete, what I'm looking for is just an explanation, a story on how you're planning to design your data flow and some of the things there. Little tips. I would try to get it organized and get some of the labels and the naming and that type of stuff that we went over on the canvas as well. Beautify the flow. Try not to read Charlotte's Web tonight then. And have a great evening. If you have any additional questions, speak now. If not, I'm going to go ahead and leave the meeting. But I will also check in on everyone's desktop later tonight because I have Minify and NaPhi processor development to go over tomorrow. And I need to make sure you all have the files for that. Awesome. Well, thank you all very, very much. Continue working as you want. And we will pick up tomorrow morning. I will see you all then. Have a good evening.
on 2024-05-06
language: EN
WEBVTT I Didn't get far no worries You can't talk me through Where you got what you're thinking? You know how are you going to accomplish this? Yeah, so I just started off with converting the CSVs to JSON and Regardless, it was a failure success since you would get a failure. It was already a JSON. I had them go through another conversion where it just made the JSON files pretty and then basically merged those into one and Then the part where was execute script was where I just I don't really have experience doing Python So I wasn't sure how to go about that Okay, okay. No worries Yeah, and you know there like I said, there's there's a couple of different ways, you know, you can handle this You could have you know use the execute script SQL I'd be careful Picking up files and moving files that you know through a processor that may not be needed Just because you know, you know to make the To get the best performance out of the system. You want that processor to You know work on data it knows about or do that single Task that it wants to do so if you send the files that you know, you know may fail Send those a different direction have another process handle things like that But overall, you know great job in and again, you know We have a mixed audience of different skill sets and those types things so I didn't expect Everyone to get finished. I didn't actually expect anyone really to get as far as some some folks did But you know being able to explain this explain how you're building out your flow It's critical. Um, you know, just keep in mind also, you know name your processor something a little bit more Readable when you can add some labels those types of things You know you if if you were to set this up and have a very extensive data flow And just help visually to to do that thing, you know to recognize where those are And we'll go into some of the other visual aspects of Nafa because you know when we get to some of the security settings But no great job Thanks for walking me through it and any questions that you know, you have so far on any of the Nafa components in the canvas or any of that stuff Hi, where is next This is really nice Thank you morning Tyler you want to kind of walk me through your flow what you're thinking How far you got, you know any questions? Yeah, just ingest the data I was just trying to work with this query record So this records during the agitation I'm having some issues with the date column right now And then this three record is Pulling out right now, I just have a wind speed warning so that I Didn't really fledge out these two paths, but these would be going to merge for that I Guess this path would be going to an aggregation report and then this path out I'm taking and splitting all of those records and then bending them by by their day. Nice. Nice so you know Another instance where we're using a query record And you're using evaluate JSON You know, there's tons of different processors. I really like your layout You know, it's nice clean You can kind of follow the path in the life of a packet of data. So, you know overall, you know great job Any questions I can answer? Any issues or anything else? Yeah, yeah, they can be you know We've seen that earlier as well yesterday they can be a little little finicky but Yeah, do the best if you have any other questions if you have any other questions, let me know I can if you want to Break as well to see if there's a way we can quickly fix that date Good morning, I Can I can Okay, so we can talk much yesterday Converting the CSV to JSON a two different Get files whether it was CSV or JSON pick it up But then started into the flow at a different point My question is is there a processor that Allows you to want external application like if I already have a Java application a jar file sitting on the server Can I launch that from here pass it to input file and have that do the work? But you can um, and so say you have like a jar file that that has all the logic built in You know pass the flow file to that and and you know get the results So you can execute a process and that could be like a like a shell script You know, we use this for executing the like Linux commands, you know those types of things The arguments Question is is when you take a piece of data outside of NAFA And any kind of processing that happens there you're going to lose, you know that data governance part so we're going to see in the lineage that you sent it to the process and We will see in the lineage when we get the data back But any processing or anything that happens outside of that, you know, we will not be able to You know capture that you know big problem on suspense Now with that being said, right, you know, sometimes we have these external applications that you know do You know things like this very well and we call them right we use it There there's ways in you know, if you don't want to rewrite your whole well It's already a Java a bet converting it to a processor would be pretty easy But you know there there's ways as well to You know keep the NAFA ecosystem alive in your separate process, you know being able to You know save those attributes, you know And then there's callbacks to not by you Used to say, you know, here's the attributes associated with this flow file you those types of things But yeah, you do have a processor to to execute a process and and if that's you know Execute Java dash jar and and your parameters, you know have that Yep, and like I said, you know, I would love to be able to go through each individual processor That's a different way of handling this you can execute you know a script even And you know do some of that stuff you but probably in your case you would just use an execute process execute that Java You know, just make sure that That external application gets shipped with them with NAFA just because you know, you deploy this and cluster it You'll need to make sure you have access to that binary You know across all the machines but great question and actually, you know That's why we go through some of these is because nobody's even asked that question yet Any other questions or issues let me know let me have to take a look Ain't great hearing from you today. We're not being Good morning Looking a little better little less spiderwebby Okay, you just kind of walk me through your flow how far you got and what issues you have That's the merge of the files but also backing up of the Original The other stuff like making an attribute all those things my brain doesn't work in that space. Yeah, it's a task alien to me No, no worries, but you did something that I would highly recommend and you know when you were going through Designing your data flow started laying these things out One of the things I noticed is you would put your original file back or you would You know take a copy of the file and save it somewhere else those types of things You know when you're building your data flow and in some of those precautions add in some of that You know safety mechanism to to ensure that you know, if you are writing a file to a database for instance You know, you're writing the values you may write it to a file first just so you can see You know does this look exactly like I wanted to go in and you know, you know The beauty of a processor is is you can branch off, you know hundreds of success, right? and so if it's a Success you can send a same success to another file And so when you're building your flows building in some of those safety mechanisms I feel like really helps and then when you're done, you've got it tested you're ready to start shipping this out You know go in and look and get rid of the redundancy, you know Go in and you know get rid of some of those safety mechanisms So, you know that way the flow can perform as best as possible but Well, and I think it's funny how it's being used so that kind of tells me who's used this before and You know, who's experimenting things like that. So no great job on putting that in I like how the only other thing I would do is just you know, you know, again back to the labeling beautification make it easier to read but usually that's at the end and when you're ready to start shipping that data flow out You know you start doing those types of things. So No, this looks great. I get what you're trying to get at and Understand where you're going your flow if you you don't have the skill set to write code for instance, you know, that's that's fine So long as you you you get close Thank you, let's look at Amanda Amanda is not here today. Oh Oh, yeah, she did not want to do Okay Okay Too much further I was just basically looking at the processes Processers and what they did. I mean I did make a little improvement like All my JSON now looks the same because why I added a flattened Can you open up your Okay, no worries You you did the one thing though that you know I was I mentioned is is you change the destination from flow file content to flow file attribute So that's what you needed to do the reason being is you can this processor and Some of these things even being a committer, you know some of these things, you know Confuse me why we're doing it this way, but you know, if you had a flow file content It's only gonna let you extract, you know one element out of that JSON document But if you do a flow file attribute you can go through the the whole JSON tree and start extracting You know every every value out of there and then having that as an attribute So I'm glad you changed that You know, I think I think you know If you would have had chance to go further your you were really close because you know Once you start looking at the evaluate JSON and you can do some of the same things with CSVs You know, I think you're merging and other things later would have been a lot easier so Any questions concerns That can help me with media Okay, is there anyone else in the room Look at Brett Okay, Brett how far looks nice looks real nice So I switched I didn't get much further than I did yesterday we were this but I Switched like halfway through the way I was doing it. I was trying to use I think it was split JSON. Mm-hmm I didn't I Wasn't getting I was able to get break things up into different files because I thought that was the way to do it But then I switched to this value So Then I was able to get The attributes Extracted before yeah extracted into the thing. Oh, I think I showed this yesterday Oh, yeah, so like our humidity precipitation, mm-hmm the station and all that stuff and then The plan was to just feed this Converted CSV converted into JSON into that same thing. You mentioned yesterday that I might have to do a separate evaluate And I think I did because the second one I got Didn't parse correctly. Mm-hmm So I'll probably have to do just a separate evaluate JSON for that CSV to get that to work Absolutely, or you may want to Just you know how you're parsing your JSON parse your CSV and have is as an attribute and then you know with both of those as an attribute You can put processors on down the line that would you know, right? a single JSON document and All of it would be the same So there's a couple different ways. I like the path of your own I always like using, you know, you know record readers and record writers just because They're reusable components. They You know, you can add some logic and schema and some intelligence behind it But I think you would have got, you know pretty close If you had the time Issues or concerns or any questions you had about the overall scenario or flow? Perfect perfect perfect All right Pedro, let's look at your okay How we doing So my purpose to like I put filter on CSV files so I can make it into JSON Well, I think I got that working with the schema Yeah, yeah, I got those guys going oh nice nice and then Then I guess after that I was my friends I was okay Let's just go in and do the JSONs and then merge them and then do whatever you have to do in there So that's kind of where I left off yesterday. Okay. Yeah, if you were able to have time You you have 10,000 files in your queue And you notice, you know with those 10,000 files in the queue is it's basically halted, right? the reason it's halted is that Error log message is stopped if you were to start that it would clear the queue for the extract text And then the process JSON files can send its queue to the extract text You know, so, you know, I'm glad it's there just so we can point it out and it's a learning ability But yeah, you know that would have helped clear to you. I think you're on a good path Just keep in mind, you know, there's there's You know, you want to reduce the amount of processors you use in the data flow So if possible, right you can pick all the files up do some filtering and sorting Soon as possible and then you know start sending it off to its own process It's on its own flow and then you could merge those at the end as well You know, so it's something just just tips and tricks to keep in mind But it looks great. I like the labeling And those things that you you've got accomplished though Alright walk me through your data flow Since we talked about it yesterday, okay But I definitely learned a few lessons that I would Yeah, it's picking up the CSVs Updating schema burning JSON and then writing it again. And then this one's just I was just picking up the already Which that is one part I would do differently if I was starting over from scratch, okay And then I was just messing with this merge content one a little bit while you're going through with the other people this morning Oh nice, but it's we're just pick the files that were written back up better merger them into a single merged JSON and then I'm always gonna do the sequel statistics on it. Mm-hmm. I Like how you know, I like how we you know Some folks were just extracting it using an Avril schema for the CSV those types things And then you know those different approaches such as using sequel So that was really nice. I all of the you know, you have a merge content and a merge record All of the standard processors have documentation Long is pretty helpful. You can you see it bolded? It's a required field if it's not You know, they're all but yeah merge a group of flow files together on a user-defined strategy I think you would have got all of those JSONs merged you could have you know extracted a few things from them and Set up some alert, you know extract the wind speed or something and and you would have been finished. So great job any questions But I think like you mentioned earlier it's good practice for this That's all safety steps in there It is it is and and a lot of people like to just well I don't want to add those processors because I have to go back and delete them or I may leave them in You know as this gets deployed, you know those types of things So, you know, it's always good to have those safety steps in place You know even even to this day You know create a flow and I'll get ahead of myself and I'm like, oh no I forgot to turn on keep source fall and I'm you know Because it's missing from the source back to the you know back to a folder You know, so I don't believe it. So, you know, I even I get into those situations sometimes so, you know No, it looks like you got really far You know One of the things that that I didn't see enough I've seen in the past is you know, this is across the border You may create a processor group that Handles the picking up and filtering of files You may have another process group within that, you know that that parent level process group that you know handles the You know your ETL steps for your CSV and then you may have another process group that Each of those functions can run independently of each other So, you know, keep in mind if you have I know you're you're accessing a website It's cumbersome. It's not automated you're downloading it But if you were getting a feed of data just written to a disk for instance with you know, different data types formats those types things you don't want an error or Something else, you know potentially blocking the whole flow. So, you know, keep in mind you can bust this up into you know Smaller functions so that way, you know, you may you may be processing seeing JSON and maybe processing CSV CSV could act up but you know, Jason will continue to process So just keep that in mind when you're designing your data flows. You can bust this up put it into You know different processor groups link those together with your input and output ports And you go from there. So But great job Everyone, you know, you got a lot further everyone got further than I was expecting I knew It would it would throw a few curveballs because you know, we were having to do some ETL steps and then now learning mechanism I knew would kind of trip folks up You know, just keep in mind that you can always go back to the documentation you can you know the The description of the documentation and not five, you know should include all of this as well But you know everything's on the website and then you know, there's a ton a processor for You know some of the some of these things and then speaking of documentation I found You know, I had mentioned that Azure was supporting nafI more and more and So last night I was looking this I was going over what I was going to you know show today and ran across a New the new Microsoft Azure So, you know Microsoft is starting to really lean into nafI they you know, I can't confirm nor deny but it will become a potential service within and so, you know, they do have a lot of Documentation on this I stole you know this graphic for the slides. I'm up to present But there's a lot of stuff that Microsoft's even realists, you know putting out so I'm gonna include a link to this and I'm gonna include other links You know just so you can take this back and and have that documentation You know, one of the biggest things I try to you know, let the class know is I'm gonna give you as much information As I can this is a quick three-day training. We're not on a server We're not you know in a multi-tenancy environment those types of things So, you know, we've got to do the best we can with what tools we have But I definitely want to get these links out to everyone but yeah, you know in case you didn't know there is now, you know some additional information Specifically on the edge. All right. So that being said any other questions about the nafI Candace Registry those types of things before we go into scalability multi-tenancy And you know those types of topics. I'll take some
on 2024-05-06
language: EN
WEBVTT But the major differences in, you know, Minify Java and Minify C++ is basically Minify Java is NaPhi engine stripped of the UI and a couple of other components. But what that gives you is what that really does is it gives you the capability to run you know, just because of its lower overhead, it is quicker, it processes data better, you know, it's just all around a better product. But unfortunately, you know, they just added some Kafka support, I think. Yeah, it's in Kafka. Kafka is used quite a bit in my circles. And, you know, it's pretty critical here. We're not going to, I don't want you to run it yet. I don't want you to kick anything off just because, you know, Minify can, let's kind of go through the directory structure. When you're installing and starting Minify, you know, you have several options. So let's look in the bin directory. In the bin directory, we do have applications like Minify and bat files and those types of things. Windows support is becoming more and more popular with Minify. I've seen Minify used on government systems to act as the agent to do log aggregation, WMI collection, you know, some of the cyber security aspects. You know, so, you know, if you see it in the wild, you know, you will potentially, you know, see it there. So in this, you know, you have a bat file to delete the service. You can install a service. I think I heard mentioned either originally or in the last couple of days, you're looking at potential like a Raspberry Pi, you know, that type of device. Yeah, I know there's a seventh group, for instance, that's out of Tampa. They're using Java Minify on Windows laptops that are deployed a little outside the talk and, you know, rerouting data back to the talk to a main Minify instance and those types of things. So, you know, depending on your use case, you may want to, you know, use the Windows version. I'd recommend, you know, if you can, the Linux version just because it's easier to deal with. It's widely supported. You can use the C++ if you don't need custom processors and those types of things. So within Minify, some of the directory structure, you know, we've got our bin. Inside bin, you know, we have, we can do a dump of Minify to see all the processes and flows running, that type of stuff. We can do a delete. You can delete the service after you installed it. So if you have a, you know, if you have a service installed and you want to remove it, you can, you know, do that. We are, you know, we're not going to install it as a service, you know, for some of this hands-on, but more of, you know, just looking at Minify and getting it up and running, those types of things. So install service. You can define your environment variables, those types of things. We can actually look at this. One second. We'll get a better editor than Notepad. Okay, so anyway, so with Minify, you know, you can edit the bat files. You can manipulate those if you're working on a Windows environment. You can go in and set your environments. You can get a status, you know, just like you had with, we touched on it only for a few minutes, but if you had the actual NaFi.bat file that ran the main NaFi, it would, you know, it would, you can change username. You can change a few things there. Minify also has a toolkit, you know, so just like working with NaFi, you can use a toolkit to work with Minify as well. Okay, perfect. And if you look in the actual bat files, you can actually, you know, change where the pins are running, where your log directory is, those types of things. So for you sysadmins out there, you know, you can set your home environment, you know, your Java home. You can change all these and really get it customized to your location or your environment. The thing with Minify again, there's no UI, there's no point and click, there's no ease of use when it comes to Minify. So, you know, if you want to install this and get it up and running, please expect to hand jam a bunch of things in and, you know, deal with the hardest part with Minify is actually getting it connected to a secure NaFi. So, you know, getting the key store, trust or those types of things set up. But if you're an admin, you know, definitely take a look at in running this on Windows, definitely take a look at the bat files for this. The conf directory is really key with Minify, you know, just like, you know, we could change, you know, NaFi and do all kinds of things. You know, one of the things that we can do with Minify is it has a configuration. And so within this configuration, you know, it's you have your flow controller name, any kind of comments, core properties, you know, file repository and how long it should be kept, you know, some of the swap characteristics. So as you go through and start fine tuning some of these Minify flows, you know, take that into consideration. You know, it's going to maintain a local repository, but, you know, even the default, you know, is 10 meg, for instance, and the max flow files is 100. You know, so you may, depending on how much data you are ingesting into NaFi, you may have a need to, you know, you may have a need to to expand upon that. If you do, you know, just take into consideration your content repository, your your provenance repository and those types of things. Your security, you know, some of your security properties will go in here, you know, your key store, your key store type, those those types of things. You know, again, if you go through and configure even an Invoke HTTP, you're going to need a key store and a trust store. A lot of times I see people just use the built-in key store and trust store that Java ships with. So, you know, I don't if you change the path, the password, which a lot of people forget about, you know, you should be decently secure to do that. But, you know, just kind of keep that in mind. Processor, processor groups. Here's where you're going to feed those unique IDs, you know, for those any kind of funnels and those types of things. There's a lot of configuration. One of the nice things, though, that they have released is the C2 aspect of Minifop. And so, let me see if I can pull up an architecture drawing of that because I think it's pretty relevant here. All right. So, you know, what I'm saying is Minify now has a C2 aspect that you can download. So, it's command and control. You know, Minify runs a single data flow very well. And, you know, you may have a need for Minify to communicate back to NaPy, you know, be able to get its new data flows and also, you know, get new configuration, those types of things. You know, so if you have a C2 instance set up, that C2 instance is going to communicate back to NaPy. It can, you know, pull from the registry and that's the reason that we went over registry. There's a lot of important aspects of that when you're talking Minify and those services. So, it can actually pull that flow, push it out, update those flows and manage it. Now, there's not a good UI to do this and so you're looking at, you know, some scripting. You're looking at, you know, potentially developing your own custom application and those types of things. So, just keep that in mind when you're using Minify. You may grow a little frustrated at the, you know, the capabilities are there, but the configuration and easy use is not. Well, the nice thing, though, is, you know, once you get it configured and up and running, you can make it as part of that, that CI CD process. And if you're using Ansible or similar, you can deploy, you know, Minify with ease. So, you know, when it comes to the architecture, one of the things that we usually see is Minify, you know, I've seen two or three hundred Minify instances reporting into one node, you know, reporting into a C2, for instance, and the C2 talking back to Nafa. You do not have to have C2. You can actually have directly from Minify to Nafa, depending on, you know, how would you use C2 if you are, you know, changing data flows frequently, if you're trying to, you know, partition or assign a node. Yeah, no worries, no worries. Okay, perfect. And then also, when we create a flow and put it into Nafa or Minify, we are going to export that out, put it into Nafa, and I'll walk you through that. The configuration for a one deploy is done in the bootstrap.conf, which you can also define here in the bootstrap configuration file. You know, you know, the bootstrap.conf and primarily revolve around the config change in gestures. The configuration of bootstrap is done using the Nafa monitor notifying gestures key, which you will see a lot in this section. So for your sys admin, you can start seeing, you know, the file change notifier configuration I talked about, the REST change notifier configuration, HTTP, you know, and those types of things. This is where you would configure that. You can, since this is the Java version, you know, you can configure the JVM memory settings. It is very low out of the box, only 256 meg of memory, but you've got to also take into consideration that, you know, this is usually used on an edge device, you know, those types of things. You know, the C++ version is more efficient. If you're using a Raspberry Pi and Raspbian OS, the support for Minify++ is actually really, really good. Minify actually supports, you know, some of the Raspberry Pi capabilities. Let me pull that for you just so you can, because I know it was mentioned. Let's see if it's on this page. Yeah. So if you're using the C++, there is built-in some Raspberry Pi capabilities, you know, OpenCV is there, a USB camera. There's, which one is it? Sensors is the one I'm looking for. Okay, there are sensors. So if you enable the sensors on the C++ version of Minify, it can read from the, you know, Raspberry Pi sensors. There's a lot of these that are built, the C++ versions of the processor, you know, is built for Raspberry Pi. It'll run on others, but it's, you know, specifically targeted after that. You know, you can use GPS. If you have a GPS module on your Raspberry Pi, you know, you have a GPS hat. It can pull that information, you know, those types of things. Capture, you know, PCAP, you know, things from, there's a, you know, I know of a lot of folks using this to capture PCAP data, you know, even on the edge to filter and sort and analyze that network traffic. So, you know, just keep that in mind, you know, when you're choosing your version that it's there, it's available in those types of things. Again, we're looking at the Java version. So you can, you have remote debugging. There is some, you know, for cluster mode to work properly, you know, some of the things that you need to do. You know, if you have some older, you know, system settings and things like that. This is, it's set to headless mode by default, you know, in those types of things. I was talking about some of the C2 capabilities. If you have a C2 server, you can go in and configure it. Again, if you are using a lot of edge devices where you're running Minify, I would highly recommend a C2 instance. Just because once you get your Minify running with the security, you know, you can feed that C2 the data flow that you want to run. And it will automatically download that data flow, run it, and you know, and it'll just look for a new one periodically. So you can set the heartbeat and all the information about C2 here. Again, I'm thinking about one of my senior engineers I was working with just a couple weeks ago. He was having a problem with C2 because there's not a lot of support for it online. You know, so a lot of the settings and configuration that you look at is, you know, trial and error. You know, you're just going to have to play around with it. But, you know, it does work and you can get it up and running. Everything that we use to communicate to the C2, to Minify, to wherever, it can be secured. You know, it's an SSL connection if need be and things like that. Now, when it comes to some of the fine grained security details of Minify, you're not going to see the same types of policies that NaFi has. You know, just because it is a, you know, basically an agent of NaFi, its primary purpose is to either, you know, get file, collect data, read a sensor. You know, you name it, do some quick operations. You may want to execute a, you know, a model, an AIML type model and those types of things. I do know, like, seventh group uses it to capture images and run, you know, image classification models on the edge. And instead of sending up a, you know, two, three, four, JPEG, they can just send the analytics from the, you know, the output of the analytics. You know, so that's something to keep in mind as well. Alright, so that's bootstrap. That's the big. Extensions is just like the extensions directory on NaFi. I actually did not go into that, but, you know, we talked about it, I think, on the first day. Your extensions is just, you know, just like you would in NaFi. This is where you're going to load, you know, your custom processor or something else like that. You don't want to restart your Minify agent. You just want to get it loaded and deployed. So, you know, use the extensions directory when you can. And of course, there's the lib directory. If you see it, you know, we still have NARS, you know, listed here, just like we would in NaFi. Not as many, you know, you would see within the NaFi, but, you know, you can, you know, depending on how you want to handle this, you can copy a NARS from... I want to go in and show you. So you're able to copy the, like, one of the NARS from NaFi. Let's see, let's... There's already Kafka, there's Solar. Like, we'll take the Solar one, for instance. You know, if I had a Minify flow that was, you know, picking data up, sending that to Solar. Solar is a search industry based off of Lucene, another Apache level project. I would include the Solar NAR as part of my, you know, deployment. You know, so keep that in mind, you know, when you have your CSC process set up, if you do have custom processors, and it sounds like you may or you will, you know, you can have those built, and then when you build and deploy your Minify, you can use that processor on your Minify, you know, install. If you are, if you have that processor part of your flow file, and, you know, we have to design our flow file, you know, the data flow itself, we have to design those within NaFi. And so once you, you know, get your data flow designed in NaFi, and you save, and we're about to do that right now. And once you save that flow, if that processor does not exist in Minify, it will reject that flow and not run. So, you know, just keep that in mind, you know, if you develop custom processors, or if you're using processors that are not part of the Minify distro. Again, you know, we know, you know, the C++ version, you have a very detailed list with Minify. And then also the, let me show you this, again, it's a very extensive list. So if you, you know, these processors are not packaged with Minify, but it's able to use the following processors out of the box. You know, so execute SQL, we were using that, we were using execute process, we, you know, we are using, you know, some of these other processors in our data files, data flows that we built. So just keep that in mind. This list is available, but it's not readily available. And so what I'm going to do is put that as part of a presentation I send out afterwards, you know, just as a tip and trick that, you know, kind of reference this list based upon, you know, what your data flow may be. I think, you know, we would use a put file and a get file, you know, in our data flow, usually you would, like, you know, in a Minify instance, this is pretty new where they've, you know, kind of separated this out. There's a lot of, you know, there's a lot of thought in the future of Minify where you can just use NaPy and just deploy it as NaPy because it's actually the same engine underneath. So do expect some changes to this coming up soon. And that's because, you know, I just had the insider knowledge that these things are coming. So, but I'll send this list out. I'll send out, you know, a few things about it, but keep it in mind as well. Okay. So Minify is able to use the standard SSL contact service out of the box as well. So keep that in mind. If you want to create a data flow with a processor not shipped with Minify, you know, the way you do that is set up your data flow, copy the NAR into the lib directory and then restart your Minify instance. And so, you know, just if you're at this, you know, you would have to, I think some of these NARS are packaged and bundled together. So if we look at NaPy, there is a NaPy. I'll have to go back and look and see where it's at. But there's a NaPy bundle, NAR, and that is a bundle of some of these standard processors. And so, you know, with that being said, you may, you know, there is a JD bundle, but that's not it. So you may have to include the bundle of processors, even though you may not use the others. One of the ways around that is, is again, go back to the source code, get the specific processor. We looked at the Git file source code and go from there. You know, once you get the Git file, you can compile it. You can use it in your main NaPy instance and then turn around and ship that with Minify. As a matter of fact, as long as it's the same, you know, processor, what I would do is build a Git file processor that's specifically used for NaPy Within the Minify directory, you have the, you know, the actual Minify application. So we are going to receive this from a remote site to site connection. This is what we get into with Minify. We'll output to a port and, you know, we are going to receive it. So name your port from Minify. And then when you are, I'm just naming mine, go into the create connection settings. And, you know, now you have a connection on your site to site, you know, from Minify. So what would happen, and I doubt we get time to get all of this fully up and running. But what's going to happen is Minify needs a place to send, you know, its data, it needs a connection. So, you know, what we are doing is we are going to listen, you know, this is our input port. We're going to listen for all those site to site connections from Minify. And when it comes in, we're going to log the attribute. You know, this is, you know, we don't have to log the attribute. We can bring whatever data is Minify sending us and, you know, send it to another data flow, for instance. Send it to another processor group, for instance. There's a lot of capabilities here on what you're able to do, what you're able to configure, and those types of things. So, you know, create your input port. Do a log attribute. If you want, you know, you can just do another log message. Add that to your canvas and add that success. Okay. And I'm going to auto terminate my log message. So that way it clears. Okay. So everyone should have something similar. Perfect. Good, Pedro. Brett, you might have missed it. You want to have an input port dragged down and you want to do remote site connections site to site. There you go. Perfect. Give you just another minute to create this. But again, this flow is to receive data from Minify. So this flow is not going to go out. You know, this isn't a flow that Minify is going to run. This is our flow to catch whatever Minify sends out through its site to site capabilities and, you know, receive that file. I have a log attribute because, you know, it's going to, you know, Minify is going to send me attributes and data. So, you know, just have the log attribute, have a message, and it should be a valid flow. Okay. Good deal. Some of you are getting fancy very quickly. Okay. So now that we, you know, that's our receiver, right? You know, that's receiving the Minify. That's what the data coming in. Now it's time for us to build on what we plan to deploy to Minify. So to do that, I want to go back and what I'm going to do is create a new processor group. And I'm going to call this my Minify deployed flow. You can name it wherever you want to name it. Generate flow file. There we go. Generate flow file is used quite a bit. And Pedro almost mentioned this earlier on your question about the web services. You know, generate flow file sometimes is used to kick off a flow. So, you know, if you have an Invoke HTTP, generate flow file is really nice where, you know, you can set a timer up, some other rules in the generate flow file, generate a message. It kicks off the Invoke HTTP. And that's what kind of drives the rest of that flow. But for Minify, let's just use the generate flow file processor. You know, we're going to configure it. We are going to basically generate a zero byte file. I'm going to set it to text. Oh, my mind is tight. And it's going to send, hello, world. Apply. Now, again, remember, we are, this is the flow that will be deployed to Minify. So, you know, this flow is not to, you know, don't get it confused with making it valid within Nafa. It needs to be valid for the Minify install. So I know that can get a little confusing, but, you know, just keep that in mind when you're developing your flow that, you know, depending on file locations and those types of things, you may use a git file or something similar. Drag your success over and go from there. The, let's see here. No, I apologize. Actually, delete that connection. We want to actually add a remote process group. And it's so that's way when Minify sends it to the Nafa, sending it to that remote process group. You can, you want to put in your instance that you're running right now. So when setting up Nafa, you know, we'll talk about this, you know, when setting up these connections, you can, you know, if you set this, you know, transport protocol to raw, you're going to need to work off of your own port. And if you look right here, I get this question quite a bit. If you look at your Nafa configuration. In the Nafa configuration, you have, I think it's port 10,000 is the Web port. The remote input port, remote input host. So, you know, when you are configuring your flow, you can choose raw or HTTP. If you use HTTP, which I is the one I recommend, HTTP is going to communicate with your Nafa over the URL that you're logged into. So it's going to create that secure connection back to your Nafa instance. And that's how it's going to listen and not Minify is going to push its data. Now, a lot of people that are processing lots of data on the edge, you may not want all of that data going to the main Nafa HTTP port. The reason being is that JETI server running underneath it, serving up the web page and the API and everything else can get overloaded. And so it could cause your browser to become a little unresponsive. And so you might not be, you know, you might turn a flow on, say, 100 edge nodes. You turn that flow on and, you know, once you turn it on, it's going to start bogging down your system. It's going to, you know, you can get into your UI, those types of things. I've seen that happen. The best way to mitigate that is just, you know, managing your resources. You know, but if you want to send it to a raw, you know, Nafa port, that is okay as well. The beauty of that is, you know, you can configure that port. You can, you know, do some log balancing. You put Nginx in front of this and, you know, do some really advanced load balancing and those types of things. I recommend the HTTP port though. It's just a little bit easier to deal with. You know, it operates on the same port in IP that you have in Nafa running. So if you can imagine you don't have to open another, you know, another hole in your firewall to allow access, you know, so keep that in mind. I always recommend HTTP. You can put some proxy information in, those types of things, but, you know, that's usually what I like to do. And then take your Generate Flow file, you know, connect to your port. Does not have any, you know, any inputs. Let me see what's going on with my other one. Did you all receive the same error that it didn't have an input port? Yes, I got it. Okay. Thank you. Oh, I think it's because it's trying to connect. But again, I haven't set up the trust store, the key store, and all of that, you know, on minify, which is, you know, a little outside the scope of configuring Nafa. So anyway, so what you would do is set up your flow file and get that configured. And you're going to have this go to an output, that remote process group. Once that remote process group is there, you want to make sure you include your, you know, the URL, the HTTP. Do remember, you've got a UUID here. So if you need to, you know, do those types of things, you may have to define that in your CICD process. You know, some of those types of things. So, you know, just keep that in mind. Let's see here. But actually, it should still connect. See if I can solve this. Let's see here. It did have an input. We have now two of them, so it should. Doesn't match any of the channels. I think we'll connect to due to SSL. So anyways, you know, the issue I'm having here is the, looks like the SSL, you know, it's not allowing it to connect to itself because of SSL. It can't match the names. But when setting this up, you're going to create your flow file here. And then once you have your flow file, you know, defined, we are going to export that out and run it in Minifile. So let me see if I can. This just worked earlier yesterday when I was testing it. One second. Let's see. What am I going to put in here? But that's not going to do it because there's no definition of where it needs to go. I have to take a look and see why the configuration is blocking me when it was just working last night. But but basically you're going to this is where you're going to build your Minifile flow. You're going to, you know, this is where you do your operations. So, you know, right now we're just doing a generate flow file and then we're just generating a hello world and sending it to the remote Minifile group. And so, you know, you would configure that here. We will transition. So anyway, so I'll diagnose this and see what's going on here. But I think you get the principle behind it. So if we were on Minifile itself in deploying this to Minifile, we would take and do a generate flow file. We can chain together other processors, you know, say we were doing an invoke HTTP, because we are going to grab a local, you know, a local file that's running on the web server. You can build out your flow as much as you want or as little as you want. But at the end of that flow file, you're going to send that to a to a remote process group. And so, you know, that remote process group gets defined here. You know, you want to make sure you have your, you know, your property set correctly. Lower latency. Anyone else's latency getting bad on this? Oh, there we go. You know, just just reiterate when you're a computer remote process group, you're going to need to put, you know, your, you know, your non-fi instance that you're going to communicate with. You're going to need to set up the matching certificate. So the certificates that you use for NAFA is going to have to be installed for your Minifile install as well. You know, for that, for this to work properly, it needs to make that, you know, that connection. Again, that's where you go into the Minifile key store trust store and set those appropriately. I don't know why I can figure this out like yesterday to double check that it worked and for some reason it didn't. But we get the idea. So once you create your Minifile flow, you can leave the group. And and then what we need to do now is actually get that flow. So we've created our data flow. Now we need to export this out. So we say. If you right click on that and say download flow definition with external services, you should be presented with a JSON document. And I'm bringing it up so you can see what it looks like. OK, again, right. It's messy. So once you download your flow in Windows external services, this is the flow that you are going to use to to feed your NAFA instance. That initial, you know, that initial flow. So if you have previously set up a if you have previously set up a C2 instance, you would actually go into Minifile and I configure Minifile to communicate with the C2 instance and the properties. And then you can pass the you know, you can actually pass Minifile command that tells it what initial data flow to get because C2 is going to serve that up. But for this scenario, we're using Minifile to, you know, using NAFA to create our flow like you would do no matter what. While we didn't save that flow, you know, for this way of running Minifile, we can export that flow as JSON. Make sure you don't export it as like a CSV or or not CSV, but an XML template or any of those other formats. It is expecting a JSON document. Okay, you know, hopefully everyone's getting back from our last break. We'll spend a few more minutes just going over some tips and tricks and as well as, you know, some custom processor development. I do have a question. Ben, if you've made it back or anybody from the audience, would you have felt better if you were working on your own laptop versus this virtual environment? And do you all have that capability to work on your own environment? Sorry, I hit the top of the off, I have the room for a second. Were you asking if we could run Minifile on our actual laptop? Yeah, if I were to give this class without the virtual machine desktop like the DAW desktop environment that we're working off of that we, you know, we run into a few technical issues, you know, latency is a pain in the butt. But so let's get into custom, not a development. And I realize, you know, even on this call, you know, we have one, two, three, four or five devs. How many concurrent tasks, right? You know, so as you can imagine, you know, we have one concurrent task per processor. But, you know, with that being said, you may want to increase that. We are using the bare minimum, you know, that we would need, you know, for this flow. You know, that is, again, I've seen this numerous times where, you know, those need to be increased. You know, if you're constantly setting up and tearing down a large number of sockets in a quick amount of time, you know, you'll need to increase those. And I've seen a lot of flows, you know, do that. But, you know, assist admin never came in and, you know, they just didn't know about the best practices. So, you know, this always gets touched on in every class I've ever seen. So I wanted to make sure I pointed out. Great question. I'm really glad because I kind of highlighted that, but I didn't get to use it. That's a good question though. Because I can show you exactly how you help. I can use the GeoMesa, I bet. Yep. Releases. There's an R. Let's see here. GeoMesa, Datastore, Service, File System, Kafka, Installer Bundle though. Oh, Implux. Perfect. Great, great question. Address. So I have that in R, right? And so I've developed my own custom processor. I have, you know, used Maven to build it and all those fun things. Done. Perfect. So you built your GNAR, you got it running, you downloaded a GNAR, but be careful what you download, right? So what I like to do, and I talk about this, is here's the ImpluxDB GNAR. So what I do is, if you remember, I can stop NaPhi and I can put it in the lib directory and start NaPhi again, and it will show up in my processors. Best thing to do is extensions. So, you know, even if, go speed racer. Even if, you know, you have a custom processor, I still recommend putting it into the extensions directory. I would leave the lib directory to your core supported NaPhi, you know, GNARs and libraries. And so, you know, you need custom, you know, processors, you know, you can have that go through a CI CD process, build it, and you want to send it to the extensions directory. The extensions directory was specifically built, you know, to allow for custom processors and also hot deployment. So if I go back to my NaPhi instance. I haven't had to stop my NaPhi instance. I haven't had to, you know, do anything. And now I should have an Implux. There it is. So now you can see I have an Implux processor. I have an Implux, get Implux database record, put Implux database, you know, those types of things. So because I just took that GNAR, downloaded it, put it into the extensions directory, it hot deployed that GNAR, and now I have full access to it. Is that your question, Pedro, or was it something else? Yep, yep. Very nice. That's what I was wondering. Perfect. I got one right. Okay. No, great question. And I mentioned the extensions directory a few times now. So I'm glad you asked how you do it. So, yeah, you know, you just take the GNAR and put it into the extensions directory, hit refresh on NaPhi, and it should, you know, it might take a minute. If you look at the logs, you know, usually when I'm sysadminning a cluster, I have putty pulled up in about a thousand logs. But if you look at the logs, you can see, you know, that GNAR getting deployed. And sometimes it just takes a minute to, well, I'm getting all kinds of, oh, my warrants is because of that connection. But anyway, you know, if you're reading your logs, you should be able to see, you know, that that GNAR was deployed. There we go. There we go. GNAR autoloader. So, you know, it autoloaded the GNAR and deployed it. It's good to go. If, you know, sometimes it's not easily apparent, but you may want to, you know, as a sysadmin, you may want to tell this log. And when you're deploying into the extensions directory, just kind of watch things. The log file will have additional data that, you know, NaPhi may not even report an issue with that processor until it's being used. And so, you know, you want to check your logs when you're loading custom GNARs just to make sure that it deployed correctly. But great question. And I'm glad we got to do a hands-on, like, real world here it is. Any other questions? Absolutely. So, that is not a policy in NaPhi. And so, you know, as a sysadmin, you need to be able to lock these directories down. And so, the only way someone should be able to bring in a custom processor is if they have access to the lib directory or the extensions directory. Now, you know, you can, you know, lock that directory down just to only root. For instance, depending on your security mechanism, there are some other capabilities there. But, you know, in NaPhi itself, you know, you notice that we never went to NaPhi when we were loading a custom processor. We just dropped it into the extensions directory and then NaPhi loaded it. So, your way of securing that is you need to secure the extensions directory, which is outside the NaPhi UI. So, great question, though. All right. Other questions before we wrap this up? Well, a few housekeeping things. You will be, you either have been sent or you're getting sent a survey. I noticed I did say um a few times and I'm trying to not do that. And so, hopefully that didn't bother anyone. I did get a mark on that before. But, you know, complete your survey as soon as you can. My pay is tied to that survey. You know, just the timeliness of it. I get paid either way. But, you know, the quicker those surveys are completed, the quicker I get paid. You know, so if I get, you know, five, six, seven of these surveys complete, I'm good to go. I will, I have, I think everyone's email address that's signed up. What I will be working on over the next, you know, tonight and tomorrow, I'm going to work on getting all your questions fully answered. I'm even going to get a tip on how to rename the files because I know we ran into that. And so, I'm going to start compiling all of that and then I will email a PDF that will, you know, have all of this information as well as some tips, tricks, some of the questions you all asked. And as well as anything else I can do to answer. If you have any future questions, let me know. You know, technically after this training is done, I'm done. But I never like to treat, you know, people like that. So if you have a question, especially on NaFa, I really enjoy this because it's one of my quote unquote babies. You know, feel free to send the question over. I'm also going to send some YouTube links. There's some really cool design patterns that, you know, that we kind of went over, but we just don't have the time to fully go through every design pattern. But my friend Mark Payne put these together on YouTube and so I'm going to send them over to you. And yeah, if you have any other additional questions or anything else, shoot me a note. If it's something short and quick, I'll be happy to help. If you've got more long term, you know, support or something like that, you know, feel free to let me know. Like I said, I contract to this training company through my data engineering firm. So, you know, I'd be happy to help out wherever I can. And if there's no additional questions, I give you back a few minutes of your time and have a good rest of the day. Thanks, everyone. Have a good one.
on 2024-05-06