Apache Nifi GROUP 2
language: EN
WEBVTT Perfect. Perfect. Thomas, did you make it back yet? Did he get called out to another call or something? Sorry, were you looking for me? Tom? Hey, Tom. Yeah, if you can, can you go ahead and start your desktop, get logged in? I think it should be good to go. Sure. All right. Looks like everyone is coming up. Peter, just so you know, the uploads folder is what I uploaded back. I actually need to upload a newer presentation. So the PDF you see there has a couple of things. Depending on the system you're using and proxies and things like that, the last training class, you know, a couple of folks got proxies to death. They had to have some fixes applied before things were working. But once you get logged in, you're going to have your virtual desktop. And so, you know, the latency sometimes can be an issue. You may point and click at something, is the website to go to. You know, the documentation is very, very extensive. As you can imagine, you know, being a government system, it requires lots of documentation. You can download it from here, those types of things. So if you're at home and want to play around with NAFA, have at it. But, you know, you can download that. You can download the source. You know, if you want the source files and to compile NAFA yourself, you can have the source. You know, it is an open source application. So you're able to download the source files. If you're a software engineer, you're able to go in and make changes. You know, if you're running scans or vulnerabilities, you know, have at it. There's a lot of capabilities, you know, and the administrator guide actually kind of goes into how to build NAFA from source and those types of things. So, but for this class, we are going to work off of the NAFA standard 126 binary. This is already pre-built. It's ready to go. We look at the documentation, those types of things. So the main parts of the documentation that we are going to go over is, you know, part of the admin guide and the user guide. But like I said, this is extremely well documented. For an open source application, it's actually one of the best ones, though. I like to just work off of the, you know, the official documentation. But yeah, you know, I can go right here and as soon as the internet wants to respond, I will have the documentation on Git file. There we go. You know, so for instance, the Git file processor, you know, creates flow files from files in the directory. It will ignore files. It doesn't have at least read permission to. And then, you know, each processor has a property. Some are required and some are optional. And then we also have some that, you know, that we can add. There's a relationship to the processor. There's other attributes, things like that, that we can take a look at. But for this, for this part of it, you know, just remember that the documentation is there. So everyone has Non-Find downloaded. And I'm going to kind of walk you through. I was debating on whether to include this as part of the class, but I felt like I think we can all accomplish this pretty easily. So actually what we're all going to do is kind of install our Non-Find, walk through what some of that means. If you don't understand it or you have any questions or some additional details you might need, again, feel free to interrupt me. You know, this time between now and then. Oh, there you go. Give it another second. It takes, like I said, it takes a minute to, you know, extract all of that zip file. That zip file is a 1.2526 gig file. And, you know, when it extracts, it's going to be, you know, even bigger. That's probably been the biggest complaint that I know of that the community, not the Non-Find community, but the user community in general complains about is just the massive size to download this. But if any of you have played Xbox, you know, some of these games can be 50-60 game now. So, yeah. All right. It looks like most everyone's got that extracted. Give it just another second for Alderius to finish up and Peter to finish it up. All right. Let's go back to my Peter here. So, it looks like it's finished. Perfect. So, you should open a new folder in Windows. And the only folder in that folder is a NaPyWa-126.0. So, if you can double click and go into that, and then you should see a bin folder, a comp folder, docs, extensions, lib. These are not all of the folders that NaPy creates. This is just the initial downloaded install. When we get NaPy up and running and started, it's going to create some additional folders called our content and that all lives locally. Now, depending on your strategy of deploying this and scaling this and some of the other things, you may want to have some of these content repositories, some of these other repositories on different network and stack attached storage. I know that for like the content and flow and provenance, those are usually stored on high speed drives just because there's a lot of reading and writing back and forth. And then you'll have some of the other repositories that, they really don't get used as much. They're still needed. So, they may break this up a little bit and put some of these folders on some high speed drives, some of the other folders on some normal drives for cost savings and performance gains. But that all depends on your deployment strategy. Whenever we have time, and we're going to have plenty of time for this, but if you want to go very technical into details, and I'll be happy to give you my opinion on that, I can get very technical. I still write software. I still write software for NiFi even. But I kind of like the training part as well for this. So anyways, so when we extracted, we've got the bin folder, the comp folder, docs, extensions, and lib. Docs is docs. And I got mentioned earlier, everything that you can find on the website, you're going to get in the docs folder as well. And NiFi utilizes that docs folder to provide you information. The bin folder, that's your binary, right? This is where you would execute the start of NiFi and those types of things. And we'll go into more of that once we start. But the bin folder for NiFi, it contains both Windows batch files, BAT files, as well as Linux shell scripts. So if you're running this on Linux, you have a way to start NiFi. If you're running this on Windows, you have a way to start NiFi. So that's how you would start NiFi as well as some of those binaries to, if you need to change a username or password or something like that, you can utilize those. The conf directory, which we are going to go into, is where all the configuration for NiFi exists. So all your properties and where does NiFi, what IP address is NiFi running on, what port number is NiFi running on, those types of things. So there is a lot of configuration. A lot of this is the security. So plugging in that security infrastructure and those types of things, you would do the configuration here. So what I am going to do though, and this is totally up to you if you want to, but if you go to nifi.properties, you should see nifi.properties. I am going to open that and go over some of the key points of the properties just so for those that are sysadmins and others, you'll have this information. I know for some of you who may not be that technical, this may be a little overwhelming. Again, this is just for information. You are more than welcome to follow along, but there are some key points that I feel like everyone needs to see as part of the property file. So anyway, so this is your core property section. Again, a lot of this is documented. A lot of this relates back to the website even. So what is the main flow configuration file? Where is that located? Of course it is going to be in your comp directory. Where is the JSON file? It is also there. You have archive enabled. Those types of things. So that is some of your core properties. Some of the other ones is your authorizers configuration file. This is where as was mentioned earlier how you are trying to work on the other organizations trying to work on getting not only installed, up and running, get the multi-tenancy, the multi-users. I think it was Brett is working on some of that. So Brett would go in here. He would configure these properties. He would take a look at the authorizers.xml file and start building in some of his configurations he would need for security and user permissions and identity management and all that fun stuff. But that is where you would find that. But there is one key property that you can just tell that has come from the government. And that is the nifi.ui.banner.text. Now this property lives there for now it is for a couple of different reasons. But this banner, as you can imagine, you could put unclassified. You could put secret. You could put top secret. You could put you know, you could do whatever classification header you would need. So what that does is it provides the government an easy way to put the classification of the system on a banner. So when you pull up this nifi instance, you immediately see the classification of the system. Also, because that is such a government property, the way that commercial companies use others in the government as well is this may be our dev instance of nifi. This may be our test instance. It may be prod. And so I know a lot of companies that use this banner as a description. So you can quickly go to the UI and you will immediately see I am working on the test system and so for me, I am actually going to put something in and say this is a test system. I can put in whatever. Okay. And you know, again, you don't necessarily need to do this. If you are following along, feel free to put in whatever you would like. This is your own personal nifi instance. And you can go from there. Or you can just leave it blank. So some of the other properties that you would need to potentially look at if you are like a sys admin and stuff like that these will be immediately available as soon as we start nifi. And then, you know, like I said, there is an extension directory. It is empty. This is where if we had a special processor and you had a CI CD process set up where, you know, a developer could create a processor and checks it in and builds it and tests it for vulnerabilities and goes through that whole CI CD and DevSecOps policies and things like that that you have set up, you know, ultimately it will spit out a norm file and that norm file could be automatically installed into the extensions directory and you would have immediate access to that processor. Not only would you have immediate access, but if the permissions and the policy was there, everyone would have access to that same processor. So, you know, as part of some of the usability points I was making earlier where you are able to reuse these components. So if I build a connector to say, you know, let's go, it is already built, but let's talk SQL Server. If I build a connector for SQL Server and test it out, it has went through the processes that, you know, you may have set up and things like that, it gets deployed to that extensions directory. Well, now everyone can use that connector. So as a different organization, I don't have to go and build a new connector. I can just reuse one that was already built, but I may be connecting to a different instance. I may be connecting to a, you know, different user names, passwords, authentication methods in SQL Server. You know, it may be the same SQL Server is just pulling from a different database or a different table. You know, those types of things. So, you know, that extensions directory, you know, is pretty important here. Then that is how we hot load processors. So that means we do not need to stop data from flowing. We do not need to turn data flows off. We don't need to restart the application. It can run. Data can continuously flow through the system. And now I have a newer capability so I can connect to new data sources. So that's the purpose of the extensions and lib. Again, all of that is referenced, you know, into the 9.5.0 properties. You can change it. I've seen, you know, some folks change it to a different lib directory depending on their policies, things like that. But, you know, as a sys admin, this is the section to do that. And then, of course, you know, you need a state. You need to have state. If you're working in a clustered environment, you would need a zookeeper, which is another software application that is open source. If you've been around any kind of distributed system, clustered system, you've heard of zookeeper. Zookeeper is widely used, you know, across the board with government and commercial alike. So, you know, here's where you would manage some of that. State management. We have a database directory. So that's our database repository. Again, we have multiple repositories here. And where you store those is configurable. So you may, depending on if it's a repository that needs a lot of reads and writes, you may store it on a slower drive instead of having to choose one or the other. Because this is so highly configurable, it's there so you can do those types of things, you know, reducing your cloud costs, your server resources, your own prim resource. I know you guys do a lot of stuff on prim. So, you know, it may help reduce some of those resources. And that's the database settings. And there's a flow file repository, the content repository we talked about. And, you know, one of the things I like to point out here is that content repository is keeping basically your flow file. So if you told it to ingest a CSV, that content repository is keeping a copy of that CSV, you know, for the time being. That's because if NaPhi were to crash and shut down, and when you restarted it, that processor that was processing that flow file is going to go back to the content repository and say, give me back that file. I need to finish processing it. And before a processor that's, you know, say we've got three or four processors chained together, you know, we get filed and we send it to the next step and the next step. Well, that next step, if something crashes before it gets time to go to that next step where it completes, when NaPhi comes back up, it will reprocess that content, that file based upon whatever processor was working on it because of that content repository. And, you know, just so you have a little bit of underneath the hood workings of this, when a flow file or a piece of data, you know, goes to a processor, it will not release that flow file and that data until the next processor has it. And so what that does is it guarantees that a copy of that data is on the next processor doing that function and that that next processor got a thumbs up from the previous processor that it was complete. So that way, you know, if something crashes and things like that, you don't lose data. Now, if it is in the middle of processing data and it crashes, it's going to try to reprocess that data. So, you know, just keep that in mind where you may get some initial results from NaPhi, but you need some additional processing to happen. So you may get duplication of data because, you know, it produced 25% of the output. But, you know, before it crashed, when it come back up, it's going to try to redo that. And so, you know, you may get an additional duplication of data. Now, with that being said, we do have ways of dealing with that as well. There's actually a dedupe processor. I don't know if it's in this latest version, but I do know it's there. Because duplicate data is a pretty big issue in my experience in all the years I've been with the government. So, yeah, that is our content. Then you have all the provenance events. It has its own repository. When we start NaPhi, that new folder is going to be created, and that's where the provenance events will go into. So you can specify how much, you know, Richard, I'm thinking about you here, where you may have an overarching data governance plan and strategy. And so, you know, you're not fine to retain the last 14 days. And then, you know, during that 14 days, you're offloading all of that provenance information into a larger data governance. Informatica has one. You know, there's a couple of open source versions. You know, there's like Knox and Tika, or not Tika, but Ranger and Apache Ranger, Apache Knox, and a few of those tools that kind of work well with NaPhi. So, you know, you may have a corporate-wide or unit-wide governance policy. So that's where this would get configured. You can configure it to keep to, right now it's configured to keep all the provenance events for 30 days with a max storage size of 10 gig. So, you know, so keep that in mind, you know, when you're building it and designing your system, that if you have a ton of data coming through, you may want to, you know, those events are being offloaded. And, you know, you have those data provenance events. So you don't need to keep 30 days worth of data. You need to only keep it for a week or a day. You know, I've seen this configured where it keeps it only for a couple of hours. Because as those events happen, all of the data governance events is being offloaded to the, you know, the corporate-wide data governance system. And so, you know, this is highly configurable, you know, for you sysadmins out there as you start working through getting it installed and things like that. You know, pay attention to some of these properties because, you know, this one, for instance, it'll take 30 days or 10 gig to fill up. And so, you know, you may want to adjust those settings. Again, you know, you see a lot of times applications have settings that you can just like, you know, go to a menu and select the setting and change it and those types of things. We have some of that in i5, but this is part of some of those core settings. There is no UI for this. You know, that was, you know, one of the things that we went over in the last training class was you're going to have to go in and edit these files. You're going to have to put in, you know, different properties based upon your organization. I wish there was an easier way to do this. I find that this way is not too bad. I find that setting up security and those types of things, now that's the more difficult part. And the problem with that is, you know, if you do run into an issue, you're asking the community, you're asking Google, you're, you know, or you're emailing me and saying, hey, Josh, how do I do this? You'll get my contact information at the end of the, you know, at the end of every day, I think it is. But I will be happy to answer any quick questions after this training. Do remember, like, you know, I'm delivering training and the is. So I mentioned that NAFA when, you know, up until recently, you could download it, you can install it and have it up and running in a few minutes, but everybody in the world could access it if it was on a public IP or something. So what they did is they went through and said, okay, we are now going to secure every install. We're going to generate a username and password that is unique to every install. So to find that information, you actually have to go into the NAFA dash app dot log folder and look for username. And you're going to see in this log folder, a generated username and a generated password. That is going to be our username and password to log in. Yours is going to be different. This is a very unique UID that is generated. And so, you know, yours is, your username and password is going to be different. I'm going through this right now, but we, as an exercise, you know, I'm going to have you all, you know, basically do the same. What I like to do, because there's no way I can remember that much information, is I like to copy it and I will actually put it in a new document. Because that log file is going to go away, you know, as we process data, it rolls over to a new log file. You know, there's a lot of information in that file. So I like to just pull out that username and password, that initial username and password and have it readily available. So what I did is I just created a new text file. I copied and pasted the username and password. And then I'm going to save it as text and I'll just throw it in my downloads. And I'll just name it up in my downloads. Perfect. So now, you know, I've downloaded NAFA, I've extracted NAFA, double clicked on run NAFA, it went through, it created everything it needed to do to get up and running. And then, you know, it's up and running now. So it's just waiting on me to log in. So what I like to do then is I'll bring up my browser. And, you know, I like to go, if you remember the IP address was 127.0.1, which is localhost. And we were on 8443, that port. So HTTPS, because it's secure, and colon 8443. Now, you'll learn that you need to do to show you what happens. Let's just go to this one. And like I said, the initial running of NAFA can take a few minutes. So if you are following along and you're trying to do this and you're getting page not found, then, you know, but it also helps that I put in the right port. 8443. But again, you can put in the correct IP address, the correct port, and it's still not load. On the last class, I noticed, you know, even three or four minutes before it was fully up and running, even though NAFA would report that it's running, it still took three or four minutes to initialize. Again, we're working in a high latency virtual desktop environment. And so your own environment may be much better or different to allow that to run. So anyway, so I'm at 127. It's going to come back and tell me my connection is not private. It's a self-signed certificate, right? All this was set up just to add that username, password, security layer. So what I like to do is I'll go advanced and I'll go ahead and proceed. And then I didn't specify slash NAFA, but it caught it. It's automatically going to redirect me. And now I will be at the login canvas. So it's asking for a username and password. I have it right here, luckily. That's why I said, you know, copy and paste it when we get to that part. When we go through this more hands-on, make sure you copy and paste it into something a little bit easier. That log is going to go away. So tomorrow when we log in, if you did not copy and paste it somewhere, you're going to have to find that old log and we're going to have to get it. In the username, the password, login. Perfect. We are now back at the application. So this is the NAFA application. It is web-based. There's a lot of buttons and a lot of things that we're going to go over every one of those. But again, it's a web-based application. There's some server technologies under the hood that's running this, JD and some other things. But it's all browser-based, mostly to work with the data flows. But again, there's no point and click, you know, properties manager. So you've got to, you know, hand edit that. You know, a lot of applications, you know, you're going to have to edit the properties. But once you get it up and running, you shouldn't need to go back to the log directory or in those other properties unless you like have a warning or an error that you need to look at in the log directory. But if you're running this as a standalone, in your spare time on your laptop, you know, even at work, you know, you probably don't need to go back and take a look at those. But make sure you keep that username and password. So we're logged in. I can actually now start building my data flows. But what I'm going to do is actually go back in my presentation where we talked about some of the core components of NaPhi. So we talked about processors, connections, flow files, flow controller, all of these things that we talked about. And let's take a look at them. Let's look at them. Let's, you know, see more about what they are in NaPhi. So what I like to do is this is your canvas. This is a blank canvas. So you don't have any processors running. You don't have, you know, any of the process groups. You don't have any data flows or anything else. You know, you don't have any of that. So, you know, it's a blank canvas. So this section up here, you can see the NaPhi logo. You know, I want to point out there's my banner that this is a test system. So I can put in capital letters, unclassified even, right? Or I could put dev or test. And that property, when NaPhi has started, it's going to read that property So anyway, so the, you know, this is the main canvas. MuI has multiple tools to create and manage, you know, your first data flow. So what this is, is the component toolbar. So if you see, you know, you should see processor, you see input port, output port, process group. If you just hover over them, remote process group, funnels, templates, and labels. So, you know, the last group, we actually, I did not mention filter or funnel on purpose. And the last group was able to actually work it into their, you know, their data flow. As it was pretty understandable, they just referenced the document. But anyways, this is your component. Now, right below your components bar is the status bar. So, you know, how many bytes are going in and out of the system, right? How many, how many processes are started? How many are stopped? How many are disabled? You know, how many have a warning? You know, all of these things. Now, the canvas itself only updates automatically every five minutes. But at any time, when I, when you'll, you'll, you'll hear me say this a few times during the, when we're building a hands-on data flow, is to go ahead and refresh, you know, your canvas. So when I say refresh your canvas, and doesn't mean, you know, go up here and refresh from the browser. That's actually just anywhere on this canvas without clicking on any component, you can hit right click and hit refresh, and it will automatically refresh the stats. But anyway, so that is your status bar. This is our operate palette. You know, we're, and we'll go more into a lot of times we have, you know, closed systems, you know, we have one way transfers and things like that. We have, you know, you have systems that don't ever touch the internet and they're on, you know, their own closed network. So, you know, you may not have understand, you know, the properties and things like that. You know, here it is without ever having to go to the internet. And then of course you have a balance, a balance easy. So, so this is version 126.0. It was built on May 3rd. It was tagged as release candidate one. And, you know, the branch and everything else, you know, you can actually pull a lot of information. So, you know, again, if you go to GitHub, for instance, one of the, let me search. Yep. And search, get hub repo where all the source code is located here. And so, you know, this is the main branch, but, you know, you can go through and see all the different branches, release candidates, those types of things. Here is the source code to all of that. Not only can you download it from that link earlier, you can do a get clone if you are familiar with GitHub and Git and others. Clone this and build it yourself as well. You know, so, you know, just keep that in mind. Again, it's very open. It's very well supported. There's a lot of documentation for it and things like that. So, that is the help section. So, that is an overview of the Canvas and all of the components on the Canvas. And so, before I start diving into, you know, some of the final workings of NAFA, I want to pause there. Is there any questions I can answer up until this point? Well, hopefully, I'm teaching so well that it's very clear and understandable. I always worry about my southern accent, you know, playing a part in this. But again, if you have a question, feel free to interrupt me. Or you don't, like, I need to translate something or speak proper English. Feel free to yell at me. I got one quick question. Yeah, go ahead, Tom. I'm understanding that when you run the command, I do see that, and I've run this on a container report. So, I've seen during the execution of the, you'll see the password in there. But I'm sorry, I missed a part where, let's just say, you don't see that here. How do you get there? Is it written in a log? Or you can go and retrieve No, that's a great question. So, when we start NAFA, it's automatically going to create this log called NAFA-app.log. And that's where almost, that's where 99% of NAFA activity is writing to this log. And so, yes, on that first install, you're going to see, you know, generated username, generated password, and it's only going to be in the logs. The problem with it, do I? No, I would just say, yep, I see it there. Okay. You know, the problem with that is, is, you know, we're going to start doing some hands-on exercises and some work here. And so, that log is going to roll over. And so, it's going to rename this old log, give it, I think, a date at the end of the log, and it's going to start a fresh one. And so, if you do not capture your username and password pretty quickly, it's going to be in another log, and it could be in, you know, a log that was generated days ago, if you didn't, you know, set everything up, right? Or, you know, you may go in and put a data flow in and run it. It's generating all these log messages. And now, your username and password is sitting in a five-day log file. So, that is where you initially get your username and password, but do know that it can go away, especially if we're doing a lot of operations very quickly. Great question. And then, we are going to, you know, go through installing and getting it up and running and all that fun stuff. If you didn't follow along, you know, I like to kind of go ahead and show you what we're doing. That's where we're going to get our username and password. You can find that log in the logs directory. And it's 95-app. Let's see if I do it. Yeah, I haven't generated enough data for it to roll over, but, you know, tomorrow, I bet there's going to be a 95-app.log 521-2024, something like that. Okay, any other questions? All right. So, let's, what I'm going to do is kind of go through more in-depth of the components and, you know, go through some of those things. We will then take a break and go to lunch and come back from lunch and, you know, get everyone else up and running and, you know, get your own version of NaPhi going so we can start building some data flows. You know, so that being said, you know, on the only components toolbar, the first thing I have is processors. So, I actually just click that and hold it and drag it down. And here are all of my processors. So, you know, with this version of NaPhi, this install, I have 359 processors available. So, you know, I have processors to handle Amazon, Azure, you know, AWS tags, JSON, CSV, you know, all kinds of things. So, what you're seeing here is just like a word cloud, you know, for all the processors and those types of things. So, you also have, you know, a list of all your processors and the, you know, the description. Just because it was asked the last time, you will see the shield, the little red and white shield beside the processor. It's specifically called out because you can now create a policy and security within NaPhi that will allow you to lock down certain processors. You know, so for this one, this is a reference remote resources processor. So, it falls within that reference remote resources. And so, because of that, you may set a policy that says, you know, my data engineers cannot, you know, see these processors in this group because, you know, they're not needed and, you know, for security reasons, you know, we're just not going to allow that. Or you may have it where, you know, I have database admins that need access to this group that contains the database connection details and those types of things to set it up. And but another group doesn't have access to it, doesn't need it. They can just reference, you know, that processor from a controlling service, right, a controller service. You know, so that is a reason for that little shield. But anyway, so all of these are processors, 359 processors. And the one I like to use is a little tag cloud here. And I want to see all processors with git in the description, right? And there should be my git file right here. So, you know, that's how you would select the processor. So, what I like to do, though, is I already know the name of the processor. So, I see it. It's highlighted. I say add. Boom. New processor on my canvas. So, this processor is just the git file processor. It's got a single function. Its function is to pick files up and retrieve those from the file system. You know, it's not trying to extract things. It's not, you know, doing any kind of ETL. It's not a model or anything else. This processor is doing one function and one function only. And it does it very well. And that's the git file. Also, within a processor, you can see again that little shield that belongs to a group that, you know, you can imagine you may have a convert text processor, right? You know, from a security aspect, that's a very low risk, you know, just because you're converting data that you already pulled in and you're converting it to other formats and sending it out. But, you know, you're not, you know, this convert text processor, for instance, it doesn't have the connection details. It doesn't, can't get a file. It can't put a file. It can't connect to a database or anything else like that. So, because this one can actually get data, you know, there is a security group for it. You may want to, you know, depending on your security policies, you may want to lock this down where, you know, folks can't do a git file or a put file. You know, they can build in the logic of the data flows and everything else. And they may get their data from another processor. And then that way, you know, you run the risk of, you know, someone doing a git file. We actually had this happen on the last class with a couple people where during the exercise, we put git file. They specified the same directory as NAFA to git. They told it to not keep the source file. And so they also told it to ingest everything. And so what they did is they built a flow that did a self-destruction. And so what it did is, you know, they run that flow, that file went and grabbed everything in the directory and, you know, and then crashed because, you know, it just couldn't work because it consumed itself. And so, you know, there is some security thoughts that go into this, you know, as you're planning this deployment out. But anyway, so that is our git file processor. You know, you can take a look at it. It's going to give you some real quick information. How many bytes came in? How many bytes read and write? How many bytes went out? And how many tasks and the time it took to execute those tasks? All of this, again, is in the last five minutes. But if you hit refresh on the canvas, so I click off of that processor and hit refresh, if data was flowing through, that would be updated. And so, you know, that's how you would get a quick refresh of what's going on with that processor. Now, every processor, you should be able to click on it. It will do a little black box around it to highlight it and then right click on it and you have options. So the option that we will use most is probably configure. So we can actually configure the processor. You know, there is a disable if you want to disable it. You want to view the data for this specific processor. You can replay the last event through the processor as well. You can view the status, the usage, the connections. You can center it in view. You can change the color of that processor. So, you know, we're going to get into, you know, some of this. But just for FYI, you know, the hands-on exercise, one of the things I look for is some of these, like, you know, coloring, you know, labels, naming conventions, you know, some of these types of things that are very nontechnical. But, you know, I look for those just because of usability, ease of use and those types of things. So anyway, so that's my git file. I have my configure, disable, provenance. I can group them. I can create a template. I can select multiple processors and create a template. I can copy it and paste it. I can delete it. But for this scenario, I want to say configure. So this is how I configure that specific processor. You know, it has a name, git file. Now, you know, I don't like that git file name because, you know, it doesn't tell me a whole lot. If I had a data engineer looking at my flow, you know, I want them to be able to look at my flow, quickly understand what's going on and how this maps together. And that way they can accomplish the task that they need to do. So what I like to do is I go into my name, you know, during the configuration and I'll say git file from system. So there we go. That is an easier, more human readable description of what this file, this processor is going to do. Also, you know, if there is a penalty or error or something else like that, it will penalize the flow file. And this is basically the duration is how long you want that penalized. So right now it's set, everything is default to 30 seconds. After 30 seconds, it's going to retry and reprocess that flow file. But, you know, 30 second penalty. The bulletin level, you know, what kind of logging do we want from this processor? You know, we may, you know, the bulletin level is set to warn. But if you want to log everything, you may put it at debug. Most times you keep it at warn or error. And so what that means is, if this processor has a warning or error, it is going to push that to the NiFi dash app log, you know, in that area. So, you know, if you're building a flow file for your first time, you may put debug. And that is going to log everything. You usually do not need that much detail. But, you know, it's there in case you need to set around one, but in about 15, 20 minutes. Okay. And then you have yield duration, just how long that that's going to yield before it's scheduled to do it again. You know, so one second is pretty standard. Again, you may change these settings when you start building your own data flows, more, you know, real world. But most of the time, these properties all stay the same, except for the name, you know, the name part of this. Scheduling, there's a couple of scheduling strategies. There's a timer driven, a cron driven. So you can set this, you know, most, all processors default to a timer. So it's going to run every, you know, it's running constantly. So you can actually set a run schedule that says, hey, I want to run this processor every one second. I want to run this processor every 10 minutes or 10 hours. You know, so what it will do is that scheduling strategy is going to dictate, you know, the running of this processor. You may have a cron where it runs, this processor runs only between 10 PM and 11 PM with a run schedule of every one minute. And so, you know, it's going to run 60 times during that hour. You may have the concurrent tasks is how many tasks. So this processor is doing a get file from, and it's running one task to get file. Now, one of the things that I had the class do last time is actually pick up and get file, pick up that 1.5, 1.2 gig zip file and decompress it that Notify came with. And so we had a few folks where the file got duplicated or they picked up everything. And so, you know, what happened was it kind of slowed the system down. It was taking a while to pick things up and send them off. And so, you know, because it was processing large amounts of data, but if they wanted to make that quicker, you know, the run schedule is already running full speed. So, but if they wanted to make that quicker, they could have put the concurrent tasks at five. We gave five concurrent tasks to execute this. Property. So this is the big one. This is the configuration run. Hopefully everyone had a great lunch and work and all coming back.
on 2024-05-20
language: EN
WEBVTT I think most of you have already answered my first question. If you follow along, do you have NaHI installed? It looks like everyone does except for O'Darius. Are you looking for your username and password? Yeah, I had a hard time using it.
on 2024-05-20
language: EN
WEBVTT I kind of followed along. Everyone is up and running on the 9.5 canvas, which is really great. Again, I was thinking about having it already installed, but I figured the install, the downloading of the zip, extracting the zip file, those types of things. I felt like it was, I think we can walk through it, so it was good to see. I see some of you have already started to create processors and chain those together. That's amazing. We're already a little ahead of the curve, so we might get out of here a little early today, but definitely looking good. Well, Daris, if you can give me a thumbs up, just give me a shout when you're ready. Are you able to type yet? Right now, one of my coworkers is helping me troubleshoot. Okay, perfect, perfect. I'll give you just another minute. While we wait on that, did anyone have any issues installing NaPy? Downloading, I know the zip was already downloaded, but you can go to the website, click download, was did anybody have any issues getting everything up and running, going into the logs, finding your username and password? Again, I'd recommend you copy that username and password and put it in a separate document, just so you have it for later today and tomorrow and the next day. Yeah, I had an issue with a Java home environment, variable issue, but I was able to power through that Yeah, you're gonna get that in the logs, just because the Java, we're just using Java that is downloaded from Oracle's website. And so it gets installed and it's a JRE, it's not even a JDK, and so it doesn't have Java home set properly and NaPy home set properly and all those. It should have complained about it, but still run. Okay, perfect. Yeah, I just saw the error there, but yeah, you gotta go in. Yeah, it'll, yeah. Yeah, the Java home environment variable is not defined correctly, instead the path will be used to find the Java executable. So if you are deploying this and building this into a reward environment, you would have a NaPy home, a Java home, those types of things. One thing to also keep in mind, NaPy supports Java 8 and Java 11. And so if you have a different version, like 13 or nine, it's not supported. But again, all of these instructions are on the, the administrator guide website. So yeah, if you look at the administrator guide, it goes into how to start, stop, how to build, some of the port configuration that we talked about, zookeeper information, also some configuration best practices. And I usually include this link I send out as well, because I've seen these best practices not followed. And NaPy likes to have a ton of open files and a ton of TCP sockets open. And so if you are running on a Linux machine and this is not configured properly, it will start spitting out some errors and some of your data flows will start having issues and things like that. So, just some of those best practices. But again, it recommends anti-virus exclusion on these folders just because they're constantly running. And not only would anti-virus constantly running on that would take a ton of resources. NaPy isolates this pretty well. So, the risk of not scanning those directories is very low. Now there is processors. I know a couple of virus processors that when they ingest files, it will go through a Clam AV or similar processor to check for viruses. But yeah, all of that is in your administrator guide. You know, you are a sys admin and you've set up Java homes and stuff like that previously. You'll see that as well. I am looking at mine. Let me pull. There's real quick. Actually, I can start your, I can actually log into yours interactively as well and see if I can type. My typing works. Yes, I send a message to a various, to close the machine and try to open, just connect to the other stuff with a different browser in general. Yeah, I'm thinking the same. Glad you're on the call for tech support, tech assistance. Yes. Give you another minute and see if you're able to get started. Give it one more minute and there is if you need help, just keep trying to get connected. And if it doesn't work, I think it was suggested to shut it down and just say, you know, shut down the machine, close the browser, try a different browser and connect back in and start it up. You draw up from the call, so you might be restarting his computer or something. Okay. Tom, are you there helping him? No, I'm teleworking. Oh, it looks like he's working now. Good deal. Now he'll join the call. All right, so I think he's good to go. Hopefully he's coming back on teams. Either way, we will get started. Okay, so it looks like everyone else, you know, got their 9.5 install up and running and working. Tonight when you're, or this afternoon, when we are complete to exit 9.5, all we're gonna do is bring this window up and we're gonna hit Control C. We're not doing it right now, but we're just gonna hit Control C to exit the process and 9.5 will go away. And then that way we're just, you know, if you don't have to, you can leave it running. These virtual machines are available to you for the next few days, you know, as much as you need them. So you don't wanna shut it down and you just wanna, you know, pause where you're at. You can just close your browser, come back tomorrow, connect and pick up right where you left off. So that being said, I am going to go ahead and finish building out my data flow. And you are more than welcome to follow along. I am, you know, it's gonna be a lot of hands on from this point forward. So, you know, feel free to, you know, build as much as you can. Some of the scenarios, I have a couple of scenarios for us to work through, you know, probably starting either later today or tomorrow morning. They will take a few hours to build, but you are gonna be able to build a data flow that does ETL steps, controller steps, and those types of things from scratch. So, you know, that's always a good thing. But let's get started. So, you know, let's pick up where we were. Oh, gee, really. So what I did is I, you know, I created a processor, I get file processor. My goal is I want to take this data, this weather data zip file, or I can take the other data, and I want to unzip this data. You know, I've already went through and extracted it manually, looked at the CSV, the JSON, stuff like that. But, you know, in this scenario, I am delivered a zip file to the file system, you know, every day. So I need to, I just heard of a Teams notification. Oh, oh, there it is, good deal. Okay, and so in this scenario, I want to pick up, you know, some data from the file system. I want to identify the type of data it is, because depending on that, you know, it's the way I might decompress this data. And so with that being said, I'm going to take my, I'm going to create a new folder. And all I'm going to do is call that weather data. And then I want to take the weather data zip file, and I want to just copy it, move it into the weather data folder that I created. I'm doing this so, you know, we could actually pick it up from that directory, but then we have to start doing some file expressions, some regex and things like that. So to keep this easy, I want to work off of this directory in my git file, and I am going to pick up this zip file. So I'm going to go back to my git file, I'm going to configure it, and I'm going to configure it to pick up everything in this weather data file. And we went into the configuration of this git file. So there is a file filter, regular expression. So if you are familiar with regex, especially Java regex, you can put in regex patterns to only pick up certain extensions, for instance. So we could do whatever dot star right now, but I can do whatever dot zip. Or if I know the name of that file coming in every day, I can put weatherdata.zip, if that was the name of it, and that's the only file it's going to pick up. Right now I have it as a wildcard, where it will pick up any amount of data, any data that comes into this folder called weatherdata. And so right now I just have my one zip file there. What I want to do because I'm testing this is I want to keep that source file. Later on, if after I test this flow, and things are working, and that type of stuff, I may just have it where it doesn't keep the file. Because either I'm pushing this file to a database, or a bigger file system, or whatever. If you have that type of requirement, you may not want to keep the source file just to save on disk space and those types of things. But for testing purposes, and because it's my initial flow I'm building, I'm going to go ahead and keep it. I'm really not messing with anything else. I have my input directory, my file filter. I'm not really worried about my batch size, the maximum number of files to pull on each iteration. Right now I have a set to 10. I've got one file, so it really doesn't matter. But you can start to see some of the power, and some of the capability in a single processor. I can specify these things. I can build these things relatively easy. And again, I'm not opening anything code related, like Visual Studio Code, or Eclipse, or any other IDE. I'm just using the 9.5 canvas to build out my data flow. So I can, I want to ignore hidden files and folders. How often do I want to pull the folder? Any kind of age or size? But honestly, I think I got it the way I want. I got it the way I want it. I want to double check here. Yep, there's my weather data. File filter, batch size. Okay, everything that I want, I want to keep the resource file applied. So my goal with this is I want to pick up compressed files. Now, that could be zip files. That could be Gunzip. That could be TAR. There's numerous different compression type technologies. So we primarily use in our day to day life zip. But you may have others, like it was mentioned with TAR, Gunzip, Rare. There's so many others out there. But in this scenario, I've only got zips, and that's all I'm picking up. But this folder could have 100 different compressed files and all different compression technologies, different file names, different extensions, things like that. And so what I want to do is I want to pick that up. And because it could be anything but a zip file, I want to send it to another processor that identifies the mine type. And so again, what I'm going to do is I'm going to go in here. Just make this a lot easier to understand. So that way, if my colleagues are looking at this, they will have, easily can reference this and understand what's going on. Thomas, I think you got your hands up. Yes, what relationship did you use on the git file folder processor? The relationship. Try or terminate, yeah. Yeah, no, so good question. So let's go back into the configure and let's go relationship. So this processor only has success. So if it can't pick it up, it can only pick up files it can read. And so if it can't read them, it doesn't know that they're there. Now, we will get into some additional relationships with other processors and where we send failures and things like that. But for this one, you only have success. So all files are routed to success. You don't need to do anything for terminate or retry. That's not a new requirement. Yeah, you don't need to. So the reason we have this here, previous NAFA, the NAFA that we built originally, you had to terminate or retry every single connection. So on a git file, I would have to create a new, push it to another processor that did terminate flow file. And there used to be a processor called terminate flow file. So what they did in some of the later versions in the last couple of years is have auto termination just because it makes sense. And it's less processors on the canvas. It just makes sense. So, but we'll get into more about the relationships because we're about to actually use those. But when you drag your arrow over to your next processor, it's only going to give you one option. So let me, one of the things I like to do is log everything. So I am going to take, drag this over. And the only relationship I have is success, add, done. And then my log message, in this scenario, I'm gonna have it automatically terminate. Oh, it actually is gonna automatically terminate everything. And that way, like, if there's any issue, it's just gonna automatically terminate it right there. And the log message is another one of those that only has success. It's either gonna log or not. And if you are not logging the information, you have a problem outside of NaPy, just because that's this only job of this processor is to log whatever you send it. So what I'm gonna do actually is, because you asked that question, right? I'm gonna have a log message. And that log message is going to log all of my success. So what I'm gonna do. And you'll realize if you're working off your local machine, it's so much nicer not to deal with any of the latency. So I'm gonna call this my success block. And when I have another relationship, it's either gonna be success or failure, but I'm gonna log it. So in the NaPy app log, you will see, I will see a success message for this log, for this file. And let me see, let me figure this. I wanna change the color. Let's give it a green. Green of some sort. All right. I mean, I wanted to ask because I, for the identify the file and the log message, I get a relationship success. Invalid because relationship success is not connected to any component and it's not auto terminated. Yeah. It doesn't look like yours says that. Okay, just a second and then we will. All right. I'll log through this and Tom, I'll take a look at yours as well. So I have my git file. And again, the only relationship on that one is success. There is no failure. So I'm sending the success to the identify mine type because in this scenario, I want to understand what type of compression is being used so I know how to decompress it. And so also I wanna send all of my messages that are successful to a log message processor. And so this log message processor, all it's doing is logging all the messages. So I'm gonna put logging all, logging all success messages. All right. So now my git file has picked the file up. It is now sending it to log all success messages as well as identify mine type. I have now created a connection to the next processor. We can actually click on this connection, right click and configure as well and go to details. You can actually name this as well to make it like more readable. So you can say, connection. As well as prioritizers. So I can send files to this connection and instead of a FIFO first in, first out method, maybe I wanna process a priority attribute or I wanna do all the older files first before I do any of the newer files. Or I really just don't care and I wanna send, the first one that comes in is the first one that goes out. So there is some configurations there. You can configure this. Most of the time I see a lot of people just leaving it as is, but it is there. Okay. So now I've picked the file up. I've built a connection to my next processor that is a success and that processor is identify mine type. Well, that one, when I go to configure it, by looking relationships, it's also a success as well. And we'll get into more of why that is. So if it identifies the mine type, it will send it where I want it to be sent. But if it doesn't, it's still identifying it. It just didn't identify it. So the processor itself still worked. Still done what it's supposed to do. It just couldn't identify unknown files. And I'll show you how we deal with that. So I've got my identify mine type and that looks like it's going to be a zip file. I'm expecting zip files. I'm expecting tar files, done zip, rare files. So I'm going to need to decompress this file. Okay. You see we have a failure and a success now in our terminating. But let's go to the properties first. So instead of doing a compress, I want to decompress. I'm going to tell it to use the mine type attribute. So when it goes through, it is going to identify the mine type. And in this case, it's going to look at that file and say, hey, this is a zip file. So it's going to give that attribute, remember it's a key pair value. So it's going to use the mine type attribute and it's going to say zip. And then if I want, I can update the file name. I can deal with removing the extension, stuff like that. But for this scenario, I'm just saying, false on the update file name. All right. So I need to drag my connection. I'm only in the identify mine type has one success. There's only a success on the connection in this scenario. So, name success. And so the success that's coming from identify mine type is going to compress content. Just for sake of use, I also want to log all success. And what I'm also going to do, because I want to do a control C, control V. So I did a copy and paste on this processor. And I want to say control C, control V on my label. And I want to name this one, failure. And I want to change the color of this one to red. All right. And then I'm going to change this processor to make sense because it's no longer log all success, it's now logging all failure. And you'll see why I put this in as well, towards the end of this flow. For design, following design principles and not thought and knowing what I know about building flows, I like to have a few log messages on my canvas while I'm building my flow files. The reason being is, I usually actually don't even turn them on, but this is a catch all method of seeing what is successful, what is failure and what those look like. And this exercise is to show you some of those additional menus and capabilities and stuff like that. One second. Okay, sorry about that. My roofer was calling me. So what I like to do is leave those messages in there and then as I'm building this flow, I can just send the, I have somewhere to send success, I have somewhere to send failure, not only to the next processor, but to a message itself just so I can see what is going on. So anyway, so identify mind type only has success. We brought it down to compress content. So what we've done is we picked the zip file up, we're identifying that's a dot zip and then we're extracting, then we're gonna send it to this processor to extract that data. And so what we did is we went in, we configured this. We told it that we want to decompress the compression format. We can specify the compression format. We can put Gzip or zip or they're snappy. There's so many different compression formats. So I can specify that it's a zip, but who knows, right? Maybe I'm looking at a folder that is getting data from all over the place in different formats and different compressions and those types of things. And instead of building a flow for each format, I can just, right now I can just send it to identify the file type. And depending on that file type is how the next processor will decompress it. And so, I've got to set to decompress, I'm using the file type. Let me get a good name for this. So let's do decompress content. And we'll work on the relationship next. So now I've got it where I'm sending it to my decompress. I'm expected a bunch of flow files from that one zip file. What I wanna do is if it fails to decompress, because identify my type is just going to identify the file type. If it's unknown, that's a file type for identify my type. So it doesn't have failures, but this one does. So if the file is corrupt or it doesn't know how to handle it or those types of things, it needs to know where to go. So we're gonna send all of our failures to our general log message here. All right, so now that we've got that, where are we gonna send the success, right? So I'm expected after this process to get the contents of that zip file. So what am I going to do with that? So let's... Let's do a put file. So now that I've unzipped it, I've got my files, I'm gonna just put them back on the file system. So here we can actually do a put file. File. Connect it. And if it successfully extracts it, we wanna use the put file. And because I wanna log all of my successes and failures, I wanna add success as well. Now, put file is invalid, right? I've got the yellow exclamation mark. It's not in a stopped state and ready to go. It's because I need to define my relationships as well as the directory it's going in. So if I look at my properties, I need a directory to put this file. So what I wanna do here is, I wanna go to my downloads. I'm gonna create another folder called unzipped weather data. And I always go on the address bar because Windows likes to use its fancy names like this PC downloads. You click on the address bar, it gives you the full path. And then what I'm gonna do on this is paste that path back in and everything should go into unzipped weather data. Now, I didn't put it back in my original weather data folder, right? Because if I did, it would take everything it decompressed and send it right back through the flow file. So I'll be writing this flow file and it would send it right back through. So what I like to do is definitely put it in a different folder because I don't wanna have to reprocess those files over and over and over and create this infinite loop. So it's going in a different directory because I don't wanna pick those up by this process group. So anyway, so I have the put file. Now I've entered the call on the directory. Now I need to terminate this and I can go in here. I can configure this processor and I can say auto terminate on failure or auto retry on failure. I can tell it to auto terminate on success or auto terminate on retry. Meaning if it writes it to file and it was successful, it's going to automatically terminate and be done. I don't know because the file landed on the file system. But this is another one of those examples where I kinda wanna see success instead of just trusting it's gonna write to the file system. And then if there's a failure, I wanna see what that failure. And so what I like to do is put those, the nice thing here as well is, if I'm not mistaken, I should be able to auto terminate failure as well. No, I cannot because another relationship exists. So you can't auto terminate if you already have a relationship built. But you can have multiple failures, multiple successes. Like the get file from system. I had a success here and I have a success here. Identify mine type, I have two successes. Same throughout the flow. So what this does is as I'm building this flow, it gives me more insight what's going on. So if I need to make changes or if there's errors or something else like that, I can make those changes before the flow actually gets deployed. So now I have this get file from system. Identify mine type, decompress and put file. Everything is in a stopped position. There's no yields. There's right now no errors. And so to me, this data flow looks good. So, Netify is very, very quick. If we were to put a thousand zip files in that folder, it would pick it up before we would have time to stop it. Luckily we only have one zip file and we kept the source. So what I like to do when I'm building a data flow is I like to think through it logically a little bit. I like to sit down and I'm like, okay, well, I'm gonna get a file. I need to decompress that file. I need for me to do that, I'm gonna have to identify the mine type. I'm gonna build those things in. And then when I'm done with this file, I may send it to another processor that extracts the data. But for this use case, in this scenario, we're just going to write it back to file system. So what I like to do is kind of think through it and start building out my flow. I have my success messages. I have my failure messages. I think I am pretty good to just let this go. But instead of turning everything on, what I like to do is I like to click on each individual processor, now that I've got the flow built, and say run once. So what's that going to do is go and pick up that file. And if my next processor is stopped, it's gonna just queue up on this connection. So you can see I ran it one time. So I got one file in the connection that's 2.19 kilobytes, and it's waiting to go to the next processor. Now that I've got that file though, I can actually go and list my queue, because all this data is automatically being queued up. So let me move this over. So with all the data file queued up, I can now start looking at those attributes. I can start looking at data governance and provenance has happened, because we have a file now in NaPhi. So what I like to do is I got that queued file, I will go and I will list the queue. It's gonna tell me that I have this file name, weatherdata.zip in the queue. Here's the file size, here's how long it's been queued, the lineage duration, if it was penalized. I can actually download the content, which is the zip file. I can view the content, or I can look at the provenance for it. So if I try to view the content, NaPhi is going to complain that there is no viewer registered for this content type. This is what I was saying, since content type applications zip. So this is what I was saying, if it's a JSON document, which we'll get into, there's a viewer built in. So I can actually look at the JSON document, but for zip, there is no viewer here for zip. If there was, I would be able to look at that zip file. So let me close this. Also, if I go right here, at the beginning, I can view the details of this file. So this is where I can start looking at my attributes. So we have our attributes and we have our flow files. From earlier, we went through this a little earlier on some of the terminology. But from the details, it gets assigned a unique ID. Everything that comes through the system gets a UUID. What is the file name that when it picked it up, it read the file name, now that is an attribute. The reason you want that as an attribute is you may already have some rules set up that if it ends in .zip, it goes to processor X. If it ends in .json, it goes to processor Y. So you can actually read the attributes and route on attribute, which is the processor in our list. You should see route on attribute. The file size, if there was a queue position, which there wasn't, how long it's been queued, how long lineage has been, the duration of the lineage. Some other identifiers, offset, if you're looking at some of the Kafka type stuff. Here's again, I can download the content, which would be a zip file. If I click that, I'm about to get a zip file downloaded. Whether they, then if I hit view, I can, if I again, if I have a viewer for this, I can, if I have a viewer for this, I can view it, but I don't have a viewer for zip. But the main thing I care about is these attributes. So, git file is giving me these attributes. So git file said, hey, here's the absolute path. Here is the file creation time. Here is the file last access time, last modified time. You may do some sorting and filtering of data based upon time. It may be who was the owner of the file, right? This is the owner as described in the Windows file system. What was the file name that I grabbed? It's a path that from when I picked it up, not the full path, the absolute, but just the path where, so if I had a recursive directory, right, it may have weather data two, like folder, and then the following. And then of course, EUID that was assigned to this. Okay, so that is our connection. We have a file in the queue. We ran this one time. And now we're trying to send it to the identify file mine type, which is identifying the files. So let's run this just one time. I do a manual click to refresh. Like I said, it's set up for five minute refresh, but if you wanna do it quicker, it's just easier to hit, right click, refresh. And then you can also, if you did refresh the browser, you can do something like that as well, but right click and refresh. But anyways, identify mine type. So I got a success. There's only success. Even if it doesn't identify it, it's still attempted to, and the relationships with this processor is success or nothing. And so that should also tell you that, well, if it doesn't identify, if it's unknown, it's still going to go to success. It's just going to go to success is unknown. But let's look at the queue. So we can actually list this queue. And if we look at the attributes, we should have a new one. Mine extension. So it did identify the mine type, and it identified it as a .zip, and the mine type is, well, it identified the extension as .zip with the mine type of application slash zip. If that was a different type of file, like tar or gunzip, it would have recognized it as application slash gz, or application slash tar. And so in this scenario, it is a zip file. So it identified it as an application dot slash zip. And now we have that attribute. So from here, right, I could say, well, I don't really want to decompress all my tar files. I want to send them somewhere else. So I can actually do a route on that attribute, excuse me, I can do a route on attribute and say something to the effect of, send tar this way and zip to the decompress. But for in this scenario, I'm going to just send it to success, and I'm going to let decompress content determine what type it is. So the mine type detected it in my decompressed content. I didn't have my property set to use the mine type attribute. So you saw the mine.type attribute. So it's going to use that attribute to determine the decompression method. So this decompressed content is a Java application, right? And the source code of this would know how to handle how to decompress or compress all these formats. And so built into this processor, it already knows that if it's a zip, this is the method I'm going to use to unzip or to zip. So for this scenario, we're going to run it one time. It should be successful because it is a proper zip file. I bet everyone on this call has downloaded a zip file before and it got corrupted or something, and you couldn't extract it. So if the processor could not extract this file, it would have went to failure. And then I would like, okay, well, why am I getting failures? And this gives me this run once way of doing things gives me the capability to just run it one time and see how things are going. And I like to do that a few times before I fully turn things on. So anyway, so now it is success. We're going to list the queue. I don't think there's any, yeah, there's no new attributes. The decompress doesn't add attributes. All right. We're gonna run put file one time. Let's see here. I'm looking at why it didn't actually do the decompress. And I wanna actually, I'm gonna test this another way too. I wanna put, I'm gonna put this file back into its original spot for now. So decompress content. We should have got three files out of that because that zip file has three. So I'm adding in a little bit more logic here to see where it's failing. So automatically terminate and automatically terminate apply. So it's good to go because I don't need to worry about that. That should clear my queues. Give me just a second. For some reason it is being penalized. And this is why we build in some of these checks and balances to begin with. Empty my queue. Still there. That's not corrupted. It's really weird, it was penalized but there was no reason to penalize it that I, but I may have clicked something wrong. Let me see here. Let this fall here. Did I use the wrong processor? Let's see here. I'm just looking at why this is not. Even when I have a problem, I do the same thing. All right, support still. This is really weird. Is this supposed to decompress? And... This one of those things where... Is it possible to test as you go? For example, you have, you put together those, like those first couple processors you had, could you have tested that some way right off the bat and then moved on and then, you know what I mean? Like kind of test as you go. Yeah. This processor level. Yeah, no, that is exactly why you see me just saying run once. So I run it one time and then what I like to do, and it's weird that this literally was not working but it was just working. What I like to do is I run it one time and then I'll look at my queue. And let me go and list my queue. Am I, like I know that I'm expecting one zip file. So I look at the queue and that's what I have is one zip file. And then I'll run. Maybe just building this relationship. No, but I mean, what I was thinking was like, I mean, doing it that way as opposed to putting all the processes on the canvas at one time and connecting them and doing all that work. I mean, I get that you can do the run once at each processor but I mean, I guess it's six half tests at the other though. But I didn't know. I mean, it depends on how you work I suppose. Exactly. So, you know, if you remember when I opened up, there's many ways to skin a cat with NaFa. And that is, that's some of the power of NaFa but that's also some of the negative because you don't have, you like strict boundaries. And so, I mean. There we go. Success. So, yeah, what I like to do is, you know, but you don't, if your question is, do you have to complete the flow file to test, you don't. You can, I can have just this and this and a log message, right? And, you know, I can run this flow. You know, it just, it will not continue until the, it will not go through the next processor until, you know, until you have everything resolved. You have the relationships and things like that. So, that's why I run this once. I'm gonna run, identify MIME type once. So, I have it successfully sent to the decompressed content, yep, and I'm gonna take and put the original, put the original back in here, even. The latency, when you're, like I said, when you're doing this on your own laptop and not a VM, it's a lot more, it's a lot better user experience. Okay, but I'm gonna drag this here and I'm gonna put the original where I want the unpacking to go, I guess. And then I'll go through and clean this up. So, I'm gonna run the decompressed content once and I'm just testing out because I'm trying to figure out why. So, one came in, three came out. Why is it not? There's the decompressed, okay. Ah, there we go. For some reason, this processor is not doing its job, but it's okay, we can fix this. We'll start deleting the connections. And this is a good example of, you know, if a processor is not a good fit, not, you know, there's something going on, it's not that there's a bug or something, there's something going on with my configuration. Q not empty, oh. But if a processor is not a good fit and you have those connections have data in the queue, you gotta clear the queue before you can replace the processor. So, what we did is we just put it to unpack content instead of decompress. So, you may wanna update your flow. I'll go back and check why the decompress is not working because it didn't decompress anything. So, in this scenario though, we've got our file coming in, we've identified the Montype, we're now using the unpack content processor. You know, it unpacks the contents of a flow file. And it gave me, you know, this one I like as well because it gives me three different relationships, more than just yes or no, right? So, it gives me a failure relationship, original relationship success. So, you know, in this scenario, I have unpacked it. I'm sending all of my success to, you know, log messages for instance, but I could send it, you know, I need to put it in this folder. So, what I wanna do is take the original. And we'll put it there. Notice that that file is still in the queue and I'm now changing direction of where it's going. And now I wanna take my success and put it here. Okay, perfect. So, now what I've done is, well, I fixed my error and I used the correct processor, but I've picked the file up. I've identified the mine type, unpacked the content. I sent any failures. So, if it's a corrupted zip or similar, it's going to log all failure messages. I'm sending the original back to the place I picked it up, you know, just for testing purposes. Well, actually I'm sending it to unzipped weather data, which I need to change. This one is going to, yeah, let me, I'm gonna change it and I'm gonna put it back exactly where I have it. It's gonna fail because the name, you know, I kept the original, but doesn't matter. Okay, we'll go ahead and empty my cues. And in this scenario and these issues that I'm running into, these are issues you're gonna deal with. And, you know, and how you overcome those. So, I think I've got my, I think I've got my flow kind of figured out. Let me empty my cues and I'm gonna run it one more time. And empty. Okay. So I see nothing in my cues. See all of my connections are good. I'm at a stopped state. If there would have been an error and I'm trying to generate an error, if there's an error, we're gonna get a little red box in the top right of your processor and it will give you a brief description of the error. That's another reason I recommend logging as much as possible so you can get a better detailed message. And I'm trying to cause an error and we will during our building of flow files. But, you know, as soon as I can cause an error and we see it, I'll let you know. But anyway, so I've built my flow. I wanna get it. Then if I'm on type, I'm gonna unpack it. Any failures, I'm gonna log those. I'm gonna take all the success and put them in the unzipped weather data folder. And then I wanna take the original and I'm gonna put it right back to where it came from. And so this should work now. So I'm gonna tell it to run once. Refresh, it's there. Run once again. And Tom, I think you were kinda asking this. Let me do this, delete. So this processor's not complete anymore because I deleted that connection. But I can still run the previous. Well, that also goes back to what you were saying. You need to make sure you resolve all those little warning things on your processors, right? And so now it's just gonna stop. And so when I resolve the relationship success, it's gonna turn red and wait for me to either start it or start once. So let's just run one time. And I should get five files, the original and four in the zipped. Okay, so we're working great now. And then from there, I am going to put, I actually just hit start instead of run once. It's gonna put all four files where it goes. And then also it's just logging everything. So let me go ahead and turn on my log. I can turn on all the logging. Good, so now let's go check our folders. Good, there we go. And then also through the original back in there. So now I have my flow, my flow is built. I've tested it, it looks like it's working. I'm being penalized. I don't know if it's system resources or what. Maybe too many successes. I'll have to take a look at why I'm being penalized. But yeah, so I've got my original, I've got my success, I've got my failures. I have a operational flow now. What I wanna do now, well, if that's success, I already know I can take away that one. So this is where I like to go back and, delete, delete, put the file success. All right. And I'll show you, you just double click where you want. This was a question I had in the last class. It's actually not in the documentation, but if you double click right where you want, you can bend the line to make it more easier to read. So you can mess with those. Okay, so I think I am good. I'm gonna empty my queue. All right, so I've got the file, it's coming in, it's unpacking, any failures go there, any failures go there. Success, goes here in the original. Oh, I don't need the success because it just put the file. Okay, testing my flow out. I'm gonna auto terminate success. Any failures, I've got it going to my log. Any unpacking failures, I've got it going to my log. And then I'm gonna send the original. Back to its original place, yeah, weather data. You know what, I wanna make a copy of it so for demonstration purposes. So I'm gonna actually create weather data one, where it's going to put it. It should come back and complain. Oh, I don't have any more successes, so. Where's my weather data? I'm gonna copy. Paste. All right, my goal is to be able to pick up the weather data, push it back. Perfect, and it's, let me do this folder out. So I wanna pick up from weather data. Once I have that, I am going to grab it here. I like to go through and double check. It's gonna pick up from weather data. It's gonna unpack it. It's gonna put all the extracted files in unzipped weather data. I have that created, unzipped weather data. I can go ahead and control A and delete all these. My folders are empty. Everything's good to go. And then I'm gonna take the original and I'm gonna put it in a separate folder so I just have a copy of the original. And then that way I can tell the git file to keep, to not keep the source. Apply. Okay, so technically this, and I'm not using this, as you can see anymore. So I can actually delete that. All right, my flow file is pretty simple now. I've taken out some logging. I have run through this once or twice to make sure everything works. So what I like to do is work backwards in starting this. If all of these, and we're about to do that too, if all of these are in a processor group, I could start and stop the whole processor group from my operant palette. All right, we can unpack content. We go there. Identify mine type. Start. And here's why I use the run once. So let me open my folders again. Unzip one of them. All right, it is empty. As soon as I turn this on, I should have files here. There they are. It's that quick. And that's what I was saying earlier. If I had 1,000 files, it would process extremely quickly. All right, so I built my flow. My flow was successful. It did exactly like I wanted it to do. So with that being said, any questions on kind of walking through our original flow? We did have a couple of errors. We had to work through those. We had to touch up a few things. It looks like a spider web. So it's definitely not how I like it to look. So what I like to do as an engineer is I'll go through and start cleaning this up. And I'll start putting this where it makes logical sense to me. And I'll start cleaning all this up. And you can see when you're dealing with these types of operations, having connections everywhere, let me read that line. There you go. Having connections everywhere, it can look like a spider web. And so what I like to do is as my flow is developing and as I'm building this out, I go through and start cleaning it up. I start applying labels. I apply colors. I will go through and rename these to make sense. So with that being said, any questions on this flow and how we built it? Any of the visualization? We still haven't went into the provenance and we're gonna work on that after the break. But any questions so far? I missed the part about the group. Also, I'm not sure what the parts, like small boxes, are. Oh, okay. And I'll look at your screen in a minute and make sure everything looks good. But the connections are the relationships coming out of that processor. So for instance, get file only has success. So I send the success to identify the MIND type. Identify MIND type only has success. So I will send it to the unpack content. But unpack content has three relationships. The one is your original file that it got, the success, which is the unzipped files, and failure in case it can't unzip the file or extract it from a tar or whatever. So those are the three relationships, three connections I need to build. I could go to unpack content and configure it and say auto terminate on failure, for instance. But what I like to do for now is I like to log that message. So I'm gonna actually, while it's running, I'm gonna say stop and configure, and I can say auto terminate failure, and it would auto terminate this failure, and it would be complete. But I like to send it, I like to especially send my failures to a message. That way I know what's going on. But later you can take this out if you don't need that, and maybe you just turn on your bulletin level down to like info or warn and try to catch that message, and that way it's in the log. I personally like to make sure that that failure is logged. But then that's how your connections are made. So when I come out of this unpacked content, I needed three different relationships. So I sent my original back to the original location. Problems? Well, I sent it to a different folder. The original speed, the original speed in the zip, right? The original being the zip, yes. Hey guys. Success was the extracted files and any failures was if it's a corrupt zip. And I remember now I looked at my put file. I can, so I was getting file from weather data. Weather data is empty, right? I was putting the original in weather data one. So if I hit copy, go back into the first zip file and I say paste. Oh, luckily I have this stopped. It would have already run through. Yeah. Oh, actually it did create an error. Good. The file name already exists in the where it's trying to put the files. And so it threw an error, which I'm glad because I was trying to figure out how to quickly generate an error for us to see. So again, this is your top right corner and you'll see that little red box. It's penalizing it because the file with the same name already exists. But yes, so the original, I was putting in weather data one. I'm picking up from weather data. The put file for the extract is going into the unzipped weather data folder. And we're good to go. You do see now I have a failure because the same name exists and my naming strategy has it throw an error. I can say ignore and it would just keep writing. Okay. I'll pause here. We actually don't have much time, but I definitely need something to drink. Any questions on the initial flow that we kind of worked together to build? And I'm gonna change this to replace. Is there an easier way to replace a processor like instead of filling all your relations and stuff, just change it? I guess I'm not, but. Great question, Travis. Swap it out. Yeah, that'd be great. Swap it out or something, yeah. Yeah, it would be amazing and keep those connections, right? So for you to replace a processor, you have to delete the connections before you can delete the processor. So unfortunately, there is not, but here's what I will do. I got three or four great suggestions from the last class, and I've already submitted them to be built into the next version of NaPhi. So I am making a note to make replacing a processor easier. That would be extremely cool. It would be. I'm trying to figure out how. Yeah, doubling that right there was pretty rough. It was, wasn't it? Imagine your flow is 100 processors in Leap. Right? I see, no thank you. Exactly. That's why I'm not a developer. So that's why I also, I recommend, let's build and test while we go through it, because you do not wanna get 100 processors deep, and you've gotta jerk like 20 of them out and replace them with another like 10 or 15. Like, it's just chaotic. This is the beauty of workflow-based programming though, right? We don't have to code it to do this, but there are nuances with this. But great suggestion, and I will actually take that back to the NaPhi committee to see if we can get that worked into some future versions. I'm trying to think through how you can do that, but we'll figure it out. Any other questions on our first flow? All right. Well, we're getting close to the end of the day. I need to go get something to drink and use the restroom real quickly. So instead of a 15 minute break, can we do 10, and then we'll come back, kind of wrap up for the day, and then I will see everyone tomorrow morning. So I have 4.02. Let's come back at, which is 2.02, y'all's time. So let's do 2.17. And four. No, no, no, we're gonna take 10 minutes. 12. Okay. I will see you all in 10 minutes. Perfect. And then let's wrap the day up. Let's get some questions. I do wanna, if you all went through your flows, I'm gonna bring them up and just kind of walk through them real quickly. Tomorrow, we're gonna be doing real flows with real controller services and those types of things. But I'll be back in just a minute. Awesome, thank you. Any suggestions you all have though? Like, I wish this was easier, that was easier. Please let me know. Those, the last class had a few really good ones that I wasn't even thinking about. While I wait, I'm gonna pull up everyone's screen and kinda, ah, some high achievers. I was told this is a non-technical class. You guys are rocking it. Give everybody a couple more minutes to get back and then we will go through our first flow. All right. Sorry, I'm just reading the decompress and compressed content processor and seeing what changed because it should have decompressed, but it looks like the MIME type is automatically detected and the data is decompressed if necessary. But there's no relationship. Huh, I wasn't making no of that. Okay. So we have a few minutes left of the day. We might be able to get out here a few minutes early, but what I like to do is kinda go around the room, see how your first data flow went, any major issues, except for following my incorrect steps, which kinda worked out because then we got to see how you replace these things and we also generated an error, which I was hoping to do, just so I can show where those errors look like and where they come from. So we will start with Tom. How did your first flow go, Tom? I thought it was, you were great at going through the steps and this was my first one. I've never even seen or touched, well, I've seen it, but I haven't done anything with it. This is my first experience hands-on with it, doing something with it. It was, yeah, I think you explained things very well. It made it pretty easy. I still got some work to do on mine, but I think I'll clean mine up and probably make mine more like yours, and we'll mess around with it after we're done for the day or whatever, but yeah. No, that is what I wanna hear. I love to hear that you've never touched this before. So in your opinion, right, picking data up, decompressing it, putting in another spot, in my opinion, as someone who has never touched it before, I think this would be a lot easier than trying to write a script or something else, but what is your overall take? What did you like? What did you dislike? How did it work out for you? So I liked it a lot, except for the thing that Travis brought up. I did the same thing as you. I replaced the decompress with the unpack content processor, and you had to delete those relationships first and all that, but it wasn't too bad, but I could see where that would be tedious if you had to do that for a lot more than just one, but I'm semi-familiar with workflow-type technologies. I did SharePoint a lot that had a workflow-type thing tied behind it, and even the stuff we use with the power apps as a workflow has a flow component to it. So yeah, like you mentioned, the flow way of doing things isn't anything new, but this is definitely pretty cool. I could see a lot of value in this. Awesome, awesome. Wait till tomorrow when we are taking the CSVs and JSON that you extracted, and we're actually extracting CSV data and making it a JSON document. I don't give as much guidance on that flow because that one, I kind of want it open to interpretation because there's a couple of different ways to do that, but no, overall, I think your flow looks great. If you're going to go through and start cleaning up, renaming the processors, you may even put in there, put file into this directory or something as a name. You could see already with mine how I had 10, 15 processors starting to stack up, and it starts running together. So if you name it and you do that beautification, it just makes it a lot easier. But I like hearing from people who has never touched the NiFi, and so great job. I think your flow looks great. I can't wait to see it cleaned up, and then I think tomorrow you'll get into some deeper dive of this. We'll look at some products, some lineage, and you're going to be a data engineer in the next couple of days. How awesome. I mean, I was worried that this was going to be overkill with the slides and not enough hands on, but this has been, to my surprise, a lot better than what I was expecting. Yeah, I don't know how else. Like, I could go through the slides and swap back and forth between NiFi and that. So what I like to do, and also, like I was in the Army, I brief generals and stuff all the time, so I understand PowerPoint and death by PowerPoint. So what I like to do is just run through the terminology and then kind of repeat it as we go along. So no, perfect. And I promise we have only a little bit more PowerPoint to do, but most of the rest of the time we are going to be working because that's the best way to learn it. All right, great. I love it. Thank you so much. Yep, yep. All right, Darius, how you doing? I had to restart my computer earlier, so I missed some of it. Okay. I pretty much just tried to mimic, I guess, what you were doing. I didn't get to all of it, though. Oh, no worries. So what you want to do is get file. Do you have the zip file and everything you need? If you go to that folder, let's look at that. Go to downloads, awesome. Yeah, you missed the part where we copied over. So actually, if you go to your desktop, Yeah, it's in the uploads. and you go to your uploads, and just right click and say copy on the weather data, and then you go back to your downloads, and paste it. You can right click and paste. You'll have to do it more towards the bottom, there you go. And then, so those are the, and you can create any kind of folders you want, but you extract weather data, just say extract all, there you go. And yeah, that'll work. So it's gonna pop open a new window. And if you click up in the address bar, instead of saying this PC downloads, it's gonna give you the real path. Now copy that path, and then go to your get file, right click and do your properties, and then put that input directory, paste that in on your input directory, on the value of the input directory. First line, there you go. Paste that in. Okay, and apply. All right, you see how it went from yellow to red, the processor? It was yellow like your log message. So, if you can, just keep trying to work through this flow. I'll be happy to come back after I go through the others and work with you. You wanna identify that mine type, you wanna unpack the content, and then you probably wanna put the file. You identify the mine type, and where do you have the original, or you auto terminate in original, on the unpacked content? Let's look at your relationship, because you have a relationship or something off because it's still yellow. Yeah, I think I need to put it in the right. All right, hit cancel. Go to your unpacked content processor, not the connection. There you go. And let's look at relationships with that. And we need the original. So, you don't have it auto terminated. You can actually just say cancel, and let's, you gotta put file for your success. Right click on your put file, say copy, and then right click beside it and tell it to paste. There you go. And then on your unpacked content, drag another connection to that, and do the original. And then, say original, say add, and now it turned red, and it's ready to go. So, the only thing you have left to do is figure out where you wanna put the original file, because you decided on the get file to not leave. You're picking it up, so it's not staying where it's at. So, you already wanna copy it to a new folder, and then also create a folder to put the weather data that is extracted, and so you wanna put that in your put file. If you right click on your put file, you'll see directory again. And so, you wanna go back to that downloads folder, create a new folder called original file, and put that directory in there, and create a new folder called unzipped weather data, and put that directory in the other put file, and then your flow should be good. And then your long message, you might wanna just auto terminate success. Okay, okay, gotcha. And then, but you got the concept, I think. Try to plug that in. I'll come back to you at the end, see how much progress you made. The only other thing is you might wanna think about naming the processors, real names that we can all understand, beautifying and stuff like that, but overall, I think you got it. So, I'll come back to you and see how you're doing. All right, thanks. Yep, yep, all right. Ecta, how are you? Hi, you doing good? How did your flow go? I think it's going fine. Everything went, is working. The only thing I wanna see is if I make it fail once the file's unzipped, where do I see the log messages for that? Oh, yeah, we had that, if you remember. So, it's set to failure, and so you can actually have that put file, send failure messages to your log message, but also in that top right corner of the processor, remember mine had the little red box, in that little red box. Oh, maybe that's why it's hiding. Well, you don't have it yet because you haven't had a failure, in the actual processor. But I really, I like what you're doing here where you're sending the failure to the log message, and so, yeah, if you do have a failure because of the naming strategy, you have this failure, you can do ignore, you can do replace, but because you have failure, it should generate a little red box in the top right corner of the processor itself, letting you know that that file name already exists. And so, to overcome that, you could rename the files in flight, and that way when they go to a put file, it'll actually be a different file name. You could ignore the error, you can log it, depending on your scenario and how you wanna do things. Okay, so whatever icon or whatever message shows up will tell you what actually happened, because I haven't seen that. Yeah, let's run it. I'm just trying it right now. Let's run it, let's do it, let's do it. We got time, let's see it. Not to put you on the spot. So you wanna run the identify mine type once? Oh, okay. Refresh. It's kinda slow. Yeah, this virtual machine can... You'll see me trying to adjust and click and everything else. Okay, so it's success. So let's unpack the content. So you have original, okay, run once, and now go to the canvas and say refresh. Sorry, refreshed. So the original message went to your put file on the left. The unzipped contents went to the bottom put file, and you should be able to say run once. Oh, you see the error? The top right button, there you go. Penalize because the same file name exists. Is this log saved somewhere else, like in some log logging files? Yeah, it's in your NaFi-at-log that you opened up to get your username and password. So it's logging the messages there that is happening. And it's good. If you're familiar with Linux, you can tell the log. You can also tell the log in Windows. But if you go back to your, go back to directory that's your bin directory, so that's where your executables are. Go to logs. There you go. Open up logs. And then go ahead and open up that NaFi app. Yeah, yeah, no worries. Go ahead and open up NaFi app. That one, you can just open it already with Notepad++. Scroll all the way down. Man, no, don't highlight it all. That's a pretty good-sized log already. You should see, scroll up a little bit, your warn right there. See, you have a warn. Yep. Standard put file with an ID. It's penalizing the standard flow. Scroll all the way to the right. There you go. Yeah, and it gives you, you know, it gives you a lot of information, a lot more information than just that little red box. But because you are sending that failure to the log message is the reason you have it. If you did not, it would just have a failure on the screen and it would not go to the log. All right. So was this your first time of, you know, working with NaFi or have you already played with it before? No, this is my first time too. Oh, nice, nice. And what were your thoughts? I think this could be really powerful. Like, because you can do things so granular and modify and, you know, kind of control where data can go and how you log it. Yeah. Yeah, I'm looking forward to it. Okay. More of this. Yeah, you're gonna, we're gonna get into some difficult ones tomorrow, but perfect. All right, Richard, how'd it go? Yeah, I was, as you can tell, I was doing a lot of experimenting when you were going. I started off just with some files, primarily images. For the most part, I did try to figure with the compressed content and I just, I don't know, I was trying to change some settings there. My first time, you know, definitely aware of it, having to have to do the process of really moving any file, so for the most part, I was kind of experimenting to see, you know, what the settings of each of these, you know, if I changed it from like zip, what would it do differently, I guess? Mm-hmm. So I guess it was more tinkering than anything. The only thing I did miss, and I know someone asked it, but I missed it, unfortunately. I didn't see how you can draw in the background, I guess. Okay, so up on your toolbar, you know, go up and the last one should be labeled. Yep, click and drag it down. And then double- Double click it, and then you can change the value. You can actually name it, you can put a name, you can put a font size, you know, all those things. But I am glad, you know, this is also why, right, for the, you know, the class to run wild, because you brought up a point that I can show right now. So non-fi is configured to, you know, to handle a max queue size of 10,000 files or one gig. Now that is configurable in the properties. So you have 10,000 files in your queue on that put file, right? So what's going to happen is the queue that's before that previous processor is going to start filling up as well. And it's gonna continue trying to process data as much as it can. So when that unpack content to put file, when that gets over one gig or 10,000 files, it's gonna back up. And so what happens is it'll start backing up the whole system. So you need to clear your queue, so the rest of them will start clearing itself. Okay. But any questions, issues or anything, and you know, any, what'd you think? No, definitely, you know, kind of value here. I was just kind of experimenting just to start, you know, some use cases on my end. Actually, I started off with some images. Who first started. Nice. And then, yeah, then I started following the content more closely and then I started going down this road. But even then I was experimenting quite a bit. So just more understanding of these settings and just trying to understand the behavior as well. But definitely love it. I could see a lot of use cases and, you know, looking forward to learning some more. All right. Good deal, Leroy. You have a failure, but it looks like everything is running. Are you there, Leroy? I will come back to Leroy. But it looks like he's got it going. Peter. Leroy has mic issues. All right, perfect. We can come back. Peter, how'd it go? I followed along with yours for the most part. Mine at the bottom, the files, we're supposed to be putting the unzip files into the folder, has a triangle on it. When I right click that triangle, it says invalid. Valid up here, I don't know what that means. So hit cancel. And then just hover over the little yellow yield beside the put file. So you do not have a success relationship or a failure relationship on your put file. So either, you know, if it fails, you may want to send that to a log message and auto terminate. And you may want to just go ahead and just say auto terminate success. So right click on put file and say configure. And then you can say auto terminate on success. And then failure, you can terminate as well, or you can send it to your log message. Okay. So if you won't, click on the put file, drag it to your log message. All right. And say failure. And say add. And your put file now is turned red and it's ready to go. Okay. Perfect. And then, that's the same thing for the one over here. Yeah. Hover over. Let's take a look. Yep. It's your relationships. And that's usually what it is. Or you're missing something that is required. But besides that, you know, what did you think? Yeah, it seems pretty straightforward. It's definitely easy to follow along with your class. I've had issues with the virtual environment, but no issues with the night fire. Yeah. The virtual environment crashed on me at one point. I had to restart that and it's just been running really slowly. Yeah. That's the biggest thing I hear is the latency sometimes, because you'll point and click and then it never responds. And then you'll get it clicked and then it responds after a second and you're already working on something else. So I totally get it. Some of the same issues, it's crept up before. So if Maria's on the call, hopefully she's taking notes. But I totally get it. Was this your first time with night fire? Yeah. Oh, awesome. Awesome. Well, it looks like you kind of run through well. You know, you can continue working on this. I would just, you know, as tips, I would just clean things up, you know, make sure you name your processor or something easily understandable. You know, you know how to apply labels and things like that. You know, and get this ready for production. So any questions I can answer for you? No, not right now. Thank you. OK. All right, Travis, with a great suggestion of replacing process. Yeah, that was probably my only major hang up. Sorry, I'm looking at Splunk related things. There's Splunk processors. There's a lot of good Splunk processors. Yeah, I'm the Splunk guy from YPG. Oh, nice. Nice. But your test worked well. OK. I got to work and no issue. Well, I had to do that first, but because I was not replacing files. So that was easy to fix. OK. This is my first time using night fire. I have watched my lead use it. So a little bit starting knowledge with that. But yeah, it's great. OK. I love it. I can't wait to use it for other things. Perfect. Perfect. Yeah, there's you know, if you get home tonight, you come back tomorrow. So hey, Josh, you know that there's a processor for this. Let me know. Now, I like hearing that. Oh, I'm the Splunk guy because we have Splunk processors. There's processors for all kinds of things. So I'd love to hear about some of the tech you're trying to connect to. Is there a. I want to say third party, but like processes that you can load in tonight, but that might not be like come with the install. Like a library somewhere. You could load up like some data flows, like pre-built data flows. And I know that like influx DB has a template that they ship with their processors as well. That you can load up that would help you get data into influx. But, you know, I. I don't think that there's, you know, I've never seen anything like, you know, if you're asking for like a pre-built flow or you just need to, you know, manage the in and out. I haven't seen anything like that, but I'll definitely take a look around and see. Yeah, I'll take a look around. See, I was just. Yeah. Are you seeking out stuff here? No. All right. Good deal. All right. And then that was Travis. We are going back to. Oh, there was it. Were you having a microphone issue? Oh, no, I think that was someone else. I wasn't you. Lee Roy, Lee Roy, right? Here comes. You guys hear me? I can. Yep. I remember now you had the mic issue. How did it go? Went well. I was able to deal with the example. I appreciate having to work through the errors. So that was a good value. Awesome. I had to anchor with an I.F.I. flow. But it was like the first time going from scratch. OK, good. I guess to be looking at the custom processors, we can kind of implement some machine learning models. Some crazy. Oh, that'd be awesome. All right. Let me know that there is a there's a TensorFlow processors. I have a friend who who develops a lot of custom processors and puts them out on GitHub. So let me let me write that down. I'll throw you some points that, you know, it's not necessarily part of the official NIFI package and training, but I definitely want to send you some stuff that would put you in the right direction. Yeah, I figured something that exists already where you can just kind of plug in your your pipe for smuggler. Oh, yeah. Yeah, yeah. There's there's quite a bit. And there's ways to run Python in NIFI that we are going to get to tomorrow. If you look at your processors, there is an execute script and an execute process. So, you know, just a tidbit. I will. You know, this goes for everyone tomorrow during the, you know, some of the hands on scenarios. I'm not going to tell you every processor you need to use. You know, the goal is to look through the processors and determine what best suits you. So, you know, if you get a free minute later today or tomorrow, it might, you know, it might be good and beneficial just to kind of skim the processor list and say, you know, kind of look at the groups of them and how they can help you. But any issues, Leroy or, you know, answers or any other questions I can answer? I think that's it for now. OK, perfect. All right. Well, I think we had a great day. I think overall, you know, everyone got it installed. It was not, you know, I don't think there was a lot of difficulty getting it installed. I don't think there was many issues there. We got it up and running. We got flows. You know, tomorrow we're going to get registry running. We're going to check our flows in and build some more. So, you know, a lot of cool things there. If you have any questions or anything else, let me know. Feel free to continue working on this. But, you know, I think you all are off to a good head start. And if you need anything, just let me know. But I'm going to go ahead and end the training for today and run to Costco right quick. And so, you know, if you need anything, let me know. But we will we will see each other again tomorrow morning. That's good. Thanks, Joshua. Yeah. Hey, thanks, guys. Hey, I called you on Teams and your cell phone. Just giving you a call back. But if you can give me a call back here on my cell phone. Towards the end of the day, it warms up. There is a bad baby bunny. It was funny up here that your dog went out and went straight for it, picking his mouth. And then I said, oh, no, no. And he dropped it and he came back. Well, then he will keep going back to the same spot until he picked it up. So I'm going to pick it up and throw it away tonight. But that baby bunny was a cat. Nothing. No, my my missus brought me one. My mistress. Well, I told you, I didn't kill you. Did you make another sandwich? No. I'm hungry. No, no, no. No, no. I also want to go to the UPS. Well, if I'm making the coffee, I'll need some stuff. Can you do that? Yeah, I think I'm making that. I found a steam bun in the freezer. And I'm going to wrap it in some jelly. Yeah, for lunch. I don't want to. Oh. You can hear me? I was calling because I need to swim spa moved. And I noticed in you guys' reviews that you do that. It just needs to be picked up and put on a trailer. And that's all it needs to be. The only problem is, is there's a fence between the swim spa and the trailer. And I really don't want to remove the fence unless unless a crane service is going to cost too much and then I'll have to remove the fence. But I don't think it should. The address is. Hang on. Let me get that for you. If you're going on. So if you're going on Google Earth to look at it, do know that the guy with the the trees have all been trimmed and the metal covering has been removed as well. Just to let you know, the DNS crane service delivered it two years ago. But that lady is not responding to me. So that's not calling around. So it's been cleared out. I can send you pictures, but the address is two to one six. L.O.C.H. Lock Haven Drive in Plano, Texas, seven five zero two three. And I have pictures of I have video of the delivery and I have pictures of the all around it and everything else that they would tell. Yeah, that cover has been removed. So that and that covers completely removed. It's wide open. And then the trees have all been trimmed back and really cut really, really good. So it's a clear opening to pick it up over that fence and put it on its framework. Also, the neighbors have given permission if you need to put like your operators down over onto their property. But there's also an alleyway there or something. But, you know, there's it's been it's been prepped and ready to go, you know, in our opinion. OK, that's the tree. Yes, yes. Over the cut, the so there's two main trees on each side of it and both have been drastically cut back. Yeah, it's a 20 foot. Nineteen hundred pounds. And as soon as possible. So it's basically up to you. Thursday, Thursday, Friday. Yeah, well, I mean, I expect you guys to be open it quickly. So Thursday or Friday works. Saturday works. Monday works this weekend, Monday and next week. Is it? Yeah, let's see for Friday. And like, I think Friday work. Yeah, how much? Well, first of all, how much do you think this is going to cost me? Because I have no clue. And I can send you the specs. Like even the specs online is 1900 pounds. Well, it will be drained out. I mean, so I'm so I'm in Austin area. I don't live in Austin, but I live in the area. And I'm buying it from this guy. So I have to arrange for transportation and pick up. And so he's going to leave the water in it. And I'm going to drive up with my guy who who has a trailer and a truck. And I'm going to look at it, make sure it's running. And then we got like two or three pumps so we can pump out all the water very quickly. You know, once I verify things are working. So we would have it ready by like noon on that day, whatever day we do it. And it would be dry. And yeah, it's 1900 pounds. I've got the full specs. I can tell you exactly how long, how wide and everything else. It's flat on the ground on concrete. Yeah. And like I said, I have video of it being delivered by DNS crane service. I can see how they did it in. Yeah, I can figure something out if you guys, you know, can you or something? I can figure out how to get the straps under me. We just lift it up. With like a jack or something. How about this? Do you think it's a good thing that covers pretty big? It is. I'm looking at it now. You got about a foot between the fence in the swim spa. Yeah, I'm looking at it right now. OK. Yeah, I mean, we can do it. You get military discount? Well, we'll try. But anyway, so so I mean, a little bit much, but I think I think three hours you'll be I think three hours is going to be plenty because I'm going to be there early to make sure everything's ready. And then. Oh, oh, I thought it was three hours on site. OK. OK. And then if we do it, you can do it on a Friday. OK, well, let me let me let me verify my guy can pick up on Friday. I just talked to him and it sounded like he could. And I can call you back. What is your name? William. OK, I'm Josh. All right. Let me let me let me coordinate getting the truck and trailer up there to pick it up. And you guys are, you know, like I said, you're just going to lift it up and put it on the goose neck trailer. And then and then everybody be done. Yeah, no, I just need to coordinate timing, so I'll definitely be calling you back. Thank you. I've got the crane service. So we don't need anybody to remove the fence. It's going to cost a little bit more than having to pay someone to remove the fence, but this way it's nice and easy and they just need to pick it up and move. Keep you updated, but it looks like it's a go. We will need to figure out how to lift it up so they can get their straps underneath. Is there any kind of clearance on the bottom? They would need to get their straps underneath to lift it. And it looks like it sits flat on the concrete.
on 2024-05-20
language: EN
WEBVTT You may have to log in to NaFi again this morning. I know I've had to. So hopefully you copied your username and password somewhere easy. If not, we can... In case you shut everything down, you may have to also run... You go into the bin directory and go to run NaFi as well to start it back up. Hopefully you just left it running and so it just needs to re-log in. It looks like the only two that we are missing is Ecta and Richard logged in. Yeah, I'm on it. I was having some menlo problems here on my side. So it looks like I should be up here. Yep. I see. All right. Perfect. And now we just need Ecta. Right. Looks like everybody's got it almost there. Give another couple minutes so everybody get up and running. And hopefully Ecta can join us again this morning. Is the link to the IP address for the NaFi... Where can I find that? It's your localhost. So it's 127.0.0.1 colon 8443. Here, I'll bring my browser up and share it. And you should be able to see the IP. But yeah, 127.0.0.1 colon 8443 slash NaFi. OK. We're running all these locally on the machine. So we should all have the same IPs. Here, he just started it. It might take a minute, but it's 127.0.0. You got to use HTTPS because the secure backup SSL port. So HTTPS colon front slash front slash 127.0.0.1 colon 8443 slash NaFi. OK. I'll do that. I'll also put it in chat. And you can also, if you want, you can just bookmark it once you get to the login page or to the main desktop canvas. That way you'll have it for tomorrow. I don't see it running on, Derrick, I don't see yours running. So yeah, go into bin. There you go. And run NaFi. Perfect. And say run. Awesome. It'll be running in just a minute. It takes a minute for everything to initialize. As I mentioned yesterday, that lib directory is full of processors. So it loads all those processors, loads everything, unpacks all the content. So NaFi itself can take a couple of minutes just to get started and running. Even though it will tell you that it's running, it takes still a few minutes for it to initialize. Then just make sure you have HTTPS in front of the 127.0.1. Hey, good morning, Ekta. I noticed you just joined us. And you were logging in. Perfect. Give Ekta just a minute to get logged in. Derrick and Peter, it looks like you guys are here. I don't know if your NaFi is running. I don't see the command. Peter, you'll have to go back into your folder. And you'll have to start. It looks like NaFi is not running. So if you'll go into the bin directory and double click on run NaFi, I want you in the bin directory. OK. Perfect. Right there next to the last at the bottom. And run that. Give it just a minute or two. You should be able to log in. Ekta's up. I'm not sure if my mind is working. I'm pulling it up right now. These PSColumns, that's 127.0. Let me try something real quick. OK. I can bookmark this. It will let me bookmark it for you. It works. I can't hit Control-D to bookmark it. Yours is up and running. Peter. Yeah, no worries. Peter, it looks like yours is coming up. I remember yesterday you said the page sometimes just gets stuck for a couple of minutes. Yeah, it takes time to initialize HTTPSColon from such and such. OK. Let me give it just another second or two. And it might work. OK. Refusing on both browsers. Illegal character. As soon as it's available, I see errors for some reason. Oh, there we go. All right, Peter. Yours is coming up. Yeah, it looks good. Thank you. All right. Go ahead and get logged in. So I think yesterday everyone did an amazing job of getting your first data flow built. Today is a lot more hands-on. We are going to dive a little deeper now that we know how to pick a file up. We did a basic operation of unzipping the zip file and just putting it back to file system. Today's goal is to pick a file up and work with the data that it has. So what we're going to do today is learn about controller services. And in the process of learning about controller services, we are going to take a CSV file, and we are going to convert it to JSON. Now, this is a little bit more in-depth to NAFA. Feel free to ask questions. I'll give you a hint. After we get through doing this, the hands-on is going to be picking up some JSON and CSV files and making all of them have a scenario where all of them will need to be the same format. And we are going to look for some patterns in the data. So that way, we can send alerts. And I've got a whole write-up on the scenario to show. So it might be, even in the last class, it was a little tricky sometimes. And so we're going to have plenty of time to work through this. But I'm going to sit here and watch and provide answers, because I know you'll have. Well, hopefully, you'll have questions. If not, you can just breeze right through it. So yesterday, we went and I'm going to pull up mine. We went and we made our whole flow to get files from the file system, put those back after we unzip them, and those types of things. We also looked at some of the, you know, maybe cleaning the flow up. There's still some cleaning I need to do as well. But, you know, renaming a processor, adding color to a processor, adding background labels to the processors, you know, to make them more easier to navigate and those types of things. The only thing that has changed that I did was I put all of this into a new process group. And so how I did that, and I'll show you, is I clicked on process group. Well, and I drag it down, and I just named it new process group or whatever. First sample flow is what I named mine. And then once you have that process group on your canvas, let me do it so I can show you. What I like to do is zoom out a little bit on my navigation. So I have my process group here. There we go. And what you can do is you can hold the shift key, and you can then, it draws a box around your whole flow, and you can take that and drag it and drop it right into your process group. So if you can, on your canvas, bring down a new process group and drag and drop, like I just did, your whole flow into that process group. And the reason we do this is for many, many reasons. If you can imagine 300 processors doing all these operations running on this canvas, it would get very cluttered very quickly. It would be very hard to navigate and understand data flows. Also, for security reasons, the way that NAFA likes to handle multi-tenancy and others is a process group. So on the main canvas, if you had, say, 10 organizations all using the same NAFA, you could create a process group for each organization and lock that process group down to just them. And then that way, from the main canvas, you can only see 10 process groups, not 300 data flows. So if you can copy that in, bring it down, I am going to... Sorry, I missed how you did that. Can you go back over that? Yeah, exactly. So what I did is I brought down a process group, name it, whatever you want to name it, my first data flow. No, but how did you get the... I don't know. We'll put it in there. How I got the whole flow into the process group. I kind of dropped it on the canvas with my flow, but it doesn't... No worries, no worries. And in fairness, the latency sometimes will mess with you, but what I like to do is zoom out a little bit on my navigation, and then I hold the Shift key, and while I'm holding the Shift key, I will drag and create a box. You see the box that's being created? And let go. And it will highlight the whole data flow, the connections, the images, all of that. Then you can take it and drag it to your process group. Your process group should highlight blue, and drag and drop it in. And then you should have only a process group on your canvas. And then when you double click your process group, you can go right in and see your flow. Now, again, the virtual desktop environment, sometimes it's a little difficult because of the latency, and it doesn't wanna select all or something, so it may take a couple of attempts. Yeah, so I tried it a little differently. I did select the entire section, and when you right click, it lets you also create a process group. Yeah, yeah. But I like to drag it down so that way we can do it. But yeah, you got the shortcut, right? Yeah, just no reason why, because I had that same problem. My browser was just not responding to that for whatever reason. So I noticed that, and it seemed like it worked. Yeah, no, that's another way of doing it. Yeah, but I also see you're already creating breadcrumbs, and I'm not doing that at all. Okay. No worries, I can take a look at it. Let's take a quick question. Once you click inside the flow, how do you get back out of it? Once you're in the process group. Oh my God. Yeah, if you click, I noticed on the bottom left corner, there's a night pie. Breadcrumb, yep. Yeah, so if you remember, I touched on it briefly, but I can go right back to my parent group, and if I go into the process group, it creates a breadcrumb, and you can go right back. Also, if you're in this, you can leave the group, or I can say leave group, and you can also leave the group. Tom, let's take a look at yours. It's all jacked up. No worries, we can get it fixed. I don't know why my stuff's all the way to the right now. I don't know what I did. Okay, well, let's watch my screen, and I'll take over, and get you squared away. No, I completely, completely get it. I lost my label, I don't know what happened. All right, let's see what you got. So, remember you have the navigation panel here. You can actually then drag and drop, and drag it around, and kind of center it. Oh, okay, that's cool. Where, did you already bring your process group down? Nope. Okay, so. I keep deleting it, because I'm not doing it right, I don't think. Okay, no worries. So, if you hold shift, and I'm holding the shift key, and I'm gonna start up here, and I'm gonna just drag this box all the way over your flow. And if you already brought down a process group, you can just copy it in there. But, since you've got it all selected, you can also say group, name your group, say add, and there you go. So now on the root canvas, you have your first flow group. And so you just double click, and you can go into it. Okay. And you can then, if you click on the canvas, you can hold and drag it around. You can drag it around? Yep. Oh, okay. You can also select everything, and say align. Beautiful. And it's not the prettiest, but you can get it aligned. Quaza. Actually, that messed up even more. That aligning doesn't work right. Let me fix this for you again. Undo, undo. Yeah, unfortunately, Control Z. I wish. Let's see, let me get this. Okay, so logically, we want this here. We would want our, unpack files, get file. We should get file from folder. Right here is the first one. Yeah, that is, you know, that's some of the nuances in its movement, because the latencies. Yeah, that's okay. No, I'm gonna clean it up real quickly. The file, the VM with its latency sometimes will, it just doesn't wanna work right. I was gonna ask how you do create, you know, like save, I guess you saved as a template, but like do a new, like you said, get a new blank canvas to start a new thing. You know what I mean? Without getting rid of what you just built. I guess this is one way of doing that, right? Yeah, and we're going to, oh lord. It's me, man, it's not gonna be easy. No, it's fine. It's, again, like with some software products, like this virtual desktop works great, but when you're real time clicking and dragging and that type of stuff, it can be a little bit, let me use it in a little bit more. Okay, so we can take this guy, hopefully move him over here. And just like with, you know, the way we write, you know, left to right, unless you speak Arabic, then you're going opposite. You know, that's usually how like flows get designed is either left to right or, you know, straight up and down, right? And that way you can branch off your connections, your log messages, those types of things. So let me see if I can, that's going back to put, did not go that way. There we go, that's a little lot better. Okay, that's a little cleaner, I think. And then while I'm in yours, so this is, you know, this is your main canvas. You can use the breadcrumb trail, you can go into the process group, and then once you're in the process group, you can just right click and say leave group if you need. Or you can just click the breadcrumb trail. There's many ways to get out of that process group. But the reason, you know, we like to do this is, you know, we have this flow, we're about to build another flow, and you can imagine your canvas would be just flows everywhere. That's also why we have input output ports to manage connections, those types of things. So, you know, for the next exercise, you would actually probably just bring down a new process group, create a new one, and that way your main canvas stays cleaner. And then when you set this up in a real world environment, where you have multi-tendency and some other things, you want to be able to, oh, I got mine open like 20 times. All right, you want to be able to lock this down where, you know, this may be organization A, and then this is organization B. In your policy for multi-tendency and some of the security, you can, you know, have that process group belong to another organization. And then under that process group is basically, you know, that organization's main canvas. And so it would be blank. You know, mine's not because I put a process group within a process group, but, you know, it would be blank, and when they log in to NAFA, that's the only process group that they have access to on the canvas, and they're able to go in to their process group and then, you know, build whatever data flows as needed. Another advantage of that is, and I see this quite a bit where, you know, organization A has a responsibility to import data, ETL it, you know, get it into right format and things like that, but they may also have a requirement to share with organization B. And so, you know, within NAFA, you can have, you know, organization A has their own section, organization B has their own section, and then you can actually do like an output port on the data that, you know, you need to share with organization B and it go to their process group, and organization B has an input port that receives that data. So that way, organization B doesn't really see, you know, if you haven't logged down, they don't really see how you made the sausage, just as, you know, they're just getting, you know, sausage delivered to their process group. So the sausage making can hide within that group, whatever logic you put in. I know some organizations, they, you know, they have models and stuff like that that they don't want, you know, folks to mess with. And so, you know, they'll run everything within that locked down group, and then the output of that will go, you know, elsewhere. So, you know, just keep that in mind. There's many ways to do this. But again, you know, I think on the last class, Brett and a couple of others chatted on how to set up some multi-tenancy. It's my understanding, you know, some of the folks on that call is working also to help set this up. So, you know, there's no point in everybody running their own instance unless you have, you know, that need or if you're just developing and learning. But when it comes to like test prod, you know, having that multi-tenancy NAFA available for everyone to use, you know, it just sounds more reasonable. So, you know, that's my understanding is, you know, potentially that's what your environment may look like in the future if that's the way, you know, that those things are set up. But anyway, so I need to move to parent group. And then I can get rid of this. All right, so if you go back to your main canvas, you should have one process group. And then when we build additional data flows today and tomorrow, what if you can bring down a new process group and work within that group. That way, you know, you don't have three, four, five data flows that we're working on building on the canvas and, you know, and just cluttered up and you can't, you know, find, you know, certain processors or anything else. Now, if sometimes, you know, processors, you know, they can be a long data flow. So within a process group, you may have, you know, a couple of processors doing an input port, receiving data, those types of things. And then you may have 10 or 15 process groups within that, that original process group that does, you know, all the data movement, logic, ETL, and even embedded into that, you know, you will have different process groups and those types of things. So I've seen it six, seven, eight levels deep. And that's the reason we have a breadcrumb trail. So you can actually go back out, you know, very quickly and go to the process group you need. But if you get lost on a flow, you do have a search bar. So you can. Let's see if I can do. And then this one, for instance, is, you know, named connection to MIME type. I think it was. Get file, yeah. Connection to get file to identify MIME type. So that's also, you know, why you wanna kind of label these with, you know, a good readable processor name, because you can easily just search and, you know, find whatever you need to find. So, you know, it's all up here on the toolbar. And then now that we have processors on our NotFi instance, you can see the status bar. You know, I have 10 stopped processors. I have two that need attention. We haven't put any data through. We don't have any other connections or disabled services yet. So they're not showing up. But yeah, everything's on the status bar. You can search and those types of things. So that should make it a little easier, a little cleaner when we start building more flows and those types of things. So for this morning, we are going to learn about controller services. I think I mentioned controller services briefly in, let's see here. These are all the slides I mentioned controller services, but it doesn't matter. So controller services, I mentioned like a database connection where you can establish that database connection, that username and password, the IP address, the port. So if you're going to, you know, MoriahDB or MySQL, you know, the port that that database runs on is usually 3306. It has an IP address. It has a username and password, all of those things. So, you know, those are some sensitive information that you may not want to share across the board with everybody that's using your Nafa if you're like over managing the system. And so, you know, as a sysadmin or dataflow developer, you can create controller services that, you know, here is, you can put in the database connection. You can put in the database connection information and all of that. And then that way, when people are utilizing that database connection, they just use the one that's already set up and running. So they, you know, it saves time for other data engineers because they can just reference the connection that, you know, that controller service database connection that's already set up. You do not have to give out the username, the password, the IP address, the port, you know, those types of things. So it's really nice, you know, to have, you know, some of that shared services. And that's what it is, is, you know, controller services are shared services that can be used by processors, reporting tasks, and other controller services. But, you know, again, the nuances in like removing a processor and putting another processor in that we ran into yesterday, in order to modify controller service, all the referencing components must be stopped. And there's ways to like stop all referencing processors together, but, you know, if you modify that controller service, you know, all the processors that are referencing that controller service will also need to be stopped. Luckily, you know, and that sounds, you know, like a lot of work, but luckily, once you establish your database connection, your database controller service, you know, unless the IP address changes or the port changes or username and password changes, even username and passwords, you can automate that. But unless there's some major changes, you should be able to install that controller service. And, you know, everyone take advantage of it. And then update, if it updates, you know, it shouldn't be that often. We see like in the real world controller services running for years without any interaction, just because, you know, the database that's referencing is always there. So within the data flow, to scope a controller service, it can be created within any process group. You can create a controller service, you know, within a processor, you can create it within a processes group. You can actually go to the hamburger menu, and you should see some controller services as well. We don't have any installed yet, but, you know, you can, if you have it on your main canvas, it will show up here. Later, when we install registry, we're actually going to install the registry service that everyone gets to use and, you know, you install it once. So that's one way of connecting to it. So what I'm gonna do is kind of go into my sample flow and kind of walk you through how we do this. You're more than welcome to follow along. It is, you know, this is a little bit more advanced because we're converting CSV to JSON, setting the name and writing the JSON out. We're gonna use a controller service for that, and, you know, I'll kind of show you how that's done. So the first processor, you know, of course, is we're gonna get a file from a directory, and this should not be configured, nope. The file that we're looking for is inventory.csv. It is a CSV file. Pull that up and show you what that looks like. So the goal of this is we're gonna take this CSV file, which is just store item and quantity, and make it a JSON document. So that way, all the data that we're receiving as CSV, we can convert it to JSON and, you know, do further operations. So that's the file that we are going to work with. You all have access to this file. You should, let me see. If you don't, I can put it on your desktop, but for now, I'm just gonna walk through with me. So I have this inventory file. Again, we're using a Git file processor, so I'm gonna just copy the path, and I'm gonna put it in. The file filter, you know, is a little different. You can actually filter just on the file name. So I put in inventory.csv. I can put whatever CSV, whatever, you know, zip file, JSON documents, it doesn't matter. It's only going to pick up inventory.csv. And then, of course, I have, you know, keep source formatting to true, you know, recursive subdirectories. There's no other subdirectories in those types of things. So I have my Git file, you know, where I need it to be. And that's the first step, right? You know, read the CSV file, get it into a flow file so I can start operating with it. So the controller service we are using is a Avro schema service, a JSON writer service, and a CSV reader service. And we're gonna go into those. But, you know, to convert CSV to JSON, there's a couple ways we can do it within Nafa. This is the most optimal way. The hands-on exercise is going to be a little different. It's very similar to this, but the hands-on exercise, you can use a controller service. You can not use a controller service. There's processors to do the scenario that's next. And so, but for this, we need to set a schema metadata. That means as soon as we get this data, we need to bring it in, and we're gonna set this attribute. So we use the update attribute processor. And so when we get this data, it's gonna show just like we did yesterday. Here's the file name, here's the file size, those types of things. What we're doing though is we're adding to that metadata for the flow file an attribute called schema.name with the value of inventory. So that is going to tell our controller service which schema to use to convert the CSV to JSON. I can have another schema that's called conversion or whatever. And so as I bring data in, I could filter and sort that data and depending on the type of data, I would assign it a schema attribute that could be whatever name I needed to be to match up. But this one is schema.name is the property value inventory is the actual value. Now schema.name is important here because NaPhi is going to look for that attribute and we'll see that as we start setting up our controller services. So NaPhi is gonna look for this and it's gonna ask the attribute, what schema name you want me to apply and we're gonna apply the inventory. So this next processor line is a convert record. And again, if we were converting CSV to JSON, I could use a processor. There's a extract record, there's a valid update record, split record, I can route text, I can use regular expression to pull that CSV out, those types of things. I can write, well, this one's writing attributes to CSV. So I could actually extract the CSV, write everything out from attribute to a JSON document. There's a few different ways to use the JSON. So imagine your data flow, you're importing this CSV, you can extract it, there's many ways to extract that. Once you have it extracted, you can use a processor to write it back as JSON, those types of things. This is one of those nuances that I go through pretty constantly is there's many ways to skin a cat. The most optimal way of doing this is the method I'm using right now where we are using controller services. That, I feel like it's a lot easier as well because I don't have to do regular expression to extract text and those types of things. So I'm using the convert record processor. We'll go in and configure this. And that way we can. And you can see here, the convert records from one data format to another. And it uses a record reader and a record writer controller service. So what, like I mentioned, we are going to use the record reader CSV service, and we're going to use the record writer JSON service to do this. But anytime you can pull this up and look at the documentation, the allowed values in the record reader, there's CSV reader, there's a JSON, there's Excel, there's XML, Windows event log reader. A lot of times this is used for cybersecurity use cases where you're pulling in syslog, event log, those types of things. As you can see, there's a syslog reader, multiple different formats for syslog. But for us, we are going to use the CSV reader. Now the writer, we have a JSON record set writer. We could read something in CSV or JSON and convert it to CSV if you wanted to. For this case, we are going to bring it in. We are going to bring this CSV in and convert it. So for the convert record, we need a record reader. So for this case, we have the CSV reader. I've already selected it, because I know that that's a CSV file coming in. Record writer is going to be the JSON record writer. I've already got it selected, just because that's the value that I'm going to need to write that out. One of the things that you'll notice with a processor that does anything with controller services, you will have a little arrow right here on the far right to actually go to that controller service. So if you were to drag and drop a convert record, let's do that. If you were to drag and drop a convert record or any kind of record, let me actually use a different one. I can select a new service. So I've already got a JSON record set, but I can create a new service. And here is the services that I have available. So if I want to do whatever, I can set that service. But anytime you use a controller service, you're going to have that arrow, so you can actually go in and configure the service, because things are not working right. On these, we don't have any kind of services, so we're not going to get the arrow. So what I'm going to do is, because I'm going to take that CSV in, I'm going to read in CSV, I'm going to tell you to read it out, and it's JSON. So I want to go to my CSV record reader service. And so now I'm bringing up my controller services. I have four listed, 1Z, Avro reader, 1Z and Avro schema registry for this flow, CSV reader, and a JSON record set right. So let's go to the CSV reader first, and click the little gear icon, and I can configure the CSV reader. First thing you notice is the use schema name property. So if you remember the previous processor, we said, set the attribute schema.name to inventory, because NaPhi is going to look for schema.name. So for this, the schema access strategy is to use the schema name property that we set. The registry, the schema registry is an Avro schema, so it needs to, usually within NaPhi, it works with Avro formats. If you're not familiar with Avro, it is a serialization format for record data. It's used quite extensively throughout the community, but it's modeled all in JSON. So that way we have a schema that says, okay, well, I'm gonna extract this CSV, I'm gonna extract every column, but I need to know where to put these values that is going into this data. And so Avro is what we're using here. I do realize that this is a bit technical, and so just let me know if you have any questions. But for that control of CSV controller service, we are going to use the schema.name property. We have a schema registry, and again, because we are now referencing another controller service, you should be able to see the arrow that goes to that. The schema name is, this is how in NaPhi you would read the schema name using the NaPhi regular expression language. If you have any questions on the NaPhi, expression language, there's a whole guide. So there's a lot to this, and there's a lot. And so if you're familiar with any kind of regex or those types of things, this should look very familiar. If you're not, we'll work through it, but NaPhi has its own. It's based off of Java, of course, regular expression engine. And so this is how you reference the file name, for instance, right? And so I can call this in a property, and it will return the file name attribute. So for this use case, though, we told it that it's going to use the schema name property. The name of the schema is schema.name. That will match that update attribute. So if you noticed on the update attribute, it had schema.name and then it had inventory. So this tells the controller service which schema to use and which property should it look for. So that's where we get the schema.name. Some of the other things that are required, CSV parser. We're using the Apache Common CSV parser. I think there's a Jackson CSV, Jackson JSON. I like the Apache Commons just because I know it works. CSV format, if you have a custom format, you can do that. You can use Excel format, MySQL format. There's all kinds of formats that it can read from. Here we're going to put custom because the value separator is a comma. It's a regular JSON. It's a regular CSV, I mean, file. So it's got just your value, comma, your value, comma, your value. So we're going to set it to the value separator as a comma. The record separator slash n, that's just a new line. We want to treat the first line as a header. Actually, that's one of the bigger issues I have seen for learning this is folks will always forget to treat the first line as the header. And it'll throw data off because it's expecting, you may expect it, you may not expect to pull that data in and then you got the header as an actual data value in the JSON or the formatting will be off, those types of things. So treat the first line as a header if you have it. And then of course, a lot of these are already kind of filled in, the quote character, the escape character, those types of things. So, oh, I don't think we'll get ahead of these. But anyway, so the one that we're looking for though is just to say, and then it's a comma separated and the record separator is a new line slash n. We've got our Apache common CSV parser. And then we've got our, we're updating it and telling it to use the schema.name property that we have already set in that previous processor. And so I can apply, let me go back into that. And now I've got that configured. Now I'm wanting to go to the Amaro schema. So if you notice, I had the property value for schema.name set to inventory. So it knows when it pulls it in, that controller knows that it is to use the schema.name as the model to convert this. And then the name of that model is inventory. So I can actually have multiple different Amaro schemas here on, if I were bringing in CSV data from multiple different sources, I can just set this up where, okay, well, it's gonna all convert to the same format and how it gets converted and recognized, I can split that schema up. So I can set the attribute to store or price or whatever for the inventory. Let me expand this. So a lot of times when we're working in NaPhi, you have a small box. And if you want, a lot of these boxes, you can just drag and drop and make them easily readable. So here I have a basic schema. Again, I wanna take all of this data, I wanna read it in and I wanna put it as a JSON document. So I have store, item, and quantity. So if you notice, the type is record, the name is inventory, and here's the fields that is gonna go into that JSON document. Store is the first one, item is the second one, and quantity is the other one. And if you see, it will match my data. And what I can, if you, I'll make sure you have access to all of this when you're working on your scenario, as you may wanna use it. I understand also you, this may be the first time you've ever seen an Amaro schema. But we'll work through it. So my schema is very easy. It's very, it's a very simple schema, it's three fields. So I built my schema, I put that into the Amaro schema registry. So okay, apply. And now, now if I, now I've got this schema registry and I can add different schemas, I can do all kinds of things. But the nice thing is, is now I can just reference that schema name in all of my data flows and I only have to configure this one time. And I only have to configure the schema one time. So it makes rents and reuse of those controller services a lot easier. So that was my CSV reader. It's gonna read the CSV, it's gonna use this schema to convert it to JSON and write a JSON document out. Now, my second step is writing it out as JSON record. So the right strategy for this is to use the schema because it's already filled out, it already knows to extract that first column, put it into the JSON document, extract the second column, put it in, so forth and so on. So we're giving it, here's the schema registry that we're using, right? It's the same one that we were using from the CSV document. Here's the name of the schema. And a couple of pretty quick JSON, just so it looks nicer and some other, is there any kind of compression or suppress no values, those types of things. So this is the controller service to write that JSON document. So what we're gonna do is apply that. And so what happens is now you've got these controller services configured. We have our schema registry right here that we've already worked off of. We also have the Avro reader. So when you use the Avro schema registry, it'll automatically add an Avro reader because it just needs to read that schema in an Avro format to be able to write that JSON document. And so to get this working though, and we're still, some of the controller services are disabled. So the other services are not gonna work. You can see that, it's just like we're working with a processor. You can highlight over the yellow yield icon and it'll tell you why it's not working properly. So for this, it's because the schema registry is invalid because the controller service is disabled. Same here. So after you get through with working on your controller services and you've done your configuration, you need to enable them. So what I like to do is click the little lightning bolt and it starts to enable that service. I can select service only, or I can select service and the referencing components. So because the processor references this service, I can enable this service and it will enable that processor as well. So I'm gonna do service only. The reason being is I wanna check my data flow to make sure everything looks good before I turn everything on. So I've got the green check box. I enabled that service. It's up and running. You see the state is enabled. Same thing here. I'm gonna enable just the service. Well, let me know the referencing services. It's got the CSV record reader, the JSON record writer, and you can actually see the processors too. Convert CSV to JSON, convert CSV to JSON. So it's that same processor, but you can see the services and processors. That way you can make a decision if you wanna enable just the controller service or everything that goes with it. So we close that. Now our CSV record reader and writer is no longer in a yield state, but it is disabled. So now everything is good. It's good to go. We just need to enable it. So with those first two services enabled, we can enable our other two services. And now everything is enabled. So we can just exit our controller service and now our convert CSV to JSON is stopped. It's no longer yellow. It's got all the services configured and so forth and so on. So now we're getting the file, picking up from inventory. We're just updating the attribute to set a property name of schema.name to inventory. That way that attribute, when it gets to this processor, that attribute is set. And so that processor is gonna look based upon what we told it with the controller services, it's going to look for the schema.name property and then it's gonna look at what value on which schema to use. So anyway, so we have CSV to JSON working now. We need, this is a new document. So we are going to need to name this document. And all we're doing is doing update attribute. For this, it's very easy. As I mentioned, the file name attribute is right here in the NaPy expression language. This is how we reference the file name. So in this scenario, we're just saying, okay, get the file name and .json to the end. So it's going to look at that file name attribute and it's gonna say, okay, I've got the file name attribute. All I need to do is name it with that name plus .json. Apply that. And then of course, write the file back to a directory. So for this, I wanna go back. And I wanna say inventory. JSON. Okay. So if we've done this right, we will be able to pick a CSV file up, set the schema, convert it to JSON, set the name of the JSON document, and then write it to file. So let's run this one time and see how it goes. All right. So we have our CSV in our queue. We can actually look at the attributes and you see file name attribute with the file name of inventory. But what we don't see is the schema.name attribute. The reason we don't see the schema.name attribute is because it hasn't went to that processor yet. This one has a viewer because, like I mentioned earlier, it's a zip file like we were working with yesterday. You will not have a viewer, but a CSV file, JSON, XML, text-based data, you're gonna have a viewer. So let's exit that. That's what the data looks like. And then I can look at the provenance, which we will hear real soon. So I've run that one time. I'm gonna run this once. The only thing that should change should be exactly the same data, except I should have a new attribute called schema.name. Right here. So now I have a new attribute called schema.name and with the value of inventory. So now it's ready to go into the actual convert record. And so again, it's gonna read the flow file as CSV and it's gonna write the flow file as JSON. Because it's reading as CSV, we have a record service. You can go back to the controller services. You can see there's a CSV reader, it's a JSON record writer. We went through and configured these. So we should be good to go. So it's going to run once. All right, success. List the queue. Still the same attributes. File name is inventory.csv. You do notice that it detected a minetype of, a lot of processors will automatically try to detect minetypes. And so this one's application JSON now. And same type of details with file name and modified date and those types of things. We do have a different file size because we went from CSV to JSON. And actually now we can view the JSON document. So now it took all of those CSV records and it wrote it out as JSON. And if you remember from our Ambrose schema, we had store, item, and quantity. And so it just followed that pattern and started writing out the JSON. We want to say, okay, you know, the problem with this is the file name is probably still inventory.JSON or CSV. Yep. So when it tries, if we were to try to write this right now, it would be a JSON document, but it'd be written as a file name inventory.csv. So we're gonna do another update attribute where we told it I want to take the file name, which is this regular expression, and I want to save it, this file, as the file name .json. You run once. Look at our queue. So our file name, you can already see here, but our file name is now inventory.csv.json. If you are really fancy, you could go in and strip the CSV name off, the .csv off, and put .json for simplest, you know, to keep this as simple and straightforward as possible, I'm just adding a .json extension. And then the last step is to write this JSON file to a directory. So that should be here. And there it is. So now I'm taking this inventory from a CSV to a JSON, even the bad data row that has nulls. So anyways, you notice that that's actually, you know, bad data row did not have any commas. It, you know, didn't conform. So when it wrote it, you know, it just applied null values to the others. So if you had a process where you were checking for null values, you would throw that record out, and you know, you may send it to another processor to do ETL steps or whatever. But we've taken our CSV, and we've made it a JSON document. So I'm gonna pause there because that is a lot to ingest in the last hour, a little over an hour. What questions, you know, do you all have? So are we gonna go through a process of doing the controllers ourselves, right? Because it, you know, obviously for us to follow along, we would have to set those up. And I saw how you configured them, but I'm not entirely sure how you added them. So the scenario is you're going to add these and start building them in, and I'll help you along the way. It's gonna be, it's very hands-on for the next part to do this. And for the scenario though, you know, if you want to use the record service, you can. The scenario on purpose is set up to be able to be, you know, you can use multiple different processors to do this. What I'm looking for in the scenario is that thought process of here's what my data flow should look like if, you know, you're going to probably have technical questions, you're gonna have technical issues. But what I'm looking for in the next scenario is, you know, I've kind of thought this whole data flow through, and I've kind of got it built out because, you know, you can build the whole data flow without turning it on. You know, you may have missing relationships or something, but you can still build that whole data flow. And what we'll do is kind of walk through it, and then where you need technical help, I'm gonna help you, you know, with whatever way you're designing your flow. Because you may say, I don't want to do with record service, I want to extract text, for instance, right? And I want to use an extract text processor. So I can manually extract the text, those types of things. You may want to use a record writer, record setter, like I do. There's a few different ways. And so what I'm looking for in the scenario is just thinking through how I want to accomplish what I'm trying to do. And then, you know, if you come back to me during the scenario and say, hey, I want to use this record writer, I want to use this record setter, I need help configuring it or help with the schema, right? We can do that. Or I want to use extract JSON, and how do I do that? You know, there's a couple of different ways because the last class, for instance, you know, they spent quite a bit of time on that scenario, and there was five or six different ways people were doing it. Now, not everyone, I think only one person actually completed this scenario, but they got all of their processors down on the canvas. They got most of them configured. They applied the labels. They, you know, applied the naming convention and those types of things. And, you know, and then what I did is go through and help them finish out the building of that. So hopefully, you know, that will help, you know, in the scenario. But if you, again, if you run into any struggles or anything else, like I'm right here during the scenario as well, and we'll just talk through it. Does that make sense? I think, Richard, you asked that question? Yep. Okay. Yep, okay. Okay, perfect, perfect, perfect. So, so again, I know that's a lot to ingest and I, you know, we can build data flows and some basic data flows, but I really wanted us to try to get through some of the, you know, more advanced ways of doing this. And, you know, it's a little, you know, it's a lot to learn real quickly, but, you know, that's why we have a few hours now to work through a scenario. And, you know, help along the way. So, so I think I went over what a controller service is about, you know, I know that there's still some points to learn and things like that, but, you know, we can get through it. But that is a controller service. That controller service, again, it was just a record reader and a record writer. It was reading in CSV and writing out JSON. If I wanted to, I could bring in more CSV files. I could have set, you know, I could set an attribute based upon, you know, a source of the file. So that way I can say, well, everything I get from direction A gets this schema. Everything from direction B gets another schema. You know, they're all CSVs. So, you know, that's how you can handle it. So think of, you know, that's the beauty of those controller services is I've now built them and set them. And so now everyone can use that service. If that was a database again, right? Everyone be able to use that service. We're going to install NaPhi registry, which has, you know, to handle our version control of data flows. It's got its own service. And so once we set it though, we all get to use it. So I wanted to make sure I went over service in those types of things because I think they are very valuable and it's a very, you know, important part of NaPhi. So with that being said, like, is there any general questions about services I can answer right now? And then we can go a little bit into provenance, take a break and come back and work on scenarios. Any controller service questions? What is that controller services for, like specifically? Yeah. So controller service is a shared service in NaPhi. So I could have a database controller service, right? And if I am building and I'm a, you know, I'm a, that database controller service is already installed. It's running, it's good to go. And I can be a separate data engineer. I can come in and I need to make a database connection because I need to put, you know, I need to put data to a SQL server. So instead of me building in the connection details, the, you know, where the database is, the username, password, the tables, you know, all of those things, if you have a controller service set up already, multiple different users can use that same service, you know, in their data flows. So, you know, if I, so I could then use that database connection service and I could write the data to the database, but as a sys admin, you never had to give me the username, the password, you know, those types of things. Plus you're able to control, you know, who gets to use that connection and, you know, who did use that connection with all the provenance information. Did that kind of help answer or do you still have some additional questions about it? Yeah, that made it more clear. Yeah, and so that's why I let off with this this morning is I know this is probably one of the most tricky, harder to learn aspects of NaPhi. So I wanted to give us plenty of time to kind of go through this. Any other questions? This is gonna be kind of a, not dumb, but maybe just more aesthetics. How did you get those relationships out to the side like that where the arrow goes? Yeah. You know what I mean? Like, yeah. When I try to do it, I don't know how you, I don't know, I'd like to know how you did that. Ah, no, no. So if you remember yesterday, I said you can click. So let me do this. Let me take a processor and let me walk you through. Let me, let me just do a get on something. I'm asking because I like the way that looks. No. Having to spread the. So I'm gonna do solar. So I'm gonna just put two flows together, two blocks together right quick. So you see this, if I double click the line, I know. Like that. So that's exactly what I was trying to do. I could not get that to work. So click right off of the box and let me do it again. So if you drag that down. No, and then all these machines, it can be a little. And then right above that box, I just double click and I get my point and then I can adjust it. Okay. Okay, that's what I was trying to do. Okay, thank you. Yeah, no, it makes it so much present, you know, better present presentation cleaner, those types of things. Yeah, yeah, I like it. Thank you. Uh-huh. Great question, actually, I get that question. There is, there's really not a lot of documentation on the, some of the like beautification of flows and working on, you know, clicking and doing things. And they're always introducing more, but they're like minor like features that don't really get publicized. So a lot of it is just trying to, you know, Googling around or having the experience to work with it. And once you start getting this, then yeah. But you can, you know, you see I've got adjusted lines here that look different, I've got them here. I've actually got them all color coded just because it's easier to read. I have labels on every one, you know, to kind of give you that visual explanation. And when you are building this out in your environment, you know, you may have a policy that says, here's some basic design principles that you need to follow. You know, software engineering teams, when I lead software engineering teams previously, we had to comment our code, right? And we had to make sure we had comments in and those types of things. So, you know, same thing here. If, you know, you may have a policy that says, you know, all these data flows, you need to label them. It needs to make sense, you know, those types of things. And, you know, beautified because you can have a spider web of processors. And today we will have a spider web of processors. Like, you know, when we're working through this scenario, it's going to be processors all over the place. So, you know, just do the best you can and then come back behind and just clean up. That's usually the best thing that I like to do. All right, Andy. I got it, but that was tricky, man. I'm like, oh, I guess. They, it's easier if you are running this on your local machine because the latency, you know, like you can click and then it won't drag and you drag too far or it never drags at all. And I've already got a pop-up that told me about latency once already. Okay, but great question. Any other questions? Okay. We have a few minutes before break. Let's, before we go into the scenario, now that we've sent data through, let's look at our data provenance. So if you remember from the hamburger menu, you can actually pull down your data provenance events. So, you know, the component name, here's the actual processor. Again, another reason to name your component, something easy to read. So you can see, you know, all the data provenance events. If you have a good, you know, good readable name that you can sort and filter and things like that, you know, a lot easier. You can actually search for events and those types of things. So, let's look at the JSON, file to directory. So when you click this, it's gonna compute the lineage. And you can actually then replay this data of actually going through that data flow. So I received it from this provenance event. I received, you know, it was get CSV file from directory. I can actually look at the attributes. I can look at the content. And this is when I received it. So my next step in that provenance event was download the content from the processor. It received it, it downloaded it. Here's what the content looked like after that event. And here's what happened. Here's the attributes and those types of things. Modify the attribute. So if you remember, the next step was to set the schema name. So here is that, during that whole data flow, here is, you know, it did a update attribute. I can look at the attributes, set the schema name as inventory. And I can look at the content. The content should still be the same because it's still a CSV file. But if anything changes, I can replay that and see it. Then the content was modified using the convert record. I can convert CSV to JSON. I can see now that, you know, it came in as CSV, it comes out as a JSON document. I can look at the content now. Here's the input claim. Here's the output claim. So input was CSV, output JSON. So that one processor took that document from a CSV to a JSON. I was able to replay this and see exactly like what changed, how it changed within that single processor. It received that JSON document. So it's a download event because it downloaded it from that processor. But now I have a JSON document that it has downloaded. I think of download as like, the download is like the connection. So, you know, the connection was to receive from processor to processor. So you have a download event. This should be the set JSON file name. So, you know, in that flow, we set the file name of the JSON document. We look at the attributes. We now have inventory.csv.json. And then after it wrote, had that JSON, it received it and wrote the JSON file to directory. Here's where it put it. As well as here's the attributes, here's the content. You know, the input of 745 bytes, output 745 bytes, same identifier offset, you know, that type. And then it dropped that data flow. So it was done. So the whole event duration was 0.006 seconds. And you can see, you know, the final content, you know, replay that, those types of things. So one of the other nice things is you can actually download the lineage if you want. I've really, I've never seen a lot of use for this, but, you know, if you were working for Center for Medicaid Services, a while back I was helping them for a while and I worked in the fraud, you know, the vision of CMS where, you know, we would need to turn over a chain of custody for data that we had received and, you know, to FBI and others to prosecute, you know, like false claims and those types of things. And in doing that, you know, we would have to download some of this lineage information. Now, when you click here to download lineage, you know, it's just basically giving you an image of what happened. I don't really find that useful, you know, but if you were extracting, you can take all of these events and send them to either like corporate governance for long-term storage or, you know, extract them out of the providence events and notify. That's how we would usually handle those, but I don't know, like they have the image here. I just don't think it's that useful. You know, you may, so, you know, you do have that capability. You can, you know, you can go strictly to that event and pull all of the things that happened, you know, those types of things. I've never had a use for the image, but you may have a use, but it's there in case you need it. So, you know, the beauty of this is we went through on the data provenance from start to finish with that data flow. We can look at exactly when we received it. We can see exactly when it was converted. We can see the before and after of that. We can, you know, see it move along from processor to processor to processor until, you know, until it's out of NAFA. The nice thing is, you know, that's built in. You can go and replay these events. We use it a lot for diagnosing data flow issues because if you can imagine, you may have a valid data flow that is, you know, picking data up, you know, putting it somewhere, but, you know, you still run into an issue with, you know, a processor. I've seen a processor handle characters incorrectly before and we just, you know, but it never had an error. And so when you look at the data provenance events, you can actually go through and replay that and say, hey, wait a minute, that processor is malfunctioning. It's not reporting any errors, but we're seeing weird things within the data. So, you know, there are some use cases for that as well. Provenance events for investigations, those types of things, you know, and just to provide that chain of custody to the data. So we went over provenance yesterday a little bit. We're gonna touch on some of these things as we go along, but, you know, your flow from yesterday should have generated provenance events as well. And so if you want to pull up, you know, your flow and run it, you can see provenance events as well. So, you know, that's where you access it. We talked about it yesterday a little bit. We're gonna talk about it a little bit today and those types of things. So, and then, you know, feel free to, to, you know, explore the menu. You break something, the beauty is, is I'm here to fix it. And so look at your flow configuration history. Look at the node status history. So, you know, here's how much free heap space. Again, we're on a Windows box and using Java, so it's gonna be all over the place. Number of, flow file repository free space, right? And so feel free to go through, look at all of these, you know, things here, because, you know, Ron, I know we have some sysadmins on the call. Tom, you know, you know, you may be interested in some of these metrics. Now, there, you know, there's an intro or service for Prometheus, so you can actually have all of these events going out to Prometheus using a Grafana dashboard. You can see some of the same thing. You know, you can, you know, you can send these provenance events off. You can send the status and all of those metrics off as well. So, you know, it's here if you need it. If you're processing a large amount of files, you will definitely look at this, because some processors consume a lot of resources. We were working with one of the bigger resource hogs of the whole system yesterday. The unpacking and packing of zip files, you know, can be very, if you can imagine, we've had folks, I've seen folks like trying to unzip and zip five to 10 gig files. And so, you know, it's trying to throw all of that data into memory, trying to unzip it, trying to then take, you know, five gig that turned into 20 and, you know, their system crashes, right? And so there's smart ways to do this. And so, you know, you just gotta work through it. But as a sysadmin, I find the status history a lot of good information. And then we went through some of the other, you know, already, but feel free to click around and, you know, just explore the UI. What we will do is take our first break of the day, and then when we come back, we're gonna start working on our scenario. Now, I do expect this scenario to take a while. And you're gonna think that we went from easy data flow to going over the deep end, but I'm here to help. I'm here to walk through it with you, talk through it. I can see everyone's screen, so, you know, just bear with us and we'll get through some of these other scenarios I have planned. And then we'll go into probably some registry later today. And then tomorrow we'll wrap up, have a little test, mostly like an open book Q&A, and, you know, clean up our other data flows and go from there. But again, you know, when you're building this out, the scenario, I'm looking for mainly how you want to do this. I don't necessarily wanna see a functioning data flow. I wanna see the thought process of here's how I plan to accomplish the task, and here's the processors I would use and, you know, the connections and things like that. So with that said, unless there's a question, I am going to get something to drink, and we'll go to, on our first break. It is 11, nine, so we will be back at nine. Okay, let's do that. All right, so let's take a quick 15 minute break. I will see everyone back here in 15 minutes, and we will start going through some scenarios. Give everybody a few minutes to get back. I was checking to make sure you all have a scenario, and it looks like it's installed. I'm gonna go ahead and start this in C first. We'll get started up here in a minute. So the data flow I was using for the controller services, I have a template of that that you can use for reference. I didn't upload it yet because, you know, I kinda wanted to walk through my flow first, but, you know, for the scenario, we can upload it for some assistance. Once everybody gets back. Thomas, it looks like you've got one of your lines figured out. Travis got it all figured out, I think. Yeah, see, I'm looking at Travis, I'm like, frickin' dude, man. Still, it's not perfect. One of my lines is like a little crooked because getting those points to straight is not very easy either, but I don't know how Travis did it. I haven't gotten the controller aspect of it yet, No, yours looks great, I'm jelly. Yeah, the controller aspect is, like I said, that's why I wanted to kind of lead off of that this morning. This will potentially take the entire day up until after lunch, at least, but, you know, kinda throws us over into the deep end, and once we get this figured out, you have NAFA 90% figured out. There's some other ways of doing things. I know we asked some questions about some Python, stuff like that, and I'm gonna probably go into that tomorrow, but once we get controller services kind of squared away, you should be set for building your own NAFA flows. All right, let me see. There was a, about to pull up a link. Hey, Maria, are you still there? No, she dropped off. Hang on, there's a scratch pad that we use. Oh, here we go. Yeah, okay, once you get to hang it moving this around. With those different, I don't know what to call it, waypoints or whatever, when you double click on the line, it's a little easier once you get it, get the hang of it. Oh, that's an easy one. Okay, so in Teams, I'm posting a link. Hopefully, you're able to click on it. It's a Dropbox link, and it has the scenarios already uploaded to your uploads folder, but it also has the flow that I just worked on. So I saved that, so that way you can use the link to this reference. So if you are able to, you should be able to bring that Dropbox link up and download, it's two zip files, NAFA scenario, NAFA example main. And if you notice at the bottom of your screen,
on 2024-05-20
language: EN
WEBVTT All right, looks like, well, Darius, you've got yours, maybe. Tom, it looks like you're working on it. It's zipped inside of a zip. Just keep following that folder down and you'll find the data flows, not the, it'll be sample data flows. And you're looking for CSV to JSON in the import. So you will have the same exact flow that I have. Notice it was not a template. It's not an XML document, it was a JSON document. This is a quirk that I have already filed an enhancement request with NaFile a while back. You can do a processor group and import a JSON data flow, but if you create a template, it's an XML data, it's an XML document. And then we're going to get into registry and versioning control, and that's a JSON document. So I feel like there is a common way to do this, and we probably should be getting rid of templates as they are configured and go with the JSON documents that we can easily share as well. It's just one of those nuances within NaFile. So you should have your data flow, should be on your desktop. If you want, you can kind of go through, you're going to have to fix your Git file, you're going to have to fix your right JSON, of course. Everything else, you should be working, except the convert record. If you go to configure on that, you should be able to go to your CSV reader, controller service, enable it, enable the other services, and as well as the JSON services, and you should be able to run this data flow one time. It's only got one CSV file. So if you can, see if you can get that to work, and then we will move into the scenario where you would potentially use some of these same components that we just went over in your own data flow. So we'll give that a few minutes. I'm gonna look at everybody's screen. If you have any issues importing it, let me know, but it sounds like, it looks like everybody got it. I'm curious to see if everyone's able to go to the convert record and follow the arrow, enable the services, you can enable in service and components if you want. It may error out because you have multiple services that need to be enabled, but we'll see. Well, Darius, you're already in the controller services. Great, get those enabled. I've already configured them for you, so you don't really need to configure anything. You just need to enable it and enable the services. So see if you can get that, and if you can, see if you can run that document through one time, that CSV document and output a JSON document. Again, you're gonna have to modify your get file and your put file, but it should be rather quickly. And you can, again, this is the Windows file system. You can make directories wherever you see fit. Also, if you are in the Avro schema controller service, you can expand that and look at the actual schema that I used. You may wanna use that to work with the scenario. You may wanna model your schema off of that one, just as a hint. Right. Did anyone have an issue? I know some of you are still working through it, but did anyone actually get a CSV to flow through and output a JSON yet? I did. Oh, awesome. Perfect. Did it help to see a little bit more what a controller service does? Yeah, I definitely, I feel very novice about it, if I'm being honest. No, and again, the leap from creating a data flow to using controller services is a very far leap. I totally get it. I have tried and tried to figure out an easier way to bridge that gap. And so, what I like to do is just kind of run through my flow and then have you doing the exact same flow as well and enable the services, kind of get the look and feel of things. And then the scenario is very closely related to this flow. But once you get done with this, the rest is all easy, unless you're wanting to write Python or write your own processor or something like that, and you've got to actually do some coding. Quick question. How do I tell if it's going through the flow or not? Yeah, so if you do a run once, instead of just turning it on, I like to just run one time. I'm looking at your screen. You see, wow, you have 4,538 files in your queue. You see that? I don't think I meant to do that. Oh, that's okay. I would stop the git file though, because you're about to blow it up. So right click and say stop. There you go, refresh. Just on your main canvas, just come out right here beside. There you go, right there. And here, right click and go refresh. All right, so you only loaded 8,000 files, so not bad. 10,000 and it would have turned red. If you notice, it turned yellow. This is a great teaching opportunity. I think yesterday we had someone load, it was Richard. Richard was experimenting or something and he had 10,000 files in the queue. So the way NaPhi is configured out of the box is once you queue up 10,000 files, it will start backing everything up until those 10,000 files are processed, but it's good. We've got this. So you see the next one is the set schema metadata. So that's where we, it's an update attribute processor. All we're doing is adding an attribute called schema.name with the value of inventory. We're adding that attribute because with a controller service needs to know, see schema.name at the bottom, inventory. So say apply. And then right click on set schema metadata and say run once. And so refresh again, somewhere off of the thing. And you have that one file now in a success after the set schema metadata. See that? Perfect. Yeah, I see. Okay, so let's go to convert record to JSON. Let's go ahead and configure it first. And you see, it's coming in as a CSV, right? It's gonna go out as a JSON. So we're using the CSV reader. Go ahead and click the arrow and it's gonna highlight that record service. So we are using the CSV record reader service. If you wanted to, you can go in and look at the gear, click the gear, you can look at the properties. Basically, I'm going, there you go. Basically, all we're doing from the default is using the schema name property because we can actually, there's different ways we can take advantage of the schema access strategy. This one is the most simplest. And so we're just updating an attribute, putting the inventory name attribute. And that's what the schema name is. So we're telling the schema access strategy to use the schema name property. The schema name property is schema.name. This is how we reference that property in NaPhi where your dollar open curly braces, schema.name, close curly. So say okay on that one. And while we're here, let's just look at the other components to this so we can just enable it. So look at your JSON set record set writer. And if you notice, schema write strategy is the set schema.name attribute. So it is the schema app, for the CSV reader, it was using that schema. We're gonna use the same schema to write it. And so it's schema.name that just tells it which schema to reference. So say okay. Okay, let's look at the Avro schema registry, second one. Click the gear. Okay, so the only thing this is, is just a registry of our schemas. So we set the schema.name to inventory. So that way when that CSV reader reads that CSV file, the CSV wants to know what schema to use. And when we write it using JSON, it wants to know what schema to use. So here is where you would put that schema in. If you click the value, there you go. And you see on the right, you can drag it and expand that box right above okay, right there. There you go, drag it. It's a very, very simple schema. So it's just type is record, name is inventory, and it has three fields, store, item, and quantity. And so what it's going to do is, when it's writing that JSON document, it's gonna take all of those records that are stored, put it into that JSON document. All the items that are item and quantity, properly put those into JSON and format properly. So just say okay, say okay again. And then let's look at the Avro reader. We have this, we have this just because an Avro schema registry for NaPhi is very, it's a very easy way to do this. Using controller services, there is, if you click schema access strategy, click the use embedded Avro schema. Let's see what other options we got. See, we can look at the schema.text name. We can use the schema.name property. We can use hwx schema reference attributes, confluent schema. We can use the confluent schema registry as well. If you're familiar with Kafka, confluent has a schema registry. We are just using the embedded Avro schema. So you can just click off of that, say okay, just say okay. So if you were doing this where, and you had hundreds and thousands of flows, and you had your own schema, Avro schema department that maintained those schemas in a more corporate wide fashion, in the use confluent schema registry, you can use that registry. We're just using the bare minimum here. So you got the everything set, everything is enabled. Hit X. And just say run once on the convert CSV to JSON. Okay, and if you refresh, there should be one on the JSON file name. Perfect. Also, you noticed how when it came in, it was 225 bytes. When it came out, it was 745 bytes, right? JSON documents are much bigger because they have a lot more logic involved, right? CSV is just common, just values, right? With a comma separator or tab separator or whatever. JSON document has a lot more depth, a lot more logic. And so of course the file size is going to be bigger. So the next one is the set the JSON file name. So if you remember the attribute going in is the file name of that file, which is inventory.csv. We want to update that file name to inventory.csv.json. So run that once. And that's where we set it in that update attribute when you went to configure. Okay, and then refresh again and you are successful. So now we have, excuse me. Now you have a 745 byte file that should be named inventory.csv.json ready to be written to the file system. So go ahead and run write JSON file to directory. And refresh. I don't see a failure. So let's look at the configuration for that write JSON file to directory. It put it in sample output data. Cancel. I don't know where that, I don't know. Create a folder where you would want to put this. Right click, please sir. Okay, right. So go to your downloads on the left. There you go. Go to your downloads. Create a new folder called data output or whatever you want to name it. If you go to the far right, you can right click and just say new folder right there. Okay, right click where? Yeah, right there. You should go say new folder. Nope, you're actually, go down below. You see that SD delete? Go right below that. Don't click on that, but go right below it. Right click. And new folder. And you can just put like output. Now go into that folder. Perfect, go into that. Now just click your address bar and it will resolve that to a full path. Control A, control C. And then copy a write JSON file to directory. Configure it. And right click and say, open that first up. There you go, paste it in. You can do it, control A, control B. There you go, say okay. Conflict resolution is replace, perfect. Hit apply. Okay, so let's clear your queue. So let's go ahead and right click on write JSON file to directory. Let's say start and let's start your set JSON file name. All right, and let's go to convert CSV. You notice how I'm working backwards. There's a reason I'm working backwards on the flow is all the queues are clear except for at the beginning. And so, yeah, now hit refresh. Let's see how quick you processed 8,000 files. Yeah, see how quick that worked? You picked up 8,000 CSV files. If you hit right click again, it'll probably be finished. Refresh. Yep, and you got 4,000 files queued for delivery to the file system. Right click and say refresh. 1,000 left for writing. So, oh, and you still got one queued up above. So you processed 8,000 files in less than two seconds. I think it was. And on a old Windows machine. Right click again and refresh it. Also notice how I did not have him turn on the log error message. The reason being is I want anything that goes to failure to queue up so I can see why it went to failure. I don't necessarily want to log that error message just yet because if I turn on log message, it's going to send that file to the log message. It's going to write the error and it's going to auto terminate and that file is gone. I want it to queue up so I can see what's going on in case of a failure. So right click again and that one single file hopefully is done. Also notice, oh, there it is. You can go to your top right by the search. You see the red little box now? It's cleaning up. It takes a minute to clean that up. So the instance that we're running on, we're using very little settings. So we're not using a lot of RAM. We're not using a lot of CPU. We've got the content repository and those types of things. They are not configured with the best performance right now. So it's going to just take a minute for NaFi to kind of clear things up and write it out. So you can see it's clearing itself. If you refresh again, it may, oh, can you stop your git file from directory because you're just processing it over and over and over again. Yeah, you've outputted almost 11,000 files already. In the last, less than one minute. Already over 11,000, okay. Let that clear, but that's your queue and you should be good to go. Since we're here, go ahead and go out to your main NaFi canvas. Go back up a level. Perfect. And you can see now, just at a quick glimpse that you have this process group. If you refresh, you can see there's 20 files in the queue. It read 11.27 meg, 11.27 meg. It's wrote 8.84 in the last five minutes. There's no more files in the queue. You have two processors stopped, four that are running. So we know that the two processors that are stopped is the log message. We know that you also stopped the git file. And so it looks like your queue is clear. If you won't, you can actually right click on that processor group and you can stop all processors. So if you click stop, it's gonna stop all your processors. If you click start, it's gonna turn all of them on again and you're going to be processing another 10,000 files every minute, just because it's picking that same file up over and over. If you wanna right click on that processor group again, let's look at this for a minute. So you can actually configure that processor group, go ahead. And it's gonna take you to the controller services that is being used. Go to general tab, there you go. You can rename this processor group. You can add parameters to it. There's all kinds of, default back pressure object threshold is 10,000 files. We saw that yesterday with Richard where he had 10,000 files in his queue. If he wanted to increase that to 20,000, right here is where you can do that. Or I think he had a 1.1 gig and it was like 400 files. So you can actually manipulate those values here. Just as a side note, if you are backing up with 10,000 files, try to distribute that process over a little bit, just because you don't want all the processor, that single processor doing the work, doing so much work utilizing resources. But go ahead and exit out of that, do an X on that. And then right click again on that processor group. And my apologies, O'Darius, I'm just using you as a training example. You can enter the group, you can start stop processors, you can enable that group, you can disable that group. You can enable all the controller services, disable all the controller services associated with that group. You can also go down to download flow definition and say with external services. And it's gonna say is a JSON document. You remember when you imported a JSON document into your processor group to create this? Now you can export that out and send it to your colleague and those types of things. So that's what I was saying earlier is we have templates, but we also have this way of doing things as well. So go back to right click on your processor group again, and you can now center in the view, you can group it, you can copy that group, you can empty all the cubes. If you have everything stopped, this is a good way to quickly stop everything. If you've got your cubes are full, something's just running and filling the cubes up and you wanna diagnose it, but you just don't have time to look at it right yet. You can stop everything, clear the cubes, and when you have time, go back and look at that processor. You can delete it. You can create a template from that processor group. You can view any kind of input ports, output port connections, the status history. So we went and looked at the status just a minute ago in the hamburger menu. You can look at that specific processor group. So kind of keep that in mind that a processor group is more than just like a collection of flows. There's a lot of capabilities with a processor group. Okay, so Darius, I think yours is good to go. Can you open your folder to see if you actually outputted any data? There you go. So you ingested a CSV file and you wrote a JSON document. Perfect. Anyone else have any issues with this or creating the flow or any questions on this flow? Oh, you're no problem. And again, my style of training is probably a little different than most where I like to just be very interactive. I like a lot of questions. I like conversation. I like just walking through things. So if you have a question, let me know. So yours looks good. Travis, I think you got yours working, it looks like. The last, I'm all good. Good, good. Yeah, the last class, we had a couple of folks who they tried to set this up and they missed one crucial step and that was in the CSV reader when it was said that you wanna treat the header, you know, true or false. And everyone, like half the class, couple of folks in the class said false and so it was throwing errors and so we had to go through and diagnose it. So, you know, just kind of keep that in mind. But it looks like everyone got it running. I think everyone got a JSON document out. So good deal. All right. Now we are going to move over to the actual hands-on scenario. Please use this flow as reference. Copy, paste, do what you need to do to, you can steal from it. You know, the best developers in the world steal their code from the internet and that's what they work off of. So do what you must. But let me bring up the scenario. And again, the scenario is, you know, own your, it's already there. But if you wanna follow along, but I'm gonna present it as well. There we go. So you should have a folder that says NaPhi scenario in your uploads. So the scenario is you are a data analyst and a local government agency responsible for monitoring environmental conditions across various locations in the region. Your task is to aggregate, transform, and analyze weather data collected from multiple local weather stations to provide daily summaries and alerts. Now, I know I have alerts in here. I know that also, you know, you may send this to a system that provides an alert and a dashboard and those types of things. You may send as an email. You may send it, you know, whatever. So when it comes to the actual alert, when you're building your flow, I know that you're not gonna be able to send an alert to my cell phone, for instance. However, what I'd like to see is how you are thinking you plan to send this alert. How you got the alert, how that worked into your data flow, those types of things. So, you know, using NaPhi to automate the collection, transformation, and reporting of the weather data. The source weather data files are generated daily by local weather stations and stored in a directory on a local server. Each file contains hourly data, including temperature, humidity, wind, precipitation. Here, I will open one of them. So you should have two CSV files in a JSON file. So here's the station ID, ST001. The date, which is 5-6. If you want, you can go in and update the date. That was the last training class, the date that they took this and so the date stayed. Hours, zero to 23, temperature, humidity, wind speed, precipitation. They're just random values. So, you know, that is the data. So your task is to set up a NaPhi flow to monitor the directory, ingest new files as they appear using the Git file processor. If I provide suggestions of the processors to use, you can use extract text here, you can use convert record, you know, wherever you feel most comfortable. Again, NaPhi can be used many ways to build a data flow and the beauty of it is, you know, there's a lot of flexibility and a lot of freedom. That's also the downfall is, is you can skin a cat many, many ways with NaPhi. So I'm providing just suggestions. However, you know, we want to go look at all the processors. You know, I showed you how you can search processors. You can filter on the word cloud, you know, those types of things. So what we wanna do is start setting up our data flow to ingest this and transform into JSON, which will allow for easier way to analyze. And then we can, you know, from there, we can extract the text or whatever, convert record. And then we want to do a data enrichment. So we wanna enrich the data with, you know, you can add a timestamp or precipitation data, you know, those types of things. The data enrichment step, you know, it's kind of optional, but the data aggregation. So aggregate hourly data into a daily summary for each parameter. So, you know, station one, station two, station three, here's an aggregate of the hourly data, you know, for each weather parameter. Again, you can use merge content to combine all the records. They're hints. And I think it was, and I don't know if this, I'm assuming you all know each other, but I think it was Ben. Ben on the last class did a badass, sorry for my language, a really amazing SQL query, I think it was, where he was combining these records and those types of things. I don't think he got complete with all of it, but I really liked his thought process that he went into. So there's a different ways. Alert generation, so, you know, generate an alert if certain conditions are met, right? So if you see a high temperature for that station or that day or that hour, you know, you may generate alert. What I'm looking for here is, you know, here is the alert. You know, it's station 002 at 2300 hours had a extremely high temperature, some sort of anomaly, some sort of story. And so I know you're not gonna be able to send me an email or a text or any of those things, but you should be able to get that alert out and we can at least just view it as an attribute or view it, you know, in the queue or whatever. Some additional considerations, you know, controller services, key processors to focus on. You know, I'm given, the goal here is give you as many hints to way to do this as possible. Again, there's many ways. You notice on the last processor, I was converting CSV to JSON. If you look at your weather data, you will have two CSV files in a JSON file to work with, right, because not all sensors are the same. So the goal here is to pick all these files up, get them into a format that is common, that we can work with and then extract and compute. So that being said, any questions? Perfect. Clear as mud. So how are we gonna do this? You know, I would go to your main NaPhi Canvas, create a new processor group called Scenario 1 and let's get started. I am sitting here. I'm going to try not to talk as much. I'm here for Q&A. I'm here for, hey, here's what I'm thinking. You know, what do you think? My goal, like I said, is after lunch, this is gonna take us a few hours. After lunch, after we get through going through, you know, building this out, I am going to go around the room on each person and I'm looking for that story. You're a data analyst. You know, I'm just looking for that story of, here's how I did this. So if there's no questions, you know, let's get started. And then as you see, I have everybody's screen pulled up and I'll pop into your screen to check on you and, you know, provide commentary along the way. Again, also I'm looking at, you know, beautification. I'm looking at all the items and all the little points that we've touched on for the last day and a half now. So, you know, just kind of keep that in mind. Last class, we had some amazing looking flow. It wasn't necessarily fully operational, but, you know, it's still quote unquote passing because, you know, it was a very good story of how it was done. It was laid out very beautiful. It had everything labeled and those colors. I think someone actually broke it up into multiple process groups. They were using an input port and an output port. I know we haven't went over that, but there, you know, it's there, the documentation's there. We had a couple of people use a funnel, which I on purpose didn't explain and I was really surprised at, you know, using some of the tips and technologies that I didn't necessarily go over. Use the flow that we just went over as an example. You know, you should be good to go. Just checking in to see if anybody has any questions. If you do, let me know. And we'll work on this until we go to lunch and then when we get back, we'll finish it up working on it. Leroy, yours is looking really nice. Well, Darius, yours is looking nice as well. It's like Peter's got about, I've got it figured out. Tom, I like how you've got all your processors coming out there and, you know, now you're working probably to build those connections. Yeah, but I have no clue, honestly. Okay, well, that's, you want to speak to- Put the freaking processors out there based on your documentation and then it was going to try to mess with it from there. Well, let's kind of walk through it, right? So you've got to get your file from somewhere. They are, so, you know, there's two CSV files and a JSON file, I think, right? So we would probably want to make all of those JSON. So you could use that previous example that data flow that we used. So you could get the file, update attribute. You may want to filter your file on only CSVs. So in your get file, you would put like, you know, go into the file filter and just add, you know, .csv on the end. I think it's .star as the default. So you want to pick up only CSV, so do .csv. And- That makes sense. Yep, and there you go. And then, you know, walk through the previous example and, you know, where you used a controller service, set that controller service up. And let's get these- Yeah, that's what I, I don't remember how to do the controller services. That was- Okay. I think we quickly went over that. Like, I thought you could figure it within this processor controller service. Yeah, well, let's go back. Let's go back to the flow that I sent over for when I went through the controller service. So here we got the CSV file. We set the schema name because we're going to need to build a schema. And then we use the convert record processor to convert it from CSV to JSON. So right there, we're set schema.name with a value of inventory. So that way- Oh, right. Yep, and so go ahead and cancel on that one. And then we use the convert record to ingest that CSV file. And go ahead and say configure. And, you know, when you use that convert record, you're going to say you want to use a CSV reader. So hit that little arrow, and it's going to take you to the controller service. So service, okay, I got you now. Yeah, so follow along. That's the reason we went through this one is because it's going to help you immediately kind of get things started. So it's a little cheat. But yeah, if you can, in your flow, just kind of work. You want to ignore that JSON document that you have already because- Okay. Because you already have a JSON document. So what we're going to try to do is, you know, you may want to open that JSON document to make a schema that resembles it. And if you look at your, you know, when you go into your Avro schema, you're going to see, you know, we put a sample schema there when we did our flow. You can actually copy, paste, add a couple of values, and you should be good to go to have all of that as JSON. Once you get it all as JSON, you can then extract those JSON elements. And, you know, then you're mostly done with the whole flow. And where was that at, that file that had all the curly, yeah, the configuration, where was that at? That is in your Avro schema registry. You clicked off of it. Go back, you know, configure. And then just go to the controller service. There you go. Click it. And if you look at the second one is the Avro schema registry. Click your gear and there's your schema in inventory, property, there you go. Oh, gotcha, all right. Yeah, so if you kind of use this flow, you should be able to get to where all your values are JSON, and then we can do an evaluate JSON path processor and do some good math. I know we're coming up on lunch here in a few minutes. The goal, and this is for everybody, the goal here is to get this started. Feel free to work on it through lunch. I'm gonna go grab a sandwich and come right back. But yeah, let's just keep working on it and taking lunch as you need and do that. You should be able to get through, you know, just think about it where, you know, you're trying to aggregate these values. So you want everything in like a common format. So if you can get those CSVs converted to JSON and just outputting, you know, all JSON, then we can look at an evaluate JSON processor. If you go into the processor list, you should click on JSON and then you have, you know, processors there to work with JSON. But work on that, Tom, and then, you know, once we get to the JSON part, you know, let me know and we'll walk through some more. Okay. Yeah, and again, for everybody else, like the scenario just mentioned processors, you can use all 300 plus processors we have. But if you get hung up or you're just stuck and you can't go any further, please, let's just speak up and let's get you past that hurdle that you may be at. And if you could get a second, I have on my workflow. Let me look at it. I think I got a quick question here. All right, I'm bringing yours up right now. What's the name? Okay. So, and I can walk through it too. So on this one, you know, getting at least one file so I can look at it right in here. I just have to get the workflow, but I came in here, I am setting my schema name to weather, right? Weather data. Yep. And then I am in a part that's the latency here. Just kind of have to click around a little bit. In here, this is where I am getting my error, but maybe you can, if I can figure this one. So what I did is, so I set my schema name to weather. So as far as my reader, what I ended up doing is, well, I'll start here with the schema registry portion. Right, so I did configure my schema. Okay. Let's look at your schema real quick. Let me just scale this out. Another little tip here, store everything as a string. It makes it a lot easier. Okay. So station ID. I tried to play with the, they didn't like that. Yeah. I at least wanted to bring in the data, right? Yeah. And so I think this is, okay, I think it gets to a point where it reads it. And so then in here, this one's pretty straightforward unless I'm overlooking something. No, that's, so the only thing that that controller service does is enable the Avro reader. And it allows you to have that Avro schema. Yeah, and then, so then, let me look at my reader. I do have, and then so you schema name property. Yep. I am looking, I made sure the, I did notice the first header is, sorry. Treat first line as header true, perfect. Because if not, it's gonna error. Yep. Okay. And then come in here, same, right? So using the same registry, the same schema, I guess. And then the schema name, which is whatever, right? Yep. So unless I'm overlooking anything here, but, so what I do is I was just kind of stepping through it, step by step. And I'll get to here, it'll crash. I think I'm already. Well, yeah, you've already got an error. Let's look at your error. Before you do that, before you do that, let's just write your error. Yep. In the top right. Yeah, and that's, that's where I forgot how to get to. So I was navigating in here and I was looking at the. Well, you have the top right of your convert CSV to JSON. You see the little red box. This one, or this one right here? Nope, nope, on this convert CSV to JSON. Look at that processor, top right corner. Yep, yep. There we go. Failed to process. There we go. That's the error I wanna see. You're logging in next record. I'm not parsing. Incoming data error or input string 11.17, which is probably the value in that, right? Look at yours. Let's go back to your schema itself. Yeah, actually I put it here. Oh, beautiful, beautiful. So I was looking at the data and then, so what I ended up doing is I just, maybe I think Winspeed might be, maybe I was thinking, because I was trying data types. I tried to use a double, I think, and then they don't like that for some reason. They do not. So the last class run into this and my recommendation was keep it all strings for now. Okay, yeah, and I think that's what it is. Actually, I just noticed Winspeed is not an integer. No, so just keep it all as strings and you should be good. And yes, because I saw this exact, one of the other, your fellow classmates on the last one did the exact same thing. But yeah, keep it a string, see if that'll work. You may have to mess around with it, but that should unblock you. And then when you're done with that, you're going to have three JSON documents, right? Okay, perfect. So let's make it a goal to have three JSON documents shortly after lunch. Now it's 11.36 y'all's time. Again, this is very working lunch for me. So what I'm going to do is pause, I'll continue answering questions right now, but when we're done with questions, I am going to go grab my sandwich that my wife made for me and I'm going to eat it at my desk. And that way I can answer any other questions and things like that. So I'm going to put up the slide that says, we're going on lunch, be back in 45 minutes, but feel free to work through this. But hopefully that will fix your problem, Richard. But I like the path your own, right? You're using that previous processor, you want to get everything together. So if you can't just say okay. And while I've got your screen pulled up, let me take a look at what you're doing past that. Go ahead and say okay and go back to your main canvas. Right quick. Sorry about that, I thought you were going to drive. Oh no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, I try not to take them. So right, Jason's directory. So basically it's my the last flow and you're just updating it. Okay, perfect. So that I understood each of those processes, right? No, that's, we're, yeah, we're reiterating what we already learned. So no, that's perfect. And then I'll get creative again. Good, good. Okay, well hopefully that will square you away. If not, just let me know. I don't see hand raising sometimes on the teams because of the, I'm actually watching you all. So again, for anybody else, just feel free to just blurt out, hey, Josh, I have a question. But yeah, you should be good to go, Richard. How's everybody else looking? Tommy, a little less frustrated now, hopefully. Yeah, I am gonna step away for a little bit but I'll come back to it here in about 20 minutes. Okay, all right. Peter, it looks like yours is good. How are you doing? I'm okay. Okay. Is there a way to copy that entire process group to this new process group? Or do you gotta copy each process at a time? No, you could actually hold shift and act, you're coming in really slow but I heard that you're good, I think. You're just working through it. Okay, perfect. Yeah, I just wanna copy the process group. So you can hold shift and copy a whole flow that we worked on previously and put it in a new processor group and then start adding to it, right? Also, you remember when I had you create a new processor group and you uploaded the JSON document I sent over? So you could start with a new process group start with the previous flow and start mocking it up. That's what I needed to do, okay. All right, there we go. So take a break for a minute, clear your mind. I always feel that I'm gonna get something to drink or something to come back and it helps me think about this while I'm not looking at a screen. All right, so I think anybody else have any quick questions before we break for lunch? Again, I'm running around for during lunch just grabbing food and snacks and drinks. So I will be around my desk and if anybody has any questions feel free to let me know. It is 11.40, y'all's time. So it's 12. I also forget how you import. No worries, no worries, I got you, I got you. So I think it'd be 12.25, 45 minutes. I'll show you in just a second. Which is 2.25. All right, so okay Thomas, so here's how you do it. You brought down, did you bring down a new process group already? Yeah, scenario one. Right click. It's empty now. Okay, right click, say delete. Do it again, but this time don't hit okay. Bring down a new process group. There you go, bring it down. And then to the right, you see the little upload. Okay, my bad. And then CSV to JSON demo data flow. Say open, add, and there you go. There you go. So we. You got a head start already. I remember now, okay thank you. You're very welcome. Well I'm gonna go grab my food if anybody has any questions or anything. I had this issue the last class. I can't type in chat sometimes, but I can see chat. So raise your hand, just blurt out my name or something, and we'll get you squared away. Right. Let's make it. I set the time, but like I said, I'll be back at my desk. So here in a few minutes if anybody needs anything. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. I'm back at my desk. Okay. This, this keyboard has all the keys to the mouse. It has the keys to the keyboard. But, how? All the keys to the keyboard. I mean, I'm not going to get it. I mean, I'm not going to get it. So, hopefully everyone's making their way back from lunch. I'm in here. If you have any questions. I might miss a little bit of the afternoon session. Okay, Richard had to drop. But, if we can, let's just continue working on your flow. Give us a little bit more time to check in. We should, you know, at least have our CSVs as all JSON. You may have to, if you use that convert record, you may have to modify that schema. Like, you know, Tom and I were going through, as well as Richard, before the lunch break. You're more than welcome to share notes or things within chat. I don't mind at all. So, if someone has a good schema, they want to copy and paste into chat, you know, have at it. So, we can all share if that is your approach. But, if you have any questions, just speak up and I'll answer. I'm going to start going through and looking and seeing how we're doing. But, yeah. I do have a question on mine. Okay. They have it all set up here. But, I have an error on the convert CSV to JSON. All right, let's look at it. It says that the record reader and record writer are validated against invalid because controller service is enabling. And, I did right click over here and click enable all controller services and that didn't seem to do anything. Okay, let's go to your convert CSV to JSON and let's look at the properties. Go ahead and say configure. And, let's go look at your CSV reader. So, click the arrow. Go ahead and, there you go. So, they are invalid. When you hover over the exclamation, so you have a problem with your schema. How about the other one? They're probably invalid because of the schema. Is it popping up? There we go. Okay, so let's look at that first one, the schema registry demo CSV. Let's go to the configuration of that. And, let's look at, well you're using inventory. You're still using the inventory name? Oh, okay, yeah, I changed the schema over here. So, I have to change the name on the left side as well? No, are you using the schema name inventory for this example? No, I changed it to weather. Okay, yeah, so there's no reference to that. So, actually if you want, say copy your schema. Just go into the box and hit control A and control C. And then say okay. And say disable and configure at the top right. Perfect. And, go ahead and do plus. So, we can add a new property. All right, and then what did you name the schema inventory? Weather. Weather, perfect. Say weather. And then say okay. And then paste your value, control V. Or right click and say paste. That's not working for me. Control V, is it working? Hit cancel. Okay, let me try to copy it again. Yeah, there you go. And then go back to your value. Click it. And it seems to let you. I keep having to double click on a lot of stuff. So, it might be the copy and pasting might not work the first time either. Okay, so yeah, I mean, we are dealt the cards we are with latency and stuff like this. Okay, so you have, I would delete inventory because you're not using it for this flow. So, you know, there's a trash can right there, perfect. Say apply. And let's see. Invalid. It's still invalid because the schema is not a valid Avro schema. Can you paste what you, your schema in the Teams chat? And I can take a look at it. I was having the same kind of problem. Is there anything wrong with leaving it? Is inventory with the processor above that before that? You can think that. You can leave it as inventory, but the schema will change because the data has changed, right? Okay. But if you wanna name it inventory and have added, I would change the name though and create a new schema if that's the process you wanna take. So, let me look at this schema. Give me just a second, Peter. Let me see if I can. Okay. Okay, so type name is weather. Stream, stream, stream. We're missing something here. Give me just a second. Name, inventory, and list. You have a comma after precipitation. So, go back to your model or your schema. Okay. Configure it. And look at that. Do you have a comma after precipitation? Precipitation should be your last one. Yes, yes, I do. Yeah, so commas don't belong on the last. There you go. Say okay. Okay. Say apply, there you go. And enable it. You can just say enable. All right, close that. Enable your CSV reader and your JSON record set writer. We've cleared that error. So, you should be good to go. But yeah, you're in an Avro schema, you don't put a comma on the last line there. Okay, that makes sense. Thank you. Yep, yep, yep. All right, any other questions? Tom, you doing good? Yeah, I'm taking, that's what I was trying to do was the same as in this changing the schema to weather set of inventory. But yeah, I think, yeah, that helped me. I did the same thing, I still needed it. I mean, inventory was working, but I'd rather do it right. Yeah, yeah. I mean, try to do it, like again, it's there to copy off of, but try also to make it as unique as possible. But that's, the beauty is, is it's only gonna get you to JSON. We still need to extract the JSON and stuff like that. Gotcha, thank you. No worries. So, as we work through this scenario, you may, instead of writing that file to pushing that to the disk, you may want to then send it to extract JSON. So, the way I would handle this is, I would read that JSON back in, I would extract all the attributes, I would extract all the fields, and then save it as attributes, and then I can write that back however I want. So, there is a, nine five extract, let me get the exact name, processor. It's evaluate JSON path is the processor. So, what that will allow you to do is bring that JSON in and, and extract the values you would need. If you use the evaluate JSON path processor, make sure the destination is set to flow file attribute, and if you do that, you should be able to extract every value from that JSON document, and have it as an attribute. So, just, you know, and again, you have the documentation there, the help documentation as well, feel free to ask me, you know, as we go through and build this. Hey Josh, can you repeat that again? I think you said use something to extract all. Yeah, so if it was me, I would send all the JSON documents after everything's converted to JSON. I would send it to the evaluate JSON path, and in the property configuration for that, you wanna make sure your destination is set from, you know, content to attribute, because that way you can extract all the values, because if you do contents, only gonna extract one, but if you do, you know, same as attribute, you can then quickly plug in, like, you know, to extract all the values, you know, in the JSON tree. You may have two different versions of evaluate JSON, because you've got, you know, you're bringing in a CSV, converting it to JSON, and then you've got your original JSON. So you may have two evaluate JSON processors, and, you know, and if you don't know how, just let me know, but in the evaluate JSON, you know, here is, let me see if I can open this image. Oh, for instance, you didn't evaluate JSON, yeah, you know, it wouldn't be $.id, but it's gonna be $. for the data, let me look. Inventory. So if I was doing this, so for this one, I would do evaluate JSON, and then I would start adding properties, and so the first property I would add is station ID, and I would say $.stationID, and that should extract the ST003. And then this one I would have date, and then I would have on this $.date, because it's looking at the JSON tree. So you can actually do like, you know, if this JSON document had embedded, you know, in the tree, in the hierarchy of the tree, if there were, you know, child nodes, you could drill down even more. But because everything's a top level, you can create a new property, there'll be a plus right here. Here, I can, let me just, let me just cheat and show you. So I would send all of my JSON to this processor. Well, I would send the CSV I converted to this one, if the JSON is different, I would send it to another one. Instead of flow file content, I'd do flow file attribute, and I would do, let's see, station ID, ID, and then I would do $. ID, I think that's the name of the field. Because it's gonna look in your JSON and look at the field. Station ID, date, hour, okay. So then I would do the station ID, so it would extract the station ID. I would do, $.date. Hour. So forth and so on. So what that would do, is the JSON coming in, would be read by this processor, and then it's going to extract those fields out of the JSON and save those as attributes. So now I no longer need to worry about the file, I just need to worry about the attributes. And from there, I could, I can do a lot with attributes. I can, let's see, what I do next? Filter attribute, flatten it in. Maybe a filter. Let's see. How I would do this? Might be able to do this with an update attribute. Might be able to even use a skid. See, if I did update attribute. So everything coming out of the evaluate JSON path would be an attribute. Then if I did update attribute, I could, rethink. I'm thinking through this after the, after I have taken all the values, I've pushed them up as attributes. And so now I have a list of attributes per JSON document. And now I just need to, manipulate those attributes. Let me think, let me think, let me think on the, I'll keep giving hints, but let me think how I would do this on this way we're doing it. Let me think on this. No. Look up the attributes here. Let me see here. If you could take all the values in, where were you at when you were putting in those properties? For the JSONs actually. Yeah. Those. Yeah. Yeah, okay. Yeah. So, by following, if you're sticking to the previous example and not coming up with your own way, by following that example, I would skip setting the JSON file name and I would not do writing file directory. I would actually take it out of here. I would go to, evaluate JSON, I would add that. Then in the evaluate JSON, I would extract the values. And then, from there, you could, and that's what I'm working on now, the next step, I would write a regular expression to calculate those, but I'm trying to think of another way besides a regular expression. Okay. Because I just think it'd be a little easier if, like writing a regular expression would be the best way, but, you know, and that's why I keep looking at the math regular expression. Oh, gotcha. We usually use Reddx 101 when we have to do regular expression type. Coding work. But I'm still, red regular expressions still blow my mind, so it's, you know, still hard, but. Oh yeah. Especially when you get complicated. It is. But the property, so you have to add the properties manually because I don't see, like, example properties. Yeah, you're gonna have zero properties, right? Okay, yeah. Right, so you're gonna need to add, I mean, and this is an exercise zone, here's the name of the property and here's the value. And the dollar dot station ID, because all these values are top level, it's gonna go look at station ID and extract that. That's what it should do. Let me double check to make sure. Okay, no way. Thank you. No, let me match. Add. Let me test out my theory as well. So you're. All right, let me run this once. That's the only thing I was gonna ask you. Is it normal for the, actually to set the schema to run to only have one in the queue? Because when you get to CSV, you get two in the queue and then you set the schema and then there's only one. At least for me there is. Well, you're picking up two CSVs, I think it was, right? Yeah, let me, I have to go back and look at it. Yeah, so you should get two JSON documents. Oh, okay. I'll take a look at that too in just a second. All right, so I got my, now I wanna see if my theory here works. Okay, I've matched. I'm gonna list my queue. And now my attributes. I should have. Date, hour, empty, empty string. All right, let me find out what I'm doing wrong. Oh, how is this? Oh, well of course it's not even an extract because I am not picking those files up. Let me go through and modify my flow to get it to work. I have more data. Oh, perfect, perfect. So I'm gonna quickly walk through and it sounds like most of you are building your flow. Yeah, I'm a keep it simple guy, so the simpler the better. If I would always approach anything like this, but I 100% would not try to get fancy out of the, right off the bat. I would be like, minimum requirements, okay, here you go, meet the minimum. And then go from there. Well, and I was actually expecting some of that, but it seems like everybody just took the previous example and worked all of that because you could actually have done this without any controller service. And it'd been a little bit easier. Oh yeah, but it would have been a lot more processes though. It would have been more processors, but less logic. Okay, oh, that's how I started building it out then. You saw all the processes I had on my canvas to start from first, we're looking at mine. I had so many, because I was taking your documents, literally, I was like, okay, then do this, now extract this, and do you know what I mean? Like each little piece, I was creating a processor for it. No, it was purely, it was purely an example of getting us started on the scenario, but you could have went any direction. Okay, so I need to go. Then I would have had a canvas full processor. It would have looked ugly probably too. Yes, but then you wouldn't have had to deal with Avro schemas, you wouldn't have had to deal with controller services, all those. But either way, we'll get it going. And like I said, this is the hardest lift. Tomorrow will be a lot easier for the most part. All right, disabled. Looks right. Okay. I am going to copy the ears. It doesn't matter. Looks good. Okay, so. Oh. Getting it, get the new one there. I want the data, bring up CSVs, can you, converting it. Hey Josh. I'm trying to do this, it's probably maybe not the right approach, but could you take a look at mine? I got two, get JSON from directory, one for reading the, getting the CSVs and one for getting the JSON, but the one for getting JSON files is not working. Okay, let's look at this, so. Yeah, let's go to the. So when I run, it doesn't really run. I know where it says go figure. Put instead of dot JSON, put dot star. Just put change JSON to star. And that's everything. Yeah, see if that like grabs files first. Because we want to test our regex pattern first. Because that's what I was having, just having an issue. Go ahead and run once. And then refresh. Four, so yes. So that's what I was just actually setting up. And you know, what I am going to do is how I'm going to solve it. Like I was just literally doing the same thing. So what I'm planning to do to solve this, this problem is I am going to, so I'm gonna get everything, I'm gonna apply a model to it no matter what. And then I am going to, I wasn't planning to build this flow with you all. So this is gonna help you, but we will not get out. So I am going to get the file and set the, no I'm actually not even gonna do that. I am going to get the file and I am going to route on attribute. Let me check to make sure that this route, route, route, route, where you at? Route on attribute. All right, and. I was trying to do that too. Like send CSV to one branch and send JSON to the other. You got it, you got it. But it wasn't picking up JSON. Oh no worries. Well we're gonna pick everything up on this. So I'm gonna change this to, this name should be get all. Files. From a directory. Okay, and I wanna do route on attribute. And I am going to go to route. Route to property name. Say okay. And then I'm going to say CSV. I'm gonna say okay. And then I'm going to go with dollar curly brace. File name. Ends. With. CSV. And then close that. Okay, file name ends with CSV is gonna go one way. And then I'm gonna add another one that says JSON. And I'm going to say if the file name ends in JSON. Okay, I'm gonna apply. And then I'm gonna take this and I should be, have a CSV relationship. All right, and then JSON relationship. I will, it's a JSON document, I'll send it here. I'll send JSON there. I don't need this success. You guys have put me on the spot. I've gotta build on demand. And then I am going to, it's not matched. I'm gonna just log the message on unmatched. And okay. What am I missing here? My regular expression is not right somewhere. Oh, it's contains. It contains as valid. Apply. Oh, no, no, no. Okay, so let me empty my queue. All right, it's getting weather data. It's getting everything from weather data. Run this once. Should have few files in my queue. We have two CSVs in the JSON. There we go. So it picked everything up. All right, it sent CSV where it needed to go. It sent JSON where it needed to go. And the last CSV, it sent where it needed to go. Okay, so how I, did you see how I accomplished it? Yeah, I did. I saw that, but it's still not picking up my files. I have no idea why. Let's look. Oh, wait, I had to change it. Oh, there you go. Yeah, instead of doing a file filter there, let's just round it based upon, you can actually do a round with the file name ends as well. But yeah, I mean, this is one way to do it, right? So you can route an attribute. Okay, good. And then on your route on attribute, go ahead and configure it. Can I send unmatched to log errors? Yes, ma'am, that's perfect. Oh, and I think you're, yeah, go ahead and run once and then run once again, because you got three or four files, so you wanna make sure that they separated. Okay, run it again because they're CSV. Oh, you have to run it twice to get two, both the CSVs to show up. You do because- Otherwise, it'll just say queued once. Okay, I see what you're saying. Yeah, because you run the get files and they picked everything up. But then you've got four files. Okay, looks like it worked for you. You go JSON one way, CSV another. Perfect. Thank you. You're welcome. Okay, I might as well finish building up my flow. Oh, my God. CSVs. Weather. Okay, CSVs. Is this okay? So if you're watching what I'm doing, I've got my flow set up where it's picking everything up. It's routing JSON one way, CSV another way. The CSV is getting routed to my convert record from the previous flow and then going back to my evaluate JSON path. The problem now that I'm running into is it extracted the values as an attribute, but it's only extracting one value from the JSON document. So I actually now need to split the JSON records. And that way I get all the attributes for each hour of the day. So if you're following along, that's where I'm at. And I'm about to split these records. So what I like to do is... Wow, I'm gonna split the JSON. God, this is gonna be a spider web. That needle went. Okay, if you were following along, here's how I did it. You know, I started bringing all the files up. I route on attribute the different file names. So if it's a JSON, it automatically goes to the split JSON processor. It's CSV, it goes to set the schema metadata. It converts CSV into JSON. And then from there, it will set the file name, which I don't really need to do, but I just left it in there. So it sets the file name and then pushes it to the split JSON as well. And then what I do is I split those records all up. So from one JSON document, I should, you know, I got 24 records, which 24 hours of the day. And if I look, it should just be one record. So from there, I will, I'm gonna evaluate the JSON path. Let me run this. And now everything should be an attribute. Yep. All right. So I was able to pick all the files up and then we're not a JSON. I made them JSON, took all the JSON, extracted each individual record, and then pushed the single record to the evaluate JSON path. I extracted all the values. And now I have all the values as an attribute. From there, then I need to put in some logic to, you know, what is an alert and then send the alert. So I'm almost done. Hopefully you all, you were able to follow along or do it your own way and go from there. But we'll give it a few more minutes. Then we'll take a quick break, our last final break of the day. We'll come back and try to finish up what we can. What we may do tomorrow is just kind of walk through, everybody walk through their flow and what they were thinking in the morning. And then it gives you extra time after class to touch up anything you wanna touch up. But if you have any questions, you know, just speak up. I'm here. Now I'm gonna figure out how I'm gonna design the rest of this flow. And I'm already 11 processors into this. Anyone stuck, have any questions? Where I'm at on my flow is I am working on calculating the values, utilizing processors that we have here. But if anyone else is stuck or having questions, please let me know. If not, we'll take our final break of the day. We'll just take a few minutes and just take a quick bio break and grab something to drink and then try to finish up as much as we can today so we can start off tomorrow going through these flows and finish up with some registry and other topics. So does anybody have any questions right now? I did, I was pretty lost and I was just learning where to start at. I don't really think I can get the final reading right. Okay, let's take a look at it. Again, if anybody gets stopped, a hurdle, blocked, anything, please speak up immediately because I wanna get you over that hurdle so we can get these complete. But let's look at yours. Okay, so on this one, you can do a file filter. Did you look at the regex expression? Or the unknown? For the file extension, right? So if you wanna do, go ahead and get, go back to your other one because I figured this out earlier. No, go back to where you just were. All right, and so you wanna do CSV. So modify that, take out CSV, believe the rest, but take out CSV. And do .-backslash.csv and say okay. Say apply, see if that works. Run once. I think I put it in the right folder. Yeah, that's the same. I don't think you need that backslash before .csv, do you? Oh, I see, take that backslash out. You may not need that, take it out. I'm just matching the other. Go ahead and say okay and see if that works. Run that once. I think you may take one you thought of. I think it might be star.csv after the bracket, I think. Yeah, go back one, there you go, refresh. Yeah, hang on. Let me look at mine. Well, let me work on my local version here and fix it. What would happen? Can you go back to your rejects there? Rejects there. And let's do take, open that and do a asterisk. Oh, you want me to start? Yeah, start fresh, start fresh. So do an asterisk, frontslash, the other one. Backslash, I mean. .csv, all right, see if that'll. Trying to remember the apply, the file filter. Go ahead and try running that once. No, it's already telling you, it's already telling you it's wrong. Oh, it's a chat, what it worked for me. Oh, you rock. Oh, there you go. Well, that's what I had a minute ago was the slash dot. He had a asterisk instead of the up carry. Oh, oh, I see, I see, I see. I didn't see that part. Okay, that worked, awesome. And then when Ekta and I was doing over it, here's how I approached it. So, come on. So when Ekta asked the question, what I did is I was getting all files and sending everything, and then I did a route on attribute. And I said, you know, the file name contains CSV or the file name contains JSON. And then if it's JSON, it goes here, if it's CSV, it goes here. But you should be good to go. Yeah, I like the way you're doing it now. I prefer to do that way, but I got lost in the sauce too when I tried to do it on my own. So I was kind of just, I don't want to say giving up, but. Well, let's not give up. Let's see where you're at. No, I mean, I just think, you know, like, some of these flows require extensive knowledge and your experience with the platform. If you're not there, you're just gonna struggle, honestly. Yeah, well, this scenario, you're gonna struggle. And that's why I keep asking, like, please, if you get stuck, ask me a question, because I want to help you through. Because once you get through this, creating other flows would be a lot easier. So you are routing on attribute, right? Yeah, I like that approach. After you were talking with active, I'm like, I. Yeah, it's a better approach, because if you use a file, that way you pick everything up to begin with, right? And you don't have to have another Git file just to pick up CSV, another Git file for JSON. You just pick everything up at once and start sorting and filtering. Okay, so you've got the route. I actually started, yeah, that's how I started. And then I was kinda backtracking. Was trying to keep it simple using what we had already done. And then you maybe changed my mind again when the actor was in your work. And I was like, no worries. No worries, okay, so you got routing attributes. So are your CSVs and Jsons going where they need to go? Okay, and then. I can clear the queue if you want. And then your CSVs are basically gonna follow the same example as you uploaded in the previous, right? So. If we're gonna go here and then down to here. Yep. And you're gonna. Still converting everything to JSON, right? We're still converting CVS files to JSON files, right? Correct. So. Format, whatever. Yeah, so you can get to a common format, right? So. And that's, you know. And so you are routing an attribute, the JSON. You're sending it to the JSON. You're sending CSV. You're gonna update the attribute. You're gonna assign it a schema. Did you use Peter's schema? Well, yeah, for the most part, yeah. Let me look at it real quick. Yeah, well, you'll have to go into the convert record. Oh, yeah, this one, yeah, sorry. No worries. And so let's look at your CSV reader and writer. It's enabled, but go ahead and go to that one there. And you have the weather. Is that the same as Peter's? Basically. Okay. Okay. Okay, say okay. That's a little bit off center, but I actually still work. All right, and you did not put the final. Comma. No, you're fine. That's it. Okay. Say okay. Okay. And then exit out of that. It looks like all that's enabled and running. So where are you putting your success once you. Because you. Well, I had something here and then I got rid of it. So I need to still link this to something. Yeah, so you wanna probably. You can do a split before you evaluate your JSON. If you won't. So that's what I did is. Is. I went to the split JSON. And I just split on every record. So it's dollar dot star. So that splits on every record. And so, yep. And then split JSON, configure it. And then the JSON expression was dollar dot star. After. Okay, that's. That's okay. Astrid. Ready to go. Okay. Say apply. All right. And then you've got your splits. Hover over your split JSON. What are we missing? We have success. We have split. Failure. I gotta send to the log. There's logs right there in the middle. Just send it to it right quick. There you go. And then you can actually just right click on log error message. Copy and then paste right beside it. And that's where you wanna send your original. For now. So just right click, say paste. Right there. There you go. And then drag and drop. So you get your original. You walking me through it. Makes it seem like it was a lot easier. Well. I get it though. Again, right? That's why I keep warning. This one's a hard one. But we've got to learn controller services. We've gotta learn some of the concepts. And this brings in multiple concepts. It brings in controller services. It brings in regex expressions. It brings route on attribute and attribute manipulation. This scenario is set up to test all the parts. But again, some of these parts, you're just being exposed to. And so that's where you, hey Josh, just a second. Where do I do this? And that's what we're doing now. So say apply. But I think also, but I also think knowing what you put in for some of these values, like it's also, you know what I mean? It's like, it might be in the documentation. I didn't really, the documentation's kind of not overly, you know, doesn't really go deep dive. It's kind of just like an overview. Okay, here are the properties. You know what I mean? So you gotta know what to put in a lot of these values. And if you don't know, you don't know. So you're just like, then you have to Google or chatbot it, whatever, you know. Yeah. And that's what I, that's why I usually just have like the red, I think I closed the window, but I had the regex expression, expression language guide. I always keep that up as well. And then that way, if I look at like file name or something else like that, it'll let me know. So anyways, you are good there. Let's look at your others. Okay, so now you're evaluating JSON path. So let's configure it. That looks correct. No, the destination, you want flow file attribute. Because if you do flow file content, it will only extract date. Oh, okay. Yep. And that's what I mentioned earlier, that that's a common hit. Oh, that's what I missed. Okay, because we added these, so okay, that's insane. Yep, so apply. Now you need to finish, you know, if it's unmatched, I'd send it over there to that original. If it's an error, I'd send it over to your left and, you know, go from there. I'd get rid of that right JSON file to directory and get rid of that failure point, you know, those types of things. And that's where I'm at right now is I am thinking of how I want to easily, now I can use regex and pull all these attributes in and I can actually do a divide, a math function, but the math function is actually for the more advanced NIFI users. So I don't want to do that. So I'm trying to think of an easy way to do it. Where are you sending your original to for the evaluate? You won't. Today you're going to the log. Okay. So after the split JSON, I'm sending the splits to evaluate and I'm sending the original to a log error message. Okay, gotcha. And then the evaluate JSON path is going to log and then, okay, then you're going, match is going to that attribute, pull up the window, whatever that is. Yeah, so I'm working. So that's a placeholder so I can test to make sure that what I'm expecting comes out. And now, as you know, the beauty of this is like, if I want to do, you know, some other thing, right? Can I do something else for match? Like what would make more sense for match instead of the rolling window? That's what I'm trying to think right now is, how do I want to do this utilizing the processors I have that doesn't require regex for me to write Python? Because, you know, I realize that we have a mix of technical talent on the call and there's no ding against anybody, right? There's technical folks and there's managers, there's different types. And so what I'm trying to do is, what is the simplistic version I can make? And so that's where I'm at right now. So after I get to helping you all, I'm gonna think through how can I generate an alert based upon this data and what alert would I generate and how would that look? Okay. A lot of the work I do is to keep the lights on type of work, you know what I mean? So, but we're trying to get more agile and DevOps-y, you know, that kind of thing. But here's the beauty, right? This is designed to, you know, start you off nice and easy and then make it really difficult. And I think I've accomplished that, but, you know, where, but again, I'm very, very open. So, you know, stop me say, hey, I am stuck here. If you are working on a processor and it's taking you longer than, you know, a few minutes, you may just wanna say, hey, Josh, right? I'm trying to do this. This is what I'm thinking. Because again, what I'm looking for is not a complete flow. What I am looking for is that story. Here's what I'm thinking about this data flow. Here's what I'm going to do, right? And you're building a full-fledged data flow in less than a day. So, you know, I'm not expecting a complete flow, but what I do wanna hear is like, here's my story. Here's what I plan to do. And here's how far I got. Because that lets me know that you're going down the right path or not. And, you know, going from there. And I like seeing how some folks are filtering right when they pick the file up. Some are filtering as soon as they got the file. You know, there's many ways to, again, there's two different ways to pick files up. So, you know, I like to just see how some people, you know, think about the problem. And that's the biggest thing with workflow-based programming, right? Is just how you're gonna think through that. But I think, okay, so you should be good to go. You should have attributes now. So if you look, right here. So I have, for instance, all of my attributes. So everything is stored in memory. And I don't need, like, so I no longer need the CSV files or the JSON files. Here's our one, humidity 88. Let me pick another random. Here's our four, humidity 71, right? I no longer need all of these files. I could write every one of these back as a CSV real easily. And then, you know, do some quick Excel math and call it a day, right? So, yeah, okay. So I think, are you in a better state now? Yeah, I think so. I mean, I'm still gonna struggle with this the way I imagine, but that's, yeah, no. No, and please stop me. I mean, it- No, it's okay. I think it helps just talking it through and listening to you explain it better or more, whatever. Yeah, and again, if I need to explain more, I do not mind at all. I have six kids, right? I'm used to explaining. All right, cool, thanks Josh. Yep, yep. Any other questions? Hey, Josh, I'm having trouble with the split Json as well. I was trying to use the merge content and I tried the merge record, but those weren't working out for me. So I saw that you were showing someone else to use the split Json, so I tried to copy that and then I wasn't sure what- Okay. This error is facing path expression. Is it valid memes? Let's look at the property. So you don't have any, so you see that Json path expression is bolded? Yes. So it's a required field. But all you need to put in is dollar dot star, because it's gonna split on every record. So dollar dot star, just like I have mine. Dollar dot star, okay. And so that's going to split every record. So each record is gonna have temperature, humidity, wind speed, precipitation, hour, date, and station ID, right? So it's going to split those records. So what's it complaining about right now? Okay, now it's just complaining that it's not connected to anything. Oh, there you go. Well, start. Wire it up. It's taking the, it's reading the JSON file over here, and that's feeding it over. The CSV files are up here, going down, converting, and making their way down over here. Okay. So those are all working. So this split JSON can take that from both of those directions, the JSON files directly, and the converted CSV files? Correct. Okay, and then this next, put this off over to the side for now. This evaluate JSON path is the next one. Correct. And all these are top level records. So it's dollar dot the ID, right? So dollar dot station ID, dollar dot date, dollar dot hour, and that's going to return hour. That's going to return temperature, those types of things. Okay. Yeah, let me show you, Oman. See. So all I did is I went in, yeah, dollar dot date, because that's going to tell, all that's doing is telling this processor to look for dollar dot date, which is a top level record. If the date was embedded under station ID, you would have dollar dot station ID dot date. And then it would pull that, depending on the JSON tree. But everything is top level. This is a simple JSON. So you don't have to worry about arrays and embedded anything, those types of things. Okay. And so once you have that, then you're going to have everything as an attribute. So everything will be in memory. It's no longer a flow file. So you can actually dump all your flow files, and everything's going to be in memory as an attribute. After the evaluate JSON path. Where's your original relationship going, Josh, from that evaluate JSON path? That going to log? Yes, it is going right here to the log error message. I think. Okay. Hang on, let me check my lines. I got a spider web going here. I think unmatched is going to the log. Oh, failure goes there. Matched goes to where I need to be. Unmatched goes to log. And so the original is actually split. So I'm sending my split JSON to the log error message, and then auto terminating. Because I don't care about the original anymore, right? Like I've already got the original. I've already either converted it or extracted all the values. I no longer, in this scenario, I no longer care about the original document. I now have those attributes that I can do whatever I want with now. Okay. So it's still a flow file going through, but it's attributes. And once you have attributes in memory, then manipulating those attributes is relatively easy. Because now, and that's what I'm looking at for mine. We'll just use another random one here. I have a date attribute, an hour attribute, humidity attribute, precipitation, the value, station ID. So I could turn around and take all of this data, send it through, write it as a CSV file even, and then export it as Excel, and then have some Excel templates. So when I open it up, it automatically computes even. So that's one, but I'm trying to think of a way to do this very simply through a processor, instead of having to write code or anything else. Because normally I would just write some Python or something like that to handle it. But we should all be at least to hear very soon to where you have data that's all in a common format that you can access. We should be at that point, hopefully by now. Peter, you ready? You get to go, Peter. I just said my session expired, so I had to re-log in. I just got back in right now. Oh, okay. Yeah, so after that split, you wanna evaluate JSON path, and you can send both failure and original, make a shortcut, right? Do failure and original, send it to the error message, send your splits to the JSON, right? You know, make it easy for yourself. All right. I'll be right back. I mean, we're already over the break. We went through break. So give me just a minute. I need to go grab something to drink. All this talking, my throat starts getting rough. I'll be right back in five minutes. Did you split JSON? Split each entry, has a different file for you guys? I think that was hosted, yeah. Because after going through split JSON, I had 55 queued, and that does not sound right. That might be. It's like multiple CSV files. Or it might be each file is a single attribute. He had a ton too, and he had like 24 of them. It's each record. It's split into multiple lines, so that's what you're seeing. Okay. Okay. Who's stuck? I think I can figure my services wrong. Okay. Do I still need the service if I use the route on attribute? Let's take a look, see what you got. So you are getting the weather data. You're doing a round on attributes. You're sending JSON one way, CSV another. You're sending the CSV to the schema name, because you're going to convert it to JSON. Success. And so you're going to convert that record to see it. What is the error you're getting with the convert record? I think it's one of my services. Yeah, your services are not set. So if you want to take all this in and send it as, make it a JSON, yeah, let's look at your services again. Do I just click on it? Yeah, it don't matter. You can go. And you have a fifth. Okay, so go to the left and hover over the little yield sign. What is, did you copy the chat record that Peter put out? I believe so. Did you remove the comma that Peter had an issue with? And I think Tom as well. Let's go to the Avro schema registry. Let's look at your weather schema. Let's go on. Yep, right there at the end. That one right there. And you might want to take out those extra spaces. So do a delete or backspace. Get rid of that comma. Nope, get rid of that, there you go. You can expand that window so we can see more. There you go. And then scroll up and get rid of the extra lines on every line. So let's do a delete here. Yep, perfect. Delete, there you go. Keep going down, there you go. Say okay. Do that three. And apply, no it needs a comma after the others. Enable that all the way to the right is the lightning bolt to enable. Say that's fine, service only. Close it, it looks good. All right, so let's hover over our CSV reader. Let's go to the gear on it. Use schema name property. Schema.name. Oh, on the schema registry, click that and say Amarillo Schema Registry. Say okay. Say apply. All right, enable it. Okay. JSON record set writer. Let's do the same. You probably, you know. Yep, there you go. Say okay and apply. Enable. Say enable, there you go. Close, go back to your processor and what's the issue now? No, no, no, we already looked at that one. Say cancel. Go to the yield. The relationship failure is not connected to anything. So you need to find another, you need to, where do you plan to send all failures, right? I'd probably just send it to that log error message on your left and send your failure there. Add, there you go. Refresh and success. You're going with the extract text? I wasn't really sure. I gotta read the whole topic again. Most everyone is going with the, going into a split JSON to bust that JSON file up into individual records. And so you can, even though it's in the queue, you should be able to drag and drop it over there. Use the front of the arrow. There you go. Take the blue little box, drag it over to split, there you go. You'll have to do the rest of the terminations like here's where you send the splits, here's where you, no, from the split JSON. There you go. So send that where you, what are you gonna do with the, what are you gonna do with it when you, what's the thought here? What does the split JSON do again? So you're splitting the JSON records into a single record. So if you look at the queue, look at your queue, look at that one file, right click and say, cancel. Right click, list queue, and go to the little eye over there. Actually, you can go over to the right eye, yeah, that one. Click that to view the content. And Pretty Prince not on, but you can already tell that you've got, also the header is not set properly. Click instead of original, say formatted, click the view as original, see if it says formatted. There you go. So it looks like the first record is just the station ID, which means, which that tells me that in your schema, you did not set the treat first line as the header file. So we need to fix that. But you now have what? Showing one, two, three, four, five, six, seven, eight records showing right now. If you scroll down, you're gonna have probably 24 records. So what the split JSON does is it's gonna split all of these records into its individual file. So you'll have just the station ID, the date, the hour, the temperature, humidity, wind speed, and precipitation as one file, not 24 records as one file. Okay, so each one of these will be one file? Correct. Correct. Okay. Great. And then like I said, you wanna go back, exit out of that, that works now. You wanna go back to your controller service and make sure that you're treating the header as true, telling it it does have a header, because it looks like you're ingesting the header as a record as well. So if you go to CSV reader, you can go to the proper gear icon on it. Scroll down. First line is header. You've got it disabled and configured. Perfect. Good job. Okay, so that should take care of the header information being a record. And then now you're going to, you can split the JSON. So if you were to clear your queue and run that again, it should show up properly. The split JSON, you're gonna need, go back into your split JSON. Since we already have it open, you have no value set for JSON path expression. So it is required. So you would just wanna do $.Asterisk. There you go, say okay. Apply. There you go. And now you'll need to finish with its connections. You don't have a relationship, sorry. You don't have a relationship for some of these others that you'll need to put in. From there, if you're following exactly what I'm doing, you will send it to a evaluate JSON path processor. And then from there, you can fill in. So here, I'll pull mine up again. If you can, look at mine. I send it to it in evaluate JSON processor. Make sure you have full file attribute. And then I just extract every value out of that single record. What happens is, and it's just $.date, $.hour. There's no nested or embedded records. And so if I look at my records, here is the actual record. But because I did the evaluate JSON, I now have all of these as attribute values. So the wind speed, temperature, station ID, humidity, the date. So this is the 13th hour of 2024, 506. And then now that I have that, you can take attributes and save them all if you want and combine them all into one document. I'm working on how I would do this in using the processors that are available. But go ahead and if you can, I mean, you're probably gonna get stuck on that point, but if you can, just go and start getting your flow cleaned up, named properly, labeled properly, those types of things. Make sure your relationships are there to do the evaluate JSON. And let's at least get to where we have all the same data in all the same format, because we're gonna need to have the data in the same format. We're gonna need to have it all together before we can make any kind of calculation or look for any alert or anything else. Okay, got it. Perfect, anybody else? Well, who else? If you wouldn't mind taking a look at my queue, I don't see the attributes like you do. I think maybe I'm missing a processor along the way. Okay, no worries. I know we're running short on time though. No, no, you're fine. The second day, we usually go over on the second day. And I need to go grocery shopping and that's it. So I'm good. I mean, after my split, I have both JSON and CSV, so I think- Well, that's fine. Oh, yeah, okay. Because it's just a file size. Let's look at your attributes. So scroll down. Okay, I don't see, hit okay. And that's what I was looking for. I didn't see what you see. Yeah, no worries. Hit X on that. X it, let me look at your flow. Okay, we are splitting the JSON. It's split. Did you do the- Oh, the split is only going to give you a record. So it's going to still give you a JSON document, but it's just one record per document. So 54 sounds about right. Then you're going to do the evaluate JSON path. Let's look at your configuration on that. I think the last time we looked at it, you had it, yeah, you have $.date. Okay, so say cancel or apply. And run it once. When I ran it, it went to, you had a failure, I believe. Let's look at that failure real quick. Go to that failure, right click on that failure and list the queue, just like you would any other queue. Oh, gotcha. Oh, each one of these has its own queue. Yeah, they're all separated by their own queue. Yeah, go ahead and click that. You are just pulling ST003. Huh, why in the hell would you be doing that? Let's go ahead and hit X. And let's actually clear that because it might be a header if you, we'll see in a minute. Just go ahead and empty the queue. Do another one once. All right, what's the error on the box? On the processor, on the right of the processor, you've got now a red box. Oh, yeah. Flow file did not have a valid JSON content. Okay, so we're converting it to JSON. Empty the split queue. Right above it. Yeah, empty that queue. Maybe I didn't run once in enough times. Let's look at your split JSON. What do you have? Figure. $$star, okay, apply. And then your convert record is going to success. Can you run once to get us to split JSON on this? Well, you'll need a file to provide that one, so. You'll have to get CSV from directory. Now refresh that. And this, it should run. So just say route on attribute. Say yes, go ahead. Run once. Actually, you're gonna need to run it three times because you have three files, two CSVs in the JSON. So we know that that's working. So run it one more time. And it should be a second CSV. Perfect. All right. So look at your split JSON. Because you already have a JSON document. You're ready to go over there. Okay. Let's run that once. Did I clear this? Yeah, go ahead and clear it. You don't need to, but yeah, let's do it. As a matter of fact, let's not run that split JSON once. Just turn it on and start it. Refresh. All right, and now stop it. Let's look at your split, the split queue. The, blow it. There you go. You have seven files now. All right, yeah, let's view that. Say okay. Click the eyeball. I wanna view the content. Is it still, okay. Something's going on here. All right, exit out of that. Dollar dot star. Empty stream. And it's just getting the JSON. Run that get CSV from directory again. Until you, just run it. You'll have to run it once to get CSV. Run once. And then the route on attribute, you're gonna have to run it twice to get the JSON. We know the first one's a CSV. The second one is. Oh, you can't do. Click on the processor. Refresh first. No, don't click on the red app. Don't, there you go. Oh, okay. And then refresh. We should have a JSON document now. Let's look at that JSON document before it goes in. All right. I wanna double check to make sure. Good idea. Hit your eyeball. All right. Okay, so that is a, I see 24 JSON records. Okay. That's two blocks. Let's exit out of that. Exit out of this. All right, the split JSON. Open up and configure it. Okay. JSON path. Dollar dot star. Empty 20. Okay. Hit cancel. Why is that me? You see the, take your split JSON and that processor. Can you kind of drag it up a little bit and align it to the right of route attribute? There you go. All right, all right. So there is where you're, let's empty the queue of the, between route attribute and split JSON, let's empty the queue of that JSON, that one document you have. Yeah, go ahead and empty it. Perfect. And let's say, okay. And then right click and say delete on that connection. That split JSON, that one, yes, delete it. All right, on your route on attribute, drag it over to split JSON. And make sure you have just JSON selected. Say add. Okay. So you're getting the files. You're routing on attribute. What's that route? Can we go look at your route on attribute? Hopefully it matches the one I have. File name, contain, seems to be. Okay, say cancel. Empty the queue from split JSON to evaluate JSON path. You have a split and original queue. Empty both of those. Okay. Yep, empty your original. Okay. You get CSV from, get CSV to route, empty that queue as well. This is really weird because it does match mine and yours is, there's something, there's something missing. So let's run, get CSVs from directory. You're just picking everything up, right? Okay, run that one. Refresh. Three files, perfect. That's what you're supposed to have. All right, run that three times and we're gonna get two CSVs in the JSON. First one CSV. Yep, right again, the second one's JSON. There you go. And run it one more time just so we clear that queue. There we go. And that should go to CSV, perfect. All right, so we have a JSON document. Let's look at that JSON document in the queue to double verify that it's actually in this queue. All right, go to your eyeball. Your eyeball. I don't see any issue with this data. All right, exit out of that. Let's hit configure before we run it once. And that's a dollar dot star, dollar period, right? All right, let me look at mine. Double, triple, sure. Okay, okay, it matches mine to a T. Say okay, say apply. And then run split JSON. Refresh. No, I can already tell you it's not right. It's one to go into the log and then seven or. Yeah, the splits. Should be going to the evaluate JSON path they are. But the splits are much bigger than 29 bytes. Okay, our problem is coming out of that split JSON. So let's empty the split queue. It's empty. Okay, all right, hang on one just one second here. Let me see. Values going there. Success from convert CSV is going in. Let's bring down another split JSON processor and put it right beside the split. There you go, right there, it's fine. All right, and then let's take our connections and the route and attribute where you have JSON. Go ahead and move that over to your new processor. Perfect. And let's take the convert CSV to JSON line and move it over as well. That one, there we go. Click that, bring it over there. Perfect. And empty your original queue. Right beside that split JSON is that original connection. There you go, empty it. And then delete it. All right, and then just delete. And let's delete. Let's delete that split. No, not that one, the connection. The connection you have right there, right beside your mouse, right where you had it. Correct, yep, yep, delete it. Okay. And okay, let's delete the split JSON to your log error message. You got it, delete it. Okay, go ahead and delete that split JSON that you have up top. Get rid of that. Let's look at your configuration of this one. All right, let's set our value. What was it, dollar, dot, star, asterisk, yeah, there we go. Okay, apply. All right, and let's drag down our first relationship. And let's go to evaluate JSON path for the split. And then let's bring down the original end failure to the log error. You can do original end failure. Say add. Yeah, and if you see your connection, it says now failure, original. I like that, okay, that's very cool. You don't have to do three different ones. And I went through and deleted all my labels just because I need to clean mine up as well. So, all right, let's run it. And let's see, so get CSVs, we're gonna run once. As a matter of fact, just turn on route on attribute, start up radio. And then run once on your get CSVs. All right, and so now your count should be eight and one. Seven and one, sorry, math is off today. Okay, so before we say split JSON, let's take a look at that JSON document one more time. Because there is something funky going on here. Let's pick you, you're up. Okay, that looks beautiful. It does say it's an application, it read the content type. So a lot of these processors have a built-in mine type. Go ahead and exit, yeah, and then go ahead and close this. All right. Run once, actually just turn it on. Just say start, refresh. This is weird because it's exactly like mine. Exactly, exactly like mine. It is dark. Tom, give me just a second, I'm gonna do this. I'm wanting to, I'm gonna take over. Okay, that's cool, because I gotta run to the bathroom real quick. I'll be right back. Do your thing. So I'm having the same exact issue as Tom. And I've been following you and it's not correcting. Are you, oh my Lord. All right, let me, like it's really weird because mine runs, so let me double check mine. All right, I picked it up. All right, let me see here, refresh. All right. Wait a minute, mine's erroring out now. Even mine is now erroring out and it was working just fine. All right. Well, luckily Tom, I don't think it's you. I think it's me. Oh yeah, why'd you say that? Because I'm running mine as well. And now I'm getting the same results where previously I was getting split document. So let me, I'm gonna work on Tom's and get this figured out. Ekta, you wanna take a quick break? You know, have at it. I'm gonna take a quick break too and I'll be right back. Yeah, yeah, even mine, I'm getting wrong again and mine was working just fine. Yep, weird. This is weird, something. All right, me. That worked. And the splits are there. Should be an individual record, yep. Okay, okay. So that worked. And it did it's splits. No, no, no, the CSV, when a CSV is converted to JSON and then the JSON is sent to the split JSON, it works just fine. But when it's just a raw JSON document being picked up, it's not. And so why it, and yet that... So I'm gonna take a look at this view. It's really weird, the time is now yesterday. All right, me. What is the difference in this data now? Let's see what... That's wrong. No, I think we're... It's definitely different. What is that? Good Lord, the spiral web of flows. That's the other challenging thing with this, man. Where you start. Having a lot on my canvas. Yeah, and you know, I am very forthcoming when it comes to this. And you know, I try to explain like, you know, this, it can quickly turn into a spiral web that goes everywhere. And... I can see where it'd be easy to lose track of where you're at and what's going where. Yeah, and like, I try to kind of clean up along the way. Let me delete this. And we, you know, we have you know, we are putting in a lot of extra logging and stuff for now. And we're doing it, you know, we're doing best practices, right? Best practice is, you know, add these log messages wherever you can and, you know, go from there. But, you know, if we, once we get ready to clean this up, then, you know, then we would you know, take out some of these, clean it up. But if I don't clean it up now a little bit, I am not gonna be able to see. Okay. So I have what's coming out of the CSV to JSON, going to this. We know it works, and we know it works well. Then I have this split JSON. With another type of expression. That's what I was gonna say. I am not setting the JSON file in our name. That's the one thing I'm not doing that you are doing. Yeah, I don't even need that step, to be honest. I just, you know, I'm just, it was part of the old file, the flow. And so, since we're all working off of the old flow, I decided to just leave it in there. I get it. But if I were to sit down and do a full, why are you not, did you run once? Turn that off. Run once. Should be two and one. One. All right, that ran, that ran, that ran. Perfect two. All right. And I know that's working. What is going on here? All right, I think I found the problem. One second, I'm gonna fix this. So when we are taking our CSV and we are making a JSON document, we are doing it absolutely perfect. And so the JSON that's coming out is formatted perfect. It's formatted beautifully. So this JSON, the weather, I don't know what happened with that, it's still a valid JSON document. It's just not formatted about a JSON document because it's just a bunch of JSON objects, not a JSON array. But give me just a second and I will fix that JSON. Because I was like, wait a minute, what the, like what is going on? But yeah, I have a fix. I have a fix. Wait, we can do this. Okay, so there is the split JSON. So, okay. So besides, okay, so if you are around on attribute and if you want to quickly get around this hurdle, if you're doing around on attribute, take your JSON and just move it to a log message for now like the you as well. And then that way it will continue doing your splits on the CSVs and you can go from there. I'm gonna, in just a second, I'm gonna provide you with an updated JSON document that is better formatted. So give me just a minute. But Thomas, you can as well if you want, you can just take this and we will drag it to here. If the latency will let us, we will empty the queue. All right. Okay, so your flow should work now and you shouldn't have any issues. You're not going to handle the original JSON document, but about to replace it with a new document. And I'll just replace it on your desktop. Let me see where, I'll see where. Server administration. Yeah, I created a folder. Oh, nice, nice, nice. Instead of working out as a downloads or upload, whatever it's, server administration, actually what we're used to in our environment. So it's like, let me just mimic. Yeah, yeah, just mimic what you, because whatever flows you set up for yourself, right, it's gonna be very closely related. And coming up with the scenario, I tried to model what you all would run into based upon, because I know I've been to YPG, I'm Army, I've worked for the Army. So I know a little bit about what y'all's mission is and what y'all do. So I tried to model the scenario off of that plus of other information I learned about you all. And that's why I came up with the data analyst. But the funny thing is, is actually this worked just fine. This is the same data on the last class, but the last class didn't have any kind of help starting off, they didn't have the previous example like we did. So they had to come up with their own methods. And so they handled it differently and that's why they were able to handle the, I think they were just splitting it by row. Actually, I wonder if you could do that. But anyways, work on that. I will provide a new version of that data file. And I think actually it was splitting, we could actually, let me look at one thing real quick. Oh, you're slow. Very slow. Yes, I don't, I contract to the training company. I don't work for the training company. So like it's not my choice on some things and definitely not my choice here. So I think I can do a split. Oh, I could, I could do a line split. I could probably, I'm just thinking of, I'm gonna put a new JSON document on you all's desktop. But if you wanted to work with that original, you could send it to a split text processor first to split by lines. And then you would have a valid JSON document because it's gonna be a single record. But let me, we're actually way over on time as well. So let's pause today. Tomorrow we will very quickly pick back up. We will finish this. Again, I'm not looking for a complete data flow. What I'm looking for is, here's my thought process. Here's what this flow might look like. Tom, you kind of had it right earlier where you laid out all your processors. So what I was looking for is more of linking those together, getting as far as you can, which you have. And then what you can't get accomplished, we'll get the processors and say, well, I don't know how to do it, but I would bring all the attributes in and do a math function or write a little script or something. And then I'd send it out as email. And if that would have completed this scenario. So again, I know this is going to be a struggle on this scenario. I promise you we're gonna have one more quick and easier one that you will go away with the victory. But the reason for this one is it really dives into multiple different processors, different controllers, different situations. It involves multiple different types of files, multiple files, those things that exercise all of Nafa. So that's the reason we do this scenario, but we'll get it knocked out. So with that being said, unless somebody, I'm gonna upload a new, what I'll do is I'm gonna hijack everyone's browser and I'm gonna put a new JSON document in your folder that you pick up with. So I think everyone's using the git file. I'll go through and see where you're getting that file. I'll replace the JSON with another JSON document. And then you should be able to run your file all the way to splitting successfully. And then tomorrow we will, I quickly, in the morning before break, tomorrow morning, we will finish this, not necessarily finish the data flow, but finish the thought. We'll go around the room. What I'm listening for tomorrow is going to be, here is my data flow, here's what I got accomplished, here's what isn't accomplished, here's my thought process, here's what I was thinking that I could do, I cleaned it up, the beautification, things like that. Because that tells me you got the main concept of Nafa as well as some of the main components. You may not know all of the components, but at least you know where to go to get information, you know where to go to get a processor, those types of things. So that's what we will work on in the morning. I will handle the JSON. Anybody else have any questions? Okay, so that will conclude for today. Feel free to continue working on this. And then I will come in and touch up your flow for just this component. And then that way you can do whatever you need. All right. Sounds great, thanks Josh. Yep, yep. All right, if that's it, anybody have any pressing questions? If not, I am gonna go grocery shopping. All right, have fun. All right, thanks everyone. And if you're playing in your machine, I'm gonna take control in just a minute. All right, thank you. Yep, thanks guys. Thanks. One, two, three. One, two, three. One, two, three. One, two, three. Six, six, five, six. Five. Four, five, six. Oh my god, oh my god. Five, three, one. Over here. Yeah. Go on. Hi, hi. Ah. Yes. Okay, now I'm gonna go to bed. I'm almost fulfilled. We got a cookie job. We slept one day. So much more than I would have liked. Ah. Ah. Hey. Oh yeah. Oh. Oh. Oh yeah. Oh baby. Hi. Oh no. Oh no. What? Oh my god. What? Oh. Hey. Go. Oh no. Oh. Hey. Oh. Ah. Ah. Ah. I saw my kid and mom. What the fuck? She's fucking with me, but she knows that her self will have a deal. Whoa. I'm gonna go home. Hi. Yes. Ah. Hey. Ah. Oh boy. Hi. Now we go to bed. I'm gonna get a drink. We close the table. I love you. Ah. Oh. What? I'm gonna go to the bathroom. I'm gonna pee. Oh.
on 2024-05-20
language: EN
WEBVTT I know it was difficult. You know, we, it's kind of difficult going from, you know, your basic processors that you, you know, these processors are used, you know, Git file, those types of processors are used quite a bit. But you know, when you cross over to controller services and setting up data flows using controller services and those types of things, it can be difficult. And even for me, when, you know, as someone who's been using this for a very long time, I have to go back and reference documents and Google sometimes and those types of things, especially when it comes to some of the regex patterns, you know, that can throw you off. We had a couple of those instances yesterday. Yeah, there you go. Perfect. Just say advanced. I think it's loading. Yep, I think so too. I do notice latency feels better right now, definitely. It's not perfect, but it's definitely better. Oh, good, good. I was, you know, I was explaining to Maria that NaFi requires, like, you know, point, click, drag, drop, and, you know, it has to be responsive or you'll drag a box and it'll go all over the place or a connection or something else like that. So they worked on that overnight as well and to improve it, I hope. Okay, yeah, that was definitely, definitely annoying yesterday. It was, trust me. You kind of click on something, nothing happens, wait a few seconds, click again. Yeah, and when I talk to Ben on the last class or even I talked to the whole class, I ask, you know, this next class coming up, would they be able to work off of their personal laptops? But, you know, there was too many. Well, some have personal or some don't. And then you got NipperNet and NipperNet and it's, you know, craziness. All the proxies and policies. So it was just easier. Okay, well, we'll stick with this environment. But, you know, again, if you can, you know, download, I would be, I'm going to send, I've got all of you all's email address. I'm going to send all this information out. Feel free to, you know, on your personal time and personal laptop, download NaFi and get it running. Kind of go through the scenarios again. You know, try to get them completed. You can actually, I think you can, well, you have ways of uploading and downloading on this virtual machine. So you could potentially do that. I know the virtual machine stay around for like a day or two after the class. So if you want to log back in and reference anything you can. So I know that's available during lunch today. You all will get a survey about the class. I just ask that if we can, you know, complete the survey pretty quickly. No matter what the results of the survey is, you know, we happen to see. But but but yeah, if you can just get the survey done completely pretty quickly. So we can just wrap this up. But if you can't go ahead and get logged in. I see Tom, you're good. There is Peter. Richard. I did. And are we replacing weather data? Yes, you can. You can. Yeah, just delete that one, replace it with the new one and you should be good to go. I'd be, you know, let's see how that works out. But but for today, let's just, you know, for this morning, let's just try to get those values extracted as an attribute, because I feel like once you have them as an attribute, you know, you have full control of that data. You've got, you know, the CSVs are now JSON. The JSON is an attribute. You take the CSVs to the JSON to an attribute. Yesterday, we removed my arrow, my arrow from route on attribute to log message. Did I move that? I forget where it was. I think it was. It looks right. As if I don't know if I need to do anything or not. I don't think I do. I don't think you do. Let me take a look. So you're getting the CSVs. Routing on the attribute. Actually, then you would need to send the JSON to the split. OK, that's what I thought. OK, thank you. And then let's see if that new JSON file will go. And then once you do the split, you can do the evaluate JSON. And if we can make it to that, then I'm calling success. OK, very good. Thank you. You're welcome. Hey, good morning. Sorry I'm late. I was on the VPN and it won't let me connect.
on 2024-05-20
language: EN
WEBVTT So great assessment, but yeah, I think we've laid it out pretty well And then also going to deploy, you know going for seven plus. So I'm no I'm good with that It's awesome the comments as well so Some of the go team review dots and stuff Tom was a new pool your your screen up here. Okay, so let's Can you just clear your keys Well, it looked like it did. Oh, I see a fire go ahead and just clear all your cues Yeah, it does it splits in it, but it also fails. I don't know what I understand why It means there's three there's three files all three fail, but then it also splits all three Makes no sense to me. But can we look at that your key record? before you clear it Well, we know what CSV is failing Can literally convert this to Jason though We look at that CSV one of those They're not Probably be right here this one. No the failure Well, actually are those failures or the original because it may be the originals and the split worked Can you look at the CSV file real quickly? It may have actually worked and what you're seeing is the original That that's Jason now Because we put failure in original on the same You know to log into the same message Yeah, so actually I think it did work and you should be good to go now you just need to do your evaluate JSON path Yeah, you know or not to like to the no no no you should have attributes there because you are a Well limit you split so you haven't done the Evaluate JSON path. Yep. So you send out that first and then I bet you have attributes Okay. Okay. Great. Thank you Hey, I got a question on mine What? Successfully when I ran it through and I clicked on the get Jason and get CSV processors and have them run the once I think it ran through successfully, but when I just have them start And just keep running automatically Like thousands of files really quickly and I don't really know where it's getting those files because they're all referencing this data directory And I only put three files in there. Yeah, what is doing is you're keeping the source And it's just sending the same ones through Okay, yeah, so Yeah, if you look let's let's Yeah, so it's just running the same files through over and over and over and If you let's stop the get file processor and take a look at it real quick Okay. Yeah As clearing up the huge now because they got back that pretty bad. Oh, yeah. Yeah, but you're testing the lemons of the system Yeah But it is You can actually Go out of this group. So go back to your main canvas using the break from or just right click say leave group Click on that group Right click and say stop And empty all cues Okay, and and that's a good easy way to empty them all and then I bet if you look at your get file Well, you're not keeping the source file What folder uploads the their old data directory can you open up that actually you can't cancel because you have an error What's the the top right you got a little red box? No Yeah, what does it? Unable to write flow file content content repository File size constraints Yeah, I think that's we are filling up our cache I'm not too concerned about that error He run once in your get file So if yeah, I mean if you're having to move the files back It's shit and it's picking them up. I mean you shouldn't have thousands going through right? Is there anything in the processed or original files? Yes Are you doing a person out here or get file look at your get file again? Okay Yeah, and your recursive is true are you putting files back into that directory Not in the Same correct sub directories. Yes. Yeah. Yeah, so it's doing it's picking all your sub directories up. Oh Okay, it's drilling down. Yeah everything exactly. So it's reprocessing everything Okay, so you might want to move it just to a different directory or just tell it no on that Okay, yep Thank you Not a problem Sorry, my internet is freaking out on my ass, but I'm here Like minimum engineering experience before because you know, I'm thinking about data engineers software engineers. Yep exactly Yep our That's caught out in you know the RP Yeah, you hear me Oh I thought I'd disconnected Yeah, like that we call that out Yeah, I just want to bounce I can do this. I just want to take a look at it and make sure the terminology fits with the previous You know everything, you know all terminology has to be the same Fingers crossed It Joshua how do I I think I got everything working on here it's going through the whole flow But how do I what's the best way to like you that data view the records and attributes that I have from these files? Let's see here Yeah, everything going through where are you putting it I Don't know I just have it connected to a first content rating. Oh nice. So Merge content Perfect. So if you want to Actually, if you want to stop merge content run your data through you can look you You know to make sure You know just to make sure everything looks good and then feel free to write it out to a different directory and You can then actually view the data Okay Yeah, you either And this is everyone's call, but you can either Write it out to a CSV. I know we've started with CSV The point is, you know learning the processors but like you can do attribute to CSV You can create a JSON attribute So you just save everything as you know one JSON document or one, you know each individual documents You know you would use that for you know loading into a database or something If you wanted to save everything as a CSV then have at it, you know, you could use that to You know do Excel or something like that You know, so I'll leave it, you know up to you But yeah, you you can stop it and look in the queue or you can you know I'm gonna do that. I would just write everything out as a CSV or a JSON document and make sure it looks good and call it today And and definitely let's see Well, you've got it you got yours for looking pretty good too you got it cleaned up Okay, yep. Yeah Process is pretty straightforward over here on the left and then everything's just kind of branching off to the error message Perfect. You might want to think about how you would handle Better error handling, you know breaking that up a little bit if you if you can see my screen and kind of run into the same problem and What I did is, you know to clean this up. I have an error handling on the left That's taking care of half of the data flow and then I have error handling on the right The same here on the other half of the data flow Okay, just to make it look more presentable Okay, great question Data I didn't name them properly, but that's fine. I think I can fix that It looks like they were only exporting the last record of data Starting it says our 23 Let's take a look. I was actually just doing The following when it does write it to save it as a date But okay, let's look so you have Just that one record right is what you're saying. Yes Did you run once or it ran only through It ran all the way through I Wonder if Can we look at your equals that can you look at your clip file? That's the last one, right? Yes. All right. Let's look at that Good properties, please I Bet it's because it's the same file name It's overriding itself over and over and over and over again Okay, so Separately and then saving each record And that's saving new record over the last one. Yeah, so can you see my screen me? I was literally just working on that issue. So if you're able to see my screen After I do my I put my all of my attributes to a JSON document and then I set the file name And then what I do is I am Renaming the file name to a date down to the you know, those little seconds and and then that's how I format it so you may want to I'll be happy to put this in if you want to try to use it I'll put this in I can't put it in chat because I can't chat doesn't work You could also do just And it will just capture the latest bombing but I think it's because the file name is just overriding Can you Actually, let's test this we can test this let's Let's okay, let's run it except for the putting the files to disk So just just run it like crazy except the writing the files to your director Yep, and let's let it queue up so we can see what is happening Okay, seems like the latency still getting you a Little bit yeah, so when you when you do your survey don't give me bad marks for that that's not under my control So you copied them in If you didn't notice enough I can move data pretty quickly It will it will I've seen I'm seeing folks build a flow and then turn it on and So you got 72 files ready to go let's list your queue So you got a lot of the same funny same file name on that one So it should write more than one Scroll all the way down a bit Three files out because yeah three files out even though you've got multiple records for for each file So you want to try to do an update attribute into a founding okay if you want to do Like you can use a UUID as well You have it generating UUID hang on I can That way when it writes it it just writes it as You can write it as a UUID I mean it's Yeah, let me I'll send you the command when you work on that I'll send you the command to Write it as a UUID if you won't like I said if you wanted you wanted to Let's see teens chat will work today, it won't work because I'm not a member of the chat today Well, I'll do Peter and I can do this for everybody is I'll just throw a text document in your uploads and And you can use it to do your bombing. Okay, sounds good
on 2024-05-20
language: EN
WEBVTT Hey good morning, how's it going? It's going okay. I wish I was a lot more further away than trying to figure this out. But basically I had the same thing from yesterday, except I didn't remove some of the extra processes. Okay, and then you wrote all your files to disk, and I think that looks good. The one thing I would take a look at as well is, just like everyone, once you start adding multiple different processors, it starts getting out of hand, you've got lines going everywhere, those types of things. From there, I guess it sends it to X2 script. I think there might have been another processor which would have been easier than this one. Oh nice, look at you. It's not complete, but it was just kind of like an idea that... I think one of the requirements scenario, they wanted you to categorize weather conditions. I was just searching based on precipitation or temperature. From the documentation, this was how you get one work file. I'm not sure if this line would work, but if it did, it would just load the flow files in JSON. You search through the keys for precipitation or temperature, and you get those values. I don't think so. This is the right processor to write it to the disk, right? To disk, yeah, put file. You got your get file, and you got your put file. Then you could have taken this data and pushed it to a database. You've got those processors. You could be receiving different data. This is a real world scenario where you're receiving different data from different sources. You're bringing it together, getting it into a common format, breaking it down per record instead of like 24 records per document, and updating the database. You got the right processor for it. You got the right kind of flow going as well. Great job. Yep, yep. Good morning, Leroy. How you doing? Good morning. I'm doing all right. All right. Let's kind of go into your, kind of walk me through your processor, your thought process, what are you thinking, and let's look at this. Okay. Well, my first attempt at this was to lay out the steps, how it's presented in the PDF for the experience. I maybe spent too much time making it look nice because I didn't get very far in actually implementing the links between them. But the general idea was to get CSV, extract text. I honestly didn't make it past this step. I was trying to extract using regular expressions the text and immediately applying it to like the attributes. Sorry, I jumbled up there because as soon as we got to the next step, I was able to just use a replace text to basically convert it over to JSON type format. I like where you're going with this. Instead of sending it through the convert records and everything else, you're just quickly ingesting it, extracting the text, and doing it that. There's a lot going into those first core processors, right? And then sending the file name. So I like the approach you took on the other one where you're just bringing the data in, you're doing a regular expression and extracting what you need. That's a sensitive value. Apply. Now I have a parameter that I can reference in all of my... I can reference it in all of my processors. Instead of Tom needing to know the password, you can just say reference this password. And because I'm already using...let me delete this one because I'm already using this. Anyways, when you set this up, when you do an input port, it's going to ask you where it came from. And so you just drag and drop it came from that previous group. So now I have an input port. And what I'm able to do is start separating this. And this goes back to what I was saying earlier where you may have Tom using this as a GDP. Tom may be responsible for all data movement that's getting written to a database. And so Tom says, well, if you want to write to a database, here are the specs that you need and just send me the specs. And my process group handles everything else. My process group will take your specs, write it to the database. You do not have to worry about that task. So breaking this up into different processor groups and having input and output ports really helps on divvying up not only the workload, but I've seen processor groups that everyone uses because it does some sort of enrichment of data. And I've had other data engineers use these processor groups and they had no clue what was in the processor group. They didn't know that they could just put CSV in and output would be this beautiful JSON that answers all these questions. Somebody else built the logic. And so in that sense, you can think of a processor group as a processor where you've built this whole data flow. And now you've got basically a processor that you can re-use over and over. So there's a lot of power there. There's a lot of capabilities. So and then when I got done with putting the file as each individual record, if I would have had time, I would have probably sent it to like a SQL. I like Python and stuff. So I would probably send it to a SQL processor. I would have done some SQL on the files coming in. That way I can compute average and stuff like that. And then if I got an alert that was above a threshold, I would package that up and send it as an email. But that's my flow, some of my thinking that went into mine. And then I try to work on some of the beautification. So any questions of what we've covered in the last two and a half days? Okay. Well, I guess you guys, I didn't know you guys came in at like 6 a.m. I would have tried to get us out of here earlier for lunch. So let's go to lunch. We will come back at 1.50 my time, 11.50 y'all's time. And then we will get started with registry. Hey, Josh. I have a question online. Yeah, Peter, go ahead. Nice. Okay. What's the CSV file? So it's a single. Okay. So it looks like that's per record. It doesn't like a. So it is a just a comma separated value, but it doesn't have any header. You know, there's no headers there. So you don't have like the attribute and then the name, which is like, like, you know, you may have temperature or day in the, you know, the correct date. Do you format? No, that's the date from the attribute. So it is. It does look like a CSV to me. Okay. Yeah. So. Yeah. Yeah. Name for each of the categories. So that's something you can add. You could add the let's let's look at the. How would you add that header back? Let me look at mine. See here. I like your date format. It's pretty cool. It's pretty cool. Once you start messing with some of the rejects. We go in. I remember where I got that from somewhere further back. All right. Let's bring down the you're doing an attributes to CSV, right? Yes. Let's do this quick. I think the include schema. If true, the schema attribute names will also be converted to CSV. So that way you have a header file. So it's going to give you the attribute name in your case would be like temperature station ID, those types of things. So if you set that true, it should give you all the attributes. The you know, it's not going to say attributes. Say temperature and then the value in the next one. So that way you always have a header file with it. But it does look like regular CSV to me. If you want, if you can copy it and send it to yourself, you can just see if Excel opens it with no problem. There's a way to upload files through the drop files here on the bottom of your screen. But I don't think this platform has an easy way to download. You could use something like paste bin not paste bin. One of the things that we use for this class sometimes etherpad. Here's like an etherpad that I say from I teach this class as well as the DLD architecture framework class. Just because I've had to implement that so many times now. But you can actually go to etherpad. You can use the this this company, Noble Frog. You can use their etherpad. If you go to etherpad.noblefrog.com, you can just create one and say OK. It's going to create a new kind of for you. And then you can just use that same address on your local and copy your information over. But that looks like valid CSV. Looks like you just need to include the attribute list and you'll have a header and you're off to the races. Yeah. Yeah. OK. Well, hopefully everyone had a good lunch. I wish I realized you all came in at 6 a.m. Because you probably take your lunch closer towards 11. I know yesterday and the day before we went a little bit past that. It would have made me happy because I usually eat lunch at about 1130 my time, which is not 30 y'all's time. So I didn't realize y'all start so early. And so do I. So so good deal. Anyways, we are going to now feel free to follow along with me. I could kill you via PowerPoint. But again, I feel like hands on is the best. And it really teaches you how to do these things. So on this next topic, we're going to talk about not registry. And what it does and then we are actually going to configure our not by instance to use registry. And we're going to check in our code or our data flows and check those out before we roll in the multi tendency, which is a PowerPoint, you know, unfortunately. Hey, in the way that I like to do, you know, some of these sub projects is, you know, I know I haven't really told you what not registry is or what it does or those things. What I like to do is just go right into it because you already have data flows that we can check in. And so, you know, that I think by getting the hands on approach is way better than PowerPoint, but it should be extracted by now. And if if if you get caught up on any steps, if you miss a step, if I go too fast, you know, you get caught away for a second, you missed a step. I don't like this. Interrupt me. And let's get you on the same page. Some of the key things is, you know, your security settings and those types of stuff. The configuration file for not registry is much shorter. Right? Is a sub project of not. There's not a lot going on there. And, you know, it's not a heavy lift. Not, you know, it's a much smaller package not to get when you extract it. I was doing a new tab on my own. So, say run run that not five dash registry. You might get a job. Warming warning. Like I have to say allow access because this is a network application. So, registry should run.
on 2024-05-20
language: EN
WEBVTT Let's do this. Bring up a new tab. And if you can see my screen, I also have my now file up, but I have a new tab and the URL you go to is HTTP, not HTTPS, because we're working on an unsecure port. So all you do is go to HTTP colon front slash front slash same IP address 127.0.0.1 colon, and we're going to use a backup HTTP port. So we're using one eight zero eight zero. So that's one two seven zero zero one colon one eight zero eight zero. Then you need to do a front slash and then you want to type NAFA dash registry, because if you don't, it's just going to take you to a page not found. Did that work for you? Thanks. Yeah, you got it. You got it. All right. No worries. Leroy and oh, I don't even see Richard. He might be caught into an oh, there we go. Let's see what URL are you going to see? Leroy, give me just a second. We'll take a look at yours. One two seven zero one eight zero nine five dash registry. You should have two command line boxes if you go down to your task bar. Actually, there you go. Here, I'll do this. Richard. All right. I'm going to I'll just take a look. We still need to keep the other one. Yes, you do. Well, luckily, it looks like you only closed NAFA or somebody. Richard only didn't start it. So, yeah, you need to keep your run NAFA going. And so, oh, OK, I think Richard missed this step. Yeah, leave the other one running. We're just going to run two of them. I think Richard got on the call or something. So I'm going to just extract his and run it for him right quick. Having issues trying to rerun the original NAFA. I thought that. OK. Just two seconds and I will take a look at yours as well. OK. I'm going to start Richard's and that way his will work and log in as well. So if you look at my screen, Peter, you should have two command line boxes open. OK. And what I did is Richard here had his already running. And so I just minimized it and went and extracted the NAFA registry zip file, went into the bin directory, double click run NAFA registry. It's going to open a new command line box. And when you see the NAFA registry and the version code, then it should be up and running. Yep, I see that one. OK, perfect. And so. I'm going to talk to you about 0.1. All right. So Peter, yeah, you're looks like you're in. I know Richard is going to go. All right. Looks like everyone made it. So so this is this is registry. There you know, like I said, it is a sub project to NAFA. The same folks that maintain NAFA, you know, also, you know, there's committers that maintain, you know, registry as well. Registry is a, you know, it's a sub project. And like I said earlier, it's a complimentary application to NAFA. It's like that translation layer. So what it does is it provides a central location for the storage and management of shared resources across one or many NAFA instances. Remember, NAFA is scalable to hundreds, thousands of nodes. And so, you know, you you have seen multiple scenarios, but usually what I've seen is like you may have a dev system, a prod, a test and prod. And depending on the system, it reaches out to NAFA registry. It grabs the flows it needs that have been checked in from dev. It grabs those flows and runs them. But, you know, you may have one registry for all, you know, three different networks or you may have a registry for its own and utilize the backing of GitHub or GitLab to replicate that code to other systems and space levels, those types of things. But registry is it's there. It's to keep track of all of our data flows for the most part, right, for storing and managing version flows. It's integrated within NAFA completely. We are going to set that up because right now when you work within NAFA, there's no hint of registry. But there's a couple things we're going to do and it's just going to flip all kinds of switches underneath. So, anyway, so that is NAFA registry. What I like to do is there's a little wrench icon at the top right. If you can click on that, that's our settings. And the settings in this, again, is very basic and very, you know, it's a translation and a place to store stuff. So if you go to the settings, you can actually click new bucket. If you can click new bucket, say, you know, name your bucket. I'm going to name mine new bucket. Description. Name it a description. And say create. And so after that, you should see at the bottom right, a success bucket is created. Once we have this bucket, we can delete it. We can manage it. So if your bucket is created, you can click the little pencil on the right to manage that bucket. When you go into that bucket, you'll have your name, description. Again, you can share that. Permission settings where you can actually make this publicly available. Bundle settings, you know, allow bundle overwrite, you know, it allows the released bundles in the bucket to be overwritten. And then also the big one is the policy. So once this is installed and up and running on you all's system, you're going to have policies in place. You're going to have username, passwords, and identity management, certificate, cat cards, you name it. And so, you know, there will be policies set up that says, you know, Tom can log in, but he can only see this bucket. That is registry. There's not a whole lot to it. Anyone have any questions about registry, you know, in itself? All right. I didn't think there would be many. So that's a good thing. What we're going to do is now kind of go into some PowerPoint. I'm going to try to get through a couple of slides. We're going to talk about some multi-tenancy and things like that. There's not, you know, there's not a whole lot of NIFI interaction right this minute. So if you can just bring up my shared screen, we'll go through some of these PowerPoint slides as quickly as possible. We'll take a break here soon, our last break of our class, and then get back and, you know, go over, you know, if you want to work on another scenario, survey, any administrative stuff, things like that. So excuse me. All right. Where is my slides? All right. I think where I left off mid-Monday was actually installing NIFI. So while we get everybody back, for registry, for instance, everything we went over is in the documentation. We went over downloading, installing, starting it. You can install it as a service. We talked a little bit about that with NIFI. Creating a bucket, connecting it, and all those things. Again, I don't put the resources in the presentation I send out to everyone, but if you have any additional questions about registry, about NIFI, about MENIFI, those parts, I'll put them in the presentation. One of the other little points I want to chat about is MENIFI. Only because the last class was really interested in this, we actually did some MENIFI things. But, you know, I mentioned MENIFI. It is a subproject under NIFI.
on 2024-05-20