Practical Hadoop
language: TL
WEBVTT ang foresight ito ay magkita ka na kung magkita ka na magkita sa locasyon ng name node at magkita sa hdfs at map reduce component. okay, so ito ay, ito ay gawin na magkita na magkita sa map reduce location at ang name node location name. So ito ay magkita sa hdfs at map reduce component. So ito ay magkita sa hdfs at magkita sa hdfs at magkita sa map reduce component. okay, so as you can see here, by default it is also empty. Okay, so we need to provide Now the next step, the next step after I update the foresight, the hdfs site, then the next So the map read site is a file that use to track variable that holds the location of the job tracker. Okay, so this time, this time we will modify the map read site. Okay, so the purpose of map read site, the purpose of map read site is this is the configuration on where we are configuring the tracker, the tracker location. Okay, so it gives us the power to configure the job tracker just like the name nodes that hold the name location of the name node. Because only the map reduce component need to know this location, it is a map read site, mapreadsite.xml is the file that we need to edit. So the network address for the name node in the job tracker specify the ports on which the actual system should be redirected on the port. So there should be no user facing location. So we don't need to bother about the web browser to this. This is just simply our web interfaces that we gonna look in our next activity. Okay, so we need to copy this job tracker. Okay, then we need to paste this inside our map read site. Okay, so we'll paste it there. So remember, we already edit, this is the third file that we edit. The first one is the core site, the second one is the HDFS site, and the third one is the map read site. Okay, so the map read site again, this is used for the HDFS site is used to handle the data node. The core site is used to handle the name node. The map read site is used to handle the tracker node, the job tracker node. So this is where we configure on how to in such a way your Hadoop can locate the specific component of your Hadoop architecture. Now, in our case here, the next configuration is the next process here, the next process is we'll create a two directory. Okay, so this directory, what's the purpose? Why we need to create the directory? Okay, we need to create a directory Hadoop inside the slash bar slash line with a writable 777 or read write execute in such a way the data can be written on that directory that holds the temporary directory property or the data that we need to be stored in our Hadoop after we create a directory, we'll convert this directory as writable and readable. writable and readable globally. So sudo chmod 777 slash bar slash line slash Hadoop. And such a way it is writable. So that we will avoid a permission error later on if the Hadoop will try to write certain file on that directory specifically for the temporary data file. Okay, now once we create our directory, we can now add a property, a new property to our foresight that we will tell the Hadoop to read only or to write only on this directory if the Hadoop needs a temporary directory to be write on somewhere else. Okay, so to do that, we need to update, copy, then we need to update the foresight. And we will add a new property inside here. So the new property that we'll add on this stage of configuration is we will tell the Hadoop that we will use the directory that we create as our what's the next step after this? So the next step after this is we need to format our name node. We need to format our name node. Why? Because before we can start the Hadoop in either sudo distributed or fully distributed mode for the first time, regardless what's the mode of it is, we need to format that part of this to become an HDFS file system. Like you can imagine if you are installing your operating system on the hard disk. So the first time you install the operating system in your hard disk, you need to format your hard disk such as path32, path64, xts, like that, so on and so forth. So it looks like this. So to be able to use the certain directory for our HDFS, we need to format that. So the command to format the name node can be executed multiple times, but in doing so, all existing file system will be destroyed because of course it will be format. Now it can only be executed when the Hadoop cluster is shut down and sometimes you'll want to do in most of other cases such as if you want to irrevocably delete every piece of data in your Hadoop file system. So it does not take much longer on larger that the file has been formatted or the location has been formatted meaning the meaning of that, that the DFS, the data file and I mean the data file system, the Hadoop data file system has been created successfully. That's the most important part. So if you cannot see this responses on your endpoints there, there is a problem. There is a problem on your configuration and you need to look again from the notes that I gave you there from the notepad. So the notepad there is, you'll notice that the notepad that I gave you is a special notes of mine which if you follow that notepad that I gave you exactly verbatim, that means that you can actually set up your Hadoop successfully. So you can save that notepad in your end or I can email that to you whatever you prefer because this is actually the, this is my special notes to set up the Hadoop. Aside from my PowerPoint which is explained one by one there, this is the clean one that if you execute this one by one line by line as per instruction, you can set up your Hadoop server. Got it? Now going back to our discussion, going back to our discussion, if you successfully installed your Hadoop, the next step there is we need to check if your Hadoop is really working. How we can check if it is really working? First of course we need to start your Hadoop. Unlike the local mode of Hadoop where all components run only for a certain lifetime of the submitted job with should do distributed or fully distributed the cluster components exist. This is should do, this is actually a should do distributed. Why should do distributed? The setup that I made here right now is a should do distributed because it only actually contains one node itself. Itself is also a node, itself is also a server, itself is also a master, itself is also a slave. So it's a should do distributed configuration. So unlike job base or local mode of Hadoop that we execute the jarro file before, we call that as local mode of Hadoop, this one, the one that we set up right now is actually a should do distributed setup. And this cluster components exist as long as the process is running. So before we use the HDFS or a map reduce, we need to start up all of the needed components before we can use all of those features or components. So to start with our program or our Hadoop server, all we have to do is type once you receive starting data nodes, starting name node, meaning you have now name nodes and data nodes. Now the startdfs command suggests that components are, suggests that it starts all of these available components. So the name node manage file system and the single data node will hold the data. So the name node will process the system, will process or manage the system, the data nodes will hold that data. So after we start these components, we can use our GDK GPS utility to see whether there is a Java processing that is currently running in our system. So we can use the GPS or Java processing system, then we can check if there's any available processes that is currently running in our system. So by typing the GPS, we can see here that we have that GPS data node, secondary name node, GPS name node. So meaning we have that, to be exact, we have that data node, secondary name node, and we have that name node. So which is, we expect of course that those three should exist because those three are our component to transact to Hadoop. Now how we can say, before we proceed to use the file system, let me verify if you're already there. Okay, very good. Okay, so Hadoop name, did you format right? Did you format before you execute the start.tfs? Correct. Okay, so start.tfs, starting name nodes, okay, warning, permitted dated local to proceed for the execution. The next thing that we need to make sure that everything was actually configured properly is we need to show the Hadoop file system inside the directory. So to be able to do that, all we have to do is type HDFS or Hadoop file system, dfs, dash ls, meaning that we want to list all the data file in our Hadoop data file system. Okay, so Hadoop data file system, data file system, dash ls, list all the file in the root directory. So that's how we interpret this command. Now, is this command also available on on the EMR or what's called as on the house, on the Amazon, on the Amazon Hadoop or EMR? Yes, all of this command is actually valid. Okay, even you will use Spark or from Azure or you will use Google big data like that, this command actually is still valid because HDFS, HDFS command is a general language command for the Hadoop system. Okay, for the Hadoop system. So meaning all of the command here are also valid on any other Hadoop system. Okay, now after we verify that we can now, we can now, what's your name, we can now execute the Hadoop file system. The next step is we will try, we'll try to use the yarn, the yarn. So yarn meaning is yet another resource negotiator. So this yarn, this yarn is a resource management framework that will allow you to have a multiple data processing engine to run on a shared Hadoop cluster. So it gives you power to manage multiple resources under one roof. Okay, so that's why we call this as yarn because it means that yet another resource locator. So to run the yarn, we need to execute, we need to execute, to run the yarn, we need to execute start dash yarn dot sh. Okay, so as you can see, starting resource manager, starting node managers like that. So now remember before we have these three components that's already running. Now we start the yarn, we expect another two resources that is running in our system. Okay, so to be able to check if all of those are running, we can use again the GPS. Okay, okay, so the yarn, the yarn is not included in the GPS. So only, I think only the data nodes, the node and the name node. So the yarn has its own different execution layer. Now, let's say, let's say we decided to temporarily stop all of the services that we want, that we execute. Okay, so what's the command? So when to stop, when to execute the stop services in our, in our, in our Hadoop. You can use the stop services if you think that you want to reformat. If you want to reformat your DHDFS instance or HDFS, what's how to say, HDFS data file, data file system, then you need first to stop, to stop the execution, to avoid premature, to avoid premature replication. Okay, to avoid premature replication. So to stop, all we have to do is command stop all. Okay, so as you can see there, it stopped the name node, the data node, the yarn, the secondary name node like that. Now, now what if, what if after we, after we reformat, we decided to start again the services. Okay, so what's the command? So the command is of course, it's start all.sh. Okay, so running this command will start all of your services. Okay, all of your Hadoop services plus the yarn. Okay, okay, now once you started all the services, you can now try to check your GPS. Okay, you can now see your GPS, the name nodes, secondary node, and data node. Okay, so that's good. You have now your GPS. Now, if there's any issue such as data node cannot find or there's an error that happened when you, when you use your, when you type the GPS, then you can reformat. You can execute again the HDFS name node format. To be able to fix that issue if there's any. Okay, now we can now check, we can now check here again. Okay, we can now check here again. Let's write, let's try to, to what's all this, to check our, our, what's all this, our configuration, our configuration if your Hadoop is working. Now, to check your configuration if your Hadoop is working, the first step is let's try to create a new user or a new directory user. So we can type Hadoop. Hadoop, okay, FS, file system, make directory, okay, make directory. As you notice the dash, the dash always before, before the file manipulation command or file system command. So Hadoop, FS dash, MKDIR, or make directory, user from the root page. So if everything is okay, then there is no error that, you know, raised on that execution. Let's say I will add further directory inside the user. So inside the user, I will add new directory Hadoop. Okay, so Hadoop, and if I execute this one, okay, so there's no error. Okay, now let's say I will, I will create my user here and I'll name that for example as VM1. Okay, as VM1. Okay, then I can see here there is a VM1. Okay, furthermore, I can now check if there is a file or a folder if I use the LS. So LS slash user, for example. Supposedly, I'm expecting, yes, I'm expecting slash user, slash Hadoop, and slash user slash VM1. Meaning that the directory or the file system that you define on after configuration of your Hadoop was successfully created. And that is a representation that your configuration is fully working. Okay, if you cannot execute up to this point, that's okay. So meaning, meaning, field for connection on JavaNet. Okay, okay. Okay, so that means that you have connection refused. For more details, go, okay. So let's go to your, you need to recheck, you need to recheck your mapread.xml. Did you update your mapread? Because that error, I think, excuse me. Ah, yes, this one. Can you see this in my share screen? Okay, so can you try to open that file that you have? Okay, okay, okay, okay. Now, yeah, can you try to open your core site? Okay, your core site. Yes, the core site. Okay, you have the core site, 49,000. Okay. Okay, okay, can you try to open your HDFS site? Your HDFS site. Okay, so that's good. Okay. Okay, now, can you, can you, okay, exit, exit there. Okay, just, just, just want to make sure. Did you execute this part of command, the slash bar sudo mkdir? Did you execute this? If not, try to execute this one. Okay, try it again, this one. sudo mkdir, then try to change the, the mode. Okay. Okay, okay, that's fine. Now, make sure to change the mode to 777. Okay. Okay, now, one, one thing that I want to test as well is can you, this one, this one is, okay, so the sudo rm-r. So that one will only execute if you had an error, okay, such as GPS cannot find the data nodes or anything like that. But the configuration in the, in the core site, you mean? Okay, try to open your core site. Let me verify it. I think I saw it. So I think that, that's good. One thing that I, that maybe is not working. Okay, try to open. Yeah, I think it's there. Yeah. Correct, that's, that's 49,000. I pressed the fourth name, 49,000. Yes, for the map rig, for the map rig. Yeah. Correct, that's correct. Now, I have doubts that there is an issue with SSH. Okay. Exert, exit. Exert, type exit. Okay, now try to type IP space. I don't know. I'm talking to, to, what's her best? To Lavinia. Okay. Okay, now try to type SSH space 10.0.3.16. Okay, that means it is something, it is something. Yeah, you need to exit. Okay, so it's, it's, it's a different issue so far. Okay, let's try to, yes, try to stop, stop all, then try to stop all first. Okay, let's, okay, stop all. Okay, follow my lead. First, stop all. Then, next is execute the, the sudo rm-r slash bar slash live slash hado. Okay, let's, let's clean your, you know, your, your configuration. So sudo space r, after that, let's, let's execute this one. If you can see my screen, add the one that I highlight, you need to execute that. Okay. Okay, execute that one. Then execute this one. Yeah, yes. Okay, let, oh, I see there, there's an issue allocated. So it says cannot create. There. Okay, so hado. So I think cannot create directory bar live hado. That's it. So that's, that's the issue there. Okay, so to be able to fix that, to be able to fix that, we need to, to, you need to execute first this one, the sudo rm. Okay, execute first the sudo rm. Yes. Okay. No, bar slash bar live hado. Cannot remove no such file or directory. Okay, can you try? Okay. Ah, okay. Now, now, execute this. This one. Yeah. Okay, now execute this part. Okay. Now execute the format again. Okay. Now, after you format, we need to try to start the DFS. Let's start one by one so that we can track first the error. So start the DFS first. Then if we can see all of the, you know, all of the, all of the DFS, then that means that we're in a good, we are in the right track. Okay. Data nodes, name nodes, secondary name nodes. Okay, type gps. Okay. Next is execute the
on 2024-09-18
language: EN
WEBVTT So while it is uploading, I already create the user, the home, the universal. So here I already create the home, I already create the user. So the next step here is to generate the key. Now, let's get to the place. We'll do an onion. Now, the next step is to create the link for the Java. So that's done. The next step is to create the link for Java. Now, remember we already created the tree certificate. So meaning the tree certificate is actually now ready to be deployed in our machine. So to be able to deploy this tree certificate, we need to go first to our VM2. So this is our VM2. We need to go inside. We need to modify the certificate. So here we need to update our certificate. In such a way, the certificate will be available all throughout the authorized key. So we can say to do nano, then we'll paste the certificate. We'll paste the certificate here. We'll copy this one. And paste it here. That's it. Then we can go, save, and exit. And let's try to pass. I see we have three certificates. So VM4, VM2, VM3. So we'll do work on the other certificates. Because this will be the key so that we can communicate with each other. So this is the process on the progress. The 18 chapters. Yes, on all three machines. The reason for that, why we do that, the reason for that, because each machine needs to communicate with each other. Right? Via SSH. Via TBSH. Now to be able to communicate with each other, each of the machines will hold a certificate. Correct? So without that certificate, of course, we cannot talk to each other. You cannot upload? Ah, okay. The reason for that, the reason for that, you are logged in as Hadoop admin. Right? You're logged in as Hadoop admin. Correct? Okay, so each of the machines have now a key. Okay. So to be able to test if the machine can communicate with each other, we need to SSH the VM2 to VM3 and VM4. Vice versa. Okay? The VM1, I mean the VM2 to VM3 and VM4. So SSH, we can say it's VM3. So I can say, you can read, type yes. Okay, so that means I am now connected to VM3. Okay, I will try to connect to VM1. Ah, sorry. Not VM1. Not VM2. VM4. What did I connect to last time? VM. So I am on the, let's start with the VM3. So I can get to VM3. So I will connect to VM3. Yes, I will connect to VM4. Yes. So as you can see, I can connect to different machines. Right? I can connect to different machines. So here, I will update my HTS. Okay. Now, what I just need to enable in my HTS is, so actually I will only add, I only update here some, some down to code. Okay, and that's it. That's the final touch. And we can go now to our HADOC1, I mean, on our master. We only ran the start, start HDFS on the master. We do not run those HDFS command in the slave. All of the command must be run in the master only. Okay. So cross finger. Hopefully there's no issues. Let's try. Let's try. Let's try. Ah, okay. Okay. Yeah, yeah, of course. We need to use a to do. Not move, I think, not move. You are moving the directory, right? You are moving the directory. Because you upload the directory, right? I remember. So, right? You want to move that inside the local, correct? Inside the local. So we can directly, we can directly command move. So instead of creating directory, you don't need to. You can directly copy. Ah, okay. So now it's moving. The HDFS site. Wait, the HDFS site. You need only to update. You can check my screen. You need to update only the, instead of slave here, you can see this one. Since slave one, you need to change that into the name of your node. But, okay. So for example, VM4. But this is only applicable on the feed. Okay. You don't need to apply this configuration in the master. I don't know. HDFS. No, no, you don't need to change that. The core site still the master for all the instances. The only file that we edit in our Hadoop is in the copy paste, right? We copy the code further, correct? So the only file that we will edit in the VM3 and VM4 is HDFS-site.xml. The rest of the file inside the Hadoop remain as is. Okay? Remain as is. I need to change the IP of the point to the VM4. Thank you for reminding me. Maybe I forget minus.
on 2024-09-18
language: TL
WEBVTT So first is we need to make sure that the screw-up is installed. Okay? So to check if the screw-up is installed, we need to check the available machine that we can take. So based on the machine that we create here, based on our composer, Okay, if we check our composer, and scroll down, Okay, here we have this Hadoop screw-up. Okay, we have this Hadoop screw-up. Now, we need to get inside to that Hadoop screw-up. Okay, so let's get inside to that Hadoop screw-up. So dapper exec-ip hadoop-screw-up bash. dapper debug hadoop screw-up. Yeah, let me check my... Oh. So today I am not... I am not on this. I am not... So it's not my luck today. So my screw-up is not working. So in your end, it's working. Okay, what happened to my screw-up? My dapper ps, I checked. Because we can check if the container is running by typing dapper ps. So based on my dapper ps, yes, my screw-up is not running. My screw-up is not running. So maybe there's a problem with my screw-up. So if this is the case, for example, as I mentioned before, it is really easy for us to... To check... I mean to trash our dapper, okay? The dapper composer. dapper compose down. cd slash cd. Hadoop cd. Then dapper compose down. Okay, let me check up the error. Java Inception Connection. Connection received. So dapper build. dapper build. screw-up. Let me check my screw-up. Ah, okay. dapper build screw-up. Okay. Then dapper up. So let's check dapper build. Okay, I think it's running. dapper screw-up. Okay, so let's have a quick break. See you after 15 minutes. Ta-da! So meaning, we will execute the screw-up, import using the driver, okay? The GDB driver. Let's import the data from the database with the server 0.2. Let's work on the server. So dapper inspect mysql.server. So that 0.2, correct. So 172.18.0.2. So here, that means that we will import the data from the database using user, using username root, and using the table emp, asking for password. From the database server, 172.18.0.2, using the user.db table. Okay? With a driver from JDBC MyT12. We need to copy this. Let's just command copy. Then let's execute the screw-up. Ah, yeah. Let's try to copy. I'll just close and reopen. Sometimes, yeah, sometimes what I do is I close this, and I just reconnect like that, or reopen. Then it's working again. It's there in the code, so that will download. And dapper... It's actually there. I checked the code in this one. So dapper, from... So we need to recompose that because it's in the set recompose. So supposed to be there is adapter, there is screw-up folder here, then inside the screw-up, there is an installation file of the screw-up. So let's try to rabbit that again. If not, then we need to download the screw-up file manually. That's what really happens. Yeah, sometimes it happens. Just to make sure that everything is clean, I usually do this. So dapper composed beyond screw-up. Then dapper composed. There you go. If that happens, I'm not sure why this will download. Normally, this is automated. So normally, it will download this file. So here... So that file normally resides here because that's in our instruction. So for some reason, it doesn't get the file. But technically, that's the project. It will get the file, it will untarget the file, it will move all the contents inside the screw-up, then it will delete the gsave file. So since it will not execute all of those, I'm going to actually move that here. Now, if we check back our code here, there's a jar file. That's supposed to be a second operation. There's a jar file here that it should be downloaded. You can paste it there. And there is another jar file. So this two jar file supposed to be will be moved inside the line. So we can go here. And we can go inside the line. And move that file. Okay. Now we can try again the docker exec-it hado screw-up. Now we can check the screw-up-version. So basing on the configuration, our home is in ops-sqwp. Then our- I think that's it. So we can check now the sqwp-version. Okay. So we can copy actually. And we'll use a lot of- Did you put now the files inside? Okay. So try to execute hado-version. Sorry, sorry. Screw-up. Okay. Since you also- Since we have the same problem. So if you have that problem, you can execute this command inside of the chatroom. Let's try again to run the sqwp-version. Okay. There are some error because the zookeeper was not- Actually the zookeeper is different service. So it should be in the different container. The hcat is also different- Is also- Should be in some different container. As well as the accumulo. So the hbase as well. Those are actually separate services. The hbase, the hcat, the accumulo, the zookeeper, and the- What do we call this? The sqwp, the hide, those are separate containers. So technically we did not install, right? We did not configure the zookeeper. We don't have accumulo. So normally we'll receive that error. The hbase, we haven't installed that. So we don't have container for that. So if you want to be fault-ledged, fault-ledged, what do I call this? A haddock, then some of those that you need. So zookeeper is simply a component that tracks the members of the network. So the zookeeper will work, it checks the availability of the network. So we call that as auto- What is that term? Auto-discovery. Auto-discovery component to the zookeeper. So the hbase is like a database for the zookeeper. A database for NoSQL. If I'm going to search in the right, like that. So those are actually different components that you need to install using different containers. Now, since in our case we will only use the sqwp and the haddock, then this is already good. This instance is already good. Now, let's try then to send a query. Okay, let's send a query in our sqwp. Here, let's paste the code before. Remember, we typed here. We have this code before here. You need to go first inside the SQL. Okay, for the SQL, you need to go inside to the SQL. You will notice that. Okay, let me show you. At the top here, you see this one? Docker exec haddock sqwp, right? Meaning this window is for sqwp. This one, yes, this one for haddock master. So to be able to use the MySQL, you need to open another terminal. Okay, so you need to open the terminal, then you need to go inside that MySQL. So Docker exec then dash IT. After the IT, we have that MySQL server container, right? Then we need to go inside the bash. So once you are in the bash already, you can type now the MySQL. So MySQL user will be the root. And your password will be the password that you put there on the Docker compose, which is root password. So you can see that. You can review that inside the Docker composer.yaml. So root password. Okay, password or root password. Now once you are in MySQL, you can now execute that MySQL command. Okay, so I think it's ready to weld there. So we can have lunch for now. So see you after lunch. Bye bye. Ta-da, tabakla.
on 2024-09-18
language: TL
WEBVTT the list-tabels so now this list database is like common to show databases as well in our secret so you can use this list screw up or screw up list-database or screw up-list-database so it's totally the same with show database for example this list databases so okay so the next one is apache flumes so the flume data so the apache flume is like a tool for aggregating and transporting large amount of streaming so primarily this is used for streaming streaming records okay it is used to copy the streaming data from the log data or any other web servers directly to our hdfs so what will happen is let's say that is the event log data generator then the event log generator will be passed through to the flumes and the flumes will process that data and push to the centralized storage which is the haddock file system or the hdfs system now the application of this of course if you are dealing with data logging data logging or analytics which require high throughput so it would be good for you to start with flume and try to explore the flume so so the application of this is for data analytics data streaming or data logging that you want to score in the hdfs server before applying data analysis before applying what all data data mining or data analysis so the flume so far that's the purpose of the flume component in the haddock so this one because it can import very large volumes of data so this is actually one of the advantage of the flume and flume scaled horizontally so meaning it can scale unlimited depending on your configurations so most of data of course can be analyzed as from the process or log files but since some data came from streaming so you need you need to process data before before you can apply analysis so we can also the problem with so the problem with hdfs is hdfs cannot handle real time so that's why we need to use the flume to be able to to process the the streaming data because again hdfs cannot handle streaming data that's why there is a flume so so the architecture of the flume is let's say we have that streaming data from our generate course so the flume will create an agent so that agent will work as enforcer, topic enforcer whatever that they signals your data on where status it is or what will happen to that before it will pass to the collector once it is in the collector the collector can be interpreted as stable data already and that stable data will be pushed on the centralized source which is our hdfs that's why in between this flume will work like your middleman to handle the event now technically this is an event manager this flume it handle the event based on the header so an agent is a dependent daemon in our flume technically the source there in our diagram is a component of an agent in which it receives data so that one receives data from the generate course then after that it transfer to a channel this is the flume agent for it we transfer the channel in a form of flume events that's why it's an event manager this flume has a small event manager that handle those signals or those data now the channel will transient the store which will receive the event from the source of the buffer until it can consume by our sink so this this sink will this bridge will act like a source test between a bridge between the sources and the sink so while the sink stores the data into a centralized source like our HDFS or HDFS or whatever or data file system that you want to store that so that's how that works now so the source there for example the tweeter, the trip source the album source and our channel will be the JDBC channel 5G channel, memory channel and our our sink is HDFS sink so a flume agent can have this multiple sources, sink, and channel because each of that also represent what's called this also represent so as you can see here the flume could have another agent another agent because they work as traffic enforcer they work like a traffic enforcer then hold the data align the data align the data then once everything is ready it pass the data collector and that data collector will pass to the storage for final storage that's why it can handle it can handle a streaming of data because it has a lot of agents to handle those data that comes in and comes out now for installation of course I think I did not include the flume yeah yeah so the configuration that we need so far here so I will go in the overview for the configuration so for the configuration we will describe the source, the sink, and the channel actually that's the 3 important ones then the binder and the source binder to the channel so those 3 should be configured so that the sources the sink and the channel so this one are need to be filled in such a way you could handle the flow in our in our flumes so for example in our case here we have this different type of sources as I mentioned we have that abro, tape, exec, jms twitter netcat, katka sitting log, htc sources maybe any legacy source like any video sources it actually support video sources custom sources, live sources, or whatever so it actually support those for the channels it use different type of channels for the processing all of this is a process of the agent except the sinks for the channel we have this memory channel jdbc, katka we have this file channel spinnable channel and shuru channel now for the sinks we have this different types of output sinks that we can use such as HDFS, Hive, Lager abro, IRC HBase all of those things are the final output of the delivery in such a way the HDFS can consume the final output so in our case here we have this sources twitter channel, memory channel so memory channel to use the memory channel and the twitter agent sinks or the final sinks that will handle is HDFS so this one are really important when you are dealing with configuration on your plumes now once the plumes has been configured you need to provide the different type of sources so it could be consumer key consumer secret access token for example in our case twitter agent sources so this one actually came from twitter because long before the twitter support transmission logging so i'm not sure because last year last year or yes last year they have this massive changes or this year forget i'm not sure if they still support these features because i know that whenever after after it has been x.com i'm not sure if this structure still still support them now so but this one is just a example so you need to so as i mentioned sources can be anything so it can be from your web from your logs or whatever so it could be anything now with the with after you define your your key there your values and everything you can now define your sink so for this sink since we are using the hdfs we can use the twitter that sink.hdfs.type and of course the hdfs path in which we will gonna store the the hdfs so the path is of course your hdfs path so meaning you can imagine that after you save because usually the twitter before is whenever someone posted like this it will generate the logs and the logs from your twitter account like that you can capture the logs and push on your so if your twitter has a lot of updates with that and logs generated second by second so you can capture those and store that on your hdfs for analysis now after you you define the define the the hdfs you can now define your channel so this channel will be will be used to so this channel it provides us to transfer the data between sources and the links as per structure of our plumes now in our case for example we are using memory channel okay so memory channel so from there so that would be that so all of this configuration will be based on the plume architecture now we have we could have this hdfs channel the channel like that and twitter the channel from memory channel so each of that each of that each of this sources of channel can be considered from our different types of sinks so after we finish we can then put that inside our configuration so it could be whatever configuration there that you can put but you need to define that in your plume agent they will interpret that location of your configuration so once the configuration has been finished for example in our case here to make this simple my source can be a netcat so remember netcat is a generator as well netcat is a generator here netcat the netcat can be considered as a generator so in our case here for example in our case here for example after I run this command and everything then it started so after it started I can use the I can use the I can use my netcat so I can use my netcat to generate to generate the content so this is my content for example that I generate here frontal netline to the 85, 65, 65 like that this is my generator since this is my generator this is now my plume this is my plume server so each time there is a generated context from the other endpoints that endpoints can be tracked down or can be can be you know can be can be put on your sink there and stored in your hdfs server so it can be put in your sink so that's why as I mentioned the source test can be anything it's not only for for text it can be used for your ccpv or whatever depending on your implementation okay but in general the the plumes can use on almost anything that is streaming that is streaming data streaming like that so it can be used for almost streaming almost all streaming cases like that now so in our here of course this is the configuration netcat hs memory so sources netcat then netcat agents hdfs so netcat that is your bind port and that is your netcat port so then this is your hdfs which will be pushed inside the hdfs plume so all of the data will be pushed inside the plume directory so the right format is in text format so if you are using binary such as video or image then you need to change that right format to be able to accommodate the specific data size batch size is really important because streaming batch sizes can be smaller to higher so an acceptable batch size is 4050 or 4 megabytes per transfer so this 1000 batch size is so small technically i made it small because i am only transferring attacks so that is 1 kilobyte so but if you are transferring images or videos or very large films then you need to increase it to 4000 456 4056 or something like that which is around like 4 megabytes sorry not 4 megabytes 4 megabytes now in our channel here also we set the memory the transaction capacity so only 100 channels can be accommodated at a time meaning based on our architecture here you can only instantiate 100 instances per agent 400 instances per agent can be generated at a time so the rest of the streams will be put on hold until will be put on hold meaning there's a time lag until it can be processed so again this part of configuration is really small in real world application you need to increase this you can handle a lag on a lands now how to compute the transaction capacity to compute the transaction capacity you need to multiply the memory i mean the memory memory times available memory times available nodes so of course there's even if it's dedicated to flumes then you need to multiply that directly minus 30 percent times 0.30 percent so that only 70 percent usage can be used by your flumes nodes and 30 percent will be assigned to it's system as well other than that but if you have this multiple multiple different services or different components to be considered then you need to consider the computations for assignment of capacity so again capacity assignment is really important otherwise your node will be dead at no time so that's the important now another another here is memory channel the memory channel here can be assigned with the memory channel once everything was set up correctly and if you run that so this is how actually it looks like this is how it looks like exactly looks like this is running application actually that i think we do that from screenshot if you run the the curtelnet you'll get this data written data so on and so forth then after that if you if you check your your plume data because i put that inside the plume then if you check there all of your stream data are actually written right there so to think this is written data and distributed across your nodes that means that this data is highly available and it is you know it is usable by the data center's end point as well now another another aside from from the plumes we have this Hive and Hive is the data warehouse infrastructure so this is the data warehouse infrastructure so initially Hive was actually developed by Facebook and later it turned over to Apache and developed further for obvious reasons under Apache Hive Apache Hive brand so for today it's called still Apache Hive but but i think it changed to Apache Bell Hive so the name today is Apache Bell Hive so before the transaction the terminal before you can see it's Hive today i think it's Bell Hive so that's why there is a change there actually a lot of changes happen after a year so Hive is a relational database and it is designed for online transaction processing so this is a language for real-time queries and it's a low level update so technically it's a first-time database processed data into HDFS so technically this is just database and HDFS so it's a database and HDFS so it provides SQL type language like like any other SQL so the formal language is called a kick well so it's called a kick well now so how does it work so this is the architecture of the Hive that's why when you install the Hive there you will notice in my in my in my you will notice there in my which one that in my composer that i create two instance if you open my composer there i create two instance the one is the metastore and the one is the Hive server okay so in this architecture the user interface here is a Hive in where data warehouse and trust software can be created to do interaction between the user and the HDFS so the user interfaces that Hive supports are web UI Hive command line hdnsite so the default is Hive command line today we call that the other side bellhive command line now the metastore so that is the user interface so for the metastore this is where the Hive chooses the respective database server to store the schema or the metadata of tables databases column and table and whatever their data types that is used for HDFS mapping on the right side there the Hive QL or the Hickwell Process Engine the Hickwell Process Engine there is similar actually to Hickwell in which it query based on the schema on the metastore on the metastore so it query the metastore therefore it is one of the replacement of the additional approach for map reduce program instead of using the map reduce it uses query it uses query it write a query for map reduce job and processor so rather than using map reduce it uses a Hickwell a Hickwell to to what's called as to write a query or to write a map reduce using the query so now the execution the execution engine that it uses to junction to the part of our Hickwell it process the engine and map reduce into an execution engine in which it will generate the same result as map reduce result so you will notice that if you use the map reducer the logic in the map reducer is almost the same as with the Hive Hive QL process engine and the output of the reducer is the same as the execution engine now our HDFS or H-Base is our base system and where we gonna store our data so this is the the architecture of the Hive so in our architecture of Hive so how does it work so technically we have this haddock here on the right side and we have this Hive on the left side so first it execute the query so this one will the Hive interface will try to execute the query using any database driver such as JDBC or ODBC now it will get the plan on which it will help the query to compile our sentas for the query requirements then it will get the metadata send the plan and execute the plan and finally execute the drum here then once the once the data or once the result has been executed this haddock will try to pitch the result and then the result to the engine and of course the drivers send the result to the Hive interfaces so that's how that works now for the installation ah okay so this is so far this is the architecture so this is the architecture so all of the logic but in general to make it clear this is your interface test this is the one that you work so using the Hive command line then it will pass the data here so in fact how the reducer works is the same as with the Hive so the logic are the same so the plot file converts into like query type so anyway all the details are put there so actually are you still using Hive or B Hive already? B Hive so technically Hive are exactly the same I don't see any difference actually I don't see any difference actually so exactly the same so that is why when you are working with with different different user layer you will be comfortable switching to different so Hive, B Hive, etc actually the same so creating this even creating decimals like that so are exactly the same so are the same the syntax for creating unions and the original so exactly the same so floating so actually all of this are just the same so the concept are just the same so creation database when you browse after creating database it will simply go to the Hadoop folder on your current folder on your current user which is for example Hive, B Hive, etc and of course you can drop and whatever it's the same you can create also temporary tables nothing special like that so like that you can use the you can also use the load data which you can use to upload data to server like that and of course the sample data will be uploaded to our browse directory then it will be added to our database database so all of this are just also generic in other interfaces whether you're using B Hive or Hive so are exactly the same so the syntaxes for query if you're familiar with SQL of course are the same so if you're good with SQL so technically all of this are just are just yeah so I think I would suggest the only thing that I would suggest to for you guys to explore is the since you are not that really if you want to to go further with the with the Hado you can explore other components and to be able to explore the other component of the Hado you need to be you need to because as of today we are using containerized application so to be able to to integrate to be able to integrate a lot of components into your Hado I would suggest that you would explore further on how to on how to whatchamacallit on how to use the containerized because that containerized can integrate different type of components into a one big Hado full pledge application so because technically the Hado is just simply a file system it's just a file system so what makes Hado powerful is its components so if you have this component if you can integrate the component into Hado then there you can imagine you can you can you can then visualize or you can further explore the power of Hado and further optimize on how to use the Hado so I guess that's my last word if you have any question you can ask yeah the powerpoint actually is your grade one instruction step by step instruction so what we did today is like a little bit advanced Hado on how to configure your server automatically like that and how to because in our powerpoint there that's how you actually configure your Hado manually so it will take you time to do that so there are two ways to create your Hado number one is using manually number two is using Docker or containerize application yeah so technically technically the step on how to create on how to create your MySQL server and everything is in the Docker dash compose dot YML already that's actually the flow if you open the Docker compose dot YML okay let me show you yeah so if you open that file if you open that file so let me open so let's say this is our file right so this is our Docker correct so actually it's everything here is just self-explanatory that this is a process on how to create the certain service correct so once the only thing that I think that I did not put here is how to run but on how to run that I put that inside the Docker file screwup okay so you can see there at the top how to build the Docker compose build and Docker compose the app dash lead okay so the first step there is you need to put 3 files into 1 folder right after you put this yeah after you put that you need to execute this this step 1 then this is the step 2 so after you execute this boom it will work like magic everything every every what's on this every server test computer are ready to use already now once that once everything is ready from there you will use your Docker command remember the what I type such as Docker exit Docker exit what's on this hadoop for example to connect to hadoop server you simply put hadoop dash master right hadoop the master so to connect to what's on this to mysql server simply type Docker exec dash it mysql server dash something like that so meaning this command is simply to go inside to the container okay so this is the command to go inside the container but once you go inside the container the rest of the command that we do in the powerpoint there that I that I release in the powerpoint from that to finish are the same our official steps okay it does happen that we short tap so instead of building manually we build automatically that's why that's why in my suggestion my final word you can focus you can watch this if you want to focus on your programming in your development in your in your hadoop you can use this docker you can use this docker compose to expand your knowledge because each of this service each of this service represent a component correct it represent a component and the more component that you explore that you inject in my docker compose the better the knowledge that you will you know you will train in yourself like that so that's why I want as much as possible if you can improve this this docker compose that I gave you actually this docker compose is almost complete already I can say it's almost complete already but you can further I know you can further you know you can further improve this by implementing auto auto scaling like that auto scaling hadoop like that something like that so this one this one here is fixed scaling but you can actually implement further you can improve this auto scaling and furthermore in a lot just little bit little bit tweaks okay okay so any more other question okay but anyway normally this topic is a little bit super advanced this one that's why I introduce it to you okay so supposed to be I will not touch this part but since our server last day is so really annoying so I don't have choice but I introduce you to more advanced like that so okay any more question so far everything good do you have any I mean I think every file I think you have ready right just confirm before we end you have this right okay so okay thank you very much and have a great day everyone I leave you for today and take care bye bye
on 2024-09-18