Cloud Desktop Teaching Platforms

Practical Hadoop

language: TL

                  WEBVTT

on 2024-09-18

language: TL

                  WEBVTT

on 2024-09-18

language: TL

                  WEBVTT
passwordless transaction.

on 2024-09-18

language: TL

                  WEBVTT

on 2024-09-18

language: TL

                  WEBVTT

on 2024-09-18

language: TL

WEBVTT
ang foresight ito ay magkita ka na kung magkita ka na magkita sa locasyon ng name node
at magkita sa hdfs at map reduce component.
okay, so ito ay, ito ay gawin na magkita na magkita sa map reduce location
at ang name node location name. So ito ay magkita sa hdfs at map reduce component.
So ito ay magkita sa hdfs at magkita sa hdfs at magkita sa map reduce component.
okay, so as you can see here, by default it is also empty. Okay, so we need to provide
Now the next step, the next step after I update the foresight, the hdfs site, then the next
So the map read site is a file that use to track variable that holds the location of the job tracker.
Okay, so this time, this time we will modify the map read site. Okay, so the purpose of map read site,
the purpose of map read site is this is the configuration on where we are configuring
the tracker, the tracker location. Okay, so it gives us the power to configure the job tracker
just like the name nodes that hold the name location of the name node.
Because only the map reduce component need to know this location, it is a map read site,
mapreadsite.xml is the file that we need to edit. So the network address for the name node in the
job tracker specify the ports on which the actual system should be redirected on the port. So there
should be no user facing location. So we don't need to bother about the web browser to this.
This is just simply our web interfaces that we gonna look in our next activity. Okay,
so we need to copy this job tracker. Okay, then we need to paste this inside our map read site.
Okay, so we'll paste it there. So remember, we already edit, this is the third file that we edit.
The first one is the core site, the second one is the HDFS site, and the third one is the map
read site. Okay, so the map read site again, this is used for the HDFS site is used to handle
the data node. The core site is used to handle the name node. The map read site is used to
handle the tracker node, the job tracker node. So this is where we configure on how to
in such a way your Hadoop can locate the specific component of your Hadoop architecture.
Now, in our case here, the next configuration is
the next process here, the next process is we'll create a two directory. Okay,
so this directory, what's the purpose? Why we need to create the directory? Okay,
we need to create a directory Hadoop inside the slash bar slash line with a writable 777
or read write execute in such a way the data can be written on that directory
that holds the temporary directory property or the data that we need to be stored in our Hadoop
after we create a directory, we'll convert this directory as writable and readable.
writable and readable globally. So sudo chmod 777 slash bar slash line slash Hadoop.
And such a way it is writable. So that we will avoid a permission error later on if the Hadoop
will try to write certain file on that directory specifically for the temporary data file. Okay,
now once we create our directory, we can now add a property, a new property to our
foresight that we will tell the Hadoop to read only or to write only on this directory
if the Hadoop needs a temporary directory to be write on somewhere else. Okay, so to do that,
we need to update, copy, then we need to update the foresight.
And we will add a new property inside here. So the new property that we'll add on this stage of
configuration is we will tell the Hadoop that we will use the directory that we create as our
what's the next step after this? So the next step after this is we need to format our name node.
We need to format our name node. Why? Because before we can start the Hadoop in either sudo
distributed or fully distributed mode for the first time, regardless what's the mode of it is,
we need to format that part of this to become an HDFS file system. Like you can imagine if you
are installing your operating system on the hard disk. So the first time you install the
operating system in your hard disk, you need to format your hard disk such as path32, path64,
xts, like that, so on and so forth. So it looks like this. So to be able to use the certain
directory for our HDFS, we need to format that. So the command to format the name node can be
executed multiple times, but in doing so, all existing file system will be destroyed
because of course it will be format. Now it can only be executed when the Hadoop cluster is shut
down and sometimes you'll want to do in most of other cases such as if you want to irrevocably
delete every piece of data in your Hadoop file system. So it does not take much longer on larger
that the file has been formatted or the location has been formatted meaning
the meaning of that, that the DFS, the data file and I mean the data file system,
the Hadoop data file system has been created successfully. That's the most important part.
So if you cannot see this responses on your endpoints there, there is a problem.
There is a problem on your configuration and you need to look again from the notes that I gave you
there from the notepad. So the notepad there is, you'll notice that the notepad that I gave you is a
special notes of mine which if you follow that notepad that I gave you exactly verbatim,
that means that you can actually set up your Hadoop successfully. So you can save that notepad
in your end or I can email that to you whatever you prefer because this is actually the,
this is my special notes to set up the Hadoop. Aside from my PowerPoint which is explained one
by one there, this is the clean one that if you execute this one by one line by line as per
instruction, you can set up your Hadoop server. Got it? Now going back to our discussion,
going back to our discussion, if you successfully installed your Hadoop, the next step there
is we need to check if your Hadoop is really working. How we can check if it is really working?
First of course we need to start your Hadoop. Unlike the local mode of Hadoop where all components
run only for a certain lifetime of the submitted job with should do distributed or fully distributed
the cluster components exist. This is should do, this is actually a should do distributed.
Why should do distributed? The setup that I made here right now is a should do distributed
because it only actually contains one node itself. Itself is also a node, itself is also a server,
itself is also a master, itself is also a slave. So it's a should do distributed configuration.
So unlike job base or local mode of Hadoop that we execute the jarro file before, we call that as
local mode of Hadoop, this one, the one that we set up right now is actually a should do distributed
setup. And this cluster components exist as long as the process is running. So before we use the
HDFS or a map reduce, we need to start up all of the needed components before we can use all of those
features or components. So to start with our program or our Hadoop server, all we have to do is type
once you receive starting data nodes, starting name node, meaning you have now name nodes and
data nodes. Now the startdfs command suggests that components are, suggests that it starts all of
these available components. So the name node manage file system and the single data node will hold the
data. So the name node will process the system, will process or manage the system, the data nodes
will hold that data. So after we start these components, we can use our GDK GPS utility to see
whether there is a Java processing that is currently running in our system. So we can use the GPS
or Java processing system, then we can check if there's any available processes that is currently
running in our system. So by typing the GPS, we can see here that we have that GPS data node,
secondary name node, GPS name node. So meaning we have that, to be exact, we have that data node,
secondary name node, and we have that name node. So which is, we expect of course that those three
should exist because those three are our component to transact to Hadoop. Now how we can say,
before we proceed to use the file system, let me verify if you're already there.
Okay, very good. Okay, so Hadoop name, did you format right? Did you format before you execute
the start.tfs? Correct. Okay, so start.tfs, starting name nodes, okay, warning, permitted dated local
to proceed for the execution. The next thing that we need to make sure that everything was actually
configured properly is we need to show the Hadoop file system inside the directory.
So to be able to do that, all we have to do is type HDFS or Hadoop file system,
dfs, dash ls, meaning that we want to list all the data file in our Hadoop data file system.
Okay, so Hadoop data file system, data file system, dash ls, list all the file in the root directory.
So that's how we interpret this command. Now, is this command also available on
on the EMR or what's called as on the house, on the Amazon, on the Amazon Hadoop or EMR? Yes,
all of this command is actually valid. Okay, even you will use Spark or from Azure or you will use
Google big data like that, this command actually is still valid because HDFS, HDFS command is a
general language command for the Hadoop system. Okay, for the Hadoop system. So meaning
all of the command here are also valid on any other Hadoop system. Okay, now
after we verify that we can now, we can now, what's your name, we can now
execute the Hadoop file system. The next step is
we will try, we'll try to use the yarn, the yarn. So yarn
meaning is yet another resource negotiator. So this yarn, this yarn is a resource management
framework that will allow you to have a multiple data processing engine to run on a shared Hadoop cluster.
So it gives you power to manage multiple resources under one roof. Okay, so that's why we call this
as yarn because it means that yet another resource locator. So to run the yarn,
we need to execute, we need to execute, to run the yarn, we need to execute start dash yarn
dot sh. Okay, so as you can see, starting resource manager, starting node managers like that.
So now remember before we have these three components that's already running.
Now we start the yarn, we expect another two resources that is running in our system. Okay,
so to be able to check if all of those are running, we can use again the GPS. Okay,
okay, so the yarn, the yarn is not included in the GPS. So only, I think only the data nodes,
the node and the name node. So the yarn has its own different execution layer. Now,
let's say, let's say we decided to temporarily stop all of the services that we want, that we execute.
Okay, so what's the command? So when to stop, when to execute the stop services in our, in our,
in our Hadoop. You can use the stop services if you think that you want to reformat. If you want
to reformat your DHDFS instance or HDFS, what's how to say, HDFS data file, data file system,
then you need first to stop, to stop the execution, to avoid premature, to avoid premature
replication. Okay, to avoid premature replication. So to stop, all we have to do is command stop all.
Okay, so as you can see there, it stopped the name node, the data node, the yarn,
the secondary name node like that. Now, now what if, what if after we, after we reformat,
we decided to start again the services. Okay, so what's the command? So the command is of course,
it's start all.sh. Okay, so running this command will start all of your services.
Okay, all of your Hadoop services plus the yarn. Okay, okay, now once you started all the services,
you can now try to check your GPS. Okay, you can now see your GPS, the name nodes,
secondary node, and data node. Okay, so that's good. You have now your GPS. Now, if there's any issue
such as data node cannot find or there's an error that happened when you, when you use your,
when you type the GPS, then you can reformat. You can execute again the HDFS name node format.
To be able to fix that issue if there's any. Okay, now we can now check, we can now check here again.
Okay, we can now check here again. Let's write, let's try to, to what's all this, to check
our, our, what's all this, our configuration, our configuration if your Hadoop is working. Now,
to check your configuration if your Hadoop is working, the first step is let's try to create
a new user or a new directory user. So we can type Hadoop.
Hadoop, okay, FS, file system, make directory, okay, make directory. As you notice the dash,
the dash always before, before the file manipulation command or file system command.
So Hadoop, FS dash, MKDIR, or make directory, user from the root page. So if everything is okay,
then there is no error that, you know, raised on that execution. Let's say I will add further
directory inside the user. So inside the user, I will add new directory Hadoop. Okay, so Hadoop,
and if I execute this one, okay, so there's no error. Okay, now let's say I will, I will create
my user here and I'll name that for example as VM1. Okay, as VM1. Okay, then I can see here there
is a VM1. Okay, furthermore, I can now check if there is a file or a folder if I use the LS.
So LS slash user, for example. Supposedly, I'm expecting, yes, I'm expecting slash user,
slash Hadoop, and slash user slash VM1. Meaning that the directory or the file system that you define
on after configuration of your Hadoop was successfully created. And that is a representation
that your configuration is fully working. Okay, if you cannot execute up to this point, that's okay.
So meaning, meaning, field for connection on JavaNet. Okay, okay.
Okay, so that means that you have connection refused. For more details, go, okay.
So let's go to your, you need to recheck, you need to recheck your mapread.xml.
Did you update your mapread? Because that error, I think, excuse me.
Ah, yes, this one. Can you see this in my share screen? Okay, so can you try to open that file that you have?
Okay, okay, okay, okay. Now, yeah, can you try to open your core site? Okay, your core site.
Yes, the core site. Okay, you have the core site, 49,000. Okay.
Okay, okay, can you try to open your HDFS site? Your HDFS site.
Okay, so that's good. Okay.
Okay, now, can you, can you, okay, exit, exit there.
Okay, just, just, just want to make sure. Did you execute this part of command, the slash bar sudo mkdir?
Did you execute this? If not, try to execute this one. Okay, try it again, this one.
sudo mkdir, then try to change the, the mode. Okay.
Okay, okay, that's fine. Now, make sure to change the mode to 777. Okay.
Okay, now, one, one thing that I want to test as well is can you,
this one, this one is, okay, so the sudo rm-r. So that one will only execute if you had an error,
okay, such as GPS cannot find the data nodes or anything like that. But the configuration in the, in the core site, you mean?
Okay, try to open your core site. Let me verify it. I think I saw it. So I think that, that's good.
One thing that I, that maybe is not working. Okay, try to open. Yeah, I think it's there. Yeah.
Correct, that's, that's 49,000. I pressed the fourth name, 49,000. Yes, for the map rig, for the map rig. Yeah.
Correct, that's correct. Now, I have doubts that there is an issue with SSH. Okay. Exert, exit.
Exert, type exit. Okay, now try to type IP space. I don't know. I'm talking to, to, what's her best? To Lavinia. Okay.
Okay, now try to type SSH space 10.0.3.16.
Okay, that means it is something, it is something.
Yeah, you need to exit. Okay, so it's, it's, it's a different issue so far. Okay, let's try to, yes, try to stop,
stop all, then try to stop all first. Okay, let's, okay, stop all. Okay, follow my lead. First, stop all. Then, next is execute the, the sudo rm-r
slash bar slash live slash hado. Okay, let's, let's clean your, you know, your, your configuration.
So sudo space r, after that, let's, let's execute this one. If you can see my screen,
add the one that I highlight, you need to execute that. Okay. Okay, execute that one. Then execute this one.
Yeah, yes. Okay, let, oh, I see there, there's an issue allocated. So it says cannot create.
There. Okay, so hado. So I think cannot create directory bar live hado. That's it. So that's,
that's the issue there. Okay, so to be able to fix that, to be able to fix that, we need to,
to, you need to execute first this one, the sudo rm. Okay, execute first the sudo rm. Yes. Okay.
No, bar slash bar live hado. Cannot remove no such file or directory. Okay, can you try? Okay.
Ah, okay. Now, now, execute this. This one. Yeah. Okay, now execute this part. Okay. Now execute
the format again. Okay. Now, after you format, we need to try to start the DFS. Let's start one by
one so that we can track first the error. So start the DFS first. Then if we can see all of the,
you know, all of the, all of the DFS, then that means that we're in a good, we are in the right track.
Okay. Data nodes, name nodes, secondary name nodes. Okay, type gps. Okay. Next is execute the

on 2024-09-18

language: EN

                  WEBVTT
So while it is uploading, I already create the user, the home, the universal.
So here I already create the home, I already create the user.
So the next step here is to generate the key.
Now, let's get to the place.
We'll do an onion.
Now, the next step is to create the link for the Java.
So that's done.
The next step is to create the link for Java.
Now, remember we already created the tree certificate.
So meaning the tree certificate is actually now ready to be deployed in our machine.
So to be able to deploy this tree certificate, we need to go first to our VM2.
So this is our VM2.
We need to go inside.
We need to modify the certificate.
So here we need to update our certificate.
In such a way, the certificate will be available all throughout the authorized key.
So we can say to do nano, then we'll paste the certificate.
We'll paste the certificate here.
We'll copy this one.
And paste it here.
That's it.
Then we can go, save, and exit.
And let's try to pass.
I see we have three certificates.
So VM4, VM2, VM3.
So we'll do work on the other certificates.
Because this will be the key so that we can communicate with each other.
So this is the process on the progress.
The 18 chapters.
Yes, on all three machines.
The reason for that, why we do that, the reason for that, because each machine needs to communicate with each other.
Right?
Via SSH.
Via TBSH.
Now to be able to communicate with each other, each of the machines will hold a certificate.
Correct?
So without that certificate, of course, we cannot talk to each other.
You cannot upload?
Ah, okay.
The reason for that, the reason for that, you are logged in as Hadoop admin.
Right?
You're logged in as Hadoop admin.
Correct?
Okay, so each of the machines have now a key.
Okay.
So to be able to test if the machine can communicate with each other, we need to SSH the VM2 to VM3 and VM4.
Vice versa.
Okay?
The VM1, I mean the VM2 to VM3 and VM4.
So SSH, we can say it's VM3.
So I can say, you can read, type yes.
Okay, so that means I am now connected to VM3.
Okay, I will try to connect to VM1.
Ah, sorry.
Not VM1.
Not VM2.
VM4.
What did I connect to last time?
VM.
So I am on the, let's start with the VM3.
So I can get to VM3.
So I will connect to VM3.
Yes, I will connect to VM4.
Yes.
So as you can see, I can connect to different machines.
Right?
I can connect to different machines.
So here, I will update my HTS.
Okay.
Now, what I just need to enable in my HTS is, so actually I will only add, I only update here some, some down to code.
Okay, and that's it.
That's the final touch.
And we can go now to our HADOC1, I mean, on our master.
We only ran the start, start HDFS on the master.
We do not run those HDFS command in the slave.
All of the command must be run in the master only.
Okay.
So cross finger.
Hopefully there's no issues.
Let's try.
Let's try.
Let's try.
Ah, okay.
Okay.
Yeah, yeah, of course.
We need to use a to do.
Not move, I think, not move.
You are moving the directory, right?
You are moving the directory.
Because you upload the directory, right?
I remember.
So, right?
You want to move that inside the local, correct?
Inside the local.
So we can directly, we can directly command move.
So instead of creating directory, you don't need to.
You can directly copy.
Ah, okay.
So now it's moving.
The HDFS site.
Wait, the HDFS site.
You need only to update.
You can check my screen.
You need to update only the, instead of slave here, you can see this one.
Since slave one, you need to change that into the name of your node.
But, okay.
So for example, VM4.
But this is only applicable on the feed.
Okay.
You don't need to apply this configuration in the master.
I don't know.
HDFS.
No, no, you don't need to change that.
The core site still the master for all the instances.
The only file that we edit in our Hadoop is in the copy paste, right?
We copy the code further, correct?
So the only file that we will edit in the VM3 and VM4 is HDFS-site.xml.
The rest of the file inside the Hadoop remain as is.
Okay?
Remain as is.
I need to change the IP of the point to the VM4.
Thank you for reminding me.
Maybe I forget minus.

on 2024-09-18

language: EN

                  WEBVTT

on 2024-09-18

language: TL

                  WEBVTT

on 2024-09-18

language: TL

                  WEBVTT
So first is we need to make sure that the screw-up is installed.
Okay? So to check if the screw-up is installed, we need to check the available machine that we can take.
So based on the machine that we create here, based on our composer,
Okay, if we check our composer, and scroll down,
Okay, here we have this Hadoop screw-up.
Okay, we have this Hadoop screw-up.
Now, we need to get inside to that Hadoop screw-up.
Okay, so let's get inside to that Hadoop screw-up.
So dapper exec-ip hadoop-screw-up bash.
dapper debug hadoop screw-up.
Yeah, let me check my...
Oh.
So today I am not... I am not on this. I am not...
So it's not my luck today.
So my screw-up is not working.
So in your end, it's working.
Okay, what happened to my screw-up?
My dapper ps, I checked.
Because we can check if the container is running by typing dapper ps.
So based on my dapper ps, yes, my screw-up is not running.
My screw-up is not running. So maybe there's a problem with my screw-up.
So if this is the case, for example, as I mentioned before, it is really easy for us to...
To check... I mean to trash our dapper, okay?
The dapper composer.
dapper compose down.
cd slash cd.
Hadoop cd.
Then dapper compose down.
Okay, let me check up the error.
Java Inception Connection.
Connection received.
So dapper build.
dapper build.
screw-up.
Let me check my screw-up.
Ah, okay.
dapper build screw-up.
Okay.
Then dapper up.
So let's check dapper build.
Okay, I think it's running.
dapper screw-up.
Okay, so let's have a quick break. See you after 15 minutes.
Ta-da!
So meaning, we will execute the screw-up, import using the driver, okay?
The GDB driver.
Let's import the data from the database with the server 0.2.
Let's work on the server.
So dapper inspect mysql.server.
So that 0.2, correct.
So 172.18.0.2.
So here, that means that we will import the data from the database using user, using username root, and using the table emp, asking for password.
From the database server, 172.18.0.2, using the user.db table.
Okay?
With a driver from JDBC MyT12.
We need to copy this.
Let's just command copy.
Then let's execute the screw-up.
Ah, yeah.
Let's try to copy.
I'll just close and reopen.
Sometimes, yeah, sometimes what I do is I close this, and I just reconnect like that, or reopen.
Then it's working again.
It's there in the code, so that will download.
And dapper...
It's actually there. I checked the code in this one.
So dapper, from...
So we need to recompose that because it's in the set recompose.
So supposed to be there is adapter, there is screw-up folder here, then inside the screw-up, there is an installation file of the screw-up.
So let's try to rabbit that again.
If not, then we need to download the screw-up file manually.
That's what really happens.
Yeah, sometimes it happens.
Just to make sure that everything is clean, I usually do this.
So dapper composed beyond screw-up.
Then dapper composed.
There you go.
If that happens, I'm not sure why this will download.
Normally, this is automated.
So normally, it will download this file.
So here...
So that file normally resides here because that's in our instruction.
So for some reason, it doesn't get the file.
But technically, that's the project.
It will get the file, it will untarget the file, it will move all the contents inside the screw-up, then it will delete the gsave file.
So since it will not execute all of those, I'm going to actually move that here.
Now, if we check back our code here, there's a jar file.
That's supposed to be a second operation.
There's a jar file here that it should be downloaded.
You can paste it there.
And there is another jar file.
So this two jar file supposed to be will be moved inside the line.
So we can go here.
And we can go inside the line.
And move that file.
Okay.
Now we can try again the docker exec-it hado screw-up.
Now we can check the screw-up-version.
So basing on the configuration, our home is in ops-sqwp.
Then our- I think that's it.
So we can check now the sqwp-version.
Okay.
So we can copy actually.
And we'll use a lot of-
Did you put now the files inside?
Okay.
So try to execute hado-version.
Sorry, sorry.
Screw-up.
Okay.
Since you also-
Since we have the same problem.
So if you have that problem, you can execute this command inside of the chatroom.
Let's try again to run the sqwp-version.
Okay.
There are some error because the zookeeper was not-
Actually the zookeeper is different service.
So it should be in the different container.
The hcat is also different-
Is also-
Should be in some different container.
As well as the accumulo.
So the hbase as well.
Those are actually separate services.
The hbase, the hcat, the accumulo, the zookeeper, and the-
What do we call this?
The sqwp, the hide, those are separate containers.
So technically we did not install, right?
We did not configure the zookeeper.
We don't have accumulo.
So normally we'll receive that error.
The hbase, we haven't installed that.
So we don't have container for that.
So if you want to be fault-ledged, fault-ledged, what do I call this?
A haddock, then some of those that you need.
So zookeeper is simply a component that tracks the members of the network.
So the zookeeper will work, it checks the availability of the network.
So we call that as auto-
What is that term?
Auto-discovery.
Auto-discovery component to the zookeeper.
So the hbase is like a database for the zookeeper.
A database for NoSQL.
If I'm going to search in the right, like that.
So those are actually different components that you need to install using different containers.
Now, since in our case we will only use the sqwp and the haddock, then this is already good.
This instance is already good.
Now, let's try then to send a query.
Okay, let's send a query in our sqwp.
Here, let's paste the code before.
Remember, we typed here.
We have this code before here.
You need to go first inside the SQL.
Okay, for the SQL, you need to go inside to the SQL.
You will notice that.
Okay, let me show you.
At the top here, you see this one?
Docker exec haddock sqwp, right?
Meaning this window is for sqwp.
This one, yes, this one for haddock master.
So to be able to use the MySQL, you need to open another terminal.
Okay, so you need to open the terminal, then you need to go inside that MySQL.
So Docker exec then dash IT.
After the IT, we have that MySQL server container, right?
Then we need to go inside the bash.
So once you are in the bash already, you can type now the MySQL.
So MySQL user will be the root.
And your password will be the password that you put there on the Docker compose, which is root password.
So you can see that.
You can review that inside the Docker composer.yaml.
So root password.
Okay, password or root password.
Now once you are in MySQL, you can now execute that MySQL command.
Okay, so I think it's ready to weld there.
So we can have lunch for now.
So see you after lunch. Bye bye.
Ta-da, tabakla.

on 2024-09-18

language: TL

                  WEBVTT
the list-tabels
so
now
this list database
is like common to show databases
as well in our
secret
so you can use this
list screw up
or screw up list-database
or screw up-list-database
so it's totally the same with show database
for example this
list databases
so okay
so the next one is apache flumes
so the flume data
so the apache flume
is like a tool
for
aggregating and transporting
large amount of streaming
so primarily this is used
for streaming
streaming records
okay
it is used to copy the streaming data
from the log data
or any other web servers
directly to our hdfs
so what will happen is
let's say that is the event log data
generator
then the event log generator will be passed
through to the flumes
and the flumes will
process that data and push to the
centralized storage
which is the haddock file
system or the
hdfs system
now
the application
of this of course
if you are
dealing with
data logging
data logging or
analytics
which require
high throughput
so it
would be good
for you to start with
flume and try to explore
the flume
so
so the application of this is
for data analytics
data streaming or data logging
that you want to score
in the hdfs server
before applying
data analysis
before
applying what all
data
data mining
or data analysis
so
the flume
so far that's the purpose of the
flume component
in the haddock
so this one
because
it can import very large volumes
of data
so this is actually
one of the
advantage of the flume
and flume
scaled horizontally
so meaning
it can scale
unlimited depending
on your configurations
so
most of data
of course can be
analyzed as from the
process or log files
but since some data came from
streaming
so you need
you need to process data before
before you can
apply analysis
so we can also
the problem with
so the problem with hdfs
is hdfs cannot handle
real time so that's why we need
to use the flume to be able to
to process
the
the streaming data
because again hdfs
cannot handle
streaming data
that's why there is a flume
so
so the
architecture of the
flume is let's say we have that
streaming data from our generate course
so the flume will
create an agent
so that agent will
work as
enforcer, topic enforcer
whatever that they
signals
your data
on where
status it is
or what will happen to that
before it will pass to the collector
once it is in the collector
the collector
can be
interpreted as
stable data already
and that stable data will be pushed on the centralized
source which is our
hdfs
that's why in between this flume
will work like your middleman
to handle
the event
now
technically
this is
an event manager
this flume
it handle the event
based on the header
so
an agent is a dependent
daemon in our
flume
technically
the source there
in our diagram
is a component of an agent
in which it receives
data
so that one receives data
from the generate course
then after that
it transfer to
a channel
this is the flume agent for it
we transfer the channel in a form of flume events
that's why
it's an event manager
this flume has a small event manager
that handle those signals
or those data
now the channel will
transient the store
which will receive the event from the source
of the buffer
until it can
consume by our sink
so this
this sink will
this bridge will act like
a
source test
between
a bridge between the sources and the sink
so while the sink
stores the data into a centralized source
like our
HDFS or
HDFS or
whatever
or data file system that you want to store that
so that's how
that works
now
so the source there for example
the tweeter, the trip source
the album source
and our channel will be the JDBC channel
5G channel, memory channel
and our
our sink is
HDFS sink
so a flume agent can have this multiple sources, sink, and channel
because each of that
also represent
what's called this
also represent
so as you can see here
the flume
could have another agent
another agent because they
work as traffic enforcer
they work like a
traffic enforcer then hold the data
align the data
align the data then once
everything is ready it pass the data
collector and that data collector
will pass to
the storage
for final storage
that's why it can handle
it can handle a streaming
of data because it has a lot of agents
to handle those
data that comes in and comes out
now for installation
of course
I think I did not include the flume
yeah
yeah so
the configuration
that we need so far here
so I will go in the overview
for the configuration
so for the configuration we will describe
the source, the sink, and the channel
actually that's the 3 important ones
then the binder and the source binder
to the channel
so those 3
should be configured
so that the sources
the sink
and the channel
so this one are
need to be filled in such a way
you could
handle the flow
in our
in our flumes
so for example in our case here
we have this different type of sources
as I mentioned we have that
abro, tape, exec, jms
twitter
netcat, katka
sitting log, htc
sources maybe
any legacy source
like any video sources
it actually support
video sources
custom sources, live sources, or whatever
so it actually
support those
for the channels
it use different type of channels
for the processing
all of this is a process of the agent
except the sinks
for the channel we have this memory channel
jdbc, katka
we have this file channel
spinnable channel
and shuru channel
now for the sinks we have this different types of output sinks
that we can use
such as HDFS, Hive, Lager
abro, IRC
HBase
all of those things are
the final
output of the delivery
in such a way the
HDFS can consume the
final output
so in our case here we have this
sources twitter
channel, memory channel
so memory channel
to use the memory channel
and the twitter agent sinks
or the final sinks that will handle is
HDFS
so this one are really important
when you are dealing with configuration
on your plumes
now once the plumes has been configured
you need to provide
the different type of sources
so it could be
consumer key
consumer secret
access token
for example in our case
twitter agent sources
so this one actually
came from twitter
because long before the twitter
support
transmission
logging
so i'm not sure because last year
last year
or yes last year they have this
massive changes or this year
forget i'm not sure if they
still support these features because i
know that whenever
after
after it has been
x.com i'm not sure if
this structure still
still support them
now
so but this one
is just a example so you
need to so as i mentioned
sources can be anything
so it can be from your web
from your logs or whatever
so it could be
anything now
with the
with after you
define your
your key there
your values and everything
you can now define your sink
so for this sink
since we are using the
hdfs we can use the twitter
that sink.hdfs.type
and of course
the hdfs path
in which we will
gonna store the
the hdfs
so the path is of course your
hdfs path
so meaning you can imagine
that after you save
because usually the twitter
before is
whenever someone posted like this it will generate
the logs and
the logs from your twitter account
like that you can capture the logs
and push on your
so if your
twitter has a lot of
updates with that
and logs generated
second by second so
you can capture those and store that on your
hdfs for analysis
now
after you
you define the
define the
the hdfs
you can now define your
channel
so this channel will be
will be used to
so this channel
it provides us to
transfer the data between sources and the links
as per structure of our
plumes
now in our case for example
we are using memory channel
okay so memory channel
so from there
so that would be that so all of this configuration
will be based on the plume
architecture
now we have
we could have this
hdfs channel
the channel like that
and twitter the channel from memory channel
so each of that
each of that
each of this sources
of channel can be considered
from our different types of
sinks
so after we finish
we can then
put that inside our
configuration so it could be
whatever configuration there
that you can put but you need to
define that in your plume
agent
they will interpret that location of your
configuration so once the
configuration has been finished
for example in our case here
to make
this simple
my source can be a netcat
so remember netcat is a
generator as well
netcat is a generator
here netcat
the netcat
can be considered as a generator
so in our case here for example
in our case here for example
after
I run this command
and everything then it started
so after it started
I can use the
I can use the
I can use my netcat
so I can use my netcat
to generate
to generate the content
so this is my content for example
that I generate here
frontal netline to the
85, 65, 65 like that this is my generator
since this is my generator
this is now my plume
this is my plume server
so each time
there is a generated
context from the
other endpoints that
endpoints can be
tracked down or can be
can be you know can be
can be put on your
sink there and stored
in your hdfs server
so it can be put in your sink
so that's why as I mentioned
the source test can be
anything it's not only
for
for text
it can be used for your
ccpv or whatever
depending on your
implementation okay
but in general the
the plumes can
use on almost
anything
that is streaming
that is streaming
data streaming like that
so it can be used for almost streaming
almost all
streaming cases like that
now
so in our here
of course this is the configuration
netcat hs
memory so sources netcat
then netcat agents hdfs
so netcat that is
your bind port and that is your
netcat port
so then this is your
hdfs
which will be
pushed inside the hdfs plume
so all of the data will be pushed inside the plume
directory
so the right format is in text format
so if you are using binary
such as video or image
then you need to change that right format
to be able to
accommodate the specific
data size
batch size is really
important because
streaming batch sizes
can be smaller to higher
so an acceptable batch size
is 4050
or 4 megabytes
per transfer so this
1000 batch size is so small
technically
i made it small because i am only transferring
attacks so that is
1 kilobyte so but if you
are transferring images
or videos or
very large
films then you need
to increase it to 4000
456
4056
or something like that
which is
around
like 4 megabytes
sorry not 4 megabytes
4 megabytes
now in our channel
here also we set the memory
the transaction
capacity so only 100
channels can be
accommodated at a time
meaning based on our architecture
here you can only
instantiate 100
instances
per agent
400 instances per agent
can be generated at a time so the rest
of the streams will be put
on hold until
will be put on hold meaning there's a
time lag
until it can be processed
so again this part of configuration
is really small
in real world application you need to
increase this
you can handle
a lag on
a lands now how to
compute the transaction capacity
to compute the transaction capacity you need to
multiply the memory i mean
the memory
memory times
available memory times
available nodes
so of course
there's
even if it's dedicated
to flumes then you need to multiply
that directly minus
30 percent
times 0.30 percent
so that
only 70 percent
usage
can be used by your
flumes nodes
and 30 percent will be assigned to
it's system as well
other than that
but if you have this
multiple
multiple
different
services or
different components
to be
considered then
you need to consider the computations
for assignment of capacity
so again capacity assignment
is really important otherwise
your node will be dead at no time
so that's the important
now
another
another here is
memory channel
the memory channel here
can be
assigned with the
memory channel
once everything
was set up correctly
and if you run that
so this is how actually it looks like
this is how it looks like
exactly looks like this is running application
actually that i think we do that
from screenshot
if you run the
the curtelnet you'll get this data
written data
so on and so forth then after that
if you
if you check your
your plume data
because i put that inside the plume
then if you check there all of your
stream data are actually
written right there
so to think this is
written data
and distributed across your nodes
that means that this data is highly
available and
it is you know it is
usable by the
data center's end point as well
now
another another
aside from from the plumes
we have this Hive and
Hive is the data warehouse infrastructure
so this is the
data warehouse
infrastructure
so initially Hive was
actually developed by Facebook
and later
it turned over to Apache
and developed further
for obvious reasons under Apache Hive
Apache
Hive
brand
so for today it's called still
Apache Hive but
but i think it changed to
Apache Bell Hive
so the name today is Apache Bell Hive
so
before the transaction
the terminal before you can see
it's Hive today i think it's
Bell Hive
so that's why there is a change
there actually
a lot of
changes happen
after a year
so Hive is a relational
database and it
is designed for online
transaction processing
so this is a language for real-time queries
and it's a low level
update
so technically
it's a first-time database
processed data into HDFS
so technically
this is just
database and HDFS
so it's a database
and HDFS
so it provides SQL type language
like
like any other SQL
so the formal language
is called a kick well
so it's called a kick well
now so how does it work
so this is the architecture of the Hive
that's why when you install the Hive there
you will notice in my
in my
in my
you will notice there in my
which one that
in my
composer
that i create two instance
if you open my composer there i create two instance
the one is the metastore
and the one
is the Hive server
okay
so in this architecture
the user interface here is
a Hive in where
data warehouse and trust software
can be created to do interaction
between the user
and the HDFS
so the user interfaces
that Hive supports are
web UI
Hive command line
hdnsite
so the default is
Hive command line
today we call that the other side
bellhive command line
now the metastore so that is the user interface
so for the metastore
this is where the Hive chooses
the respective database
server to store the schema
or the metadata of tables
databases column and table
and whatever their data types
that is used for HDFS mapping
on the right side there
the Hive QL
or the Hickwell Process Engine
the Hickwell Process Engine
there is similar actually to
Hickwell
in which it query based on the schema
on the metastore
on the metastore
so it query the metastore
therefore it is one of the replacement
of the additional approach for map reduce
program
instead of using the map reduce
it uses query
it uses query
it write a query for map reduce job and processor
so rather than using map reduce
it uses
a Hickwell
a Hickwell to
to what's called as
to write a query
or to write a map reduce
using the query
so now the execution
the execution engine that it uses
to junction to the part of our
Hickwell
it process the engine and
map reduce
into an execution engine
in which it will generate
the same result as map
reduce result
so you will
notice that if you use the
map reducer
the logic in the map reducer is
almost the same as
with the Hive
Hive QL process engine
and the output of the reducer
is the same as the execution engine
now our
HDFS or H-Base
is our base
system and where
we gonna store our data
so this is
the
the architecture
of the Hive
so in our architecture
of Hive
so how does it work
so technically we have this
haddock here on the right side
and we have this Hive on the left side
so first
it execute the query
so this one
will the Hive interface
will try to
execute the query
using any database driver
such as JDBC or ODBC
now it will get
the
plan on which
it will help the query
to compile our
sentas for the query
requirements
then it will get the metadata
send the plan and execute the plan
and finally execute
the drum here
then once the
once the data
or once the result has been executed
this haddock will try to
pitch the result and then
the result to the engine
and of course the drivers
send the result to the Hive interfaces
so that's how that
works
now for the
installation
ah okay
so this is
so far this is the architecture
so this is
the architecture
so all of the logic
but in general
to make it clear
this is your interface test
this is the one that you work
so using the Hive
command line
then it will pass the data here
so in fact
how the reducer works is the same as
with the Hive
so the logic are the same
so the plot file
converts into like
query type
so anyway all the details
are put there
so actually
are you still using Hive or B Hive already?
B Hive
so technically Hive are exactly
the same
I don't see any difference actually
I don't see any difference actually
so exactly the same
so
that is why when you are working with
with different
different
user layer
you will be comfortable switching
to different
so Hive, B Hive, etc
actually the same
so creating this
even creating decimals like that
so are exactly the same
so are the same
the syntax for creating
unions and the original
so exactly the same
so floating
so actually all of this
are just the same
so the concept are just
the same so creation database
when you browse after creating database
it will simply go to the Hadoop
folder
on your current folder
on your current user
which is for example Hive, B Hive, etc
and of course
you can drop and whatever
it's the same
you can create also temporary tables
nothing special like that
so like that
you can use the
you can also use
the load data which
you can use to upload data
to server like that
and of course the sample data
will be uploaded to our browse directory
then it will be added to our
database
database
so all of this are just also
generic
in other interfaces
whether you're using B Hive
or Hive
so are exactly the same
so the syntaxes
for query
if you're familiar with SQL of course
are the same
so if you're good with SQL
so technically all of this
are just
are just
yeah
so I think I would suggest
the only thing that I would suggest
to
for you guys to explore
is the
since you are not
that really
if you want to
to go further
with the
with the Hado
you can explore other
components
and to be able to explore the other
component of the Hado
you need to be
you need to because as of today
we are using
containerized application
so to be able to
to integrate
to be able to integrate a lot of components
into your
Hado I would suggest
that you would
explore further
on how to
on how to whatchamacallit
on how to use
the containerized
because that containerized can integrate
different type of components
into a
one big Hado
full pledge application
so
because technically
the Hado is just simply
a file system
it's just a file system
so what makes Hado
powerful is its components
so
if you have this component
if you can integrate the component
into Hado
then there you can imagine
you can
you can
you can then visualize
or you can
further explore the power of Hado
and
further
optimize
on how to use the Hado
so I guess
that's my last word
if you
have any question you can ask
yeah
the powerpoint actually
is your
grade one
instruction
step by step
instruction
so what we did today is like
a little bit advanced Hado
on how to configure
your server automatically
like that
and how to
because in our powerpoint there
that's how you actually
configure your Hado
manually
so it will take you
time to do that
so
there are two ways to create
your Hado
number one is using
manually
number two is using
Docker or containerize application
yeah
so technically
technically
the step
on how to create
on how to create
your MySQL
server and everything
is in the
Docker
dash compose
dot YML already
that's actually the flow
if you open the Docker compose dot YML
okay let me show you
yeah so if you open
that file
if you open that file
so let me open
so let's say this is our file
right so this is our
Docker correct
so actually
it's
everything here is just
self-explanatory
that this is a process on how
to create the certain service
correct
so once the only thing that
I think that I did not put here
is how to run but on how to run
that I put that inside the
Docker file screwup
okay so you can see
there at the top how to build
the Docker compose build
and Docker compose
the app dash lead
okay so
the first step there is you need to put
3 files into 1 folder
right
after you put this
yeah after you put that
you need to execute this
this step 1 then this is the step 2
so after you execute this
boom it will work like magic
everything every
every what's on this every
server test computer are
ready to use already
now once that
once everything is
ready from there you will use
your Docker command remember
the
what I type such as
Docker exit
Docker exit
what's on this hadoop
for example to connect to hadoop server
you simply put hadoop
dash master
right hadoop the master
so to connect to
what's on this to mysql server simply type
Docker exec
dash it
mysql server
dash something like that
so meaning
this command is simply
to go inside to the container
okay so this is the command
to go inside the container but
once you go inside the container
the rest of the command
that we do in the powerpoint
there that I
that I release in the powerpoint
from that
to finish are the same
our official steps
okay it does happen that we short
tap so instead of building
manually we build automatically
that's why
that's why in my suggestion
my final word
you can focus
you can watch this if you want to focus
on your programming in your development
in your
in your
hadoop you can
use this docker
you can use this docker compose
to expand your knowledge
because each of this service
each of this service
represent a component
correct it represent a component
and the more component
that you explore that you inject
in my docker compose
the
better the knowledge that you will
you know you will train
in yourself like that so that's why
I want as much as possible
if you can improve this
this docker compose that I gave you
actually this docker compose is almost
complete already I can say it's almost
complete already but you can
further I know you can further
you know you can further improve this
by implementing auto
auto scaling like that
auto scaling hadoop
like that
something like that so this one
this one here is fixed scaling
but you can actually implement further
you can improve this auto scaling
and furthermore in a lot
just little bit
little bit
tweaks
okay
okay so
any more other question
okay
but anyway normally this topic is
a little bit super advanced
this one that's why I introduce it to you
okay so supposed to be I will not
touch this part but since
our server last day is so
really annoying
so
I don't have choice but I introduce you
to more advanced like that
so okay any more question
so far
everything good do you have any
I mean I think every file I think
you have ready right just confirm
before we end you have this
right okay
so okay thank you very much
and have a great day everyone
I leave you for today
and take care
bye bye

on 2024-09-18

Visit the Practical Hadoop course recordings page

11 videos

Practical Hadoop | Videos

Practical Hadoop