built the Nike agent hi welcome today I'm going to be going
over how to scrape data on nike.com so first we're going to load up nike.com as
our starting URL
uh you can press the navigating web browser option first to navigate manually to
some of the shoes that we want
let's go to men's lifestyle
there's under this category you can see that we have 234 products here currently
if I scroll down the page you can see that more products gets
loaded as I scroll down the page so let's check out our activity module to see what is going on behind the scenes
active activity module in the bottom right corner and we can try to see if
any data is being loaded asynchronously I'm gonna scroll down
foreign that there's data being returned we're
just going to click on that see if it's the data that we actually
want
and yep it's the call that Returns the products
items title Nike Air Force price employee price okay so this is the call
that we want to make in order to return all of the products we can click on the
URL to see what URL and headers we need to return this data
under the web request for CSV and simple inputs we can press the test button to see if this call actually works
and it Returns the data that we actually need then right here we can just copy this
call to our clipboard exit out of this
create a new navigate URL command
and paste it in let's edit under action discover action
browser choose Json parser press save and navigate
here we can scroll down until we find a list of products
seems to be under items so we can left click to select one of the items
hold down shift press left click again and you can see
that the collection selection count is now 60. so we can click once again to capture the entire
list or table and you and you can see that content Grabber has automatically generated all
of the fields within this category and wall content card raw price local price
some of these don't seem to be necessary but for this demonstration we'll just
leave everything in there for now
so we want to navigate directly to the product itself because this link this
URL is only navigating to the listings page so you can scroll down to PDP URL which
we already captured PVP URL
say that this is the link to the product page so we can scroll down press add command
add a navigator URL command unselect use default input
for the data provider select capture data and under data columns so like PDP
URL press save edit again action discover
action and for browser we're gonna open up a new browser but this time in
dynamically just to see what happens initially
as you can see product pager gets loaded along with available sizes as well
and along with some other categories that we might need such as
color and style so we'll capture that on this page right now
excited out
left click once left click twice capture text and now we can rename this to be color
showing light white we can also
highlight that press generate transformation and you can see that content Grabber automatically generates
a regular expression in order to parse out this information also click on Style
left click again capture text
we name this style and perform the same action
generate transformation press save so we also want to capture the available
sizes as well but since it's this is using a dynamic browser it's not going to be efficient so we would try to
optimize this agent so move back to the second tab we'll navigate URL and their action could rename this to
navigate to product actually browser and using HTML parser instead
per se and navigate here using an HTML parset you can see
that the sizes here aren't actually available on this page but we're going to create a new capture
command uh left click the I command button at a
web content and change the XPath to HTML
press save press edit come back here and to extract HTML now I'm going to press
transformation Scripts to see the entire HTML that's being loaded onto the page
it's quite a lot let's go through
but there is one category there's one script that actually runs
to give us the availability of the sizes
no it's not the script
find the script right now but script
was called uh windows
wait where did you cut off oh so the name of the script was Windows
Dot initial redox state so then we can just change our X path directly to it
contains HTML window Dot
I'm going to show redox State press apply
you can see that there's one selection so in this script actually exists and press save edit configure
than other transformation script you can see that the script is loading
all of the availability of the shoe sizes
it will be easier to visualize this in a Json parser in a second so it's going
to close this out add a navigate link
oh how to navigate link oh could use the navigate link
use XPath copy the six path over
okay now under configure action URL
so default URL
okay actually we're going to use the navgate URL command and delete that real quick
not get URL command and select data provider or capture data
and web content I should probably rename that press save
for now rename this web content to script
right click don't export data because we don't actually want this in our export data then navigate
can rename this and navigate to sizes
press edit in the transformation script here
you can see that the data here is actually turned in a form of Json
so we just need to parse this data out correctly and then parse it into Json so
in our regular expression we could return Windows Dot
initial redox state equals
test transformation and you can see that we have the opening brackets
but the data here also has a closing bracket and the script which we also want
to remove so just type that in explicit link
press test transformation and you can see that the data is now returned fully in Json so instead of
just returning this data we're going to add Json colon dash dash
test transmission permission save now under actions
unselect discover action browser new and now load this up in a Json parser
press save now execute
now in this data we can see all of the sizes that are available
you can scroll down until we find the data that we're looking for
products product ID
so this three one five one two three 001
this seems to be the style of the shoe and then different shoes would have
different styles as well so then this shoe here three one five one two three three one
five one two three zero zero one I'm just gonna
check out the shoe and you see that it's the black one so I'm just going to create a list
right click add Command Web element list and here I'm going to
navigate directly to the products because it seems like the first product
has three one five one two three zero zero one as a as a node
in the second product has three one five one two three one one as a node so then that corresponds to the white shoe so
I'm just going to create an x-path that captures the white shoe because that was what
the previous page was capturing so uh this web element let's press edit
let's try to get a next PATH it was product
let me scroll back up again since this next node changes dynamically
which is gonna see if I can find anything but it doesn't have any products oh
product yes sorry about that products now we need to get the logo name
of the next node and then we need to say that it is equal to
find data of the data that we got previously so in this navigate products
our style press edit configure it was this so I'm just going to copy
this number oh my copy that number over but style is the name of the note that we really previously had
find Theta Style
slash you want
SKU yep skus skus
press apply
can use
yep so I think this is not showing up any selection count right now because it's
doing it dynamically in the end when the agent is running but we'll leave this in for now and then explain what you just
said about the selection cap construction camps oh yep in the bottom
left corner there's assume they know what that is in the bottom left corner there's a selection count and that's usually how
many nodes are being selected currently so right now you can see that the section count is zero but I think that's
because this fine data Style is actually runs dynamically so that it
needs to capture style from the previous one I
mean from the parent command list style from here and that's why it's not
showing any selection count right now because it's not doing it in design time but when we run it
it should work so I'm just going to rename this list to list of skus
and we're just going to capture from here
SKU ID yep skuid
which changed expand directly SKU ID
and rename it to skuid
and probably the sizes as well so Nike size
you can't do that as well press add the name to size
changed expat to Nike size press a
this will give us SKU of all of the shoes and the sizes but it will now give us the availability yet because that is
actually at the bottom so I'm gonna continue scrolling down
in this section you can see that there's available skus so these are skus that
are actually available like in stock so then if the SKU that we captured
before is not in this list then it's not available
as a default value I'm going to set this to not in stock
press save I'm going to add a web capture command
also name it availability
okay to XPath and for this XPath we would want to
find if skus actually exists in this SKU list or not
so we need to go back to our root because this is within a different
parent node than the list of skus so we'll go to ancestor
go back to the default root command so root then we can start off from
available skus skus
slash skuid that's what we want s-q-u-i-d
equals what we found before so find oops no quotes
find data of our SKU ID that we captured before
SKU ID press save
and then in our transformation script
when you want to return the fact that it is in stock
so if the SKU exists it's going to execute this transformation script and turn in stock otherwise it's going to be
left blank if it's left blank it's going to execute this next availability command
which will override the previous one if it's empty well not in stock
and that's how we will get the availability for the issues all right I will save this agent
store liking press save
navigate back to see if I'm missing anything so right now currently it only does one
category the one that I navigated directly before
so then let's just make sure that it goes through all the categories so then I'm gonna press edit for my navkit URL
I'm going to rename this navigate to listings
press edit uh let's not use the default one just press transformation script
see what we can change in the call to navigate to the other ones as well so you can see HTP Nike
so that the grid wall path is men's lifestyle shoes followed by
this number followed by PN page that's probably the
second page we're on so I'm going to change that to one for the first page and then we can set up a pagination as
well and then the prefer
which was the previous page where all it gets that page from how I change that as well
but okay so then it seems like what we need is to capture the great wall pack
from this page from the parent page itself so then let's just go back to
nike.com
so from here we want the XPath for
all the categories like men's running training in gym basketball Jordans women's and kids
these three categories so then let's press tools uh browser tools
from here we can try to get the XPath of our navigation to these links
navigation menu
clicks on our first navigation and you can see that this enough corresponds to the entire link
Ally so in correspond to new releases and customize we don't want those
okay so let's come back to our agent let's add uh
web element list it's called list of products
what's up navigations Maybe doesn't really matter what you name it
XPath edit we want to start with the x-path of the
first link so div class
Li class primary respond to XPath
at class equals
oh you can press apply so you guys selects all five in our bottom left corner it says
section count five so so selecting all five of these but we don't actually want all five of them we only want
the middle three of them so then we can just specify that uh
position is not equal to one position as a function so position not
equal to one and position [Music] not equal to five
now you can see that selection count is now three and those two have been unselected so let's go further down
just did that navigation menu zero
another one here is the ID that navigation menu.1
so then this divided actually changes for each different Navigation menu so
I'm just gonna select the
contains at ID equals the application menu
it's really long to be another day
so it seems to follow another day
and this one here leads to new men's items so it's two dips in
followed by two division press apply
oh
we don't
invalid oops I think I Miss added something
foreign
[Music]
that's not supposed to be an equal sign
I can see this box is all of the ones that are being selected so let's now get
on to Second div
and in our main category we don't actually want
first two categories as well when we want two clothing and accessories and equipment to begin this shopping
collection is basically like a subset of all of these so I think this is going to
be duplicates same with the first one as well so we're just going to exclude those two
come back to this navigations uh position not equal to one and position
not equal to four that's going to eliminate those two
and now all we want is this link pretty much that has the new items and this code in
the end so then let's go a at class equals nav menu item
for supply and like I said there's 91 selections on the bottom which is all the categories
but there's actually some categories that we don't want as well like under men's all shoes and the
sneaker launch Calendar because this page seems to be of a different page of
items that aren't actually available yet so we're going to exclude those two as well
press edit not
contains
all and not contains
launch okay press save and I can see our
selection count now is down to 74. that's exactly what we need so now we can add a webcatcher command
come over here to configure it instead of extracting
that will extract the URL press save rename this to
navigation URL and don't export data we don't actually
want that in the end no we could edit and under transformation
script we don't want anything that's behind the PW
so that emotions PW turn
and this is what we want so press save now for navigator listings move it
within the list of navigation command press edit press
data provider capture data navigation URL
and that's what we wanted from before but I'm just going to copy
what we had from here copy this
come over edit how to command notification URL
and paste that with thin hair so that we can see we need men's
lifestyle shoes and this number after the group wall path so then delete the
stuff behind the grid wall path type in our assign one that replaces it
with the current selection and also for the referrer as well we can eliminate that put it in Dallas
time one and now we can return this
Let's test transmission and press save save and now navigate to listings
now this navigation will actually navigate through all of the categories listed above
and in our list we'll capture everything that is required on the listing page
then it will navigate to the products well this and this thing is actually only the first page as well we're
actually in the pagination so let's come back to navigate to listings
transformation and let's make it so that it navigates the first page first
under PN pn1 the transformation save
so now this will navigate to the first page so we're watching to set up a
patronation second page and Nike is pretty nice they already give us the patient for the next page
here directly on their next page data services so you can just click on that
oh press add
follow-up navigation pagination and then we can set that
XPath directly to this next page data services click on that press save
under action URL
you can see that the text here now becomes the second page but we still need the information in the front
which is store.nikey
nugget products here
it's text HTML services so press transformation script
and we just need to return http.s store.nikey.com
in front of it press test transmission and this should lead to the second page press save save execute
now third page
now we need to see if our navigation actually ends or not
afterwards this is men's lifestyle shoes
you can see after one of the pages it just stops so there's nothing on that
page so navigation for the pagination will end by itself that's great that's exactly what we need
and navigate the products we'll go to the sizes and now inside this all right so it seems like this station is ready
for debugging so go to save the agent press the bug and start
I'm going to slow down the debugging process up here there's a debug speed which can control the speed
of the department
and increase this part actually sales Channel 2 uh that's not always
going to be available for all the commands so we're going to select optional command simple sales Channel 4 optional command
error processing script text was longer than 4000 characters right so we're going to press stop debugging
so we can see that the script here is over four thousand characters so if I'm compress edit the script properties
scroll down and under data type instead of short text and convert it to long text
to handle over 4000 characters move similar pieces and press log start
again
I think the command availability is selecting well that's not exist optional command because the availability not
might not always be available
NikeiD select optional command
seems like for some products none of these are actually existing
and for these you can press optional command for all of them
I think this is happening because some of the shoes directly on the site
are not actually for sale yet so it's just an ad there
that's why none of these actually exist
and so that we can't navigate to that product if it doesn't exist so I'm just going to ignore that error for now
I can show you directly on the Nike page if I start I'm going to pause debugging
go directly to the Nike page pause debugging go directly to my key
page since you're going to start the plugin first all right come directly to Nike
page under men's lifestyle
you can see that it's this ad over here that's causing all of none
of these fields to be found so that it's folder shut the data for these three shoes but then for this one it's not
actually the issue just an ad that's why it's causing those problems there but now we can view our export data
and we'll see that everything has been extracted including the availability of the shoes
availability for assume not installed in stock yep
a lot of these fields are actually not required so we could just delete a lot of these fields as well and that is how
we built this Nike agent initially