How to Mix and Match Protocols to Help Optimize Server side Resources

foreign

[Music]

sorry I seem to be having trouble with

my sound

there we go

great

um we've got a bunch of people on the

webinar today thank you for joining us

um if you have been following our

webinar Series this quarter

um we have been covering the broader

topics as related to sequentum's Best in

Class web data extraction platform today

we're going to dive into one of our key

differentiators and something that

people don't always notice right away

about our Tech it's a it's a it's a

subtler point but it makes a huge

difference in your overall operation

and that is uh the fact that with

sequenton's CG Enterprise software you

can mix and match your protocols meaning

for one request you can launch a full

browser for the next request you can

launch

um a lower level parser which consumes

far fewer resources on your servers and

this is a really Critical differentiator

with our software

we built our software with a with a

custom version of Chrome inside our tool

rather than putting our tool inside a

browser and it's a key architectural

um just choice that we made and a

differentiator for us and now the the

problem that that people have when

they're running large scale scraping

operations is often that their server

infrastructure

requires a lot of memory a lot of

CPU and a lot of disk space

um and it's uh and then the servers are

expensive

um and so we wanted to architect our

solution to be as efficient as possible

when it comes to the those server

resources while maintaining the

incredible ease of use that our software

has for for people running large scale

operations now when you're dealing with

a website let's say a retail site

um there's a lot of scraping going on in

the retail sector

um pretty much not going to compete

um online unless you have real-time uh

competitive price scanners running so

retail sites have become Savvy about

this and they've added blocking which

often requires that you need a full

browser you need to launch a full

browser at some point in the beginning

of your scrape now if you if you think

about what it takes to launch a full

browser you're you've got something

that's orchestrating that browser you're

launching the full browser you're

monitoring whether that browser is

running there's all kinds of

attributes of that browser that you need

to set like cookies and headers there's

session and state there is a client-side

JavaScript engine there's a lot of

complexity in launching a full browser

and pulling down web pages inside that

browser right web pages that are You

Know Rich

content sites that they'll have

JavaScript code that has to get

downloaded compiled instantiated

rendered you know there's a lot going on

in in pulling that page down and we have

our own version of Chrome that we've put

inside our software to manage this

process and it works seamlessly now once

you launch that full browser and and

just put this in context if you're if

you're doing this on using open source

Technologies for example if you're if

you're trying to launch a full browser

using selenium

um

you're going to have to stand up a

selenium grid you're going to have to

define a lot of information about your

browser State you're going to send all

of that request over all that

information over the over a network and

then you're going to launch this browser

through selenium grid you're going to

have all kinds of

resources that are monitoring

um the the state of that browser in case

it crashes right browsers are notorious

for leaking memory web pages are

notorious for leaking memory

um you know there's problems with with

browsers crashing you have to have

monitors in place that are looking for

this

um and it's it's just an incredible

amount of uh memory and state that you

have to keep track of on your servers

now once you've launched that selenium

browser if you're using open source

then you it's pretty much diffic you

know it's just really hard to then take

all of that state and pass it to some

lower level protocol it's hard to do

that correctly it's hard to do that

nimbly

um and there's a lot of different

libraries that you have orchestrate to

do that there's a lot of overhead a lot

of coding

um it's going to be a senior engineer

that's putting all this together for you

and um there's just going to be a lot of

overhead and you're not going to be able

to run very many of these on your

servers it's going to be very expensive

operation to staff and to operate

so with content Grabber what we do is we

allow you to load a page in a full

browser it's extremely simple to

configure and then your next request can

easily be in a Json Plus or HTML parser

so you you immediately tear down that

browser it's completely seamless to the

engineer who's writing and maintaining

these agents they don't have to worry

about any orchestration overhead all the

monitoring of that browser is done

automatically in the software if it does

crash which you know I mean you know

just from manually browsing the web

sometimes your browser tab crashes on a

particular site the software notices

that it it restarts picks up where it

left off all of this is automated for

you so this is while this seems like a a

very detailed

um topic and a lower level point to make

about our software it's really a key

differentiator it's going to save

endless time

um labor time and effort it's going to

allow you to be much more efficient with

your server resources and it's going to

help get you that high quality data on a

schedule that that your business

analysts need so with that I'm going to

hand it over to our engineer zijang

who's going to demonstrate how to mix

and match your protocols when writing an

agent

thanks

all right see take it away

hi guys

hi guys

um so over here I have a Nike agent open

that mixes and matches protocols in

order to optimize the agent itself

so for the Nike agent firstly it loads

up um the entire

first page inside a dynamic browser so

if you ever want to change what type of

protocol you want to use just navigate

directly to your navigation link and

then under action there's a browser tab

that you can click on and then you can

specify what type of browser type that

you want to use for the agent so for the

first page we're going to use a dynamic

browser because on Nike sometimes there

is blocking and sometimes it requires

you to load up a full browser instead of

just like an HTML parser

so then afterwards we have this

navigation list of all products where we

grab all this navigation URL which then

we use for a Json call in order to

retrieve the data in Json so that we

press the execute button

you can see that uh

the information here is loaded in Json

but you could actually just also use a

under action browser use a default or

like a Dynam by default it defaults to

the its parent browser so then if the

parent browser is a dynamic one then by

default it will also be a dynamic one so

if I execute this directly you can see

that the layout will be different

it might be like harder for you to like

write a script in order to extract

information from here directly so then

what makes and so the Json parser makes

it so that it's easier to point and

click

on elements that you want to extract

so from here I can just like point and

click on this navigation I mean a

pagination element up here if I wanted

to but then afterwards um

from this data that's being returned you

can see that there is a list of products

here being returned along with some

additional Fields such as uh price sale

price and so forth and colors

from here I would also want to navigate

directly to the products page in order

to get some additional information that

was not actually returned from this page

like available sizes and so forth so

from here

um I could navigate directly to the

products Details page

just using uh this PDP URL that's being

returned here

now for this navigate to products detail

page I don't I wouldn't have to load

everything up directly in a dynamic

browser itself so from here under action

browser I'm actually loading this up in

an HTML parser because all the

information there

is there present on that page just you

in HTML so if I navigate directly to the

products page you can see that it loads

up the products page and then there's

some additional Fields here that we're

collecting

such as item group ID description image

link conditions and then on this page

itself there's a script that runs that

returns all of the available sizes

itself like if I come directly to this

page I'm just going to copy it and then

open up this page in Chrome

you can see here on the side that

there's some additional

images and sizes that will be available

like select sizes and so forth if I load

this up directly in HTML parser this

information does not actually get loaded

in but there's actually a script that

runs on their backend that actually

loads up this information so then I

could just execute that script directly

so then from here I can go under action

browser and then load up the script

directly into an HTML parser or just

press execute here

now you can see that uh there will be a

list of sizes available under

available skus

let me find that real quick

available skus so from here these are

the skus that are available on this page

that's KU I think we actually extracted

directly from

this navigate to products detail page

directly

and then afterwards we can basically

compare the SKU to the ones that we

extracted on the previous page to see

which one is available and which one is

not so back in the details page

sometimes this page will not load

successfully inside an HTML parser and

it will you'll get an access denied so

in this in this case it actually loaded

up successfully but then some additional

cases where an HTML parser does not work

you can add it in a if condition in

order to reload this page inside a

dynamic browser in order for that to be

in order for it to load successfully due

to like blob blocking

or something so then here I can just run

the agent real quick

okay this agent actually takes a while

to run so I'm just going to let this run

in the background and show you guys a

second agent that I actually built as

well uh so this is the Victoria's

Secrets agent

so then in this Victoria's Secrets agent

um

I'm loading up the first page inside a

dynamic browser and then afterwards I

can select some cookies that I'm

receiving on the site directly

that I will need for my request then um

I'm also collect I'm also collecting

this collection ID that's available

inside the dynamic browser I know it's

got the collection ID of this clearance

section

I will also need this directly for uh

this URL command that I'm going to

navigate to directly which is also a

Json parser so then if I want to just go

action

browser yep Json browser under common

you can see the I'm using a Victoria

Secret API directly in order to uh load

up this next page

so if I press execute

you see both the cookie and the

collection ID that I collected on the

first page inside a dynamic browser I'm

able to successfully use the API in

order to return back Json data

and from here it just collects a

all the data that's available on this

previous page like so instead of

scraping from this page directly and

scrolling and loading more additional

elements from the bottom I can just load

up everything directly at once then

collect some ID name rating price some

sale price

just run this agent real quick as well

this one runs faster than the Nike one

there's less products

the agent has completed I can check out

the data and you can see that agent has

successfully extracted

um all of the clearance items from the

Victoria's Secrets section along with

price and sale price rating and names

and so forth

and this is the more optimal optimally

than instead of uh loading up the uh

this Dynamic browser and then scrolling

through all the products and capturing

all of sale prices from here

back to the Nike one seems like it's

still running a bit

you have to get all the data

okay so it seems like actually taking a

while for this agent to actually run so

now I can actually go back to uh the

Victoria's Secrets One and show you guys

how

um this API

um how we actually got this uh

request here from the API directly so

then when we load up this first page

inside a dynamic browser we can actually

just put click on the activity module on

the bottom right

in order to see all of the calls that

are being made in order to load up this

dynamically so in here there's one call

that basically returns all the data via

Json

to look through all of this it's one of

the API ones

you can also always press the test

button in order to see uh the test

results that appear

[Music]

so I think it's this one

Pi print Chrome

actually I can just take a look at the

request and see which one it is

it's just Stacks V6 brands

Stacks physics Brands oh since we this

one then we press test

I can see that this is all the

information that's being returned

and then from here we can just make this

request in order to make this of course

we can just copy this directly to our

clipboard

um the request here comes with a bunch

of headers as well so we can just copy

this directly to our clipboard

and then if I just want to add in a new

command I can just come to the agent add

a command add a navigate URL command

under actions click on action on select

discover action click on browser Json

parser and then head back to Common and

then paste in this request that I just

copied from my clipboard copy and then I

can press execute

and from here this is all the data that

we extracted previously and that's how

we would go about finding

um API calls directly from certain sites

after we load a note for dynamic browser

so it seems like this negation has

attracted some data I'm just going to

stop it real quick

why is it taking so long

um Asian export is exporting data right

now agent is incomplete now I can just

view the data that the agent has

actually extracted and you can see that

um it's gone through the first page

where it got the item group id id title

descriptions this was all found under

the this API call here directly under

navigate Json and then afterwards when

it navigates to the products page it got

some additional information such as link

um brand G10 MPN color and then from

this last request that we got from the

script directly under this low Json

script load Json script it got condition

I mean not condition availability of all

of the colors of colors and sizes of all

shoes

and other products on Nike

and that's how we will go about mixing

and matching protocols in order to

optimize resources on our servers

does anyone have any questions or

anything else they'd like us to cover

today

I see we have a bunch of folks on the

webinar today

I don't see any questions

all right well if we don't have any

questions then we will

um we will

looks like maybe Chris has a question oh

Mary Mary does okay

um let's see

can I see that if access denied command

yeah so for the if access to nine

command all we're doing is checking to

see if um this particular X path here

exist or not

oh sorry

all we're doing is checking to see if uh

this particular X path here exists or

not under um exists denied I mean access

denied and if it exists then we're

loading um

this backup inside a dynamic browser

under action browser you can see that it

reloads the page inside a dynamic

browser

under this particular same URL

thanks c

um can you also show how you reload the

page

so for reloading page all it's doing is

it's capturing the current URL and then

it's just

HTML document.url which is the current

URL that we're currently on and then

it's reloading it in a dynamic browser

so now if I press execute

oh it seems like um it's not actually

happening here in uh

debug mode since I'm not getting this

access denied option

if it does get access denied with I mean

if um this XPath exists then it would

execute this command and we load it in

Dynamic browser

all right great are there any other

questions that you have based on what

you've seen today

Chris it looked like you might have a

question

um there is a way to type in your

question

I could try taking you off of mute

okay Chris you're off of mute

did you have a question

oh no you muted yourself okay so you

don't have a question excellent

um well I think that's it for today

thank you so much for joining

please send any further questions you

have to support contentforever.com and

we'll be happy to answer those for you

um

thanks again we look forward to

to our webinar next week which is all

about blocking

website blocking and how our software

gets you passed

past those blocks

thanks again and talk to you soon

bye-bye