hi everyone and welcome to the sequentum webinar on key evaluation criteria for
building a large web scraping operation um can everyone hear me
just want to make sure that we're set up properly it looks like we've got a
Fair number of people on the line
let's see it looks like we have some questions okay yes you can hear me fantastic thank
you so much um all right then let's get started um first of all I just want to thank you
for getting ahead of the curve staying ahead of the curve joining our webinar today and educating yourselves and how
best to do this um obviously uh the the the landscape is
changing decisions are becoming data driven the incorporation of data-driven
decision making is becoming a mandate across Enterprises and finding the
smartest most efficient most reliable way to incorporate data feeds from the
public internet are also becoming a mandate uh sequentum has been in this
business for 10 years our practice is um We Believe Best in Class
and we're excited to share this webinar with you today so the key just at a high
level the key evaluation criteria for looking at web scraping software We
Believe are the following number one it has to be easy to use when you're pulling in many different
data feeds and and you're handling the Myriad problems that are going to come up your software needs to make your life
easier and it needs to be easy to use number two
there has to be as much automation as possible around the typical functions
and methods that you're you're going to need as part of your everyday workflow
so productivity enablers are a key evaluation criteria and you should make sure that whatever
software you're choosing has many of them and that they work well and make
your your life easier and your work more efficient the third is a laser focus on
data quality you need to make sure that your uh the data that you're pulling
across is of the highest quality and if there's any degradation to that quality that it's easy for you to configure
rules and handling for any of those cases that may come up since since the
quality of any data-driven decision making process whether it's AI or ml
since it depends on the quality of the underlying data this is really a mission
critical key evaluation criteria for your software the fourth is
blocking um there's hundreds of millions of dollars getting invested in about
blocking software and services um you need to make sure that your web data collection software is viewed as
close to human-like as possible and is not blocked and in the event that it is
blocked there need to be simple tried true ways to automatically configure your agents to no longer be
blocked um so that's number four number five is you want to centralize
everything that you possibly can since there are so many different pieces to a
large web scraping operation you want to centralize as much as possible to
support your operations managers and also to support any compliance oversight
of it and that leads into number
um number seven which is the the tech stack that you put in place
should support any compliance operating guidelines that
your legal group wants to put in place to govern the web scraping operation
um it should make it transparent it should make adherence to compliance
operating guidelines um you know easily checked and there should be uh key features and
key configurations in the software that allow you to conform to compliance guidelines
and the last but last but not least is Rich data interoperability so you need
to be able to put this data collection operation in
the middle of your data-driven decision making process um
and obviously each company is at a different stage of maturity and is choosing different ways to implement
data-driven decision making so this software and Tech stack needs to
interoperate with whatever it is that you put in place and whatever it is that
your team might transition to in the future so you not only do you need to be
able to support many different uh data sources
um but you also need to be able to export to many different formats
to distribute to many different targets uh maybe it's S3 this year and maybe in
the future there's some API you want to distribute your data collection
to those distribution targets should be flexible fungible
um and of course you should have full API integration into every aspect of
your webscrapering operation so the data collection toolset can really be embedded into the
overall enterprise-wide system that you're setting up um
so with that I'm going to go ahead and start with an overview of the sequentum
tool set and how it delivers on these key evaluation criteria at a very basic
level we've got three components one is the desktop where you point and click an
author and maintain your agents it automates 95 of what you need to do in
order to create and maintain these agencies um and I will demo that for you shortly
the servers are just your workhorses the servers essentially
um you know run all of your data collection and deliver that content in whatever
format you've specified to your targets um and then what sits in the middle tying it all together is the agent
control center the agent control center uh added at its at it at its basic most
basic Foundation is a Version Control repository it keeps track of your agent
and all of the associated dependent artifacts like input data
David you're asking why the screen isn't changing I'm just going through the diagram now and I'll I'll go through a
live demo in a moment um yeah so the agent control center has the
Version Control repository that keeps track of all the age independencies for example input data reference data any
third-party libraries you're using um
um your schema uh any credentials that
you have to either third party uh S3 buckets or database
credentials and then on top of that we build a whole deployment mechanism
server management mechanism and proxy management mechanism
okay so with that I'd like to show you a quick demo of
our desktop now for the folks on this call some of you are new to content
Grabbers some of you are very very familiar with content Grabber and have been customers for you know the full
decade that we've been in existence um I'm going to give a quick overview of of
these features and show you how it automates 95 of what you need to do so
as an example I'm going to bring up a simple web page it's a typical web scraping scenario
where I've done a search based on a given category and a list page is coming
up as you see as I'm mousing around there is a browser inside the tool the tool is
not inside a browser and this gives us tremendous capabilities to detect and
handle errors as they come up it also um is a tool that's context aware so for
example when I click on the first item in the list hold down the shift key and
click on the next item in the list it automatically knows that I'm creating a list container and what it's doing is
it's actually generating the XPath for me automatically so these typical things that you need to do as a web scraping
engineer or operations manager your most junior staff can do this without
assistance so instead of having a team of 15 highly paid
you know python programmers or full stack data scientists who are collecting all this information and doing your data
than decision making you can have one senior web scraping engineer and
Reporting it to him you can have 40 more junior staff who are doing you know 95
of the work and that senior engineer is really only focused on the interesting
uh problems that match this level these more basic
things can be done by you know lower level staff that uh are more attuned to
handling a lot of very detail-oriented work um things that programmers are not
really interested in so the Mind numbingly repetitive stuff is automated for the most part
um and then a lot of the checking and whatnot is uh is done by by lower level
staff so here I'm just adding a simple pagination container so it can go
through each one of the items in the list and each page in the list as I
click on the detail page it's keeping track you'll see that and notes Here in
the browser it's keeping track of the workflow automatically it's generating a schema on the back end automatically
um these these uh capture commands that you see under the agent Explorer here these are actually going to be the
um spell this correctly these are going to be the column headers of your X
um another thing that we have automated for you if you want to extract any portion
of any string that you're pulling from an element page you can easily generate the
regular expression automatically by you know simply highlighting the text and
collecting it so again all these things are not rocket science but automating
them all in a tried and true way is saving you time
so your your your junior staff can focus on doing 95 of the work that needs to
happen they could do it with accuracy they can do it quickly um and they can also do it in an audible
way so when something does go wrong you can go back and figure out what it is
okay um another thing that that is very typical in in scraping operations is
there's all kinds of edge cases these are third-party websites they don't want you to pull their data they try to make
it difficult um and and you know sometimes it's really hard to tell why is it 15 of the
time uh your data isn't coming across properly well you can visually debug because we have the browser inside the
tool we can display exactly the flow of that data collection operation and it
makes it so much easier for your teams um to debug um so you really want when you're
looking at at web scraping tools you want to make sure that you have uh incredible ease of use you want point
and click functionality you want all of these productivity enabling in there you
want it to be easy to debug um you know and then of course the next step is taking you to step down you want
to make sure that you have data quality so what we've done is we've built fine-grained data validation at
the level of every field of every row of data collected you can go in and set strict type
expectations for the data that's coming across and you can specify things like whether
or not you allow this field to be null and then you can take it a step further and Define things like let's say you're
pulling used cars and you collect the dealers but you don't want to collect the individual sellers because you
consider that to be pii right you can write regular expression to say if you
know the seller field matches this list of dealers then great otherwise mask this content
um you know similarly with date time for some of you that are already doing large-scale web scraping
um you know that proxies sometimes move one week is in the US the next week it's
in Japan and when that happens the website will often present information in a different way there may be a
different currency symbol there may be a different daytime format and you really don't want to blow up your data driven
decision making so things like date time you want to specify exactly what that format is and
you can do that and validate in real time similarly with value ranges you don't
want a negative price you want that put into an error file and anytime you're pulling Json you want
to validate it because you don't want to trust these third-party websites you can specify things like time zone so if
you're pulling Airline airfare information for example you're dealing with a lot of different time zones
depending on the markets that you're you know your origin and your destination you want to be able to specify exactly
what that time zone is so that you're comparing Apples to Apples when you're doing for example competitive pricing
analysis and then we allow you to specify whether to run this validation at runtime which is the point at which
you're extracting the content from the web page or at export or both
um you know so here you can also specify what are your keys
so let's say you want to do deduplication another extremely typical
um you know action that your you know web scraping Engineers are going to have to implement
um you want this built into the tool and you want it to be easily configured to specify what are your keys you know
Amazon for example maybe their keys are not just the job ID but also the location
um you want to be able to specify what your keys are for deduplication and you want it to work every time you don't
want to reinvent the wheel for every agent that you're writing um and of course
um you don't want to be hemmed in so you want all these productivity enablers but you don't want to be restricted in any
way if you ever need to do something custom you want to be able to write freeform custom logic at any point and
that's uh that you know that's what's in this tool you can use python3 you can
use all the python3 um uh libraries you can do c-sharp
normal programming languages that that you're familiar with JavaScript regular expression Etc
I see I have a question can this software deal with downloading Excel files through JavaScript links uh yes
absolutely we can download files and um because it is a Windows based uh
software we have integrated the I filter packs that are supported by SharePoint
so we can deal with almost any file type so if it's Microsoft Office files if
it's OpenOffice files if it's PDF if it's any sort of archive file every
single type of file that's supported by the ifilter packs can be read
natively and so you're asking specifically about Excel files yes
um okay so this is the data validation this is happening in real time so let's say a
website does um has window dressing for the Halloween holiday that's coming up or they're
doing a b testing or ABC testing um you want to have this data validation running
um in real time and you want to have error handling logic that's happening
um you know in real time and so you know you've got all of these capabilities built in you can do things like Branch
logic um if else you know Etc to basically catch and handle your
errors depending on what the website is doing the other thing that we have that
ties in with the data validation is Success criteria which is at the level of every run or the parent job so you
can specify say I want 95 of the same number of actions that I had in the last
run right this enables you since it's a database back tool and we track
um kpis uh you know as as we're running our our
jobs day over day or hour over hour you can easily make sure that you're getting the same amount of data each time and if
not it'll raise an alarm or an alert it'll notify you that something is a
mess and you can really set thresholds so let's say you're getting data from a
government site which is typically error error has a lot of Errors um maybe you know that you get a minimum
of 300 errors every time um you work with a site and you really
don't want to be notified unless you have um 325 or more errors you can specify that in your
success criteria and there's two layers to the success criteria so you have the success criteria for the individual runs
and then you can create a job which has a bucket of runs and you can have success criteria at that higher level so
we are laser focused in our Tech stack on making sure that we get the right data quality that we get the right data
accounts that we're setting thresholds and that we're notifying sending alerts and alarms anytime
there is an issue um so you can integrate the notification
into um any API enabled ticketing system our
AG control center also comes with a ticketing system built in that allows you to do this very simply
and of course you can set email notifications as well at any point you know you can write custom code to do any
custom validation you want I see I have some questions coming up um
data validation stop or pause the agent no it's running in real time in parallel
as the data is coming across so there's a there's a there's a validation that's
happening at runtime opening in real time and then there's validation that's happening at export that's at the end of
the run when you're exporting all of the data from the internal database to the export Target
uh and I have another question here what if the data doesn't match so if there is
um you know if the success criteria isn't met then that will trigger an alert an alarm and a notification
um to whoever you have configured to receive those notifications
um if you wanted you could put some Branch logic in to say in this case uh
um you know if I'm getting more than uh 300 errors on this government website or
if the error is uh 503 service center available then pause collection and
resume in an hour you could do something like that where it doesn't actually notify anyone it just handles a problem
because you know who knows maybe a government website goes down uh frequently enough that you would just
come back in an hour and try again um okay so that's data quality now
blocking um with blocking there's basically two types of blocking right you've got
um you know the type of blocking where they're applying rate limits on your proxy this is the old style of blocking
right so we so to counter this we have built um proxy uh centralized proxy management
components in text apps so we have one provider pool component that basically allows you to
set up any type of proxy provider so whether it's a super super proxy or back connect ports or a
list of ips you know whatever type of uh proxy provider integration is required
it's supported um and from there you can create proxy pools so let's say you're doing
Australian retail and you've got uh you know a bunch of different providers for Australian
retail um you know because for critical data collection we never trust any one
provider um you know then you set up your your proxy tools and you can configure those
in your agent um and so when a provider has an outage
or if there's a particular subnet that's gone down or some small subset of
providers uh pool that they've extended to you isn't working properly you can
easily fix that in one place and all your agents are still running um so this is a big
um uh productivity productivity enabler and a key evaluation criteria you want
to make sure that you are not tied in to only one provider and you want to make
sure that you have absolute control over who your providers are and what your pools are what types of ips you're using
because this is an area that's changing rapidly on the web and you want to make sure that your teams have full control
um and so that's one way of blocking another way of blocking is to look at
your device uh fingerprint so and the way they do this is they take objects in
the browser typically and and information about the the um environment that your scraper is
running in and they take all of those different variables and flatten them into a single hash which is the
fingerprint for your device and then they track rate of requests from that device so if you have a server
um that's doing your scraping and you're trying to pull data from a site that has implemented uh blocking based on rate
limits that are looking at your device fingerprint your server is going to be dead in the water pretty quickly unless
you're using our software so we basically will because we have a custom browser it's a
custom version of chromium inside our tool we can randomize all the objects that are used to create that device
fingerprint so when you when the site you know like typically what you'll see
is a site will require that you run your scraper in a full browser which is very
expensive but very easy to do in our software um so we'll make one request in that
browser and we'll randomize all the types of things that they typically look at
so for example um uh you know we'll make sure to load a
full browser and then we'll um do things like
um you know we'll clear the storage uh we'll make sure that the connection is
alive worry about the proxy you know put random delays in there if we want we can specify the that we can randomize the
browser size which is another thing they typically look at rotate the user agent
um you know rotate the web browser profile you know typically with the with rotating the proxy address and then we
can do other things at the level of the web browser which allow you to do things like
um you know sometimes they'll block if you don't have canvas reading turned on
um it's okay if you don't know what these things are we have you know troubleshooting steps in our 600 page manual that explain all of this
um uh you know you rotate your canvas string where you know there's all these things that you can do to basically
um randomize what your um what your device fingerprint looks like so for all of those hundreds of
millions of dollars going into block blocking Services um so Quantum is not blocked
um you know sometimes we come across interesting cases that are new to us but for the most part
um we're not blocked and it's a key performance uh key evaluation criteria
that you should um consider when you're looking at web scraping software now the last well one
of the you know another point that I want to mention is um when you're so so when going back to
our diagram here I'm going to now walk you through the agent control center
which is accessible to your agent developer and also via the browser it's
accessible to um your Ops managers or you know Dev
managers or compliance managers so if your developer can basically check in in
an agent you know it's very easy to just go to Version Control and check in your
agent um you know and there's some simple views here that that uh the developer can get
um but what you're getting is basically a full version of every agent and all of
its dependent artifacts so there's no more losing track of exactly what the
input file was there's no more losing track of exactly what the credentials were at that point or what you know
where it was writing to or what API it was using everything is stored in one place and it's stored together in a
single version so you can keep track you can also um actually look and see
um you know as a manager you can see oh yeah this is the one that had the handling added if we get the if access
denied um error right so you can see oh yeah this is the one that I actually want to
deploy and then you can go and you can actually deploy that version to your
production cluster or you can say you know what I want to add a deployment just to my QA server because I want to
verify that this is actually working and then when something goes wrong you can
easily open up the audit log and see any changes that have happened to schedule
deployments Etc um so this is really a very rich
um very rich approach and it's a critical feature that you're gonna work
in your web scraping software um when you're adding you know scores or
hundreds or thousands of agents to a large-scale web scripting operation like
many of our customers do you really need automation at this level you need everything pulled together for you and
you need a very clean way to know what's running where um you know check run history check you
know job history here with the the success criteria in the job is is
set to Market as failure but you can actually see like the developer can go in if there's
if they assigned a bug for example they can go and and load the Run history and
see you know what are the issues that have come up here or how has this been running
um what should I expect Etc and they can see the success criteria and they can see what proxy
pool you know Etc everything that you need is basically in one place um
so now I'm going to show you in the browser when you log in to go to
your agent control Repository um you've basically got the same view
but this is this is for um a user that isn't necessarily writing and maintaining agents
um they just want to see what's going on so they can go and see you know okay this is my Nike agent these are my
Fields here's some sample values here's all my data validation rules
um you know what I'm doing where I'm Distributing to Etc um and this is how my server is
configured all of those uh variables that need to be set for a particular
server can be set in one place again you don't have to open every agent anytime a server variable changes
really important and your credentials to any trusted Source are encrypted you
basically create your Connections in here and they're an encrypted file you
store them in the agent repository that the agents have access to them um but you're you know uh your your most
junior staff do not have access to um you know all the data in these
trusted resources now what also enables you to do as a uh
as a as a as an organization is it allows you to really implement
compliance guidelines into your entire operation so web scraping is an
unregulated field and there's spin a fair bit of litigation in the space
trying to Define uh what's uh you know what's a good actor and what's a bad
actor in this space so what we do and we work with a lot of very sensitive institutions large hedge
funds banks that have Deep Pockets and are governed by the SEC
and they're concerned about uh you know potentially having any liability brought
to them in the space I'm actually as a CEO of sequentum I'm working with
financial information standards committee there are non-profit to Define
standards for web scraping operations but basically what we do in our software
is we make sure that every single configuration is explicit that has to do
with compliance and that it's auditable fully auditable so for example if if uh
if you're selling real estate data and there's a real estate site that is that
is quite certain that you're reselling data that you pulled from their site then and they come to you and say I
think that you owe me you know x amount of dollars because every night at midnight I have to spend up 80 extra
servers for Bots to pull down data well you can go into your uh uh software and
show that actually according to your guidelines you're only pulling lessons
you know X percent of average daily values and here's all the configurations and here are
the rate limits that you have set on each of these data collection agents
and you can go through and show the Run history and the job history and show
exactly what you pulled when um and you know you can basically uh you
know nip it in the bud right there um so you mitigate the risk that you're going to get blamed for
um scraping activities of Bad actors that may at some point look like they're coming from you
um not only that you can show because you can pull down each of these agents
at any point you can see what version was running and you can actually go and
get that particular agent and make sure that
you know it's configured the way that that you expected it to so for things like captcha captcha is a is a is an
explicit um configuration tool if you want to add um you know capture handling you can do
that um SEC governed institutions don't tend to like to do that but retail companies
for example routinely do that um so these are different you know different approaches from different
compliance groups from different companies but but because the software is so Advanced
and we have all these productivity neighbors and tracking of every explicit configuration you can uh you know track
what whether you're doing captures or not very easily um similarly you can do things like
specify whether or not you want to follow robots txt um you can either
always follow it so if the robots txt changes from one
group to the next the agent will automatically stop collecting or you can choose to warn only or to ignore you
know all of these things are explicit configurations and fit nicely with your compliance requirements I have a couple
of uh questions here um
what about third-party proxy providers can you connect yes um you can basically set up any proxy
provider that you want uh right here we have a simple list but you can set up
um any any type of proxy provider whether it's a super proxy or a data list or
back connect ports uh you know or an API whatever it is you can configure it in your providers and then from there you
can create pools that you then set up in your
in your agent um yes all of basically any provider can be
supported in here um
uh what I'm showing is there's a question about CG professional
um so we have Legacy versions of our software Visual Web Ripper and CG
professional and CG premium those do not include all the Enterprise features so
CG Enterprise is the third generation of our product and the Enterprise large-scale web scraping features are
only available in CG Enterprise um and we're happy to you know walk you
through the process to upgrade but the last point I wanted to mention and then I want to go to questions is that you
want to have Rich data interoperability um you want to make sure that this data
collection process works with your data sources and with your data targets
so for example uh when you're you know even in just collecting data you want to
make sure that you can collect data from any source so one of the incredibly Rich
features in this software is that we can mix and match requests so if you are
getting rate limited by a device fingerprint and you need to load up a
full browser page well for those of you that have done this before that's an incredibly expensive thing to do on your
server it takes a lot of memory it takes a lot of CPU power it costs a lot to run full browser scrapes
so maybe you just want to do one request after that you can go and do these lower
level simpler requests what we call parsers that will basically parse the
information and it's incredibly easy to find back-end apis they power probably
40 percent of all Sites we have purpose-built Equity browsers that will
help you find uh you know back-end apis that expose all the data that you're
trying to pull so it's incredibly easy to see in this example I'm showing a
Json um back-end API and then I'm switching to the web request Editor to look at the
requests that actually pulled up that data um I'm just modifying one of the parameters which I'm I'm able to do
visually it's incredibly easy for me to do and I'm going to format it for input
you could format it to put into C sharp or python code or into a regular expression
um for this I'm I'm basically going to just uh create it for input
this is a quick example I'm going to get rid of all of these full browser requests because I don't
actually need to do this scrape in a full browser and I'm going to basically pull up
the content in the Json parser and here just like I did in the
in the in the browser I can point and click and I can choose my my elements
and I can pull down all the data that I want so I'm clicking on one job node holding on the shift key
s clicking on another job node it's creating my list component and then again that context awareness I'm going
to click on this jobs out it's going to give me options all the types of things
that I would typically want to do here I'm going to capture all the web elements you know Etc
so now really in a couple of moments I have found you know with this data source there's a backend API and I'm
able to pull all of the data very simply so you really want a tool that allows
Rich data interoperability with your sources and your targets so for this one
I'm going to go ahead and and run this
okay it's making one request it's gotten a couple of Errors these are
just data validation errors because it's defaulting to short text for the description I'm not going to worry too
much about that um but basically at this point now that
I've collected this data I'm going to um
first of all I may want to do change tracking right I may want to I may want to
specify okay if I don't see this job listed for two days show it as deleted
so that I'm tracking what jobs have been filled presumably so I can set up change tracking I can
also set up export targets to basically export to any format
if your operation is like ours or some of the some of our customers you have uh
different units that are all pulling the same data they basically want to have uh
they basically want to have their content in Excel CSV Json parquet you
know they may want it in lots of different formats and you basically can export to as many formats as you want
at the same time and then you can deliver those um to any delivery Target so you can
have SFTP you can email them you can send to S3 or Azure and of course you
can write a script using again normal programming languages Python 3 C sharp
you can have script a script Library you know reusing um the same logic over and over
um so you've you can export to any format you can distribute to any Target and last but not least you've got full
API integration at every point so anything that you need to do anytime
you want to kick off the job ad hoc or you want to pull down data at a specific
time to integrate with your operation you've you can embed this the
large scale operation you know while you have full transparency of everything that's going on you can embed it into
your wider data engineering pipeline and we've got a full uh you know manual
that you know 600 Pages the details you know all the types of things that you
can do incredibly Rich tool set built over 10 years um you know this is just like an example
of agent command um uh capabilities you can basically do all
of these different types of things pretty much everything that you would ever want to be able to do
so that's um I guess the last point with data interoperability is that you can
integrate to external libraries if you wanted to integrate uh you know OCR or
um or any custom PDF reader or image reader you can do that you can read data
from files you can read data from databases whether it's an oledb or nosql
DB and of course you can interact with any API so if you want to enrich your data
on the fly with um with IBM Watson sentiment analysis or
entity extraction that's easily done so that you know a key evaluation
criteria is that your your tool set allow for Rich data interoperability
both at their level of the sources as well as the level of the targets just just to summarize you want to make sure
that your tool is incredibly easy to use so you don't have to pay top dollar for
each of your resources and you're not boring uh your high level Engineers your
highly skilled Engineers with mind numbingly repetitive work that can be can be automated and can be managed by a
much lower level staff you want all of those automation productivity enablers
all those key functions and methods that you need to use over and over again like change tracking or uh you know just take
the new data from the list you want to have your monitoring and controls and
alerts alarms over data quality you want to make sure that you're not blocked you want to have that centralization of all
the aspects of your operation separating out all the components that could possibly change
affecting your agents you want to be able to change a change uh you know any
anything that's going wrong you want to be able to fix in one place and have all your agents automatically working again
and then of course you want that centralized transparency also um so that you can set up a compliance
uh governance program to make sure that you're uh you know checking all the
boxes from uh risk mitigation standpoint and then of course you want that data
applicability so let's see I have one more question here is it possible to connect inputs from a database and write
outputs to a database yes MySQL yes we support MySQL you can read and write uh
from and to an oledb with mySQL or SQL
server or mariadb postgres you can do that at any point uh in your workflow it's easily done with custom code and uh
you know of course we have a lot of built-in capabilities depending on how you configure
um depending on how you configure your internal database
um so you can basically configure your internal database to be whatever the uh
OLED type database that you have on site whether it's I guess it's MySQL for you
um so you can just configure it to use MySQL and because it's a database backed tool
you can pick up wherever you left off so for example um you know once you configure your
clusters and servers um you know if one of your servers uh
you know drops dead for some reason you can easily just move that work to another server and it could pick up
where it left off because it's a database back tool um okay so someone has asked where can I
download a demo um we don't make our software publicly available
um just because it is so rich and robust we like to hold your hand a little bit while you're demoing the software we're
happy to provide a free trial and uh and to allow you to use the desktop and
we'll give you a um a demo environment of our uh you know
ACC that you can upload your your content to um we're happy to do that uh just uh
contact us after the webinar and we will um set you up
with a trial license okay any other questions
let's see I don't see any other questions
I'm gonna see right for a moment
looks like we may have a couple more a ballpark pricing okay so here we go so
the desktop is um five so we're on an annual subscription basis
um you've got a product uh development team that's dedicated to you our desktop
is five thousand dollars annual subscription and that includes support um so basically uh we're gonna get you
unstuck if you have a problem not uh maybe you don't know how to use something and uh you know you're you're
having trouble figuring out um you know where to turn the in our
manual um you send your uh your agent into us
and we usually can uh get you unstuck in in a matter of 10 minutes so there's
email support um and we typically respond uh in a maximum of 24 hours it's often a
lot quicker than that um and then if if there's a particular agent that you're blocked on
um or you really just need us to write something for you because you're pressed for time
um we can do that ad hoc for 150 an hour we'll write agents for you
um the server comes bundled right now with the agent control center the pricing is not based on a paracore uh or
uh you know any of those uh measures it is ten thousand dollars annual
subscription per server so if you wanted to set it up initially on a small uh small form factor and that is your
operation grows you deactivate it reactivate it on bigger software on bigger Hardware that's fine that's still
the same price um from our point of view it's very difficult for to estimate exactly how
many resources their agents are going to use in the future so we've come up with
this pricing model for the server to make it incredibly easy for teams to
expand and grow Within their budget very very focused on
keeping your your budget lean you may have noticed that there's no per page
load pricing and there's no per agent pricing and there's no per core pricing
that makes it extremely friendly to teams that are really just getting started and trying to set up large-scale
wave scraping operations from scratch so you basically know exactly how much
you're going to spend as you start adding thousands of Agents you're going to buy more servers we know that because
one server is going to be optimized for full browser workloads and other ones going to be optimized for parser
workloads etc etc um and so we're confident that you know
you'll grow with us and um that you know you'll build uh businesses on top of uh air software
that are very very profitable we have many customers that have over a billion
in annual revenue um so uh you know this is
so we're very confident that our pricing is is uh competitive and that it's really uh working for our customers
um yes the the desktop pricing is five thousand dollars annual subscription per
seat so that's per user if you wanted to set it up on a virtual desktop and have
you know one person in one time zone use it for one shift and another person another time zone use it for another
shift that's fine we're not limiting you by the number of users in fact in the
agent control center there's no limit on the number of users you can configure your organization
and your users There's No Limit here there's no license attached to these
users again we're trying to make it as easy as possible to get things going and
get off the ground another question what kind of business or activities your clients use what can
we do with it in general okay so this is a very common question that we get
um it's what are the use cases that people are using web data collection for
and this is an incredibly vast field so for example maybe you're a university
professor and you're looking at macroeconomic um uh you know you're trying to assemble
a macroeconomic uh index or indicator that shows the overall health of the economy you might look at rental prices
you might look at sale prices you might look at um you know one one common one that's
been written about in The Press is is uh in summertime they look at RV rentals and our RVs uh used RV sales
um how fast are they moving are the prices getting discounted which would represent uh certain weakness in the
market you know these are all macro economic indicators if you're an investor and you're running a hedge fund
that's a large Porsche probably 70 percent of our bespoke services are are uh our hedge funds right now
um you know the business of investment is is getting automated they will look at anything that a company has put
online that exposes uh their internal operations so if it's
um if it's an airline they'll look at how many markets are they servicing have they dropped any markets are the flights
late um how often are they late what airports are they servicing and how much traffic uh is there in general in this in those
uh airports um you know these are all uh this is all information you can find online they'll
look at prices the look at seat map availability per class these are all
um uh you know facts that you can find online by looking at various websites and they'll be able to construct the
balance sheet of those Airlines uh before the earnings are announced um and
they'll automatically take in these fees and they will apply algorithms to them
and they will uh basically uh you know create what's what they call signals and
a signal a positive signal means you know invest invest or hold
um and a negative single means uh get out of here uh stop investing in this this is going down or short it
um and uh and so that's a finance use case in real estate
um they're pulling you know their their marrying government data to um uh real estate sale or or rental
listings that are on you know typical websites um and they're and they're able to come
up with an entire um uh information database on all the
closings and all the real estate activity in a particular District or
city or region or Market um real estate is a is a very big area
in in retail uh of course they're doing competitive pricing they're checking you
know an online retail sales and e-commerce sites uh they're constantly checking each other's prices
um you can also uh now there's another layer which is uh the the product
uh manufacturers and brands are checking the retail sites like Walmart uh or
Amazon to see where their product uh um sits in the order of products returned
first you know so that's a whole nother level um you know there's social media use cases
maybe you're pulling um sentiment around a brand maybe if you're looking at
um uh you know for example a new product like uh meatless meat or fake meat or
whatever it's called Uh the the the meat Alternatives that are coming out um you know a good place to to look to
assess um how things are going is to look on social media and Yelp and these types of
sites that have reviews um and you pull all that all the review content down and then you run it through some sentence and Analysis and keyword
extraction um uh you know text analytics and you get a pretty clear picture pretty
quickly about how that new product is doing and how individuals and consumers
are responding to um that product um so those are a number of different use case examples jobs are also a really
typical one um it'll show you what areas the company is expanding into and again you can use
those third apis to do text analytics to enrich the raw data that you're pulling
with our software um so one more question I have what defines
a server a server is um right now the way the server is
defined is a server is basically um the software that runs on uh a
particular machine that does all of the data collection work that's defined
as part of that agent and right now the server license is a bundle of the server
and the agent control center so those two things come together
all right let's see do we have any other questions it looks like we're almost done we have about three minutes left oh
we've got one more question here how is CG handling the new Google photo
based captcha oh I'm so glad you asked this um so we have layers and layers of ways
of dealing with captcha so for um uh you know for basically for captcha
it's very easy to either uh you can either make your request
whenever you encounter a captcha you can either reissue the request with new you
know clear your storage clear your headers you know reset with a new IP you know Etc and you can get around that
um captcha or um you can go ahead and you can just automate the captcha using
um you know this built-in function that we have um you can either do it
um you know by a custom script uh which you know there's uh there's ways to use
AI to get past these captions or you can use these third-party services death by
captcha to captcha are two very very common ones and basically what what you
do is you have a key to their services and this enables you to very quickly
um pass that captcha on to their human beings who are going to click through for you and then they pass back the
success key to your scraping session and off you go
and all of this is documented in our manual um which I should mention is quite
voluminous it's got information on uh um you know basic things about what's uh
you know how techniques limitations you know Etc and then it's got detailed
information about pretty much everything that you would need to know um on how to collect data how to work
with data um you know any anonymity requirements you have compliance
um you know Etc full documentation of how to deal with captcha in our manual
any other questions
okay well um we're just coming on the hour I want
to thank you all for your um for attending and congratulate you for
um investing time and efforts in in to make sure that you have the best solution possible for your companies and
your businesses if you have any follow-up questions please go ahead and email us at sales sequentum.com
um we'd be happy to answer anything uh any for follow-on questions you have we're
also happy to um set up trials and uh get you set up
to uh you know to really get Hands-On with this with this software
um I hope this has been uh an educational session and I look forward to seeing some of you
hope to see some of you on some of the future webinars that we are doing we have a full schedule that we've sent out
we're doing uh one webinar a week on specific topics this one has been on key
evaluation criteria we look forward to staying in touch and helping you achieve
uh your large-scale web scraping operation goals