Skip to content

Web Scraping Orchestration

Goal

Create an index of Fandom.com and TVtropes.org then connect them. * TODO * Tutorials/Docker Postgres with Backup and Restore * Docker Traffic Through VPN * Merge SQLite databases * Minio Setup Tutorial * S3 Backup and Restore * Nats Tutorial * Scrape some fandom sites and store them as sqlite databases * Share code with other people * Update docker networking tutorial * Scope out 2.0 Scraper * Use NATS pub sub + job queue * Use a networked database, so no sqlite * Store the images of pages in object storage * Have other scraping engines such as selenium and puppeteer * Use VPN and other proxies * Upgrade proxies to be changeable via pubsub * Customised Scraper * Develop UI for contextual annotation * What is this contextual annotation and how does it relate to question engine, that needs to be written out

Logs

  • 2023-04-09
    • We need to store the links
    • Well we don't know if they are in the database or not
    • There are a fuck tone of links
    • We do not care about optimization until later
    • Alright but can't we use the full_url as the primary key
    • Ohhhhhhh that needs to be written down somewhere
      • Well we can do that, what would we need to change
      • We would need to change the insert and stuff
      • This is just far more efficient
      • Can we do join's on strings
      • Why not
      • It is easier than doing a select check for everything
      • YES FAR EASIER
      • Alright let's do it
    • Done, it's now doing the recursive scraping
    • Wonderful what's next
    • Next is to do the network stuff
    • Oh yes, wireguard or openvpn
    • It would be cool to do wireguard
    • Let's find a tutorial
    • Wait this is a separate tutorial
    • Yes, yes it is
    • So IDK if I am scraping correctly, I am extracting all parts of the URL, I think is is compartamentalized well enough
    • I should be reading right now
  • 2023-04-06
    • Well I just lost all those logs RIP
    • Gotta remember to git stash
    • So where were we
    • We have the scrape URL file finished
    • Yes what comes after that?
    • Now we need to actually scrape it, then we need to have a session where we delete the scraped, add the log
    • Next we need to scrape, upload contents, then delete scrape, add log
    • That seems doable
    • So how are we supposed to do the jobs?
    • I mean extract the URL's
    • Is this going to be a separate job
    • Yes it is obviously a separate job
    • So how does the system know when to do it?
    • There are the we have the job logs, we have the scraping queue
    • Are we renaming it job queue?
    • Ya, then we can add another type of job
    • This is where callback functions come in doesn't it
    • YES
    • We can just store it in the HTML contents
    • No we store it in the logs, cause the contents can be object storage
    • Do we want to practice JSON in sqlalchemy
    • No we should do that in the tutorial
    • Alright so how do we do this
    • We append the logs with a links_extracted boolean
    • Sure why the fuck not
    • And we do string
    • Do we want to jump to jobs server now?
    • No we want to know how to do this single threaded
    • Then we can do jobs and stuff
    • So the next thing to do would be what exactltly?
    • Alright so what would the next thing be?
    • We read through the URLs and decide which ones need to be scraped
    • Are we going to have to process the domains as well
    • Yes, yes we will
    • Then what do we do?
    • We read all the URLs and path that have not been scraped and put them in queue after every URL we scrape until we have a lot of like 100
    • Can we do that right now
    • Probably
    • So what was the plan?
    • Subdomain parsing
    • Wait do we really need that, we can just do the right SQL queries like distinct with a %.fandom.com, ya that is over engineering, trust the library developers
    • So then we just need a function that can geed new urls into the queue and scrape again until a specific number of logs exist
    • We still need to check http error and log it
    • Alright than that's basically it then right
    • Ya then we can start doing jobs and start doing the cool shit with Selenium
    • We also need to link what links to what, and to do that we need to know what links are actually useful
    • Oh ya I forgot about that
    • So links?
    • What links matter
    • The ones with the
    • Game of Thones was too big a wiki to start with, found 9000 pages
    • We need to track the links between stuff, that is very important, that is the who point of this, we gotta find that graph sqlite example
    • Wait after we have the data what do we do with it?
    • Paul you should have thought this part out
    • How do we want to look in at this data
    • Well we want to see the stories summarized
    • We can also fetch the actual reference texts
    • WE WANT TO LABEL THE EDGES
    • We want to generate meme vectors
    • We want to connect this to TV Tropes
    • Oh ya, how do we do that
    • Alright so next steps are?
      • Track the links between pages, use a graph
      • Find a small simple example rather than GameOfThrones
      • Write the postgres in docker tutorial
        • Maybe do MariaDB and MySQL at the same time
      • Scrape some fandom sites and store them as sqlite databases
        • Share code with other people
      • Update docker networking tutorial
      • Scope out 2.0 Scraper
        • Use NATS pub sub message queue
        • Use a networked database
        • Store the images of pages in object storage
        • Have other scraping engines
        • Use VPN and other proxies
      • Develop UI for contextual annotation
        • What is this contextual annotation and how does it relate to question engine, that needs to be written out
  • 2023-04-01

    • Alright so what do we do next?
    • Well why are we doing this project?
    • What is the utility of Fandom.com when I do not even read it?
    • Well we want to find themes and connect them, the same way we connect these themes we can connect people to work on projects
    • Dude that is based
    • Glad no one reads my shit, talking to myself in cringe
    • I need to talk to myself, no one else wants to listen
    • Well you can make them listen, you would be supprised
    • Thanks alter ego
    • Alright so what schema do we want?
    • We doing those Diagrams?
    • Entity Relationship Diagram
    • Well let's explain the different parts of this project first shall we?
      • We need the database
      • We need the scraper
      • We need jobs based scraper
    • Alright that sounds pretty simple
    • Yes it does
    • Design your schema
    • Alright so what is this Schema supposed to contain?
      • Web pages to scrape
      • The contents of web pages
      • Errors from web pages scraped
      • URL extracted form web pages
      • Edges, what pages link to each other
    • So this is 5 tables?
    • Sure why not
    • Let's look at how we did it last time? sql CREATE TABLE IF NOT EXISTS SCRAPED_URLS_T( SCRAPED_URL_ID INTEGER PRIMARY KEY, URL_ID INTEGER, DATE_SCRAPED DATE, HTML TEXT) CREATE TABLE IF NOT EXISTS URLS_T( URL_ID INTEGER PRIMARY KEY, FULL_URL TEXT, SCHEMA TEXT, DOMAIN TEXT, PATH TEXT, PARAMS TEXT, QUERY TEXT, FRAGMENT TEXT )
    • Okay so that looks okay
    • How do we want to do this? Node or Python
    • Well we should probably use Selenium
    • Ya
    • Cool Selenium can use extensions too
    • We can also get a Mentor for this kind of stuff down the line
    • Okay so we using SQLAlchemy then
    • Yes
    • We doing ORM
    • Sure why not
    • Why don't we just write the ORM code first
    • Alright sure
    • Where are we writing this code, github, gitlab, keybase, codeburg
    • Gitlab
    • So do I want to run this on SQLite or Postgres?
    • Does it matter? You need a job system before you need to run postgres
    • Alright
    • So how about we write a SQLAlchemy Tutorial, then we can use it in the Web Scraping Tutorial
    • Ya so what goes in the SQLALchemy Tutorial
    • Alright so this is another project now right?
    • Yes
  • 2023-03-30

    • So how do I scrape everything?
      • Well I can develop a basic web scraper, check for 200 return, and log errors
      • Didn't we do this before?
      • Yes when? There was an error table that was very important?
      • Keybase Binding?
      • No
      • ENS Indexing, that's what it was
      • So we just indexing all the raw HTML to a database
      • Ya why not, otherwise things get complicated
      • This might be a LOT of data
      • Who care, we need to scope this correctly, scrape by single site, list all sites
      • Can't I just go buy indexes like this off the internet?
      • Probably but good luck
      • I bet people usually just hire someone to do it for them
      • Alright so how do we do this?
      • Design Schema, Scrape, Scale, ETL
      • Alright it might actually be that simple
      • We are going to have to spend a lot of time designing this crap.
      • This seems doable