Web Scraping Orchestration
Goal
Create an index of Fandom.com and TVtropes.org then connect them. * TODO * Tutorials/Docker Postgres with Backup and Restore * Docker Traffic Through VPN * Merge SQLite databases * Minio Setup Tutorial * S3 Backup and Restore * Nats Tutorial * Scrape some fandom sites and store them as sqlite databases * Share code with other people * Update docker networking tutorial * Scope out 2.0 Scraper * Use NATS pub sub + job queue * Use a networked database, so no sqlite * Store the images of pages in object storage * Have other scraping engines such as selenium and puppeteer * Use VPN and other proxies * Upgrade proxies to be changeable via pubsub * Customised Scraper * Develop UI for contextual annotation * What is this contextual annotation and how does it relate to question engine, that needs to be written out
Logs
- 2023-04-09
- We need to store the links
- Well we don't know if they are in the database or not
- There are a fuck tone of links
- We do not care about optimization until later
- Alright but can't we use the full_url as the primary key
- Ohhhhhhh that needs to be written down somewhere
- Well we can do that, what would we need to change
- We would need to change the insert and stuff
- This is just far more efficient
- Can we do join's on strings
- Why not
- It is easier than doing a select check for everything
- YES FAR EASIER
- Alright let's do it
- Done, it's now doing the recursive scraping
- Wonderful what's next
- Next is to do the network stuff
- Oh yes, wireguard or openvpn
- It would be cool to do wireguard
- Let's find a tutorial
- Wait this is a separate tutorial
- Yes, yes it is
- So IDK if I am scraping correctly, I am extracting all parts of the URL, I think is is compartamentalized well enough
- I should be reading right now
- 2023-04-06
- Well I just lost all those logs RIP
- Gotta remember to git stash
- So where were we
- We have the scrape URL file finished
- Yes what comes after that?
- Now we need to actually scrape it, then we need to have a session where we delete the scraped, add the log
- Next we need to scrape, upload contents, then delete scrape, add log
- That seems doable
- So how are we supposed to do the jobs?
- I mean extract the URL's
- Is this going to be a separate job
- Yes it is obviously a separate job
- So how does the system know when to do it?
- There are the we have the job logs, we have the scraping queue
- Are we renaming it job queue?
- Ya, then we can add another type of job
- This is where callback functions come in doesn't it
- YES
- We can just store it in the HTML contents
- No we store it in the logs, cause the contents can be object storage
- Do we want to practice JSON in sqlalchemy
- No we should do that in the tutorial
- Alright so how do we do this
- We append the logs with a links_extracted boolean
- Sure why the fuck not
- And we do string
- Do we want to jump to jobs server now?
- No we want to know how to do this single threaded
- Then we can do jobs and stuff
- So the next thing to do would be what exactltly?
- Alright so what would the next thing be?
- We read through the URLs and decide which ones need to be scraped
- Are we going to have to process the domains as well
- Yes, yes we will
- Then what do we do?
- We read all the URLs and path that have not been scraped and put them in queue after every URL we scrape until we have a lot of like 100
- Can we do that right now
- Probably
- So what was the plan?
- Subdomain parsing
- Wait do we really need that, we can just do the right SQL queries like distinct with a %.fandom.com, ya that is over engineering, trust the library developers
- So then we just need a function that can geed new urls into the queue and scrape again until a specific number of logs exist
- We still need to check http error and log it
- Alright than that's basically it then right
- Ya then we can start doing jobs and start doing the cool shit with Selenium
- We also need to link what links to what, and to do that we need to know what links are actually useful
- Oh ya I forgot about that
- So links?
- What links matter
- The ones with the
- Game of Thones was too big a wiki to start with, found 9000 pages
- We need to track the links between stuff, that is very important, that is the who point of this, we gotta find that graph sqlite example
- Wait after we have the data what do we do with it?
- Paul you should have thought this part out
- How do we want to look in at this data
- Well we want to see the stories summarized
- We can also fetch the actual reference texts
- WE WANT TO LABEL THE EDGES
- We want to generate meme vectors
- We want to connect this to TV Tropes
- Oh ya, how do we do that
- Alright so next steps are?
- Track the links between pages, use a graph
- Find a small simple example rather than GameOfThrones
- Write the postgres in docker tutorial
- Maybe do MariaDB and MySQL at the same time
- Scrape some fandom sites and store them as sqlite databases
- Share code with other people
- Update docker networking tutorial
- Scope out 2.0 Scraper
- Use NATS pub sub message queue
- Use a networked database
- Store the images of pages in object storage
- Have other scraping engines
- Use VPN and other proxies
- Develop UI for contextual annotation
- What is this contextual annotation and how does it relate to question engine, that needs to be written out
-
- Alright so what do we do next?
- Well why are we doing this project?
- What is the utility of Fandom.com when I do not even read it?
- Well we want to find themes and connect them, the same way we connect these themes we can connect people to work on projects
- Dude that is based
- Glad no one reads my shit, talking to myself in cringe
- I need to talk to myself, no one else wants to listen
- Well you can make them listen, you would be supprised
- Thanks alter ego
- Alright so what schema do we want?
- We doing those Diagrams?
- Entity Relationship Diagram
- Well let's explain the different parts of this project first shall we?
- We need the database
- We need the scraper
- We need jobs based scraper
- Alright that sounds pretty simple
- Yes it does
- Design your schema
- Alright so what is this Schema supposed to contain?
- Web pages to scrape
- The contents of web pages
- Errors from web pages scraped
- URL extracted form web pages
- Edges, what pages link to each other
- So this is 5 tables?
- Sure why not
- Let's look at how we did it last time?
sql CREATE TABLE IF NOT EXISTS SCRAPED_URLS_T( SCRAPED_URL_ID INTEGER PRIMARY KEY, URL_ID INTEGER, DATE_SCRAPED DATE, HTML TEXT) CREATE TABLE IF NOT EXISTS URLS_T( URL_ID INTEGER PRIMARY KEY, FULL_URL TEXT, SCHEMA TEXT, DOMAIN TEXT, PATH TEXT, PARAMS TEXT, QUERY TEXT, FRAGMENT TEXT )
- Okay so that looks okay
- How do we want to do this? Node or Python
- Well we should probably use Selenium
- Ya
- Cool Selenium can use extensions too
- We can also get a Mentor for this kind of stuff down the line
- Okay so we using SQLAlchemy then
- Yes
- We doing ORM
- Sure why not
- Why don't we just write the ORM code first
- Alright sure
- Where are we writing this code, github, gitlab, keybase, codeburg
- Gitlab
- So do I want to run this on SQLite or Postgres?
- Does it matter? You need a job system before you need to run postgres
- Alright
- So how about we write a SQLAlchemy Tutorial, then we can use it in the Web Scraping Tutorial
- Ya so what goes in the SQLALchemy Tutorial
- Alright so this is another project now right?
- Yes
-
- So how do I scrape everything?
- Well I can develop a basic web scraper, check for 200 return, and log errors
- Didn't we do this before?
- Yes when? There was an error table that was very important?
- Keybase Binding?
- No
- ENS Indexing, that's what it was
- So we just indexing all the raw HTML to a database
- Ya why not, otherwise things get complicated
- This might be a LOT of data
- Who care, we need to scope this correctly, scrape by single site, list all sites
- Can't I just go buy indexes like this off the internet?
- Probably but good luck
- I bet people usually just hire someone to do it for them
- Alright so how do we do this?
- Design Schema, Scrape, Scale, ETL
- Alright it might actually be that simple
- We are going to have to spend a lot of time designing this crap.
- This seems doable
- So how do I scrape everything?