This tutorial will outline my process of footprint selection and filtering your lists once you have finished scraping. The latter is important when you are scraping for a specific platform and don’t want a load of “junk” included in your list. It doesn’t take long to do and will save you loads of time when it comes to running your list.
For this example we are going to be scraping for Aska BBS which can be posted with Xrumer.
IDENTIFYING GOOD FOOTPRINTS
This is one of the most important parts of any scrape; finding some good footprints which will yield the optimum results. In this example we are not going for a quick low quality scrape, we want to get every potential target from Google using as many footprints as we can find. We don’t care if his takes a whole week to run – the more links the better!
So first off we want to find an example of a site using the Aska BBS CMS. Let’s start by Googling “buy Viagra” “aska BBS”
We used the keyword “buy Viagra” as we know that pharma guys are heavily into spamming so we should find some good targets.
We will be using 3 types of footprints to search for websites in Google:
· Onpage footprints
Of the above the far most important for most platforms (and my personal favorite) is inurl: as it yields the most accurate results but it does tend to get your proxies banned faster.
So let’s start with locating inurl: footprints, these are footprints which are found within the actual url structure of all of the sites we wish to scrape. You can see from our scrape above that the first obvious one is:
If we search that term in Google we see that there are a few hundred thousand results, more than enough to get some good scraping done with. What we will do now is scan these search results and our initial results using the “buy Viagra” keyword to detect spammed pages and see if we can find some more similar footprints for Aska.
Here’s what I came up with:
For this platform the footprints are very similar which is fine, as they will still yield more results than using 1 more generic footprint alone (inurl:aska.cgi). But, with other platforms you will find lots of varying footprints including, for example, the registration page, the member profile pages, the post pages, the help page etc. In this example we don’t have that option as it’s a simple BBS without many added features.
Now, let’s talk about our inurl footprints. There’s something wrong with them…
“What’s wrong?” I hear you ask. Google doesn’t like non alpha-numeric characters!
To get around this issue sometimes we can put some of the footprints into speechmarks, this doesn’t work with all characters but seems to work for the majority.
inurl:aska.cgi?mode= yields 62,200 results in Google.
inurl:"aska.cgi?mode=" yields 678,000 results in Google.
You’ll need to experiment with this a little as there’s no set method that I use but get creative and see what you can squeeze out of footprints. Sometimes it even works best to split up your footprints like this:
But not in this case!
Also try navigating some of the sites you have scraped to see if there is a common file structure, like for example, if they all had the same contact/faq/support page or if some of their other pages have something in their urls that can be used to identify them when scraping.
These are worth adding to get a few extra links, you are basically searching for a common footprint in the page titles. With Aska BBS this is easy, remember our initial search?:
There’s a pretty obvious intitle footprint above :)
Simple as that!
Onpage footprints are common words and phrases you can find on the pages you wish to scrape. These are not always immediately obvious at first but spend some time examining your targets and you’ll come up with a few Gems.
Let’s take a look at a couple of ASKA pages:
From the above examples some immediate onpage footprints are apparent:
· "- Aska BBS -"
· "- ASKA BBS -" "Modified by isso"
· "- ASKA BBS -" "powered by TOK2"
I’m sure there are tonnes more to find and there’s even the opportunity to use foreign footprints depending on what you’re using to scrape with but we won’t go into that now.
So… here’s our final footprint list:
"- Aska BBS -"
"- ASKA BBS -" "Modified by isso"
"- ASKA BBS -" "powered by TOK2"
More than enough to get us going! If you find any more after you have started scraping you can always add them mid-scrape.