This tutorial will outline my process of footprint selection and filtering your lists once you have finished scraping. The latter is important when you are scraping for a specific platform and don’t want a load of “junk” included in your list. It doesn’t take long to do and will save you loads of time when it comes to running your list.
For this example we are going to be scraping for Aska BBS which can be posted with Xrumer.
IDENTIFYING GOOD FOOTPRINTS
This is one of the most important parts of any scrape; finding some good footprints which will yield the optimum results. In this example we are not going for a quick low quality scrape, we want to get every potential target from Google using as many footprints as we can find. We don’t care if his takes a whole week to run – the more links the better!
So first off we want to find an example of a site using the Aska BBS CMS. Let’s start by Googling “buy Viagra” “aska BBS”
We used the keyword “buy Viagra” as we know that pharma guys are heavily into spamming so we should find some good targets.
We will be using 3 types of footprints to search for websites in Google:
· Onpage footprints
Of the above the far most important for most platforms (and my personal favorite) is inurl: as it yields the most accurate results but it does tend to get your proxies banned faster.
So let’s start with locating inurl: footprints, these are footprints which are found within the actual url structure of all of the sites we wish to scrape. You can see from our scrape above that the first obvious one is:
If we search that term in Google we see that there are a few hundred thousand results, more than enough to get some good scraping done with. What we will do now is scan these search results and our initial results using the “buy Viagra” keyword to detect spammed pages and see if we can find some more similar footprints for Aska.
Here’s what I came up with:
For this platform the footprints are very similar which is fine, as they will still yield more results than using 1 more generic footprint alone (inurl:aska.cgi). But, with other platforms you will find lots of varying footprints including, for example, the registration page, the member profile pages, the post pages, the help page etc. In this example we don’t have that option as it’s a simple BBS without many added features.
Now, let’s talk about our inurl footprints. There’s something wrong with them…
“What’s wrong?” I hear you ask. Google doesn’t like non alpha-numeric characters!
To get around this issue sometimes we can put some of the footprints into speechmarks, this doesn’t work with all characters but seems to work for the majority.
inurl:aska.cgi?mode= yields 62,200 results in Google.
inurl:"aska.cgi?mode=" yields 678,000 results in Google.
You’ll need to experiment with this a little as there’s no set method that I use but get creative and see what you can squeeze out of footprints. Sometimes it even works best to split up your footprints like this:
But not in this case!
Also try navigating some of the sites you have scraped to see if there is a common file structure, like for example, if they all had the same contact/faq/support page or if some of their other pages have something in their urls that can be used to identify them when scraping.
These are worth adding to get a few extra links, you are basically searching for a common footprint in the page titles. With Aska BBS this is easy, remember our initial search?:
There’s a pretty obvious intitle footprint above :)
Simple as that!
Onpage footprints are common words and phrases you can find on the pages you wish to scrape. These are not always immediately obvious at first but spend some time examining your targets and you’ll come up with a few Gems.
Let’s take a look at a couple of ASKA pages:
From the above examples some immediate onpage footprints are apparent:
· "- Aska BBS -"
· "- ASKA BBS -" "Modified by isso"
· "- ASKA BBS -" "powered by TOK2"
I’m sure there are tonnes more to find and there’s even the opportunity to use foreign footprints depending on what you’re using to scrape with but we won’t go into that now.
So… here’s our final footprint list:
"- Aska BBS -"
"- ASKA BBS -" "Modified by isso"
"- ASKA BBS -" "powered by TOK2"
More than enough to get us going! If you find any more after you have started scraping you can always add them mid-scrape.
Once you have finished scraping, depending on how strict your sieve filter setting were in hrefer, it is usually a good idea to further filter your list using the LDA (Link Database Analysis) feature in Xrumer. Often it’s better to use relaxed sieve filter settings, especially for platforms which don’t have a definitive URL structure, so this will help you clean up your list if that is the case.
We will use this code to filter our list using Xrumer so that it contains ONLY PAGES WHICH ARE RUNNING OUR PLATFORM’S CMS. This will get rid of all the garbage collected by Hrefer during the scraping and allow us to post to our new lists faster and with less resources.
To view the source code of a site you can right click and click “view page source” in Google Chrome. This is usually similar for other browsers.
Let’s examine the source code of an ASKA BBS page:
Our goal now is to identify elements in this source code which is common amongst ASKA BBS sites. I like to open about 10 tabs of source code from different ASKA sites on different domains and make sure I can find something which is present on all of them.
You’ll need to experiment with this to find something which works, you don’t want something too generic (e.g. <p><span style=) or you will get false positives and Xrumer will locate sites which aren’t your platform. You need to find code which is present on ASKA sites but NOT present on any other platforms.
After searching the above I have found 2 patterns in the source code which I believe could only be present on ASKA sites:<!—ASKA BBS
So, now I need to check another few ASKA sites to see if they contain this footprint in the source. I now check the other 9 sites I have open and guess what? They all have both those incode footprints present on them!
I can now choose either or both of these footprints to use in the Xrumer LDA tool but before I do that, let’s examine the 2nd incode footprint I found.
I found it in the top navigation section of the site with some other links also present:
These links are present on all ASKA BBS sites so let’s add them to our inurl: footprints for scraping!
These are not necessary for our LDA filtering but I thought it was worth adding to show how there is no set process for finding footprints. You may notice them whilst doing something entirely different like checking the source code! Keep building up your footprint list and you’ll end up with an arsenal of great footprints to scrape some huge lists!!
Now let’s setup Xrumer to filter our scraped URLs!!
We’re going to open Xrumer and click “Tools” -> “Link Database Analysis”
The LDA tool should open up:
We’re now going to setup the LDA tool to filter our linklist for definite ASKA BBS pages:
1. Select your scraped linkslist here
2. This is where your filtered list will be saved
3. Input your incode footprint. I have chosen to just use 1 this time but you can use as many as you like.
4. Choose the number of threads to use. I like to use 100-200 depending on how much else is going on on my server.
5. This will show you the number of links found with your incode footprint, so far we’ve got 36% of the links it has checked.
NB: I just did quick scrape of 5k urls for demonstration purposes, realistically you would have a much bigger list to run through this tool for a big scrape.
OK so our filtering has finished and here’s the results:
10% of the scraped links were actually ASKA BBS. We can now run our new lists and save 90% of our resources which would have otherwise been used on false positives!
LDA is a great tool to utilise and can be used for tonnes of other great stuff including checking for “200 OK” in the headers of sites to make sure they’re up and running.
Looking to expand your Xrumer reach? Adding an external captcha solving service is a great way to get up on sites not hit by the out of the box users.
If you don't already have DeathByCaptcha you can signup
2: Replace the contents of anticaptcha.ini with the following;
3: Save, Close, and restart Xrumer
4: In Xrumer go to; Options -> Speed <----> Success Rate and set the following options