Let's start right away with the main script code:
#!/usr/bin/perl
# which-forum.pl script
# (c) 2010 Alexandr A Alexeev, http://site/
use strict;
# commented lines - for rigor
# if the task is to collect engine statistics, leave it as is
# if you make a list of forums - uncomment
my $data ;
$data .= $_while (<>
)
;
# check how much was Powered by phpBB without a link in the footer You will find this and other scripts mentioned in the post in this archive. Script which-forum.pl examines the HTML page code to see if it contains signatures of the forum engine. We used a similar technique when defining WordPress and Joomla, but there are a couple of differences. Firstly, the script itself does not load the page code, but reads it from stdin or a file passed as an argument. This allows you to download the page once, for example, using wget, and then run it through several analyzers, if we have more than one. Secondly, in this script the presence of a signature is 100% a sign of the engine. Last time, the presence of a signature only added weight to the corresponding engine and the engine with the greatest weight “won”. I decided that in this case, such an approach would only unnecessarily complicate the code. To test the script, I did some research. I compiled a list of several thousand forums and ran each of them through my script, thereby determining the percentage of program responses and the popularity of various engines. To get the list of forums, I used my Google parser. Queries like this were sent to the search engine site:forum.*.ru and so on. You will find the complete query generator code in the file gen-forumsearch-urls.pl. In addition to zone.ru, .su .ua .kz and .by were also used. Last time, it was difficult to conduct such a study, since WordPress and Joomla sites do not have such signatures in the URL. Catalogs like cmsmagazine.ru/catalogue/ do not provide a sufficient sample size. What is 600 Drupal sites? I must admit, the results of the experiment disappointed me. Of the 12,590 sites studied, the engine was successfully identified on only 7,083, that is, only in 56% of cases. Maybe I didn't take into account some engine? Was it really true that half of the forums had Bitrix installed? Or should I have spent more time searching for signatures? In general, additional research is required here. Among the 56% of successfully identified engines, the most popular, as expected, were IPB (31%), phpBB (26.6%) and vBulletin (26.5%) They are followed with a large lag by SMF (5.8%) and DLEForum (5.3%). My favorite punBB was only in 6th place (1.64%). I wouldn’t recommend putting much faith in these numbers (they say that every third forum on the RuNet runs on IPB), but certain conclusions can, of course, be drawn. For example, if you intend to make a site on a forum engine and plan to modify the forum, say, pay users $0.01 for each message with automatic withdrawal of funds once a week, then you should choose one of the three most popular engines. The more popular the forum, the greater the chances of finding a programmer who is well versed in it. If no significant changes are expected in the engine, then it may make sense to choose a less popular engine, for example SMF or punBB. This will reduce the number of hacker attacks on your forum and the amount of spam automatically sent on it. Scripts for searching/identifying forums can also find many practical applications. The first thing that came to my mind was to sort the identified forums by TIC and post on the first hundred posts with links to one of my sites. However, hundreds of forum dofollow links did not affect the TCI in any way (2 updates have passed), so it is better not to waste time here, unless you are not interested in transitions. It is clear that the mentioned use of scripts is far from the only one. I think you can easily figure out how else you can use them. Organized by Botmaster Labs, not planned. I don’t have time, the video is needed for a competition, as a newfangled trend, although everything can be explained more easily with good screenshots (my IMHO), and I don’t really want to shoot anything. There are very few profitable topics left, stupid spam doesn’t rule at all anymore, you need to think here and no one will shoot topics, unless you try to put the outdated ones in a beautiful wrapper and powder them a little. :) But this is not about us. In general, these 3 “don’ts”, I think, basically became barriers to participation in the competition for the majority of potential participants. It’s like repairing a car out of three: cheap, high quality, fast - the service can only fulfill 2 conditions at the same time. sit and choose what is closer to you. :) It’s the same with a competition: I have time, I know how to make a video, but there is no topic, or I know how to make a video, there is a topic, but I don’t have the time at all, or I have free time and there is a small topic, but the video is scary. But this is good if 2 conditions are met at the same time. Well, okay, let's discard the lyrics. I'll continue to myself. I didn’t plan, which means I took part in the competition, I even chose which article I would vote for. Whatever you say, Doz knows the software very well and knows how to use it very intelligently. But today I learned that intrigue has appeared in the competition. It turns out that I won’t be able to vote, and only newcomers who purchased the software in 2011 will be able to do this, and the competition is designed for them. I was a little surprised, but the owner is a gentleman. The competition is an advertising campaign and Alexander knows better how to carry it out. In general, I then decided to post an article; it is somewhat easier to write when it is clear for whom, for the entire collective farm, in fact, it is impossible to do this. Powered by php-Fusion In version Khroomer 7.07, the program is trained on several new engines: forumi.biz, forumb.biz, 1forum.biz, 7forum.biz, etc. phpBB-fr.com, Solaris phpBB theme And the process of learning new things is continuous. "Powered by SMF 1.1.2" "Powered by SMF 1.1.3" "Powered by SMF 1.1 RC2" "Powered by SMF 1.1.4" "Powered by SMF 1.1.8" "Powered by SMF 1.1.7" "2006-2008, Simple Machines LLC" And that's not all. While collecting engine versions, on some SMF forums we find the caption “2001-2006, Lewis Media” in the footer. We are checking this request, it also fully satisfies us. We find a similar request: "2001-2005, Lewis Media". Looking through the footers further we find the following request: “SMFone design by A.M.A, ported to SMF 1.1”. We check - great. And so on. Half an hour of work and you have a wonderful database of queries for the engine, and Google will ban you for these queries much less often than if you use operators in them. And at the same time, your database will be much cleaner than if you use queries like “index.php?topic=", because here Google will give not only the forums we need, but also many left-wing resources where it was possible leave a link to the forum topic. You might object, what's wrong with that? Others left a link, so we can too. But! Links can be left not only by Khrumer, but also by other programs. Moreover, they can be specially tailored for leaving comments on a certain resource, the so-called highly specialized software, plus such links could be left by hand. Again, I repeat, it is not the quantity of garbage that is important to us, but the quality; we will collect the database with the right requests. The advantage of this method is that you will practically not need to configure sieve -filter
伟哥 - Viagra 吉他 - guitar 其他 - rest 保险公司 - insurance Put these replacing codes in the Words file: %E4%BC%9F%E5%93%A5 %E5%90%89%E4%BB%96 %E5%85%B6%E4%BB%96 %E4%BF%9D%E9%99%A9%E5%85%AC%E5%8F%B8 If you are promoting an insurance website, then by placing a link in your profile on a thematic (!) even Chinese forum found by request " SMF forum" 保险公司 it will be very good.
print "phpbb \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?phpbb\.com\/?"[^>]*>phpBB/i or
# $data =~ /viewforum\.php\?[^""]*f=\d+/i or
$data =~ /phpBB\-SEO/i or
$data =~ /)
;
print "ipb \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?invision(?:board|power)\.com\/?[^"]*"[^>]*> [^<]*IP\.Board/i
or
$data =~ /]+href="[^"]*http:\/\/(?:www\.)?invisionboard\.com\/?"[^>]*>Invision Power Board/i or
$data =~ /
$data =~ /index\.php\?[^""]*showforum=\d+/i)
;
print "vbulletin \n"
if ($data =~ /Powered by:?[^<]+vBulletin[^<]+(?:Version)?/i
or
$data =~ /)
;
print "smf \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?simplemachines\.org\/?"[^>]*>Powered by SMF/i or
$data =~ /index\.php\?[^""]*board=\d+\.0/i)
;
print "punbb \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:(?:www\.)?punbb\.org|punbb\.informer\.com)\/?"[^>]*> PunBB/i) ; #or
# $data =~ /viewforum\.php\?[^""]*id=\d+/i);
print "fluxbb \n"
# if($data =~ /viewtopic\.php\?id=\d+/i or
if ( $data =~ /]+href="http:\/\/(?:www\.)fluxbb\.org\/?"[^>]*>FluxBB/i)
;
print "exbb \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?exbb\.org\/?"[^>]*>ExBB/i) ; # or
# $data =~ /forums\.php\?[^""]*forum=\d+/i);
print "yabb \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?yabbforum\.com\/?"[^>]*>YaBB/i or
$data =~ /YaBB\.pl\?[^""]*num=\d+/i ) ;
print "dleforum \n"
if ($data =~ /\(Powered By DLE Forum\)<\/title>/i or
$data =~ /]+href="[^"]+(?:http:\/\/(?:www\.)?dle\-files\.ru|act=copyright)[^"]*">DLE Forum<\/a>/i)
;
print "ikonboard \n"
if ($data =~ /]+href="[^"]*http:\/\/(?:www\.)?ikonboard\.com\/?[^"]*"[^>]*>Ikonboard/i or
$data =~ /\n"
if ($data =~ /\n"
# if($data =~ /forums\.php\?fid=\d+/i or
# $data =~ /topic\.php\?fid=\d+/i or
if ($data =~ /]+href="http:\/\/(?:www\.)?flashbb\.net\/?"[^>]*>FlashBB/i)
;
print "stokesit \n"
# if($data =~ /forum\.php\?f=\d+/i or
if ($data =~ /]+href="http:\/\/(?:www\.)?stokesit\.com\.au\/?"[^>]*>[^\/]*Stokes IT/i)
;
print "podium \n"
# if($data =~ /topic\.php\?t=\d+/i or
if ($data =~ /]+href=[""]?http:\/\/(?:www\.)?sopebox\.com\/?[""]?[^>]*>Podium/i)
;
print "usebb \n"
# if($data =~ /forum\.php\?id=\d+/i or
if ($data =~ /]+href="http:\/\/(?:www\.)?usebb\.net\/?"[^>]*>UseBB/i)
;
print "wrforum \n"
# if($data =~ /index\.php\?fid=\d+/i or
if ($data =~ /]+href="http:\/\/(?:www\.)?wr\-script\.ru\/?"[^>]*>WR\-Forum/i)
;
print "yetanotherforumnet \n"
if ($data =~ /Yet Another Forum\.net/i or
$data =~ /default\.aspx\?g=posts&t=\d+/i)
;
site:talk.*.ru
site:board.*.ru
site:smf.*.ru
site:phpbb.*.ru
....
The long introduction is over, now to the point.
What does a beginner need when he has purchased such a super-combine, which is the Xrumer + Hrefer complex? That's right, learn how to work on it and discard the illusion that you can earn money by starting to spam sheets. If you think so, better donate your money to charity right away. You need to learn how to use the tools of the complex, preferably sharpening it for yourself. The time of “take more - throw further” is gone. Quantity gives way to quality. This means we will assemble a base for ourselves; if you don’t learn how to do this, you will fall behind the train. Naturally, Khrefer will help us with this. If you plan to promote your resources on Google, then we also need to search for donor sites through Google. I think this is understandable and logical. But Google, like the mistress of the copper mountain, does not give away its wealth to everyone. You need an approach to it. I would like to say right away that do not hope that based on the signs that you find in the public you will be able to collect something. The reason they are available in public is because they are worthless. I will not develop the topic further. It’s better to tell you how to assemble it correctly so that you can see the result, you can work out the rest yourself, the main thing is to understand the principle. We need to collect the right ones based on the characteristics of the specific engines we need, and not on the characteristics of forums in general. This is the main mistake of beginners - not concentrating on a specific thing, but trying to cover everything in its entirety. And also, if you want to parse a more or less normal database, stop using operators in queries. No "inurl:", "site:", "title", etc. Google will ban searchers like you instantly. Therefore, we carefully study the engines that Khrumer is currently working with:
In general, we need to prepare the correct queries for parsing by Hrefer. Let's take the forum dizhok as an example. SMF Forums. And let's start disassembling it into spare parts for parsing. Our beloved Google will help us with this. Enter a query into Google SMF Forums- there is a lot of garbage in the search results, we rewind to some 13th page and select any link. I came across this one: http://www.volcanohost.com/forum/index.php?topic=11.0. Let's open it and study it. We need to find something characteristic on the page that can be applied to the search for other pages on this engine. In the footer we notice the following inscription Powered by SMF 1.1.14, quote it and enter it into Google, it shows us that for this query it knows about 59 million options. We quickly look through the links, add a couple more options to this keyword, for example, "Powered by SMF 1.1.14" poplar or "Powered by SMF 1.1.14" viagra. We make sure that the request is great, the results are only forums and almost no garbage.
In addition, we are not interested in quantity, but in quality, as I said above. Go ahead. From the same forum we take another phrase from the footer: , we also quote it and feed it to Google. In response, he reveals that he knows more than 13 million results. Again, we quickly look through the results, add additional words and check the results with them. We make sure that the request is excellent and there is also almost no garbage. In general, there are already 2 iron requests. I suggest leaving the first forum alone for now and continuing to collect requests from other forums. Fortunately, we have Google open upon request. 2006-2008, Simple Machines LLC. We take from the search results, for example, these forums: http://www.snowlinks.ru/forum/index.php?topic=1062.0 and http://litputnik.ru/forum/index.php?action=printpage;topic=380.0 in the footers we take the following queries from them: “Powered by SMF 1.1.7” and “Powered by SMF 1.1.10” (I always recommend entering queries for Hrefer in quotation marks, because we need quality first of all). I think it’s clear what we are doing, in the end we will have a certain database of queries for searching forums on the SMF engine (it was chosen as an example, the same with other engines).
It will look something like this:
I think that learning how to use Hrumer correctly at the initial stage is very important, because once you learn this, you can always find a use for Hrumer, no matter how the situation changes. Protections are becoming more complicated, and if on some types of engines the protection has been strengthened and Khrumer cannot cope with it at the moment, then there is no point in spending resources on collecting these links, and then working on them with Khroomer, it is better to concentrate forces on what gives results . And at the same time, if the Botmaster Labs team taught Khroomer something new, you can quickly dissect a new patient and prepare the base for Khroomer while the patient is still warm. Time is money; the resource may no longer be relevant when you buy the base. collected by someone. In addition, correct collection of bases for yourself significantly expands the “white” use of Khrumer. And this is exactly where everything is moving, whether we like it or not, and the process of whitening or graying is ongoing. Black sheets are becoming a thing of the past in every possible way.
All other technical aspects of working with Hrefer can be viewed in the help and there is no point in dwelling on them; all goals, points, seconds are set experimentally for each car individually.
As a bonus, I’ll post here a template for parsing the Chinese search engine Baidu, the other day they asked me about it, so I did it casually, excuse the pun. :)
Hostname=http://www.baidu.com
Query=s?wd=
LinksMask=
TotalPages=100
NextPage=
NextPage2=
CaptchaURL=
CaptchaImage=
CaptchaField=
I tried to test parse them, there was no ban, Khrefer collected resources quickly, all the queries for parsing were similar to Google’s, but there were a lot of Chinese resources, with a high PR, and besides, there were many places where no European had ever set foot. It is better to parse Chinese queries. Google translate will help with this, type in a list of keywords in Russian and translate it into Chinese. The truth in " Words"Hrefer words cannot be added in Chinese, they need to be recoded.
Instead of Chinese:
In conclusion, I would like to say that I never understood people who complained that Khrefers were cooked poorly or badly; in response to this, I always wanted to say, you just don’t know how to cook them. No parser can collect results better than a reffer; the requests just have to be correct. Hrefer is a car: good, solid, made in German, but it is driven by a person and it all depends on how well it is driven; you cannot force the car to drive both right and left at the same time.
A separate topic is cleaning databases, I once did this 3 years ago for a previous competition. For the most part, everything is still relevant there, but now you can refuse to check for 200 OK, I really didn’t really like this process, there were very large errors, a lot of unnecessary stuff was filtered out. Now this can be done almost automatically during the operation of Khrumer, although this process is not a complete analogue of checking for “200 OK”. Anyway, to the point: not long ago a wonderful opportunity appeared in Khrumer - to rob information from resources at the time of running a project. It looks like this. You enter a template that will be processed during operation, and the information collected from the template will be entered into the xgrabbed.txt file in the Logs folder. You can use this function for anything, the flight of imagination is huge. I use this function once a week to remove links from my working "expired" database. It’s no secret that forums are dying out every day in order to clear our database of such resources, and the “Autograbbing” tool will help us in this case.
After all, you must admit that when we often type, for example, http://www.laptopace.com/index.php, we see that this domain is already, for example, a good guy selling money, but there is no forum there. So, in order to throw this slag out of the base, we will rob. :) Open the source code of the page and see this entry there:
Now all the “dead men” from goudaddi will be known to us by name.
Here is a small selection for the Autograbbing tool, if you want to clear the database of different “expired” domains: