- Joined
- Nov 26, 2020
- Messages
- 716
Understanding Boundaries and Playing
Safely Are you a web spammer? No, I'm serious. If there is one area that many search terms and marketers do not always understand, it is fines and filters from search engines. This is something you will often find in SEO circles. We don't need to search for anything other than duplicate content. While this is (usually) a filter, there is no shortage of people calling it the " duplicate content penalty ".
So I thought it would be a good idea to look at the versatility of web spam from a search engineer's perspective. This is not about teaching you how to be the best spammer- in fact, quite the opposite, since I'm not a fan of this nonsense. Of course, I have a few comrades who play in the world of black hats, but they know very well that I am not a fan of this and do not pollute the Internet in general.
We hope this journey helps you avoid tactics or groups of actions that could put your client or your own websites at risk.
Defining Web Spam
What is Web Spam? In research for this post, this struck me as the best, or at least the most concise definition I have come across:
any deliberate human action that is intended to cause an unreasonably favorable relevance or importance to a web page, given its true value (from Web Spam Taxonomy, Stanford).
Hmmmm. Or is it? If that were the case, we would ALL be spammers, since what we do as search engine optimizers is multiple attempts to build a deck. Hell. Well. What's even more interesting is that the Stanford article says:
An important voice in the field of web spam is Search Engine Optimizers (SEO) such as SEO Inc. ( www.seoinc.com ) or Bruce Clay ( www.bruceclay.com ).
Oh. Not nice at all - how about:
Most SEOs claim that spam only increases relevance for queries that are not related to the topic (s) of the page. At the same time, many SEO professionals recommend and practice techniques that influence importance scores to achieve what they call “ethical” positioning or web page optimization. Please note that according to our definition, all types of actions aimed at increasing rankings without increasing the true page value are considered spam.
Heck - this reminds us that SEOs are not criminals, but they are definitely the enemy. Let's dig a little back and see that spam is a blatant manipulation that adds no value and aims to exploit the algorithmic blindness of the search algorithm, okay? Lol - leave it as it is. And never forget, they don't like us (search engine optimizers).
Types of Web Spam
Basically, there are two types of spam: boost and hide.
Promote
This is when someone takes an action designed (falsely?) To increase or enhance the value of a page.
Spam Term : These can be those who are trying to manipulate with elements such as TITLE pages (header spam), Meta Description, or Meta Keywords (meta spam). As most of us know, two out of three of them have been abused to the point that most modern search engines don't use them as signals at all.
URL spam is another area they've been known to pay attention to as well. Yes, as odd as it sounds, since some search engines attach some importance to URLs, this can be viewed as manipulation.
Link spam is another well-known technique that also includes anchor text spam. Search engines take into account not only the mass of link spam, but also the anchor text, as this is one of the most important signals in terms of ranking. This section apparently also includes cases where spammers seek to place links on pages to increase the value of landing pages (forums, comments, guestbooks, etc.), and obviously more nefarious hacking and removal methods.
Hiding Techniques
This set of techniques is when someone uses usually subtle methods to increase page rank. Or, more appropriately, hiding amplification techniques. This is certainly more problematic, and search engines tend to view them as more insidious than the ones driving the promotion.
Content hiding : These are techniques in which terms and links are hidden when the page is displayed by the browser. The most common approaches are to use color schemes that render the corresponding elements virtually invisible.
Disguise : We all know that, right? This is when someone identifies a search engine crawler and tries to show the spider a different version of the page than it would be for the average user. This is supposed to reduce the number of posts from users or competitors who might otherwise see the spam page.
Redirection : The page is automatically redirected by the browser in the same way, so that the page is indexed by the engine, but the user will never see it. Essentially acts as a proxy / portal to play with the engine and misleads users.
Anti
-spam approaches Content spam
Language . In some tests, search engineers looked at the actual languages of the pages to see what they could find. It is noteworthy that French was most often considered a spam festival, followed by German and English. I found this sample interesting.
Domain : I'm sure it's no surprise that .BIZ domains have been found to have a much higher spam rate than any others. This was followed by the .US and .COM domains. But .BIZ was a cut above the rest - stay away from them, okay?
Number of words per page: another commonly used approach. They found that pages with a lot of text often contained more spam. This curve did decrease once every 1500 words. From 750-1500, it seemed to be the sweet spot of spammers.
Keywords on a TITLE page : This is another area they will be looking at as testing has shown that spam pages tend to use much more KW in the TITLE element than non-spam pages.
The amount of anchor text . Another interesting approach is to look at the relationship of text to anchor text on a page. This can be at the page or site level. Websites with a high percentage of text anchor (to standard text) are more likely to be spam.
Percentage of visible content: This refers to attempts to use hidden text, not to be confused with code-to-text ratio. They look at the percentage of text that doesn't actually appear on the page.
Compressibility : As a mechanism used to combat KW padding, search engines may also look at compression ratios. Or, more specifically, repetitive or revolving content. Search engines often compress the page to avoid indexing and processing. There is a compression ratio (uncompressed divided by compressed) that spam pages are likely to have.
Globally popular words: Another good way to find KW padding is to compare words on the page with existing query data and known documents. In fact, if someone KW fills in these terms, they will be used more unnaturally than user requests and notoriously good pages.
Spam in requests. With the rise of query analysis, click-through data, and personalization, spammers can query for a variety of targeted terms and click on their own results. When you look at the query structure in combination with other signals, this tactic becomes statistically obvious.
Host level spam crawls other sites and domains at the server and / or registrar level. As with the trust rating, spammers are often in the same area as other spammers.
Based on phrases: with this approach, a probabilistic learning model using training documents looks for text anomalies in the form of related phrases. This is similar to the steroid-packed KW. Searching for statistical anomalies often reveals spam documents. TrustRank Link
Spam : This method has several names, for example TrustRank is a variation on Yahoo. The concept is based on the presence of “good neighbors”. Research shows that good sites link to good ones, and vice versa. The company you make knows you. Filling with links
: This is more like an on-site approach where the spammer creates tons of low value pages and directs all links (even on the site) to the landing page. Spam sites tend to have a higher ratio of these types of unnatural appearances (to the training set of notoriously good pages).
Non-Potic Links : This is where we have everything from paid links to tradable (reciprocal) links . While this may be an obscure area for SEOs, search engines will most certainly consider link manipulation in any form of reciprocal manipulation open.
Topological spam (link farms): Although we have our own jargon, search engines will analyze the percentage of links in the graph versus known “good” sites. Usually, those looking to manipulate the engines will have a higher percentage of links from these locals.
Temporal anomalies . Another area in which spam sites tend to differ from other pages in the corpus is historical data. The index will show the average of link gain and link reduction from "normal" sites. Temporary data can be used to help detect spam sites engaging in unnatural link building habits.
Lessons for SEO Professionals
What's the Point? This trail was interesting to me from several points of view. Let's see:
Signal Ranking : If we reverse engineer them, we can begin to really see which signals are important and which ones they want to protect. Understanding what they are protecting tells us what they think is important. Right?
Signal funnel : Given the amount of effort put into link spam, we know that modern link-targeted search engines are interested in less diverse approaches to ranking. That is, if you NEED links to rank, they know where to look for spammers. Dealing with web spam is highly dependent on the future of search. Watch and learn.
You are the bad guys : As stated, we are not on most search engineer's Christmas card lists. Know this and understand this. They tolerate us — even the most well-meaning white hat among us.
More Commonly Damped : Another thing I've learned is that more often than not, especially with edge link spam, the juice gets cut off rather than the site being de-indexed. Is this a fine or a filter? Does it matter?
Authority / Trust : It would be wise to see where we play. Building credibility and networking with other well-known organizations is very important.
As always, it never hurts to better understand the search engines if you're going to optimize for them. Heck, maybe if we, as a group, begin to better understand search engineers and their problems, they might someday speak well of us. No, it's just a silly dream.
Combinations Create Spam Signals
It is always important to remember that in most cases no signal or approach is considered final. Search engines often use a variety of methods to find spam. For those of us who play well, this means that there is less chance of false positives.
To get your customers or yourself into hot water, you generally have to satisfy more than one element.... At the same time, most people in the search community are not big fans of SEO, but there are those who believe that even minor " manipulations " should be punishable. As far as I know, we don't have to worry too much about lynching just yet. Ultimately there are levels and thresholds, and as long as you avoid disconnecting too many wires, you should be fine.
One thing is for sure: you will become a much better SEO specialist if you gain more knowledge in the field of information retrieval. This post touches on some general aspects - there is a TON of more for those interested.
Hope you enjoyed your trip ... play it safe!
Patents, Articles and Videos
Before I leave, here are loads of research and reading material for you to read if you want to know more - my goal is always to motivate people to learn more. No single blog entry can convey the validity of any IR (information retrieval) topic. Below are some of the elements I looked at when putting this together.
Web Spam Research Articles
Double Spam Sequence: Connecting Web Spammers to Advertisers - Search Ranger System
Detect unwanted web pages with content analysis - Microsoft
Improving Web Spam Classification Using Ranking Features - (AIRWeb 2007)
Finding Adversarial Information on the Internet - (AIRWeb 2007)
Web Spam Detection with Decision Trees - Indian Institute of Information Technology
Web spam detection: link and content - based techniques - Yahoo
Identifying web spam via content and hyperlinks - Yahoo
TrustRank concepts
Fighting Internet Spam with TrustRank - Stanford, 2004
Spreading Trust and Mistrust to Reduce Web Spam - Lehigh University
Recognizing family links on the web - B. Davison
Detection of family ties by disagreement of the language model
Link Spam Alliance - Stanford
Know Your Neighbors: Web Spam Detection Using Web Topology - Yahoo
Detecting Overly Reciprocal Links Between Web Objects - Yahoo (Patent)
Link spam
A Small Example of Link Based Learning for Web Spam Detection - Chinese Academy of Sciences
Excessive Influence: Eliminate the influence of link plagiarism on web search rankings - B Wu, BD Â
Link spam detection using temporary information - Microsoft
Extract Link Spam Using Biased Random Walk from Spam Seed Sets - B Wu, C. Chellapilla
Link analysis to detect web spam - Yahoo Research
Link Spam Detection Based on Mass Scoring - Stanford
Link based feature and web spam detection - Yahoo
Implicit / explicit signals
Identifying web spam with user behavior analysis - AIRweb
Behavioral Web Spam Detection - WWW
Web Spam Detection with Business Intent Analysis - Andras Bentsur, Istvan Biro, Karoli Chalogani
Analyzing query log to detect spam - Yahoo
Disguise
Camouflage and redirection: - Preliminary study from Lehigh University.
Semantic Masking Detection on the Web - Lehigh University
Social spam
Antisocial tagger - detects spam on social bookmarking systems - AirWeb
An empirical study of a sample of active learning to detect splogs - AIRweb
Detection of video spammers in social networks - Polytechnic University
Social Spam Detection - Indiana University
Language / semantics related
Identifying web spam using language model analysis - AIRweb
Detect unwanted web pages with content analysis - Microsoft
Exploring Linguistic Features for Web Spam Detection: A Case Study - Various Authors
Video
Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam measures, local search, peer-to-peer search, and search on blogs and online communities.
Using Rank Spread and Probability Scoring to Detect Link - Based Spam - Yahoo! Explore
Calling Web Spam 2007 Track II - Research by Secure Computing Corporation
Web Spam Detection - Sapienza University of Rome
WITCH: A New Approach to Web Spam Detection - Google Tech Talks
Patents
Trust Signals
Yahoo - identifying spam hosts using layered graphical learning
Yahoo - Spam Host Detection Based on Prediction Levels
Request spam
Classifying web spam pages using request - specific data - Microsoft
Link spam
Detect web spam when website links change - Microsoft
Method for detecting link spam in databases with hyperlinks - Google
Identifying overly reciprocal links between web objects - Yahoo
Link - Based Spam Detection - Yahoo Â
Spam masking and redirection
Disguise detection using popularity and market value. - Microsoft
System and method for identifying hidden web servers - Najork, Marc A .; January 4, 2002 (now at Microsoft)
Search Ranger and double-sequence model for search spam analysis and browser protection (cloaking) - Microsoft
Detecting and characterizing network proxies - Yahoo
Other
Spam document detection when searching for information based on phrases - Google
Detecting media spam using speech conversion -
domain-based spam resistance rating Microsoft - Microsoft
Content Assessment - Microsoft
So, if this is not all you ever wanted to know about web spam, then I don't know what it is!
Safely Are you a web spammer? No, I'm serious. If there is one area that many search terms and marketers do not always understand, it is fines and filters from search engines. This is something you will often find in SEO circles. We don't need to search for anything other than duplicate content. While this is (usually) a filter, there is no shortage of people calling it the " duplicate content penalty ".
So I thought it would be a good idea to look at the versatility of web spam from a search engineer's perspective. This is not about teaching you how to be the best spammer- in fact, quite the opposite, since I'm not a fan of this nonsense. Of course, I have a few comrades who play in the world of black hats, but they know very well that I am not a fan of this and do not pollute the Internet in general.
We hope this journey helps you avoid tactics or groups of actions that could put your client or your own websites at risk.
Defining Web Spam
What is Web Spam? In research for this post, this struck me as the best, or at least the most concise definition I have come across:
any deliberate human action that is intended to cause an unreasonably favorable relevance or importance to a web page, given its true value (from Web Spam Taxonomy, Stanford).
Hmmmm. Or is it? If that were the case, we would ALL be spammers, since what we do as search engine optimizers is multiple attempts to build a deck. Hell. Well. What's even more interesting is that the Stanford article says:
An important voice in the field of web spam is Search Engine Optimizers (SEO) such as SEO Inc. ( www.seoinc.com ) or Bruce Clay ( www.bruceclay.com ).
Oh. Not nice at all - how about:
Most SEOs claim that spam only increases relevance for queries that are not related to the topic (s) of the page. At the same time, many SEO professionals recommend and practice techniques that influence importance scores to achieve what they call “ethical” positioning or web page optimization. Please note that according to our definition, all types of actions aimed at increasing rankings without increasing the true page value are considered spam.
Heck - this reminds us that SEOs are not criminals, but they are definitely the enemy. Let's dig a little back and see that spam is a blatant manipulation that adds no value and aims to exploit the algorithmic blindness of the search algorithm, okay? Lol - leave it as it is. And never forget, they don't like us (search engine optimizers).
Types of Web Spam
Basically, there are two types of spam: boost and hide.
Promote
This is when someone takes an action designed (falsely?) To increase or enhance the value of a page.
Spam Term : These can be those who are trying to manipulate with elements such as TITLE pages (header spam), Meta Description, or Meta Keywords (meta spam). As most of us know, two out of three of them have been abused to the point that most modern search engines don't use them as signals at all.
URL spam is another area they've been known to pay attention to as well. Yes, as odd as it sounds, since some search engines attach some importance to URLs, this can be viewed as manipulation.
Link spam is another well-known technique that also includes anchor text spam. Search engines take into account not only the mass of link spam, but also the anchor text, as this is one of the most important signals in terms of ranking. This section apparently also includes cases where spammers seek to place links on pages to increase the value of landing pages (forums, comments, guestbooks, etc.), and obviously more nefarious hacking and removal methods.
Hiding Techniques
This set of techniques is when someone uses usually subtle methods to increase page rank. Or, more appropriately, hiding amplification techniques. This is certainly more problematic, and search engines tend to view them as more insidious than the ones driving the promotion.
Content hiding : These are techniques in which terms and links are hidden when the page is displayed by the browser. The most common approaches are to use color schemes that render the corresponding elements virtually invisible.
Disguise : We all know that, right? This is when someone identifies a search engine crawler and tries to show the spider a different version of the page than it would be for the average user. This is supposed to reduce the number of posts from users or competitors who might otherwise see the spam page.
Redirection : The page is automatically redirected by the browser in the same way, so that the page is indexed by the engine, but the user will never see it. Essentially acts as a proxy / portal to play with the engine and misleads users.
Anti
-spam approaches Content spam
Language . In some tests, search engineers looked at the actual languages of the pages to see what they could find. It is noteworthy that French was most often considered a spam festival, followed by German and English. I found this sample interesting.
Domain : I'm sure it's no surprise that .BIZ domains have been found to have a much higher spam rate than any others. This was followed by the .US and .COM domains. But .BIZ was a cut above the rest - stay away from them, okay?
Number of words per page: another commonly used approach. They found that pages with a lot of text often contained more spam. This curve did decrease once every 1500 words. From 750-1500, it seemed to be the sweet spot of spammers.
Keywords on a TITLE page : This is another area they will be looking at as testing has shown that spam pages tend to use much more KW in the TITLE element than non-spam pages.
The amount of anchor text . Another interesting approach is to look at the relationship of text to anchor text on a page. This can be at the page or site level. Websites with a high percentage of text anchor (to standard text) are more likely to be spam.
Percentage of visible content: This refers to attempts to use hidden text, not to be confused with code-to-text ratio. They look at the percentage of text that doesn't actually appear on the page.
Compressibility : As a mechanism used to combat KW padding, search engines may also look at compression ratios. Or, more specifically, repetitive or revolving content. Search engines often compress the page to avoid indexing and processing. There is a compression ratio (uncompressed divided by compressed) that spam pages are likely to have.
Globally popular words: Another good way to find KW padding is to compare words on the page with existing query data and known documents. In fact, if someone KW fills in these terms, they will be used more unnaturally than user requests and notoriously good pages.
Spam in requests. With the rise of query analysis, click-through data, and personalization, spammers can query for a variety of targeted terms and click on their own results. When you look at the query structure in combination with other signals, this tactic becomes statistically obvious.
Host level spam crawls other sites and domains at the server and / or registrar level. As with the trust rating, spammers are often in the same area as other spammers.
Based on phrases: with this approach, a probabilistic learning model using training documents looks for text anomalies in the form of related phrases. This is similar to the steroid-packed KW. Searching for statistical anomalies often reveals spam documents. TrustRank Link
Spam : This method has several names, for example TrustRank is a variation on Yahoo. The concept is based on the presence of “good neighbors”. Research shows that good sites link to good ones, and vice versa. The company you make knows you. Filling with links
: This is more like an on-site approach where the spammer creates tons of low value pages and directs all links (even on the site) to the landing page. Spam sites tend to have a higher ratio of these types of unnatural appearances (to the training set of notoriously good pages).
Non-Potic Links : This is where we have everything from paid links to tradable (reciprocal) links . While this may be an obscure area for SEOs, search engines will most certainly consider link manipulation in any form of reciprocal manipulation open.
Topological spam (link farms): Although we have our own jargon, search engines will analyze the percentage of links in the graph versus known “good” sites. Usually, those looking to manipulate the engines will have a higher percentage of links from these locals.
Temporal anomalies . Another area in which spam sites tend to differ from other pages in the corpus is historical data. The index will show the average of link gain and link reduction from "normal" sites. Temporary data can be used to help detect spam sites engaging in unnatural link building habits.
Lessons for SEO Professionals
What's the Point? This trail was interesting to me from several points of view. Let's see:
Signal Ranking : If we reverse engineer them, we can begin to really see which signals are important and which ones they want to protect. Understanding what they are protecting tells us what they think is important. Right?
Signal funnel : Given the amount of effort put into link spam, we know that modern link-targeted search engines are interested in less diverse approaches to ranking. That is, if you NEED links to rank, they know where to look for spammers. Dealing with web spam is highly dependent on the future of search. Watch and learn.
You are the bad guys : As stated, we are not on most search engineer's Christmas card lists. Know this and understand this. They tolerate us — even the most well-meaning white hat among us.
More Commonly Damped : Another thing I've learned is that more often than not, especially with edge link spam, the juice gets cut off rather than the site being de-indexed. Is this a fine or a filter? Does it matter?
Authority / Trust : It would be wise to see where we play. Building credibility and networking with other well-known organizations is very important.
As always, it never hurts to better understand the search engines if you're going to optimize for them. Heck, maybe if we, as a group, begin to better understand search engineers and their problems, they might someday speak well of us. No, it's just a silly dream.
Combinations Create Spam Signals
It is always important to remember that in most cases no signal or approach is considered final. Search engines often use a variety of methods to find spam. For those of us who play well, this means that there is less chance of false positives.
To get your customers or yourself into hot water, you generally have to satisfy more than one element.... At the same time, most people in the search community are not big fans of SEO, but there are those who believe that even minor " manipulations " should be punishable. As far as I know, we don't have to worry too much about lynching just yet. Ultimately there are levels and thresholds, and as long as you avoid disconnecting too many wires, you should be fine.
One thing is for sure: you will become a much better SEO specialist if you gain more knowledge in the field of information retrieval. This post touches on some general aspects - there is a TON of more for those interested.
Hope you enjoyed your trip ... play it safe!
Patents, Articles and Videos
Before I leave, here are loads of research and reading material for you to read if you want to know more - my goal is always to motivate people to learn more. No single blog entry can convey the validity of any IR (information retrieval) topic. Below are some of the elements I looked at when putting this together.
Web Spam Research Articles
Double Spam Sequence: Connecting Web Spammers to Advertisers - Search Ranger System
Detect unwanted web pages with content analysis - Microsoft
Improving Web Spam Classification Using Ranking Features - (AIRWeb 2007)
Finding Adversarial Information on the Internet - (AIRWeb 2007)
Web Spam Detection with Decision Trees - Indian Institute of Information Technology
Web spam detection: link and content - based techniques - Yahoo
Identifying web spam via content and hyperlinks - Yahoo
TrustRank concepts
Fighting Internet Spam with TrustRank - Stanford, 2004
Spreading Trust and Mistrust to Reduce Web Spam - Lehigh University
Recognizing family links on the web - B. Davison
Detection of family ties by disagreement of the language model
Link Spam Alliance - Stanford
Know Your Neighbors: Web Spam Detection Using Web Topology - Yahoo
Detecting Overly Reciprocal Links Between Web Objects - Yahoo (Patent)
Link spam
A Small Example of Link Based Learning for Web Spam Detection - Chinese Academy of Sciences
Excessive Influence: Eliminate the influence of link plagiarism on web search rankings - B Wu, BD Â
Link spam detection using temporary information - Microsoft
Extract Link Spam Using Biased Random Walk from Spam Seed Sets - B Wu, C. Chellapilla
Link analysis to detect web spam - Yahoo Research
Link Spam Detection Based on Mass Scoring - Stanford
Link based feature and web spam detection - Yahoo
Implicit / explicit signals
Identifying web spam with user behavior analysis - AIRweb
Behavioral Web Spam Detection - WWW
Web Spam Detection with Business Intent Analysis - Andras Bentsur, Istvan Biro, Karoli Chalogani
Analyzing query log to detect spam - Yahoo
Disguise
Camouflage and redirection: - Preliminary study from Lehigh University.
Semantic Masking Detection on the Web - Lehigh University
Social spam
Antisocial tagger - detects spam on social bookmarking systems - AirWeb
An empirical study of a sample of active learning to detect splogs - AIRweb
Detection of video spammers in social networks - Polytechnic University
Social Spam Detection - Indiana University
Language / semantics related
Identifying web spam using language model analysis - AIRweb
Detect unwanted web pages with content analysis - Microsoft
Exploring Linguistic Features for Web Spam Detection: A Case Study - Various Authors
Video
Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam measures, local search, peer-to-peer search, and search on blogs and online communities.
Using Rank Spread and Probability Scoring to Detect Link - Based Spam - Yahoo! Explore
Calling Web Spam 2007 Track II - Research by Secure Computing Corporation
Web Spam Detection - Sapienza University of Rome
WITCH: A New Approach to Web Spam Detection - Google Tech Talks
Patents
Trust Signals
Yahoo - identifying spam hosts using layered graphical learning
Yahoo - Spam Host Detection Based on Prediction Levels
Request spam
Classifying web spam pages using request - specific data - Microsoft
Link spam
Detect web spam when website links change - Microsoft
Method for detecting link spam in databases with hyperlinks - Google
Identifying overly reciprocal links between web objects - Yahoo
Link - Based Spam Detection - Yahoo Â
Spam masking and redirection
Disguise detection using popularity and market value. - Microsoft
System and method for identifying hidden web servers - Najork, Marc A .; January 4, 2002 (now at Microsoft)
Search Ranger and double-sequence model for search spam analysis and browser protection (cloaking) - Microsoft
Detecting and characterizing network proxies - Yahoo
Other
Spam document detection when searching for information based on phrases - Google
Detecting media spam using speech conversion -
domain-based spam resistance rating Microsoft - Microsoft
Content Assessment - Microsoft
So, if this is not all you ever wanted to know about web spam, then I don't know what it is!