Tuesday, March 24, 2009

Creating XML Site Map

Apart from the HTML site map on your web site, which can help the web site visitors and search engine robots in navigating through your web site, you can create a XML Sitemap. The XML Sitemaps are specifically for search engine robots and can be submitted to the particular search engine. A Sitemap lists all the links of your website that you would like to be visited by the search engine robots, specifically helps with dynamic pages, which search engine robots, will have no knowledge of otherwise.

Sitemap

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Required

<url>

Required

<loc>http://www.example.com/index.html</loc>

Required

<lastmod>2005-01-01</lastmod>

Optional

W3C Datetime standard

<changefreq>monthly</changefreq>

Optional

[alwayshourlydailyweeklymonthlyyearlynever]

<priority>0.8</priority>

Optional

Default = 0.5

</url>

</urlset>


All the data in Sitemap must be entity-escaped, UTF-8 encoded. Sitemaps have an upper limit of 50,000URLs and 10MB size per Sitemap. Sample Sitemap.xml.

Location

Sitemap.xml file is usually located under the high-level directory of your website (http://www.yourwebsite.com/Sitemap.xml). This is not a requirement but highly recommended. The location of a Sitemap.xml decides the URLs it can contain in it. So if the Sitemap.xml is located under www.youwebsite.com/product/Sitemap.xml, the Sitemap.xml can only contain URLs for pages under http://www.yourwebsite.com/product/ which also means all the URLs in a Sitemap.xml must be for the same host. You also need to specify path to your Sitemap.xml in robots.txt.

Sitemapindex

Sitemapindex groups multiple Sitemap files together with a Sitemap element entry for each Sitemap file location on your website. There is an upper limit of 1,000 Sitemap per website. A Sitemapindex can only group Sitemap of the same website and as with Sitemap, all the data in Sitemapindex should entity escaped and UTF-8 encoded.

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Required

     <sitemap>

Required

    <loc>http://www.yourwebsite.com/sitemap1.xml</loc>

Required

       <lastmod>2009-01-31</lastmod>

Optional

W3C Datetime standard

</sitemap>

</sitemapindex>


Submitting Sitemap

1. Through robots.txt

Specify the location of your Sitemap.xml in robots.txt

Sitemap: http://yourwebsite.com/sitemap.xml

2. Thru Search Engine Submission Interface

Most search engine provide interface to submit Sitemap, some also provide tools to generate one for your website.

Google Sitemap Submission interface

Google Sitemap Generator

Yahoo Sitemap Submission interface

3. Via PING URL

<SearchEngineURL>/ping?sitemap=http%3A%2F%2Fwww.yourwebsite.com%2Fsitemap.xml

Google ping URL -www.google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.yourwebsite.com%2Fsitemap.xml

Ask ping URL - http://submissions.ask.com/ping?sitemap=http%3A%2F%2Fwww.yourwebsite.com%2Fsitemap.xml


Yahoo ping URL - http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http%3A%2F%2Fwww.yourwebsite.com%2Fsitemap.xml

Here the SearchEnginerURL is the URL of the search engine you would like to submit the Sitemap to. Once you receive the HTTP 200 response, you know that the search engine received your Sitemap (although it does not guarantee that your site is valid). The ping request can be issued from wget, curl or any other mechanism.


Other Formats of Sitemap

Although the other formats carry limited information about your website, sometimes they can come in handy for the Sitemap submission.

RSS /ATOM Feed – RSS feeds can also be submitted as
Sitemaps. The <link> in the feed is interpreted as the URL to the page and <pubDate> or <modified> field is interpreted as last
modified info by search engine robots.

Text File – A simple text file containing URL to your web pages per line can be submitted as Sitemap.

The text file must be UTF-8 encoded and must not have any comment lines. A text file can have 50,000 URLs and should be no larger than 10MB. The text file can be separated at Sitemap into several text files with list of URLs (less than 50,000) and each file can be submitted separately. The text file must be in the highest level directory of your website.

There are more formats of Sitemap which are accepted by search engines to satisfy different data formats, such as, video sitemap,
mobile sitemap, news sitemap, code search sitemap etc. Also not all search
engines support them. If interested in these sitemap
content, please refer Google Webmaster Help.


Compressing Sitemap

You can compress the Sitemap xml or text file and provide the link to the compressed file in your links or submissions and is accepted per Sitemap standard.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <sitemap>
          <loc>http://www.yourwebsite.com/sitemap1.xml.gz</loc>
          <lastmod>2009-01-31</lastmod>
      </sitemap>
</sitemapindex>

Sitemap Validation

Schema for validating sitemaps can be downloaded from:

Sitemap Schema:
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd

XML header for referring the xsd will change to-

<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9"
url="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
...
</url>
</urlset>

Sitemap index Schema:http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd

XML header for referring the xsd will change to

<?xml version='1.0' encoding='UTF-8'?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9"
url="http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
...
</sitemap>
</sitemapindex>

For specific questions related to generating or writing Sitemap for your website, please reach me at bhawnablog@gmail.com

Wednesday, March 11, 2009

Keeping search robots away!

At times, you would want few of your web pages to not be visible in the search engine result page (SERP). The reason being it is under construction; or it's a semi-private page, something you would like to share with smaller community; or any other. Here I have discussed ways to keep the search engine robots or crawlers from visiting the page or a link on a page supported by Robots Exclusion Protocol (REP).

To block the search engine robot from indexing a particular page, use the mate tag robots with the content value noindex.

<meta name="robots" content="noindex" />

Alternatively, if you'd like the web page to be indexed but suggest the search robot to not follow any of the links on the page, use the nofollow content value.

<meta name="robots" content="nofollow" />

More Content values for robots Meta tag -

Content Value

Description

Supported By

noindex

Do not index the web page

Google, Yahoo, Ask, MSN Live

index

Index the web page

nofollow

Do not follow/visit any link on the web page

Google, Yahoo, Ask, MSN Live

follow

Follow all the links on the web page

noarchive

Do not cache the web page

Google, Yahoo, Ask, MSN Live

nosnippet

Do not auto generate the description based on page content

Google

noodp

Do not overwrite the description or title tag content from Open Directory project [home page only]

Google, Yahoo, MSN Live

noydir

Do no overwrite the description or title tag content

Yahoo

You can also have combinations of content values (of course the combinations should make sense).

<meta name="robots" content="noindex, follow" />

Now if you are dealing with keeping search robots from visiting multiple web pages of your website, you can make use of robots.txt file which is placed in the top-level directory hierarchy of your web site.

Here is the syntax of the robots.txt file-

User-Agent: *
Disallow: /

In the above syntax, User-Agent identifies the search robot and * refers to all search robots. You can also specify the search robot name here to address a particular search engine robot. Refer to the User-Agent of major search engines.

To restrict certain directory of your website -

User-Agent: *
Disallow: /Songs

To restrict particular robot from visiting your web directory

User-Agent: Googlebot/2.1
Disallow: /Songs

If you are addressing multiple search robots in your robots.txt, make sure that the directive for specific User-agent is specified before.

#Disallow Google bot from visiting any webpage/ content under /Songs/private
User-Agent: Googlebot/2.1
Disallow: /Songs/private

#Disallow all other bots from visiting any web page/ content under /Songs
User-Agent: *
Disallow: /Songs

More Robots.txt directives –

Content Value

Description

Supported By

Disallow

Do not visit specified web page

ALL search robots

Allow

Allow visiting the particular web page
Eg, Disallow: /Songs
Allow: /Songs/Favs

Above statement will restrict the search robot from visiting all the directories under Songs other than Favs subfolder

Google, Yahoo, Ask, MSN Live

Sitemap

Location of your sitemap


Sitemap: http://yourwebsite.com/sitemap_location.xml

Location of sitemap index file can also be included here.

Google, Yahoo, Ask, MSN Live

Wild card (*/$)

Wildcard * - matches sequence of characters
Eg, Disallow: /Songs/*personal*
Wildcard $ - matches everything from the end of the URL
Eg, Disallow: /Songs/*.mp3$
Learn more on pattern matching

Google, Yahoo, MSN Live

Crawl-Delay

Specifies minimum delay between two successive requests made by search robot

Ask



robots.txt quick Tips


Q. To allow all search engine spiders to index all the files of your website

Your robots.txt file should like below
User-agent: *
Disallow:

Q. To disallow all spiders to index any file

Your robots.txt file should like below
User-agent: *
Disallow: /

Note: Slash '/' here is your root directory and by adding that in your Disallow statement you are restricting spiders from indexing all the files of your website.

For specific questions related to writing robots.txt for your website, please reach me at bhawnablog@gmail.com

Sunday, March 8, 2009

Organic SEO Best Practices Checklist

The DOs

Description

TIPs

Title Tag

<title>keyword in the Title</Title>

  • < 60-65 characters including spaces.
  • First 3 words in any combination will lead to keyword phrase.

Image Tag

<img src="" alt="keyword in the alternate text"/>

  • Alt is another opportunity to add keywords
  • Add only image related text in Alt
  • Create short and meaningful alt text

Anchors

<a href="link to a related website">keyword</a>

  • Text based links
  • As long as it is a important keyword text and the link is relevant, anchor it
  • Avoid broken links
  • Use anchor text for linking to relevant dynamic content (crawlers will favor you)

Meta Tags

<meta name="keyword" content="related keyword list"/>

<meta name="description" content="short description of your website with few keywords"/>

  • <200 characters (description)
  • Do not repeat exact title in description
  • Avoid keyword repetition

Header Tags

<h1></h1>,

<h2></h2>,

<h3></h3>,

<h4></h4>

  • Most important <h1>------> less important <h4>
  • <h1> occurrence – 2 at most

URL

Parameter: http://www.yoursite.com/products.jsp?id=12356&category=7&type=42&size=6&batch=65

Depth: http://www.yoursite.com/products/category/batch/season/item

  • Max parameters: 2


  • Max depth: 4

Inbound Links

  • As many strong inbound links, the better
  • Request links, write articles with link to your site, PR, social network community, paid links

Visible Body Text

<body>body text</body>

  • Use <strong></strong> for relevant keywords
  • Use <em></em> for relevant keywords

Sitemap

XML based document at the root of your website.

http://www.yourwebsite.com/sitemap.xml

Learn more about writing Sitemap

  • < 50,000 URLs and 10MB size per site map
  • <1,000 site maps, per website, if multiple site maps used
  • Submit Site to the search engines

Navigation

  • Text based navigation on the left

Domain Name

  • If new website, try using keyword in domain name to specifically define the content of your website

Bread-crumb trail

On your web page, provides navigation depth info.

Home > kids > Toys > 4T – 5T

  • Use your website name instead of HOME

FAQs

FAQs of popular searches related to your product

  • 4-15 questions
  • 200-800 words for each FAQ.

Popular Search List

Maintain a list of popular /most frequently searched items/ keywords related to your product /website on very web page.

  • The search list should be relevant and should link to pages in your website.

Robots.txt

Learn more about robots.txt

  • Suggest crawlers on pages to crawl and pages to avoid
  • Helps in logging search engine visits

Company Address

Company Name, Street Address, City, State, Zip, Country, Phone#

  • Provides visibility in location search

Contact Statement

If you need more info, please contact ……

  • Instead,

    If you need more info about <yourproductname>, please contact …..


The DON'Ts

The Fix

Duplicate URLs

  • Use canonical link tag in the <head> section of all the duplicate pages to point to the original web page content:

    <link rel="canonical" href=http://yourwebsite.com/product.html"/>

Broken Links

  • Fix them!
  • Add nofollow keyword to the link suggesting the search crawler to not visit the link

Cookies

  • Un restrict cookies
  • Set some default content when cookie is unavailable.

Session Ids

  • Generate a guest user and allow to view the un restricted content

Frames

  • Provide alternative to framed web site using the <noframes> tag.
  • The <noframes> tag content should be exactly the same as frames site content
  • Add link to HOME, with attribute TARGET="_top"

302 redirect

  • Avoid, if possible
  • Use robot.txt to avoid crawling this link, if possible
  • Add > 15-sec delay before redirecting