Sharepoint Forum

Ask Question   UnAnswered
Home » Forum » Sharepoint       RSS Feeds

Cannot index non-portal site

  Asked By: Ernest    Date: Apr 12    Category: Sharepoint    Views: 1358

I am trying to add a non-portal site to be indexed by the search
engine. It is our current intranet site which contains lots of
information on it.

I set it up as a new content source, specified the URL, told it to
index the whole enchilada and start the full update.

The index runs for about 2 seconds, stops and nothing is getting
indexed. I thought this was suppose to be pretty simple. I swear
Microsoft should hire me to find me to things up so they can
make things that makes more sense



8 Answers Found

Answer #1    Answered By: Darrell Peters     Answered On: Apr 12

We just ran into this very issue on our intranet  site. There was a robots.txt
file that tightly controlled what processes could crawl the site. We needed to
add a rule to allow the SharePoint crawler.

You need to enter the full  string:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)
And then the appropriate Allow directives.

We had been using the "MS Search 4.0 Robot" user agent string which does not
work (MS will be adding a KB article about this).

Answer #2    Answered By: Lester Casey     Answered On: Apr 12

I wouldn't have thought  about the robots.txt file but
that makes  total sense. You wouldn't have one handy would you that is
already done?

Answer #3    Answered By: Rudy Francis     Answered On: Apr 12

Yes, we use that, but to disallow all robots except for the "select few".
That's why we needed a specifc rule to allow the SPS crawler.

Otherwise, using User-Agent: * with Allow: / should do the trick.

Here's what we have:

User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)
Allow: /wwm/
Allow: /americas/
Allow: /consulting/
Disallow: /bin/

User-agent: *
Disallow: /

(there are a few rules for some other robots)

Our site  admins want to disallow all robots and enter explicit entries for
trusted bots.

Answer #4    Answered By: James Miller     Answered On: Apr 12

I created the robots.txt file, but when I go to start  the full
update it still only lasts about 1 second.

Did you guys adjust the registry as outlined in this doc?


Answer #5    Answered By: Collin Griffith     Answered On: Apr 12

No, we didn't need to do that (just updated our robots.txt file). I did add  a
Site Hit Frequency Rule in the SPS Central Administration to limit the request
to 1 page at a time (our intranet  servers were getting hit hard by the crawler).

That was the only thing we needed to do. What errors are you getting
specifically whe you look at the gatherer log?

Answer #6    Answered By: Scott Nelson     Answered On: Apr 12

I found in the log files the problem I think:

The address was excluded because its file extension is restricted in
the file type rules.

The site  I am trying to index, all of the pages end in .php. It
appears the indexer is excluding all of these pages (which in turn is
the entire site).

How do I turn this off and where?

Answer #7    Answered By: Blake Marshall     Answered On: Apr 12

Under Configure Search and Indexing > Include file types and add  php.

Answer #8    Answered By: Dwayne Jensen     Answered On: Apr 12

Also, it seems SharePoint caches the robots.txt. Log onto your index  server and
restart the search service: "Microsoft SharePointPS Search". Also, bring up IE
on the server and clear the temporary files and be sure to check for newer
versions of stored pages "every visit to the page" under settings.

Didn't find what you were looking for? Find more on Cannot index non-portal site Or get search suggestion and latest updates.