Provide user with a record of the crawl results. Allow for 'validity check' of crawl results, i.e., whether crawler has captured all interlinkings of the nodes on the map, by checking interlinkings between sites. This way the qualitative people can see if it crawled what it should crawl and investigate the effect of ceilings
Extra:
- parse all rss feeds out of crawl records
- parse all APPLIED robots.txt rules out of crawl records: e.g. ROBOTS: successfully got http://www.....com/robots.txt, ROBOTS: rule "user-agent: *" applied
- parse all skipped non html/php pages out of crawl records (e.g. pdf, doc, xls)
problem:
- we are talking about a million urls per crawler or even iteration
Topic revision: r3 - 14 Apr 2008 - 13:31:00 -
ErikBorra