news.google scraper (BETA)

The script located at tools.issuecrawler.net/beta/googleNews queries and scrapes http://news.googlenews.com and returns the results on screen and as a tab separated text file. This textfile can then be used to do analysis on by e.g. importing it into excell for analysis with reseau-lu.

Input

Description of the input form:
  • Number of results specifies the maximum number of results you wish to retrieve. Google outputs it's results with a maximum size of 100 so if you specify a value > 100, multiple queries will be done till the maximum number of results is reached or until google doesn't return results anymore.
  • All text inputs can have multiple queries, separated by a comma (,). For each query a new google search is done.
  • All select forms can have multiple selected values. For each selected value in 'Google Version' and in 'Language' a new google query is done. To select multiple values hold down 'ctrl' on windows and linux or 'apple' on mac while clicking on the values.
  • There are 5 fields which can have multiple queries: 'Search for', 'Return only articles from the news source named', 'Return only articles from news sources located in', 'Google Version' and 'Language'. If multiple queries are specified every combination of those queries will be executed. E.g. if 'Search for' contains 'bush, kerry' and in 'Google Version' usa and uk are selected there will be four google queries: 'bush in google version usa', 'bush in google version uk', 'kerry in google version usa' and 'kerry in google version uk'.
  • Between every query an interval of 3 seconds is taken into account. Elaborate queries will take a long time.
  • In the 'Search for' input you can enter boolean queries and you can group terms by quotes ("). See this page for a description on how to correctly formulate queries. Note that our script uses comma's (,) to separate queries.
  • Googlenews normally doesn't offer the ability to restrict search results by language. I found out this does give some interesting results sometimes. Also, not all languages in our script seem to be supported by google.
  • Our script works with filter=0 which means all results are queried - as if you would do a query with similar results included.

Output

  • In 'What to output' you can specify what fields need to be displayed as output. Take care in selecting what you want as output. If you search on images also include the result number -as images are referenced to the result number. If you search on different languages, select languages. Etc.
  • Next to 'What to output' you can specify if you want output to the screen, to a file - as a tabseparated list, or both.
  • In 'What to output' you will find a special selectable value 'same images'. When selected, all the thumbnails found for you query are compared by using a normal unix diff function. Only if the files (images) are exactly the same the output will be true. So it might be the case that you see pictures which appear to be the same. This means that the files are slightly different. Next to each image you will find the result nr of the article it belongs to.
    There will be an extra column in the result called 'same images' which gives you a list of result numbers with the same thumbnail.
  • All google results which have a date of the form 'x hours ago' or 'x minutes ago' (in any language) are calculated as the current time of the server (UTC+2) minus the time given by google. All dates are translated into the form day/month/year.
  • Every US stated will mapped to USA in the output
  • if you selected thumbnail or 'same images' in 'what to output' the thumbnails are stored for future reference.
  • all output is returned as utf-8
  • only results that have been stored to a file can be retrieved through the 'Previous results' link. For previous results which have 'same images' as output, the diff is calculated again to give you a nice overview.

We recommend firefox as your browser but any browser should do as the html is w3c compliant.

If something doesn't work as expected please send an email and specify the exact time and date as well as your timezone.

Topic revision: r4 - 16 May 2006 - 21:06:32 - ErikBorra
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback