Crawling NY Times API for relevant Articles

Ash
4 min readMay 10, 2021

This is one of my first web crawler, filtered to query articles on AWS security and export it to excel file for me to review.

In order to access NY Times API, you need to create account with NYT to use their API.

https://developer.nytimes.com/

Second, need to activate your App on NYT portal, and this will allow you to get API key and secret code. You will only need API key for this program.

Then, you need to activate API’s on this app.

Don’t forget to click save to activate these changes.

Now you can start in python IDE, I am using spyder you can use any other. You will need following libraries.

import requests
from pprint import pprint
import pandas as pd

You can use save your API under environment variables or assign it to a variable.

apikey = os.getenv('NYTIMESAPIKEY')
or
apikey = 'XCCM...'

You can use, API syntax to query for your articles. More here

https://developer.nytimes.com/docs/articlesearch-product/1/overview

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=election&api-key=yourkey

NYT API provides numerous filters to filter on, under Filter Query Fields. It also lists values for following filters:

  1. Type of Material Values
  2. Section Name Values
  3. New Desk Values

These filters follow filter_name: (“value”) format and you can use AND and OR to add multiple filters to your query.

I was interested in looking for AWS related articles, where I had ‘security’, ‘breach’ in body of article, glocation == ‘ USA’.

query = "AWS"
begin_date = "20200701" # YYYYMMDD
filter_query = "\"body:(\"security\") AND body: (\"aws\") AND body:(\"breach\") AND news_desk:(\"Technology\") AND glocations:(\"USA\")\""
page = "0" # <0-100>
sort = "relevance" # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
f"q={query}" \
f"&api-key={apikey}" \
f"&begin_date={begin_date}" \
f"&fq={filter_query}" \
f"&page={page}" \
f"&sort={sort}" \
f"&fl={response_field}"

Also, important to note

“Pagination

The Article Search API returns a max of 10 results at a time. The meta node in the response contains the total number of matches (“hits”) and the current offset. Use the page query parameter to paginate thru results (page=0 for results 1–10, page=1 for 11–20, …). You can paginate thru up to 100 pages (1,000 results). If you get too many results try filtering by date range.”

print(response['meta'])
{'hits': 15, 'offset': 0, 'time': 12}

My query had 15 results, but only 10 show up in first result. And in order to look up second page, I need to use page number with my query to get second or subsequent page. As you might notice, page number here for page 2, is actually 1.

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=AWS&api-key=APIKEY&begin_date=20200701&fq="body:("security") AND body: ("aws") AND body:("breach") AND news_desk:("Technology") AND glocations:("USA")"&page=1&sort=relevance

You also don’t need all the fields from the API for your analysis, so you can filter field in response as well. I was interested in following field so, used response_filed = ‘abstract,web_url,snippet,source,headline,keywords,pub_date,section_name’

More on fields, you can filter here.

section = "technology"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?api-key={apikey}"
r = requests.get(query_url)query = "AWS"
begin_date = "20200701" # YYYYMMDD
filter_query = "\"body:(\"security\") AND body: (\"aws\") AND body:(\"breach\") AND news_desk:(\"Technology\") AND glocations:(\"USA\")\"" # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "1" # <0-100>
sort = "relevance" # newest, oldest
response_field = 'abstract,web_url,snippet,source,headline,keywords,pub_date,section_name'
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
f"q={query}" \
f"&api-key={apikey}" \
f"&begin_date={begin_date}" \
f"&fq={filter_query}" \
f"&page={page}" \
f"&sort={sort}"\
f"&fl={response_field}"
print(query_url)
r = requests.get(query_url)
pprint(r.json())

Now, that I have all the data, I just need to export it to an excel file

data = r.json()
response = data ['response']
docs = response['docs']
docs = pd.DataFrame(docs)
data.to_excel('data3.xlsx', header = None, index = None, encoding = 'utf-8' )

Other challenges I had with this project was

  1. Understanding structure of the response. Thankfully, there are numerous resource online and NYT API page is also very helpful.
  2. Playing around with the extract in pandas. I keep forgetting basic commands in pandas, and rely too much on google. I need to work on getting the basic commands right, its useless time drain that I should avoid.
  3. File export was missing utf-8 characters. df.to_csv and other options had some data corruption for utf-8 characters, didn’t deep dive much and ended up using df.to_excel, which had correct output.

--

--