首页 > 解决方案 > Google Patents - 使用 Python 和 BigQuery 抓取专利的公开号

问题描述

我需要从 Google Patents 获得大量的出版物编号。我需要的名称示例:US7863316B2、KR102121633B1。我试图通过使用经典的 Python 工具(如 BeautifulSoup)来抓取数据,但这种方法不适用于 Google。然后我去了 Google Cloud BigQuery,我得到了一些结果。但是在很好地理解如何使用这个平台之前,我遇到了一个错误:Quota exceeded: Your project exceeded quota for free query bytes scanned. 我用来获取数据的代码:


  q = r'''
  WITH 
  pubs as (
    SELECT DISTINCT 
      pub.publication_number
    FROM `patents-public-data.patents.publications` pub
      INNER JOIN `patents-public-data.google_patents_research.publications` gpr ON
        pub.publication_number = gpr.publication_number
    WHERE 
      "epilepsy" IN UNNEST(gpr.top_terms)
      AND pub.grant_date < 20000101
  )

  SELECT
    publication_number, url
  FROM 
    `patents-public-data.google_patents_research.publications`
  WHERE
    publication_number in (SELECT publication_number from pubs)
    AND RAND() <= 1000/(SELECT COUNT(*) FROM pubs)
  '''

  return q

df = client.query(create_query(search_term)).to_dataframe()

if len(df) == 0:
  raise ValueError('No results for your search term. Retry with another term.')
else:
  print('Search complete for search term: \"{}\". {} random assets selected.'
  .format(search_term, len(df)))

embedding_dict = dict(zip(df.publication_number.tolist(), 
                          df.embedding_v1.tolist()))

df.head()```

Probably there are some other ways to get information I need?

标签: pythongoogle-bigquerygoogle-patent-search

解决方案


推荐阅读