首页 > 解决方案 > 如何使用 Python 从网站链接复制网页的所有文本

问题描述

我需要访问特定部分的 DNA 序列,但是它们太多了,但是我看到了这样的 URL 链接模式

https://www.ncbi.nlm.nih.gov/nuccore/AF193276.1 ?report=fasta&log$= seqview &format=text&from= 1311 &to= 4322

这个链接,我可以从下面的位置 1311 到 4322 访问 HIV(ID:AF193276.1)的 DNA 序列。

>AF193276.1:1311-4322 HIV-1 CRF03_AB isolate KAL153 from Russia, complete genome
TTTTTTAGGGAGAATTTGGCCTTCCAGCAAAGGGAGGCCAGGAAATTTTCCTCAGAGCAGACCAGAGCCA
TCAGCCCCACCAGCAGAAAACTTTGGGATGGGGGAAGAGATAACCCCCTCCCTGAAACAGGAACAGAAGG
ACAGGGAACAGCATCCTCCTTCAATTTCCCTCAAATCACTCTTTGGCGACGACCCCTTGTCACAGTAAGA
ATAGGAGGACAGCTAAAAGAAGCTCTATTAGATACAGGAGCAGATGATACAGTATTAGAAGACATAAATT
TGCCAGGAAAATGGAAACCAAAAATGATAGGGGGGATTGGAGGTTTTATCAAGGTAAGACAGTATGATCA
GATACTTATAGAAATTTGTGGAAAAAAGGCTATAGGTACGGTATTAGTAGGACCTACCCCTGTCAACATA
ATTGGAAGAAATATGTTGACTCAGCTTGGTTGTACTTTAAATTTTCCAATAAGTCCTATTGAAACTGTAC
CAGTAACATTAAAGCCAGGAATGGATGGCCCAAAGGTTAAACAATGGCCATTAACAGAAGAGAAAATAAA
AGCATTAACAGACATTTGTAAGGAGATGGAAAAGGAAGGAAAAATTTCAAAAATTGGGCCTGAAAATCCA
TACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGGTTTCAGAG
AACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACACCCTGCAGGGTTAAAAAA
GAAAAAATCTGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATCAAGACTTCAGA
AAGTATACTGCATTTACCATACCTAGTACAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGC
TTCCACAGGGATGGAAAGGATCACCAGCAATTTTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAG
AAAACAAAATCCAGAGATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAGATA
GGGCAGCATAGAACAGAAATAGAGGAACTGAGAGAACATCTGCTGAGGTGGGGATTTACCACACCAGACA
AAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACTGTACA
GCCTATAGTGTTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGCTAGTGGGAAAATTGAAT
TGGGCAAGTCAGATTTATGCAGGGATTAAAGTAAGGCAATTATGTAAACTCCTTAGGGGAGCCAAAGCAC
TAACAGAAGTAATACCACTAACAGCAGAAGCAGAGCTAGAACTGGCAGAAAACAGGGAGATTCTAAAAGA
ACCAGTACATGGAGTGTATTATGACCCATCAAAAGACTTAGTAGCAGAAATACAGAAGCAGGGACAAGGC
CAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTGAAAACAGGAAAATATGCAAGACTGAGGG
GTGCCCACACTAATGACGTAAAACAGTTAACAGAGGCAGTGCAAAAAATAGCCACTGAAAGCATAGTAAT
ATGGGGAAAGACTCCTAAATTTAAACTACCCATACAAAAAGAAACATGGGAAACATGGTGGACAGAGTAT
TGGCAAGCCACCTGGATTCCTGAGTGGGAATTTGTCAATACCCCTCCCTTAGTAAAATTATGGTACCAGT
TAGAGAAAGAACCCATAGTAGGAGCAGAAACTTTCTATGTAGATGGAGCAGCTAATAGGGAGACTAAATC
AGGAAAAGCAGGATATGTTACTGACAGAGGAAGACAAAAGGTTGTCTCCCTAACTGACACAACAAATCAG
AAGACTGAGTTACAAGCAATTCATCTAGCTTTGCAGGATTCGGGATTAGAAGTAAACATAGTAACAGACT
CACAATATGCATTAGGAATCATTCAAGCACAACCAGATAAGAGTGAATCAGAGTTAGTCAGTCAAATAAT
AGAGCAGTTGATAAAAAAGGAAAAGGTCTACCTGGCATGGGTACCAGCACACAAAGGAATTGGAGGAAAT
GAACAAGTTGATAAATTAGTCAGTGCTGGAATCAGGGAAGTACTATTTTTAGATGGAATAGATAAGGCAC
AAGAAGAACATGAGAAATATCACGGTAATTGGAGAGCAATGGCTAGTGATTTTAACCTGCCACCTGTGGT
AGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAATTAAAAGGAGAAGCCATGCACGGACAAGTAGAC
TGTAGTCCAGGAATATGGCAACTAGATTGTACACATTTAGAAGGAAAAATTATCCTAGTAGCAGTTCATG
TAGCCAGTGGATATATAGAAGCAGAAGTTATTCCAGCAGAAACAGGACAGGAAACAGCATACTTTGTCTT
AAAATTAGCAGGAAGATGGCCAGTAAAAATAATACATACAGACAATGGCAGCAATTTCACCAGTACTGCG
GTTAAGGCTGCCTGTTGGTGGGCAGGGATCAAGCAGGAATTTGGCATTCCCTACAATCCCCAAAGTCAAG
GAGTAGTAGAATCTATGAATAAACAATTAAAGCAAACTATAGGACAGGTAAGAGATCAAGCTGAACATCT
TAAGACAGCAGTACAAATGGCAGTATTCATCCACAATTTTAAAAGAAAAGGGGGGATTGGGGGGTACAGT
GCAGGGGAAAGAATAATAGACATAATAGCAACAGACATACAAACTAAAGAATTACAAAAACAAATTATAA
AAATTCAAAATTTTCGGGTTTATTACAGAGACAGCAGAGATCCAATTTGGAAAGGACCAGCAAAACTACT
CTGGAAAGGTGAAGGGGCAGTGGTAATACAGGACAATAACGATATAAAAGTAGTACCAAGAAGAAAAGCA
AAGATCATTAGGGATTATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATT
AG

我需要这个网页上的所有信息,但有几个菌株。

如果我改变菌株 ID 和 DNA 位置,我想我可以复制所有这些 DNA 序列。

我尝试从将文本从网页复制并粘贴到 txt 文件或 csv 文件

import requests
url = 'https://www.ncbi.nlm.nih.gov/nuccore/AF061641.1?report=fasta&log$=seqview&format=text&from=192&to=1684'
data = requests.get(url)
with open('file.txt','w') as out_f:
   out_f.write(str(data.text.encode('utf-8')))

但我得到了这个

b'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n    <head xmlns:xi="http://www.w3.org/2001/XInclude"><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n    <!-- meta -->\n    <meta name="robots" content="index,nofollow,noarchive" />\n<meta name="ncbi_app" content="entrez" /><meta name="ncbi_db" content="nuccore" /><meta name="ncbi_report" content="fasta" /><meta name="ncbi_format" content="text" /><meta name="ncbi_pagesize" content="20" /><meta name="ncbi_sortorder" content="default" /><meta name="ncbi_pageno" content="1" /><meta name="ncbi_resultcount" content="1" /><meta name="ncbi_op" content="retrieve" /><meta name="ncbi_pdid" content="fasta" /><meta name="ncbi_sessionid" content="CE8C1A31EBE79081_2023SID" /><meta name="ncbi_uidlist" content="3403216" /><meta name="ncbi_filter" content="all" /><meta name="ncbi_stat" content="false" /><meta name="ncbi_hitstat" content="false" />\n\n    \n    <!-- title -->\n    <title>HIV-1 isolate HH8793 clone 12.1 from Finland, complete genome - Nucleotide - NCBI</title>\n    \n    <!-- Common JS and CSS -->\n    \n\t\t<script type="text/javascript">\n\t\t    var ncbi_startTime = new Date();\n\t\t</script>\n\t\t<style>.async-hide { opacity: 0 !important} </style><script type="text/javascript" src="/core/assets/kis/dist/kis_ga_nuc_protein.js"></script><script type="text/javascript" src="https://static.pubmed.gov/core/jig/1.14.8/js/jig.min.js"></script>\n\t\t\t\t\n\t\t\t<script type="text/javascript" src="/core/ajax_loader/2.1/js/loadingbar.js"></script> \n        \t<script type="text/javascript" src="/core/ajax_loader/2.1/js/contentLoader.js"></script>\n        \t<link type="text/css" rel="stylesheet" href="/core/ajax_loader/2.1/css/loadingbar.css" />\n\t\t\t\n\t\t\t  \n    \n    <link xmlns="http://www.w3.org/1999/xhtml" type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4187342/css/3881636/3579733.css" xml:base="http://127.0.0.1/sites/static/header_footer/" />    \n<link rel="shortcut icon" href="//www.ncbi.nlm.nih.gov/favicon.ico" /><meta name="ncbi_phid" content="CE8C1A31EBE6A9A10000000007E703EB.m_11" /><script type="text/javascript"><!--\nvar ScriptPath = \'/portal/\';\nvar objHierarchy = {"name":"EntrezSystem2","type":"Layout","realname":"EntrezSystem2",\n"children":[{"name":"EntrezSystem2.PEntrez","type":"Cluster","realname":"EntrezSystem2.PEntrez",\n"children":[{"name":"EntrezSystem2.PEntrez.DbConnector","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.DbConnector","shortname":"DbConnector"},\n{"name":"EntrezSystem2.PEntrez.ParamContainer","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.ParamContainer","shortname":"ParamContainer"},\n{"name":"EntrezSystem2.PEntrez.MyNcbi","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.MyNcbi","shortname":"MyNcbi"},\n{"name":"EntrezSystem2.PEntrez.UserPreferenceUrlParamContainer","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.UserPreferenceUrlParamContainer","shortname":"UserPreferenceUrlParamContainer"},\n{"name":"EntrezSystem2.PEntrez.GridProperty","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.GridProperty","shortname":"GridProperty"},\n{"name":"EntrezSystem2.PEntrez.Nuccore","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NoPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NoPortlet","shortname":"NoPortlet"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_PageController","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_PageController","shortname":"Sequence_PageController"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_SearchBar","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_SearchBar","shortname":"Entrez_SearchBar"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_BotRequest","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_BotRequest","shortname":"Entrez_BotRequest"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_LimitsTab","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_LimitsTab","shortname":"Sequence_LimitsTab"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.blankToolPanel","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.blankToolPanel","shortname":"blankToolPanel"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_ResultsController","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_ResultsController","shortname":"Sequence_ResultsController"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Filters","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Filters","shortname":"Entrez_Filters"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Pager","shortname":"Entrez_Pager"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_DisplayBar","shortname":"Sequence_DisplayBar"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.HelpFormAttributes","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.HelpFormAttributes","shortname":"HelpFormAttributes"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Collections","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Collections","shortname":"Entrez_Collections"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SpellCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.SpellCheck","shortname":"SpellCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SearchEngineReferralCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.SearchEngineReferralCheck","shortname":"SearchEngineReferralCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.KnowledgePanel","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.KnowledgePanel","shortname":"KnowledgePanel"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.HistoryDisplay","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.HistoryDisplay","shortname":"HistoryDisplay"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Discovery_SearchDetails","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Discovery_SearchDetails","shortname":"Discovery_SearchDetails"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.KISSensor","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.KISSensor","shortname":"KISSensor"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.MultiSensorPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.MultiSensorPortlet","shortname":"MultiSensorPortlet"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.WrongDbSensor","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.WrongDbSensor","shortname":"WrongDbSensor"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DiscoveryExptChooser","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_DiscoveryExptChooser","shortname":"Sequence_DiscoveryExptChooser"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerTitle","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerTitle","shortname":"Sequence_ViewerTitle"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport","shortname":"Sequence_ViewerReport"}]},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.EmptyPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.EmptyPortlet","shortname":"EmptyPortlet"}]},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_Facets","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence_Facets","shortname":"Sequence_Facets"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_Clipboard","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_Clipboard","shortname":"Entrez_Clipboard"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_StaticParts","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_StaticParts","shortname":"Sequence_StaticParts"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_Messages","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_Messages","shortname":"Entrez_Messages"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NcbiJSCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NcbiJSCheck","shortname":"NcbiJSCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.Footer_ExtraData","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.Footer_ExtraData","shortname":"Footer_ExtraData"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIBreadcrumbs","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIBreadcrumbs","shortname":"NCBIBreadcrumbs"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIHelpDesk","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIHelpDesk","shortname":"NCBIHelpDesk"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIApplog_NoScript_Ping","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIApplog_NoScript_Ping","shortname":"NCBIApplog_NoScript_Ping"}]}]}]}]}]};\n--></script>\n<meta name=\'referrer\' content=\'origin-when-cross-origin\'/><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/css/3808861/3917732/3974050/3751656/3395415/4091728/3257261.css" /><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/css/3501913.css" media="print" /><script type="text/javascript">\n\nvar ObjectLinks=[{i:0, ename: "p$ExL", esid:"*", sname: "p$ExL", ssid:"*", dname:"p$el", dsid:"0",m:"CopyValue",p:[],f: function(src, dst) {fn_CopyValue(src, dst);}}]\n\n\nvar ActiveNames = {"p$ExL":1, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ExpandGaps":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.InUse":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ItemCount":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.db":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.display_type":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.fasta_text_params":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.maxdownloadsize":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.report":0};\n</script></head>\n    <body>\n        <form enctype="application/x-www-form-urlencoded" name="EntrezForm" method="post" onsubmit="return false;" action="/nuccore" id="EntrezForm">\n            <div id="maincontent" class="container">\n                <div>\n  <div id="viewercontent1" class="seq gbff" val="3403216" SequenceSize="14415" VirtualSequence=""></div>\n  <div class="hidden">\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.db" sid="1" type="hidden" value="nuccore" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.report" sid="1" type="hidden" value="fasta_text" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.maxdownloadsize" sid="1" type="hidden" value="1000000" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.display_type" sid="1" type="hidden" value="single" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ItemCount" sid="1" type="hidden" value="1" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.InUse" sid="1" type="hidden" value="" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ExpandGaps" sid="1" type="hidden" value="" />\n    <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.fasta_text_params" sid="1" type="hidden" value="&amp;from=192&amp;to=1684" />\n  </div>\n</div>\n\n            </div>\n        <input type="hidden" name="p$a" id="p$a" /><input type="hidden" name="p$l" id="p$l" value="EntrezSystem2" /><input type="hidden" name="p$st" id="p$st" value="nuccore" /><input name="SessionId" id="SessionId" value="CE8C1A31EBE79081_2023SID" disabled="disabled" type="hidden" /><input name="Snapshot" id="Snapshot" value="/projects/Sequences/SeqDbRelease@1.124" disabled="disabled" type="hidden" /></form>\n    \n\n<!-- CE8C1A31EBE79081_2023SID /projects/Sequences/SeqDbRelease@1.124 portal105 v4.1.r585844 Mon, May 06 2019 02:53:16 -->\n\n\n<script type=\'text/javascript\' src=\'/portal/js/portal.js\'></script><script type="text/javascript" src="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/js/4184195/3217400/4176568/4177091.js" snapshot="nuccore"></script></body>\n</html>'

现在,我有所有菌株 ID 和位置范围,所以我需要复制这些 DNA 序列进行分析。

先感谢您

标签: pythonpython-requests

解决方案


这是您遇到的可快速纠正的错误。

您所做的基本上称为“Web Scraping”并获得所需的输出,您需要使用另一个包,如“Selenium”或“BeautifulSoup”。

我个人更喜欢 BeautifulSoup 而不是 Selenium,但这只是我的意见。这是一个 BeautifulSoup 实现。

from bs4 import BeautifulSoup
import requests

url = 'https://www.ncbi.nlm.nih.gov/nuccore/AF061641.1?report=fasta&log$=seqview&format=text&from=192&to=1684'
data = requests.get(url)
content = BeautifulSoup(data.content,"html.parser")

这就是一些成品的样子。“内容”变量现在将具有网页的完整 html 结构。剩下要做的就是在 html 结构中找到所需的信息并从中提取。

只剩下一点时间来完成代码,但我无法帮助并完全完成代码,因为我没有所有必需的参数,但你应该能够在阅读 web 抓取大约半小时后完成剩下的部分。

希望这有帮助!:)


推荐阅读