首页 > 解决方案 > 使用 Flask-SQLAlchemy 简化将关系数据插入数据库的过程

问题描述

我目前正在使用 Flask 并希望上传大量文章(总共 5000 万),每篇文章都是由不同数量的用户编写的(总共约 1 亿)。为此,我创建了一个 Article 模型、一个 User 模型和一个 Contribution 模型,它以多对多的关系将用户链接到文章。

我目前的上传过程如下:

article_dictionary = create_article_dictionary()
user_dictionary = create_user_dictionary()

for xml_file in files:
    with gzip.GzipFile(xml_file,'rb') as gfile:
        root = open_as_xml_tree(gfile)  
        for article in root:            
            # parse XML file using custom XML_Article class
            article_object = XML_Article(xml_article=article)
            # create Article object
            new_article = Article(  title = article_object.article_title,
                                    pubdate = article_object.publication_date,
                                    article_id = article_object.article_id
                                )
            if not article_dictionary.get(new_article.article_id):
                # update the dict keeping track of which articles are indexed
                article_dictionary[new_article.article_id]=1
                db.session.add(new_article)
                db.session.flush()

                # add each author in the article
                for author in article_object.authors:
                    new_user = User(firstname=author.firstname,midname=author.midname,lastname=author.lastname)
                    # if the user does exist, don't add them, just note their contribution
                    if user_dictionary.get(new_user.firstname + new_user.midname + new_user.lastname):
                        new_contribution = Contributions(   user_id = user_full_name[new_user.firstname + new_user.midname + new_user.lastname],
                                                            article_id = new_article.article_id,
                                                            # other information about the contribution is here
                                                        )
                    # if they don't, add them
                    else:
                        db.session.add(new_user)
                        db.session.flush()
                        new_contribution = Contributions(   user_id = new_user.user_id,
                                                            article_id = new_article.article_id,
                                                            # other information about the contribution is here
                                                        )                       
                        user_dictionary[new_user.firstname + new_user.midname + new_user.lastname]=new_user.user_id
                    db.session.add(new_contribution)
                    db.session.flush()
                    count+=1
        db.session.commit()

但是,这非常慢,不足以容纳 5000 万个条目。我怀疑速度慢的部分原因是必须反复添加和刷新,我想知道是否有任何方法可以更快地批量执行此操作。

标签: pythondatabasesqlitesqlalchemyflask-sqlalchemy

解决方案


推荐阅读