首页 > 解决方案 > for循环中的dataframe输出是正确的,但是当csv输出到一个文件夹再读取时,出现了一些奇怪的新元素

问题描述

伙计们,我正在预处理 youtube 数据,在进行 csv 输出时遇到了一些问题。我确实检查了处理后的数据帧在 forloop 中是否正确。但是,当它输出到 csv 文件时,出现了问题。

for country in country_list:
    # loading data
    df = pd.read_csv(f"../input/youtube-new/{country}videos.csv")
    js = pd.read_json(f"../input/youtube-new/{country}_category_id.json")
    
    # preprocessing
    df = df.drop_duplicates() # deal with duplicates
    # 1. filter video days below 2
    video_count = df["video_id"].value_counts()
    dict_video = dict(video_count[df["video_id"].value_counts() > 1])
    df = df[df["video_id"].isin(list(dict_video))]
    
    # 2. count video days
    continue_days = df["video_id"].apply(lambda x : dict_video[x])
    df.insert(2, column = "continue_days", value = continue_days)
    
    # 3. count tag words
    tags_count = df["tags"].apply(lambda x : len(x.split("|")))
    df.insert(loc = 7, column = "tags_count", value = tags_count)
    
    # 4. deal with true and false value
    df["comments_disabled"] = df["comments_disabled"].apply(lambda x : 0 if x == False else 1)
    df["ratings_disabled"] = df["ratings_disabled"].apply(lambda x : 0 if x == False else 1)
    df["video_error_or_removed"] = df["video_error_or_removed"].apply(lambda x : 0 if x == False else 1)
    
    # 5. count the words of description
    df['description'] = df['description'].fillna(0) # deal with null value
    description_count = df["description"].apply(lambda x : 0 if x == 0 else len(x.split(" ")))
    df.insert(loc = 17, column = "description_count", value = description_count)
    
    # 6. deal with trending date
    df["trending_date"] = df["trending_date"].apply(lambda x : x.split("."))
    df["trending_date"] = df["trending_date"].apply(lambda x : "20" + x[0] + "/" + x[-1] + "/" + x[1])
    
    # 7. create new category based on id 
    dict_category = {}
    for i in range(len(js)):
        label = int(js["items"].loc[i]["id"])
        category = js["items"].loc[i]["snippet"]["title"]
        dict_category[label] = category
    category = df.replace({"category_id" : dict_category})
    df.insert(loc = 6, column = "category", value = category["category_id"])
    
    # 8. sort values 
    df = df.sort_values(by = ["video_id", "trending_date"], ascending = True)
    
    # 9. dropna
    na_index = np.where(df["trending_date"].isna())[0]
    df.drop(na_index, axis = 0)
    print(df.info())
    # create csv
    df.to_csv(f"./outputs/{country}/{country}.csv", header = True, index = False)
    print(f"Done {country}")
    
    # create csv with different category
    category = df["category"].unique()
    print(category)
    for cat in category:
        df_target = df[df["category"] == cat]
        df_target = df_target.sort_values(by = ["video_id", "trending_date"], ascending = True)
        df_target.to_csv(f"./outputs/{country}/{country}_{cat}.csv", header = True, index = False)
    print(f"Done {country}_category")
    

而下面是forloop中的输出,我们可以看到每一列的个数是相等的,但是,看第二个img,里面加了一些video_id,不知道怎么处理这个问题,我真的确定路径是正确的。有谁知道发生了什么?非常感谢。

在此处输入图像描述

在此处输入图像描述

标签: pythonpandasdataframe

解决方案


推荐阅读