首页 > 解决方案 > 使用预定义的一组值创建数据集

问题描述

我正在尝试使用以下条目创建数据集:

Flight_ID、Iternary_ID、Flight_Date、Booking_Date

我这里有代码:

import random
import pandas as pd
import numpy as np
from faker import Faker
import datetime
#import datetime
fake = Faker()
#import itertools
#from math import floor
random.seed(999)
## Create random iternary
no_of_days = 30
no_of_flights_per_day = 5
total_obs = no_of_days*no_of_flights_per_day
itr_a= np.random.randint(1,3,size = int(0.7*total_obs))
itr_b=np.random.randint(3,6,size=total_obs - len(itr_a))
itr = [1,2,3,4,5]*no_of_days
#itr = itr_a.tolist() + itr_b.tolist()
#for x in range(10):
#random.shuffle(itr)

## Generate flight ids
flight_ids = np.random.randint(10000,99999,size= total_obs).tolist() 
flight_ids = ["FT-" + str(x) for x in flight_ids ]

## Generating date for  flights
base = datetime.date(2019, 9, 1)
date_list = [base + datetime.timedelta(days=x) for x in range(30)]
date_list = date_list*5
#random.shuffle(date_list)
#####  Creating booking dataset
v = pd.DataFrame( list(zip(itr,flight_ids,date_list)),columns =['Flight_Id', 'Iternary_Code', 'date_list'])

flight_id_superlist = []
flight_final_date_superlist= []
flight_booking_date_superlist = []
flight_iternary_code_superlist=[]
for i in range(len(v['Flight_Id'])):
    no_days_booked = random.randint(4,30)
    flight_date = v['date_list'][i]
    booking_start_date = flight_date- datetime.timedelta(days=30)
    book_dates = [fake.date_between(start_date=booking_start_date, end_date=flight_date) for v in range(no_days_booked)]
    #book_dates.sort()

    flight_id_superlist = flight_id_superlist + ([v['Iternary_Code'][i]]*no_days_booked)
    flight_iternary_code_superlist = flight_iternary_code_superlist + ([v['Flight_Id'][i]]*no_days_booked)
    flight_final_date_superlist = flight_final_date_superlist + ([flight_date]*no_days_booked)
    flight_booking_date_superlist = flight_booking_date_superlist + book_dates

Flight_df = pd.DataFrame(list(zip(flight_id_superlist,flight_iternary_code_superlist,
                                  flight_final_date_superlist,flight_booking_date_superlist
                                  )), 
               columns =['Flight_Id', 'Iternary_Code','Flight_Date','Booking_Date'
                         ])  
Flight_df.to_csv('./Flight_data.csv',index=False) 

预期的输出是获取预订日期的 (Flight_ID, Iternary_ID, Flight_Date) 条目。

我面临的问题是,对于特定的 flight_Date 和 Itenary_ID,分配了多个 Flight_ID(附有截图)。数据帧 v 存储了 (Flight_ID, Iternary_ID, Flight_Date) 的唯一值,尽管 booking_date 可以更改,但它们应该保持不变。我无法弄清楚原因。任何人都可以提供同样的帮助。

在此处输入图像描述

标签: pythonpandasnumpy

解决方案


看来您在最后一个for循环中犯了一个简单的错误:

flight_id_superlist = flight_id_superlist + ([v['Iternary_Code'][i]]*no_days_booked)
flight_iternary_code_superlist = flight_iternary_code_superlist + ([v['Flight_Id'][i]]*no_days_booked)

如您所见,您将迭代代码放入航班 ID 超级列表中,并将航班 ID 放入迭代代码超级列表中 :)。反转它们可能会有所帮助:

flight_id_superlist = flight_id_superlist + ([v['Flight_Id'][i]]*no_days_booked)
flight_iternary_code_superlist = flight_iternary_code_superlist + ([v['Iternary_Code'][i]]*no_days_booked)

然后对于其余的,它有点太无证了,我无法理解逻辑是否是你所想的。


推荐阅读