python - Pandas: Read random sample of data using read_json
问题描述
I would like to read in a random sample of a large .bz2 file.
Similarly to how you would read in a sample of csv like this:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.csv"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
I've figured out how to read the file in chunks, but this isnt random.
import os, json
import pandas as pd
import numpy as np
import glob
import random
pd.set_option('display.max_columns', None)
temp = pd.DataFrame()
path_to_json = '/content/drive/My Drive/Loghost/'
json_pattern = os.path.join(path_to_json,'*.bz2')
file_list = glob.glob(json_pattern)
for file in file_list:
chunks = pd.read_json(file, lines=True, chunksize=3000000)
i = 0
chunk_list = []
for chunk in chunks:
i+=1
user = chunk[random.sample(chunk.UserName)] # i want to take a random sample of 100 users
chunk_list.append(user)
print("Progress:", i)
del chunk
df = pd.concat(chunk_list, sort = True)
temp = temp.append(df, sort = True)
the above-commented line is where I attempt to randomise rows by selecting random samples of the users but it doesnt seem to work. Any ideas?
解决方案
推荐阅读
- visual-studio - Xamarin.IOs 错误:本地应用程序和远程构建之间的不一致
- sql - Oracle APEX 5 选择列表错误 ORA-01400
- python - 如何使用 xlwt 和 xlrd 将每五行放入一张纸中?
- javascript - 我无法使用 javascript 将输入文本框设为大写
- powershell - 获取所有进程 PID 并将它们作为参数传递给 program.exe
- kubernetes - 有没有办法确定 kubernetes apiserver 正在与哪个 etcd 主机通信?
- c++ - 从 P 和 B 帧中创建 I 帧
- c++ - masking the password('abcs') clause to password('****') using regex in cpp
- ruby-on-rails - 批量邮寄时内存泄漏
- php - 在json响应中打印多维数组