python - How to improve the performance of traversing a large dataset
问题描述
I want to improve the logic for my task that is defined as following, the tasks is implemented in Python 3 with Django framwork:
The source data is onboard to our system, the Job entity defines what involves to fully process a source, Job_instance defines the instance of Job. The requirement is as following:
- Given the source name, find the latest onboard date that has all jobs of the source being processed.
- Also output the overall status of job being processed
The entity relationship and table structure:
- "Job" - this is an entity defines what involves to process a source ingested from external party. a source is divided into parts that can be processed into multiple jobs. table structure:
Field name | description | type |
---|---|---|
id | id of job | int |
name | name of job | string |
source | name of source | string |
- "Job_instance" - this is instance of the job that has been executed in our system.
Field name | description | type |
---|---|---|
id | id of job instance | int |
job_id | id of job | int |
period | date string the source has been onboard to our system | string |
updated_timestamp | last execution time and date for the job | timestamp |
status | the status of the instance after run has been proceeded, can be one of any, e.g: success, failed, finished, created etc | string |
The overall status of jobs processed shows as "Success" when all the job_instance of each job running successfully, otherwise it shows as "Failure"
My current logic:
Get list of jobs from source name and cached job_ids with boolean as a map.
job_ids = Job.objects.filter(Q(name__icontains=source) ).values_list('id', flat=True) """ Combine all Job Ids into one restriction """ """ Constructing where clause to match all job ids """ ids_filter = Q() for job_id in job_ids: ids_filter = ids_filter | Q(job__id=job_id)
Get list of job instances from job ids and sort it by period and update_timestamp in descending order( this is done from database)
completed_status_filter = Q() """ Only check job instances that have been completed """ for status in [status_dict.get('Failed'), status_dict.get('Finished')]: completed_status_filter = completed_status_filter | Q( status__exact=status) """ Find Finished or Failed job instances for job ids """ """ and sort by period_code and updated timestamp by decending order """ job_instances = JobInstance.objects \ .filter(ids_filter, completed_status_filter) \ .values('id', 'job__id', 'period_code', 'update_timestamp', 'status') \ .exclude(period_code='Ad Hoc') \ .order_by('-period_code', '-update_timestamp')
Created a period to list of job_instances map.
def get_period_code_instances_map(self, job_instances): period_code_instance_map = {} for instance in job_instances.iterator(): instances =period_code_instance_map.get(instance['period_code']) if not instances: existing_instances = list() period_code_instance_map[instance['period_code']] \ = existing_instances existing_instances.append(instance) else: instances.append(instance) return period_code_instance_map
Traversing the map as following:
def get_last_period_code_at_curated( self, period_code_instances_map, job_ids, status_dict): target_period = None job_ids_verification_map = dict((key, False) for key in job_ids) """ Traverse instances ran on the same period_code """ for period_code, instances in period_code_instances_map.items(): for instance in instances: job_id = instance['job__id'] """ Set to True if job has not been visited""" if not job_ids_verification_map.get(job_id): job_ids_verification_map[job_id] = True """ If job id has not been visited after traversing """ """ reset job_id verifcation map """ if any(not checked for checked in job_ids_verification_map.values()): job_ids_verification_map = dict( (key, False) for key in job_ids) else: target_period = period_code break result = JobInstanceLatestPeriodCodeForSource() if not target_period: return result """ Once period_code is found, get the timestamp from 1st instance """ """ Evaluate status by check all statuses in instances """ instances = period_code_instances_map[target_period] result.period_code = target_period result.update_timestamp = instances[0]['update_timestamp'] failed_status = status_dict.get('Failed') if any(instance['status'] == failed_status for instance in instances): result.status = 'Failed' else: result.status = 'Finished' return result
a. Iterate all job instance in the period to find if all jobs have been run. Here i used the job_id and boolean map to track which job has been visited for the period.
b. If all jobs have been run in the period, then the period is found. otherwise, keep traversing the map.
When period is found, iterate the job instance list from the map again to find the overall status.
Limitation:
- Since the list of job_instances can be quite big( > 1 million entries), I have used iterator to create the map when going through data from database. But any optimization suggestion is welcome.
- Is there any more efficient way on time/space to find the period code. I found i have to use nested for loop to traversing, which gives me O(n2).
Thanks in advance
解决方案
找到了解决方案,由于处理结果集可能需要一些时间,所以我要求上游用户每 5 秒轮询一次,直到计算出结果。
推荐阅读
- prolog - 如果已经存在 [a,b],如何从列表中删除 [b,a]
- unix - 本地转发错误的 zsh 命令
- c# - Windows PPKG 配置文件的签名和加密格式/规范
- javascript - 使用超时并基于 JSON url 构建 html 元素
- c# - 在 Autofac 中自动处理子生命周期范围
- reactjs - 自动化 ReactJS 应用程序中的超时问题
- ubuntu - xspec:加载共享库时出错:libXSPlot.so:无法打开共享对象文件:没有这样的目录
- python - Pandas - 从样本数据的随机选择中生成新值
- c++ - 如何抑制未使用的(void **arg)参数?
- javascript - `xhr.getAllHeaders()` 在浏览器扩展上下文中缺少一些标题?