首页 > 解决方案 > 如何删除具有空值的重复文档和字段

问题描述

csv 文件中存在具有空值的重复行。我想删除这些记录中的空字段并用其他记录覆盖它们。我设法删除了空值的字段。但是,我不能那样做。请帮我!

My csv file =>
name,surname,age,email,phone
Busra,Duygu,99,,05555555555
Busra,Duygu,,busraduygu@gmail.com,
Busra,Duygu,99,,
Busra,Duygu,,,

这意味着,在我的 csv 文件中,同一个人多次重复信息,并且某些记录具有空值。我想得到的输出: Büşra Duygu,99,busraduygu@gmail.com,05555555555

为了实现这些,我首先将 csv 文件添加到 null_problem 索引中,然后我创建了一个名为 null_problem_finger 的索引来用指纹方法组织这些重复的文档,但我没有成功。

null_problem 索引=>

input{
    file { 
      path => ".../null_problem.csv"
      start_position => "beginning"
      sincedb_path => "NUL" 
    }
}
filter{
    csv{        
        autodetect_column_names => "true"
        separator => ","
        skip_header => "true"
        columns => ["name","surname","age","email","phone"]
    }
    mutate { 
        remove_field =>["path", "host", "message", "@version", "@timestamp", "trade_date"]
    }
    ruby {
        code => "
            def walk_hash(parent, path, hash)
                path << parent if parent
                hash.each do |key, value|
                walk_hash(key, path, value) if value.is_a?(Hash)
                @paths << (path + [key]).map {|p| '[' + p + ']' }.join('')
                end
                path.pop
            end
            @paths = []
            walk_hash(nil, [], event.to_hash)
            @paths.each do |path|
                value = event.get(path)
                event.remove(path) if value.nil? || (value.respond_to?(:empty?) && value.empty?)
            end
            "
    }
}
output{
    elasticsearch { 
        hosts => "http://localhost:9200"
        index => "null_problem"
        document_type => "_doc"
    }
    stdout {}
}

null_problem_fingerprint 索引 =>

input {
  elasticsearch {
    hosts => "localhost"
    index => "null_problem"
    query => '{ "sort": [ "_doc" ] }'
  }
}
filter{  
    fingerprint {
    method => "SHA1"
    source => ["name","surname","age","email","phone"]
    target => "[@metadata][generated_id]"
    concatenate_sources => "true"   
  }
  mutate { 
        remove_field =>["path", "host", "message", "@version", "@timestamp", "trade_date"]
  }
}
output {
    stdout { codec => dots }
    elasticsearch {
        index => "null_problem_fingerprint"
        document_id => "%{[@metadata][generated_id]}"
        doc_as_upsert => "true"
        action => "update"
    }
}

我用ruby中的代码博客删除了具有空值的字段,但是在制作指纹后,我仍然无法达到所需的输出。请帮我!

标签: elasticsearchamazon-elastic-beanstalklogstashkibanaelastic-stack

解决方案


推荐阅读