java - 如何让这段代码线程安全?
问题描述
此代码是方法内的一部分。代码使用两个 for 循环遍历两个列表。我想看看是否有可能使用多线程来加速两个循环的这个过程。我关心的是如何使其线程安全。
已编辑:更完整的代码
static class Similarity {
double similarity;
String seedWord;
String candidateWord;
public Similarity(double similarity, String seedWord, String candidateWord) {
this.similarity = similarity;
this.seedWord = seedWord;
this.candidateWord = candidateWord;
}
public double getSimilarity() {
return similarity;
}
public String getSeedWord() {
return seedWord;
}
public String getCandidateWord() {
return candidateWord;
}
}
static class SimilarityTask implements Callable<Similarity> {
Word2Vec vectors;
String seedWord;
String candidateWord;
Collection<String> label1;
Collection<String> label2;
public SimilarityTask(Word2Vec vectors, String seedWord, String candidateWord, Collection<String> label1, Collection<String> label2) {
this.vectors = vectors;
this.seedWord = seedWord;
this.candidateWord = candidateWord;
this.label1 = label1;
this.label2 = label2;
}
@Override
public Similarity call() {
double similarity = cosineSimForSentence(vectors, label1, label2);
return new Similarity(similarity, seedWord, candidateWord);
}
}
现在,这个“计算”线程安全吗?涉及3个变量:
1) vectors;
2) toeknizerFactory;
3) similarities;
public static void compute() throws Exception {
File modelFile = new File("sim.bin");
Word2Vec vectors = WordVectorSerializer.readWord2VecModel(modelFile);
TokenizerFactory tokenizerFactory = new TokenizerFactory()
List<String> seedList = loadSeeds();
List<String> candidateList = loadCandidates();
log.info("Computing similarity: ");
ExecutorService POOL = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
List<Future<Similarity>> tasks = new ArrayList<>();
int totalCount=0;
for (String seed : seedList) {
Collection<String> label1 = getTokens(seed.trim(), tokenizerFactory);
if (label1.isEmpty()) {
continue;
}
for (String candidate : candidateList) {
Collection<String> label2 = getTokens(candidate.trim(), tokenizerFactory);
if (label2.isEmpty()) {
continue;
}
Callable<Similarity> callable = new SimilarityTask(vectors, seed, candidate, label1, label2);
tasks.add(POOL.submit(callable));
log.info("TotalCount:" + (++totalCount));
}
}
Map<String, Set<String>> similarities = new HashMap<>();
int validCount = 0;
for (Future<Similarity> task : tasks) {
Similarity simi = task.get();
Double similarity = simi.getSimilarity();
String seedWord = simi.getSeedWord();
String candidateWord = simi.getCandidateWord();
Set<String> similarityWords = similarities.get(seedWord);
if (similarity >= 0.85) {
if (similarityWords == null) {
similarityWords = new HashSet<>();
}
similarityWords.add(candidateWord);
log.info(seedWord + " " + similarity + " " + candidateWord);
log.info("ValidCount: " + (++validCount));
}
if (similarityWords != null) {
similarities.put(seedWord, similarityWords);
}
}
}
添加了另一种相关方法,由 call() 方法使用:
public static double cosineSimForSentence(Word2Vec vectors, Collection<String> label1, Collection<String> label2) {
try {
return Transforms.cosineSim(vectors.getWordVectorsMean(label1), vector.getWordVectorsMean(label2));
} catch (Exception e) {
log.warn("OOV: " + label1.toString() + " " + label2.toString());
//e.getMessage();
//e.printStackTrace();
return 0.0;
}
}
解决方案
(针对已更改的问题更新了答案。)
通常,您应该在尝试优化代码之前对其进行分析,尤其是在它非常复杂的情况下。
对于线程,您需要确定哪些可变状态在线程之间共享。理想情况下,在诉诸锁和并发数据结构之前尽可能多地这样做。包含在一个线程中的可变状态本身不是问题。不可变是伟大的。
我假设传递给您的任务的任何内容都不会被修改。这很难说。final
在字段上是个好主意。集合可以放置在不可修改的包装器中,但这不会阻止它们通过其他引用进行修改,并且现在确实以静态类型显示自己。
假设你不打破内部循环,唯一共享的可变状态似乎是similarities
它包含的值。
你可能会也可能不会发现你最终还是会做太多的串行操作,需要更改similarities
为并发
ConcurrentMap<String, Set<String>> similarities = new ConcurrentHashMap<>();
get
和put
of将similarities
需要是线程安全的。我建议始终创建Set
.
Set<String> similarityWords = similarities.getOrDefault(seed, new HashSet<>());
或者
Set<String> similarityWords = similarities.computeIfAbsent(seed, key -> new HashSet<>());
您可以使用线程安全的Set
(例如 with Collections.synchronizedSet
),但我建议为整个内部循环持有相关锁。
synchronized (similarityWords) {
...
}
如果你想similarityWords
懒惰地创作,那会“更有趣”。
推荐阅读
- sql-server - 当 SQL 中的组合没有数据时创建空行
- c# - 不同的使用方式
- python - 调用pickeled函数时出现RecursionError
- google-cloud-platform - 缺少 Google Cloud Platform (GCP) Cloud Shell “Boost”功能
- windows - 如何更改 Mac Native WebRTC 上的默认播放设备?
- sql - 如何从连接表中返回一行
- react-native - 如何在本机反应的文本输入中显示选定的日期和时间?
- php - 为什么在 Wordpress PHP-Modul 中显示 html-entities 实体时出错?
- javascript - 如何使用 JavaScript 在位置 1 处修复 JSON 中的意外标记 o?
- promise - 如何重用 Promise
在我的 Angular 服务中使用过一次之后?