c# - How to improve algorithm for transforming dot separated strings to data objects?
问题描述
I'm currently working on an algorithm to transform dot separated strings (Tag1.Tag2.Foo) into data objects that will be used in an tree view component. For each part of the string will be an object created, if there is none. Additional the objects will be "linked together" with a parent id so that in the end the data will be shown as tree.
raw data:
"Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3"
transforms into:
[
{
"Id": "ac96e451-b4e4-47f7-afd2-0673b28128d8",
"ParentId": "00000000-0000-0000-0000-000000000000",
"Identifier": "Tag1",
"Text": "Tag1"
},
{
"Id": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
"ParentId": "ac96e451-b4e4-47f7-afd2-0673b28128d8",
"Identifier": "Tag1.Tag2",
"Text": "Tag2"
},
{
"Id": "089403ee-193a-4992-96b7-d69ef9fff957",
"ParentId": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
"Identifier": "Tag1.Tag2.Foo",
"Text": "Foo"
},
{
"Id": "4d806c52-153e-4752-bb9d-f2d187fb11e8",
"ParentId": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
"Identifier": "Tag1.Tag2.Bar",
"Text": "Bar"
},
{
"Id": "c3e1da58-075a-4c8a-9381-316a7b609b87",
"ParentId": "00000000-0000-0000-0000-000000000000",
"Identifier": "Tag2",
"Text": "Tag2"
},
{
"Id": "359c40db-dc12-44e1-b055-d198b568070d",
"ParentId": "c3e1da58-075a-4c8a-9381-316a7b609b87",
"Identifier": "Tag2.Tag3",
"Text": "Tag3"
},
{
"Id": "75418314-bc2c-489f-9319-d57375d4cfec",
"ParentId": "359c40db-dc12-44e1-b055-d198b568070d",
"Identifier": "Tag2.Tag3.FooBar",
"Text": "FooBar"
},
{
"Id": "fd64fe19-7dae-4965-9688-720e7fc210b7",
"ParentId": "00000000-0000-0000-0000-000000000000",
"Identifier": "Tag3",
"Text": "Tag3"
}
]
look in the tree view:
-
Tag1
-
Tag2
- Foo
- Bar
-
Tag2
- Tag2
- Tag3
- FooBar
- Tag3
- Tag3
My current code works, but with large data it gets slow. 12000 strings was computed in 20-30 seconds. I think there is room for improvement. I took a look at parallel computing, but ran into race condition problems and problems with the ever changing output collection. Maybe someone could point out where I can improve my code or point in another direction of how to handle such a problem.
using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
namespace TreeBuilder
{
class Program
{
static void Main(string[] args)
{
var tags = new List<string> { "Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3" };
var treeItems = new List<TreeItem>();
foreach (var tag in tags)
{
CreateTreeItems(tag, treeItems);
}
Console.WriteLine(JsonConvert.SerializeObject(treeItems, Formatting.Indented));
}
static void CreateTreeItems(string rawItem, List<TreeItem> treeItems)
{
var rawItems = rawItem.Split('.');
for (int i = 0; i < rawItems.Length; i++)
{
var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());
// create item if doesnt exist already
if (!treeItems.Any(ti => ti.Identifier == currentIdentifier))
{
var item = new TreeItem
{
Id = Guid.NewGuid(),
ParentId = Guid.Empty,
Text = rawItems[i],
Identifier = currentIdentifier
};
treeItems.Add(item);
// add as child, if necessary
var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);
if (!string.IsNullOrEmpty(parentIdentifier))
{
item.ParentId = treeItems.FirstOrDefault(ti => ti.Identifier == parentIdentifier).Id;
}
}
}
}
}
class TreeItem
{
public Guid Id { get; set; }
public Guid ParentId { get; set; }
public string Identifier { get; set; }
public string Text { get; set; }
}
}
Update
Added the changes mentioned in kalexis solution. I also removed the List because it is not necessary anymore. If a list is needed you can use .toList() on the dictionary values.
using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
namespace TreeBuilder
{
class Program
{
static void Main(string[] args)
{
var tags = new List<string> { "Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3" };
var treeItems = new Dictionary<string, TreeItem>();
foreach (var tag in tags)
{
CreateTreeItems(tag, treeItems);
}
Console.WriteLine(JsonConvert.SerializeObject(treeItems, Formatting.Indented));
}
static void CreateTreeItems(string rawItem, List<TreeItem> treeItems)
{
var rawItems = rawItem.Split('.');
for (int i = 0; i < rawItems.Length; i++)
{
var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());
// create item if doesnt exist already
if (!treeItems.ContainsKey(currentIdentifier))
{
var item = new TreeItem
{
Id = Guid.NewGuid(),
ParentId = Guid.Empty,
Text = rawItems[i],
Identifier = currentIdentifier
};
treeItems[currentIdentifier] = item;
// add as child, if necessary
var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);
if (!string.IsNullOrEmpty(parentIdentifier)
&& treeItems.TryGetValue(parentIdentifier, out var parent))
{
item.ParentId = parent.Id;
}
}
}
}
}
class TreeItem
{
public Guid Id { get; set; }
public Guid ParentId { get; set; }
public string Identifier { get; set; }
public string Text { get; set; }
}
}
解决方案
I believe these are the cultprit:
treeItems.Any(ti => ti.Identifier == currentIdentifier)
treeItems.FirstOrDefault(ti => ti.Identifier == parentIdentifier)
These iterate the whole list, and the bigger it gets, the more time it will take. And that's for each tag! Instead, try passing a List<T>
and a Dictionary<string, TreeItem>
, populate both, and use Dictionary<string, TreeItem>.ContainsKey()
to check whether the identifier is already present, and TryGetValue()
to retrieve a parent. This way these checks are not O(n)
, but O(1)
instead. Should work much better on big data sets. Please check and tell me how it went :-)
Like that:
using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
namespace TreeBuilder
{
class Program
{
static void Main(string[] args)
{
var tags = new List<string> {"Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3"};
var itemList = new List<TreeItem>();
var itemDict = new Dictionary<string, TreeItem>();
foreach (var tag in tags)
{
CreateTreeItems(tag, itemList, itemDict);
}
Console.WriteLine(JsonConvert.SerializeObject(itemList, Formatting.Indented));
}
static void CreateTreeItems(string rawItem, List<TreeItem> itemList, Dictionary<string, TreeItem> itemDict)
{
var rawItems = rawItem.Split('.');
for (var i = 0; i < rawItems.Length; i++)
{
var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());
// create item if doesnt exist already
if (!itemDict.ContainsKey(currentIdentifier))
{
var item = new TreeItem
{
Id = Guid.NewGuid(),
ParentId = Guid.Empty,
Text = rawItems[i],
Identifier = currentIdentifier
};
itemList.Add(item);
itemDict[currentIdentifier] = item;
// add as child, if necessary
var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);
if (!string.IsNullOrEmpty(parentIdentifier)
&& itemDict.TryGetValue(parentIdentifier, out var parent))
{
item.ParentId = parent.Id;
}
}
}
}
}
class TreeItem
{
public Guid Id { get; set; }
public Guid ParentId { get; set; }
public string Identifier { get; set; }
public string Text { get; set; }
}
}
I've checked the provided code for performance with autogenerated data (30'000 tags total), and my results are as follows:
Original: 11908.4479ms
With Dictionaries: 32.9763ms
Tell me whether you're experiencing the same results :-)
推荐阅读
- mongodb - 如何根据满足的条件有不同的更新值
- c# - 使用包含逗号的正则表达式从字符串末尾删除一系列字符
- c# - 为预先检查为 null 并被锁定的对象(即套接字)引发 NullReferenceException
- javascript - 使用 lodash 对具有多个条件和未知键的对象进行排序
- python - 在music21中,我怎样才能让它不忽略速度标记?
- java - 一个事件可以由 2 个轴突处理程序处理吗
- objective-c - 将 NSData 转换为浮点数组 Objective C
- r - 编写for循环以读取excel表
- javascript - 为什么箭头函数 ()=>{} 语法无效?
- python - 登录几秒钟后,Chrome 中的 Selenium Python 出现“会话已过期”错误