首页 > 解决方案 > How to improve algorithm for transforming dot separated strings to data objects?

问题描述

I'm currently working on an algorithm to transform dot separated strings (Tag1.Tag2.Foo) into data objects that will be used in an tree view component. For each part of the string will be an object created, if there is none. Additional the objects will be "linked together" with a parent id so that in the end the data will be shown as tree.

raw data:

"Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3"

transforms into:

[
  {
    "Id": "ac96e451-b4e4-47f7-afd2-0673b28128d8",
    "ParentId": "00000000-0000-0000-0000-000000000000",
    "Identifier": "Tag1",
    "Text": "Tag1"
  },
  {
    "Id": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
    "ParentId": "ac96e451-b4e4-47f7-afd2-0673b28128d8",
    "Identifier": "Tag1.Tag2",
    "Text": "Tag2"
  },
  {
    "Id": "089403ee-193a-4992-96b7-d69ef9fff957",
    "ParentId": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
    "Identifier": "Tag1.Tag2.Foo",
    "Text": "Foo"
  },
  {
    "Id": "4d806c52-153e-4752-bb9d-f2d187fb11e8",
    "ParentId": "e9a86afe-7af5-41e8-a791-11f50a6d472b",
    "Identifier": "Tag1.Tag2.Bar",
    "Text": "Bar"
  },
  {
    "Id": "c3e1da58-075a-4c8a-9381-316a7b609b87",
    "ParentId": "00000000-0000-0000-0000-000000000000",
    "Identifier": "Tag2",
    "Text": "Tag2"
  },
  {
    "Id": "359c40db-dc12-44e1-b055-d198b568070d",
    "ParentId": "c3e1da58-075a-4c8a-9381-316a7b609b87",
    "Identifier": "Tag2.Tag3",
    "Text": "Tag3"
  },
  {
    "Id": "75418314-bc2c-489f-9319-d57375d4cfec",
    "ParentId": "359c40db-dc12-44e1-b055-d198b568070d",
    "Identifier": "Tag2.Tag3.FooBar",
    "Text": "FooBar"
  },
  {
    "Id": "fd64fe19-7dae-4965-9688-720e7fc210b7",
    "ParentId": "00000000-0000-0000-0000-000000000000",
    "Identifier": "Tag3",
    "Text": "Tag3"
  }
]

look in the tree view:

My current code works, but with large data it gets slow. 12000 strings was computed in 20-30 seconds. I think there is room for improvement. I took a look at parallel computing, but ran into race condition problems and problems with the ever changing output collection. Maybe someone could point out where I can improve my code or point in another direction of how to handle such a problem.

using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
namespace TreeBuilder
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new List<string> { "Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3" };

            var treeItems = new List<TreeItem>();

            foreach (var tag in tags)
            {
                CreateTreeItems(tag, treeItems);
            }

            Console.WriteLine(JsonConvert.SerializeObject(treeItems, Formatting.Indented));
        }

        static void CreateTreeItems(string rawItem, List<TreeItem> treeItems)
        {
            var rawItems = rawItem.Split('.');

            for (int i = 0; i < rawItems.Length; i++)
            {
                var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());

                // create item if doesnt exist already
                if (!treeItems.Any(ti => ti.Identifier == currentIdentifier))
                {
                    var item = new TreeItem
                    {
                        Id = Guid.NewGuid(),
                        ParentId = Guid.Empty,
                        Text = rawItems[i],
                        Identifier = currentIdentifier
                    };

                    treeItems.Add(item);

                    // add as child, if necessary
                    var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);

                    if (!string.IsNullOrEmpty(parentIdentifier))
                    {
                        item.ParentId = treeItems.FirstOrDefault(ti => ti.Identifier == parentIdentifier).Id;
                    }
                }
            }
        }
    }

    class TreeItem
    {
        public Guid Id { get; set; }
        public Guid ParentId { get; set; }
        public string Identifier { get; set; }
        public string Text { get; set; }
    }
}

Update

Added the changes mentioned in kalexis solution. I also removed the List because it is not necessary anymore. If a list is needed you can use .toList() on the dictionary values.

using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
namespace TreeBuilder
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new List<string> { "Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3" };

            var treeItems = new Dictionary<string, TreeItem>();

            foreach (var tag in tags)
            {
                CreateTreeItems(tag, treeItems);
            }

            Console.WriteLine(JsonConvert.SerializeObject(treeItems, Formatting.Indented));
        }

        static void CreateTreeItems(string rawItem, List<TreeItem> treeItems)
        {
            var rawItems = rawItem.Split('.');

            for (int i = 0; i < rawItems.Length; i++)
            {
                var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());

                // create item if doesnt exist already
                if (!treeItems.ContainsKey(currentIdentifier))
                {
                    var item = new TreeItem
                    {
                        Id = Guid.NewGuid(),
                        ParentId = Guid.Empty,
                        Text = rawItems[i],
                        Identifier = currentIdentifier
                    };

                   treeItems[currentIdentifier] = item;

                    // add as child, if necessary
                    var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);

                    if (!string.IsNullOrEmpty(parentIdentifier) 
                        && treeItems.TryGetValue(parentIdentifier, out var parent))
                    {
                        item.ParentId = parent.Id;
                    }
                }
            }
        }
    }

    class TreeItem
    {
        public Guid Id { get; set; }
        public Guid ParentId { get; set; }
        public string Identifier { get; set; }
        public string Text { get; set; }
    }
}

标签: c#algorithm

解决方案


I believe these are the cultprit:

  • treeItems.Any(ti => ti.Identifier == currentIdentifier)
  • treeItems.FirstOrDefault(ti => ti.Identifier == parentIdentifier)

These iterate the whole list, and the bigger it gets, the more time it will take. And that's for each tag! Instead, try passing a List<T> and a Dictionary<string, TreeItem>, populate both, and use Dictionary<string, TreeItem>.ContainsKey() to check whether the identifier is already present, and TryGetValue() to retrieve a parent. This way these checks are not O(n), but O(1) instead. Should work much better on big data sets. Please check and tell me how it went :-)

Like that:

using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;

namespace TreeBuilder
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new List<string> {"Tag1.Tag2.Foo", "Tag1.Tag2.Bar", "Tag2.Tag3.FooBar", "Tag3"};

            var itemList = new List<TreeItem>();
            var itemDict = new Dictionary<string, TreeItem>();

            foreach (var tag in tags)
            {
                CreateTreeItems(tag, itemList, itemDict);
            }

            Console.WriteLine(JsonConvert.SerializeObject(itemList, Formatting.Indented));
        }

        static void CreateTreeItems(string rawItem, List<TreeItem> itemList, Dictionary<string, TreeItem> itemDict)
        {
            var rawItems = rawItem.Split('.');

            for (var i = 0; i < rawItems.Length; i++)
            {
                var currentIdentifier = string.Join('.', rawItems.Take(i + 1).ToArray());

                // create item if doesnt exist already
                if (!itemDict.ContainsKey(currentIdentifier))
                {
                    var item = new TreeItem
                    {
                        Id = Guid.NewGuid(),
                        ParentId = Guid.Empty,
                        Text = rawItems[i],
                        Identifier = currentIdentifier
                    };

                    itemList.Add(item);
                    itemDict[currentIdentifier] = item;

                    // add as child, if necessary
                    var parentIdentifier = string.Join('.', currentIdentifier.Split('.')[0..(currentIdentifier.Split('.').Length - 1)]);

                    if (!string.IsNullOrEmpty(parentIdentifier) 
                        && itemDict.TryGetValue(parentIdentifier, out var parent))
                    {
                        item.ParentId = parent.Id;
                    }
                }
            }
        }
    }

    class TreeItem
    {
        public Guid Id { get; set; }
        public Guid ParentId { get; set; }
        public string Identifier { get; set; }
        public string Text { get; set; }
    }
}

I've checked the provided code for performance with autogenerated data (30'000 tags total), and my results are as follows:

Original: 11908.4479ms
With Dictionaries: 32.9763ms

Tell me whether you're experiencing the same results :-)


推荐阅读