首页 > 解决方案 > 将大 txt 文件转换为任何结构化格式

问题描述

我有一个大的 txt 文件,其中有空格分隔的“列”,我希望能够将其转换为 JSON、xlsx、csv 等内容,以便对数据执行编程操作。

该文件很大,所以我不会发布整个内容 - 这里有一个片段来获取示例:

ID number Name                              TitlFed  Grade GamesBorn Flag
10207538  A E M, Doshtagir                      BAN  1864    0        i
10206612  A K M, Sourab                         BAN  1714    0        i
 5045886  A K, Kalshyan                         IND  1958    0  1964  
 8605360  A La, Teng Hua                        CHN  1915    0  1993  wi
 5031605  A, Akshaya                            IND  2016   29  1994  w
 5080444  A, Sohita                             IND  1447    0  1995  wi
 5706068  A. Nashir, Mohd Khairul Nazrin        MAS  1878    0        i
10201971  A.f.m., Mahfuzul Haque                BAN  1690    0        
10202650  A.k. Azad, Akand                      BAN  1692    0        i
10210997  A.K.M. Mehfuz                         BAN  2015    0        
24663832  Aab, Manfred                          GER  1808    0  1963  
 1701991  Aaberg, Anton                         SWE  2374    4  1972  
 1513966  Aabid, Ryaad                          NOR  1642    0  1958  
 1407589  Aabling-Thomsen, Jakob            f   DEN  2331   18  1985  
12524670  Aadeli, Arvin                         IRI  2015    0        
 5072662  Aadhityaa, M                          IND  1898   10  1999  
25034677  Aadish S                              IND  1528    5  1999  
 5086183  Aaditt, M K                           IND  1610    0  1996  i
 5027942  Aaditya, Jagadeesh                    IND  1814   16  1998  
25011952  Aadityan G                            IND  1621    7  2001  
 5063485  Aadityan, N.                          IND  1758    8  1996  
 1427024  Aagaard, Gert                         DEN  2030    7  1966  
 1401815  Aagaard, Jacob                    g   DEN  2506    9  1973  
 1411802  Aagaard, Kasper                       DEN  1913    0  1992  i
 1017942  Aagaard, Michael                      NED  2075    0  1960  
 1406248  Aage, Bjarke                          DEN  2068    0  1978  i
 1506064  Aagedal, Geir Ole                     NOR  1833    7  1957  
25021044  Aagney L., Narasimhan                 IND  1285    6  2000  
10205640  Aahelee, Sarker                       BAN  1577    0        w
25014510  Aakanksha Hagawane                    IND  1622    0  2000  w
25030388  Aakash Jain                           IND  1577    7  1998  
35004336  Aakash S B                            IND  1235   10  1998  
 5093295  Aakasha                               IND  1620    3  2000  w
  504599  Aakio, Seppo                          FIN  2078    0  1954  
 1402315  Aalbaek, Kurt Frede Nissen            DEN  1440    0  1944  
 1024388  Aalbers, Klaas                        NED  1891    0  1955  i
 2252465  Aalbersberg Kroon, Pedro              ESP  1878    0  1933  
 2218682  Aalders, Hendricus                    ESP  2021    0  1930  i
 1033948  Aalders, Peter                        NED  1903    0  1964  
  501956  Aaltio, Erkki                         FIN  2118    0  1935  
 1504452  Aandal, Kristian                      NOR  2012    0  1985  i

我用 javascript 编程,所以理想情况下我希望将其转换为 JSON,理想情况下,每个玩家/id 都将在他们自己的对象中,如下所示:

    var AllPlayers =
[{
    "2434324243":
    {
        "name":"some guy",
        "title":"f",
        "fed":"USA",
        "grade":"1999",
        "games":"3",
        "born":"1990"

    },
    "8787878887":
    {
        "name":"anyone",
        "title":"",
        "fed":"BER",
        "grade":"2222",
        "games":"6",
        "born":"1970"

    }
}
]

我尝试使用节点中的 fs 模块来读取 txt 文件,然后我计算了每行的长​​度(71 个字符)并尝试将其推送到数组中 - 但是在读取文件时似乎空白空间被消除了这是一种不可行的方法,因为每个人的信息都有可变长度。

  var fs = require('fs');
 var allPlayers=[];
 thisPlayer='';
 //1st row length =74
//other rows 71
 //14895 rows 
fs.readFile('jul12frl.txt', 'utf8', function(err, contents) {    
        for(let x=74;x<14895;x++){
            thisPlayer+=contents[x];
            if(thisPlayer.length==71){
                allPlayers.push(thisPlayer);
                thisPlayer='';       
            }
        }
});

我还尝试使用 Excels 内置向导将 txt 转换为 excel 格式 - 但它不会获取所有所需的列 - 它会将名称/标题/fed/grade 列合并为一个巨型列。

标签: javascriptnode.jstext

解决方案


const data = `10207538  A E M, Doshtagir                      BAN  1864    0        i
10206612  A K M, Sourab                         BAN  1714    0        i
 5045886  A K, Kalshyan                         IND  1958    0  1964  
 8605360  A La, Teng Hua                        CHN  1915    0  1993  wi
 5031605  A, Akshaya                            IND  2016   29  1994  w
 5080444  A, Sohita                             IND  1447    0  1995  wi
 5706068  A. Nashir, Mohd Khairul Nazrin        MAS  1878    0        i
10201971  A.f.m., Mahfuzul Haque                BAN  1690    0        
10202650  A.k. Azad, Akand                      BAN  1692    0        i
10210997  A.K.M. Mehfuz                         BAN  2015    0        
24663832  Aab, Manfred                          GER  1808    0  1963  
 1701991  Aaberg, Anton                         SWE  2374    4  1972  
 1513966  Aabid, Ryaad                          NOR  1642    0  1958  
 1407589  Aabling-Thomsen, Jakob            f   DEN  2331   18  1985  
12524670  Aadeli, Arvin                         IRI  2015    0        
 5072662  Aadhityaa, M                          IND  1898   10  1999  
25034677  Aadish S                              IND  1528    5  1999  
 5086183  Aaditt, M K                           IND  1610    0  1996  i
 5027942  Aaditya, Jagadeesh                    IND  1814   16  1998  
25011952  Aadityan G                            IND  1621    7  2001  
 5063485  Aadityan, N.                          IND  1758    8  1996  
 1427024  Aagaard, Gert                         DEN  2030    7  1966  
 1401815  Aagaard, Jacob                    g   DEN  2506    9  1973  
 1411802  Aagaard, Kasper                       DEN  1913    0  1992  i
 1017942  Aagaard, Michael                      NED  2075    0  1960  
 1406248  Aage, Bjarke                          DEN  2068    0  1978  i
 1506064  Aagedal, Geir Ole                     NOR  1833    7  1957  
25021044  Aagney L., Narasimhan                 IND  1285    6  2000  
10205640  Aahelee, Sarker                       BAN  1577    0        w
25014510  Aakanksha Hagawane                    IND  1622    0  2000  w
25030388  Aakash Jain                           IND  1577    7  1998  
35004336  Aakash S B                            IND  1235   10  1998  
 5093295  Aakasha                               IND  1620    3  2000  w
  504599  Aakio, Seppo                          FIN  2078    0  1954  
 1402315  Aalbaek, Kurt Frede Nissen            DEN  1440    0  1944  
 1024388  Aalbers, Klaas                        NED  1891    0  1955  i
 2252465  Aalbersberg Kroon, Pedro              ESP  1878    0  1933  
 2218682  Aalders, Hendricus                    ESP  2021    0  1930  i
 1033948  Aalders, Peter                        NED  1903    0  1964  
  501956  Aaltio, Erkki                         FIN  2118    0  1935  
 1504452  Aandal, Kristian                      NOR  2012    0  1985  i`;


const rows = data.split("\n");
function parseRow(row) {
  const id = row.slice(0, 10).trim();
  const name = row.slice(10, 44).trim();
  const title = row.slice(44, 48).trim();
  const country = row.slice(48, 53).trim();
  const grade = row.slice(53, 60).trim();
  const games = row.slice(60, 64).trim();
  const born = row.slice(64, 70).trim();
  const flag = row.slice(70, 72).trim();
 
  return {
    id,
    name,
    title,
    country,
    grade: grade && parseInt(grade),
    games: games && parseInt(games, 10),
    born : born && parseInt(born, 10),
    flag
  }
}

const parsedRows = rows.reduce((acc, row) => {
  const parsed = parseRow(row);
  acc[parsed.id] = parsed;
  return acc;
}, {});

console.log(parsedRows);
                        
                        

鉴于行和列的长度都与示例中提供的相同,您可以像这样解析它:

// Split original string into rows as an array of strings
const rows = data.split("\n"); // could be replaced with contents read from file

function parseRow(row) {
  // Parse the values by extracting it from the row by start and end index of the column
  const id = row.slice(0, 10).trim();
  const name = row.slice(10, 44).trim();
  const title = row.slice(44, 48).trim();
  const country = row.slice(48, 53).trim();
  const grade = row.slice(53, 60).trim();
  const games = row.slice(60, 64).trim();
  const born = row.slice(64, 70).trim();
  const flag = row.slice(70, 72).trim();

  return {
    id,
    name,
    title,
    country,
    // Parse numbers
    grade: grade && parseInt(grade, 10),
    games: games && parseInt(games, 10),
    born : born && parseInt(born, 10),
    flag
  }
}

const parsed = rows.reduce((acc, row) => {
  const parsed = parseRow(row);
  acc[parsed.id] = parsed;
  return acc;
}, {});

这是一个粗略的解决方案,但它似乎可以解决您的问题。它确实可以运行您提供的示例数据。如果完整数据集与您的示例数据不同,那么您可能需要更新各个列的开始和结束索引。

但是,在您提供的示例数据中,这些列只是用空格分隔的。如果实际数据集是制表符分隔的,那么解决方案将更易于使用。[id, name, title, country, grade, games, born, flag] = row.split('\t')


推荐阅读