Sep是一个现代、高性能的.NET CSV解析库,专为机器学习等场景设计。它可能是世界上最快的.NET CSV解析器之一,具有零分配、跨平台、可裁剪和AOT兼容等特性。本文将详细介绍Sep的主要特性和使用方法,并提供多个实际应用示例。
Span<T>
等现代.NET特性,最小化内存分配只支持7.0以上版本
C#static void Main(string[] args)
{
var text = @"A;B;C;D;E;F
Sep;🚀;1;1.2;0.1;0.5
CSV;✅;2;2.2;0.2;1.5";
using var reader = Sep.Reader().FromText(text);
var header = reader.Header;
var idx = header.IndexOf("B");
foreach (var row in reader)
{
Console.WriteLine(row["C"].ToString()+" "+ row["D"].ToString());
}
Console.ReadKey();
}
C#internal class Program
{
static void Main(string[] args)
{
using (var writer = Sep.Writer().ToFile("output.csv"))
{
foreach (var data in GetData())
{
using var row = writer.NewRow();
row["Name"].Set(data.Name);
row["Age"].Format(data.Age);
row["Score"].Format(data.Score);
}
writer.Flush();
}
Console.ReadKey();
}
// 生成示例数据的方法
private static IEnumerable<Person> GetData()
{
return new List<Person>
{
new Person { Name = "张三", Age = 25, Score = 85.5 },
new Person { Name = "李四", Age = 30, Score = 92.0 },
new Person { Name = "王五", Age = 22, Score = 78.5 }
};
}
// 定义数据模型
public class Person
{
public string Name { get; set; }
public int Age { get; set; }
public double Score { get; set; }
}
}
Sep提供了高效的多线程处理能力:
C#static void Main(string[] args)
{
using var reader = Sep.Reader().FromFile("data.csv");
var results = reader.ParallelEnumerate(new SepReader.RowFunc<House>(ProcessData))
.AsParallel()
.AsOrdered()
//.Select(x=>x.Price)
.ToList();
foreach (var house in results)
{
Console.WriteLine($"Size: {house.Size}, Rooms: {house.Rooms}, Price: {house.Price}");
}
Console.ReadKey();
}
// 处理每行数据的方法
private static House ProcessData(SepReader.Row row)
{
var house = new House
{
Size = double.Parse(row["Size"].ToString()),
Rooms = int.Parse(row["Rooms"].ToString()),
Price = double.Parse(row["Price"].ToString())
};
return house;
}
// 定义数据模型
public class House
{
public double Size { get; set; }
public int Rooms { get; set; }
public double Price { get; set; }
}
Sep默认使用分号(;)作为分隔符,但可以轻松自定义:
C#static void Main(string[] args)
{
using var reader = Sep.New(',').Reader().FromFile("data.csv");
foreach (var row in reader)
{
Console.WriteLine(row["Size"].ToString() + " " + row["Rooms"].ToString() + " " + row["Price"].ToString());
}
Console.ReadKey();
}
Sep支持直接将列值解析为指定类型:
C#static void Main(string[] args)
{
using var reader = Sep.New(',').Reader().FromFile("data.csv");
foreach (var row in reader)
{
var house = new House
{
Size = row["Size"].Parse<double>(),
Rooms = row["Rooms"].Parse<int>(),
Price = row["Price"].Parse<double>()
};
Console.WriteLine(JsonSerializer.Serialize(house));
}
Console.ReadKey();
}
可以同时操作多个列:
C#static void Main(string[] args)
{
using var reader = Sep.New(',').Reader().FromFile("test_data.csv");
foreach (var row in reader)
{
var (name, age, score) = (row["Name"].ToString(), row["Age"].Parse<int>(), row["Score"].Parse<double>());
Console.WriteLine($"{name}: {age} years old, score: {score}");
}
Console.ReadKey();
}
ParallelEnumerate
进行多线程处理SepToString.PoolPerCol()
进行列级别的字符串池化C#static void Main(string[] args)
{
using var reader = Sep.New(',').Reader().FromFile("test_data.csv");
var nameIdx = reader.Header.IndexOf("Name");
var ageIdx = reader.Header.IndexOf("Age");
foreach (var row in reader)
{
var name = row[nameIdx].ToString();
var age = row[ageIdx].Parse<int>();
}
Console.ReadKey();
}
C#using var reader = Sep.Reader().FromFile("scores.csv");
var sum = 0.0;
var count = 0;
foreach (var row in reader)
{
sum += row["Score"].Parse<double>();
count++;
}
var average = sum / count;
Console.WriteLine($"Average score: {average}");
C#using var reader = Sep.Reader().FromFile("employees.csv");
using var writer = Sep.Writer().ToFile("high_salary_employees.csv");
foreach (var readRow in reader)
{
var salary = readRow["Salary"].Parse<decimal>();
if (salary > 100000)
{
using var writeRow = writer.NewRow(readRow);
writeRow["Bonus"].Format(salary * 0.1m);
}
}
C#async Task ProcessLargeFile(string filePath)
{
using var reader = Sep.Reader().FromFile(filePath);
var results = new List<ProcessedData>();
await foreach (var data in reader.Enumerate().Select(ParseRow))
{
var processedData = await ProcessDataAsync(data);
results.Add(processedData);
}
await SaveResultsAsync(results);
}
static RowData ParseRow(SepReader.Row row)
{
return new RowData
{
Id = row["Id"].Parse<int>(),
Name = row["Name"].ToString(),
Value = row["Value"].Parse<double>()
};
}
C#static void Main(string[] args)
{
var text = @"A;B;C
1;2;3
4;5;6";
using var reader = Sep.Reader(o => o with { DisableColCountCheck = true }).FromText(text);
var actual = new List<int>();
foreach (var row in reader)
{
// 跳过空行
if (row.Span.Length == 0) { continue; }
actual.Add(row["A"].Parse<int>());
}
foreach (var i in actual)
{
Console.WriteLine(i);
}
Console.ReadKey();
}
C#static async Task Main(string[] args)
{
var text = @"C
1
2
3";
using var reader = Sep.Reader().FromText(text);
var squaredSum = await Task.Run(() => Enumerate(reader).Sum(x => x * x));
static IEnumerable<int> Enumerate(SepReader reader)
{
foreach (var r in reader) { yield return r["C"].Parse<int>(); }
}
Console.WriteLine(squaredSum);
}
Sep支持对引号进行转义处理,这是通过SepReaderOptions
中的Unescape
选项来控制的:
C#internal class Program
{
static void Main(string[] args)
{
var text = @"Name;Description
""John Doe"";""""Software ""Engineer""""";
using var reader = Sep.Reader(o => o with { Unescape = true }).FromText(text);
foreach (var row in reader)
{
var name = row["Name"].ToString(); // 结果: John Doe
var description = row["Description"].ToString(); // 结果: Software "Engineer"
Console.WriteLine($"Name: {name}, Description: {description}");
}
Console.ReadKey();
}
}
根据提供的基准测试结果,Sep在多种场景下都表现出色:
ParallelEnumerate
时,Sep比Sylvan快23倍,比CsvHelper快35倍。这些性能优势主要来自于Sep的高度优化的字符串池化、ReadOnlySpan<char>
的优化哈希处理,以及集成的快速浮点数解析器csFastFloat。
Sep库为.NET开发者提供了一个高性能、灵活且易用的CSV处理解决方案。通过利用现代.NET特性和优化技术,Sep在保持API简洁性的同时,实现了卓越的性能表现。无论是处理小型数据集还是大规模机器学习数据,Sep都能满足各种CSV解析需求。
希望本文的示例和说明能帮助您更好地理解和应用Sep库。随着项目的不断发展,Sep有望成为.NET生态系统中CSV处理的首选工具之一。
本文作者:技术老小子
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!