NumPy 用 loadtxt savetxt ，HDF5 用 File ，JSON 是 load 和 dump 。

csv

注意事项：一般的.csv文件的第一行为行头

使用numpy

numpy中集成了处理csv文件的函数

sq = np.arange(100).reshape((10,10))

数据保存

savetxt(fname, data, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)

np.savetxt("s100.txt",sq,fmt="%d") # 使用“%d”增加可读性
np.savetxt("s100_18e.txt",sq) # 默认为"%.18e"18位科学计数法

数据读入

使用np.loadtxt将文本文件直接读取为numpy.ndarray

np.loadtxt("s100.txt", dtype=int) # dtype=int 保证以整数解读文件，否则数字读入时会被暗中转化成浮点型。
# 一些其他类型：int16 int32 int64 float16 float32 float64 float128 
print(type(np.loadtxt("s100.txt", dtype=int) )) # =>  <class 'numpy.ndarray'>

使用`csv`库

import csv

文件读入

path = Path("s100_1.txt")
lines = path.read_text().splitlines() # lines的类型为list
reader = csv.reader(lines) # 迭代器类型
header = next(reader)

注意，csv.reader默认以','作为分隔符
自定义分隔符：使用 delimiter=';'参数

path = Path("s100_1.txt")
lines = path.read_text().splitlines() 
reader = csv.reader(lines, delimiter=';') # 自定义分隔符
header = next(reader)

HDF5

HDF5 具有数据的原始（raw）表示，即 HDF 中保存的是与内存同样标准的整数、浮点数，不会有类似 CSV 的精度损失。HDF5 的数据类型自我描述，在读入内存时不需要额外的信息源，因为 HDF5 文件中包含了数据类型和长度等辅助信息。

一个h5文件的例子

s100.h5: Hierarchical Data Format (version 5) data
HDF5 "s100.h5" {
GROUP "/" {
DATASET "s100" {
    DATATYPE  H5T_STD_I64LE
    DATASPACE  SIMPLE { ( 10, 10 ) / ( 10, 10 ) }
}
}
}

GROUP可以嵌套，类似于文件夹
DATASET是多维数组

命令行中查看h5文件

(bash)h5dump：Displays HDF5 file contents.
- -A Print the header and value of attributes; data of datasets is not displayed.
在vscode中可以下载相关拓展
使用vitables软件

文件写入(基于numpy)

h5py.File()产生的对象不是字典，但是模拟了字典的接口
- 也有.keys(),.values()方法
h5文件的dataset，也就是“values”具有类numpy数组属性

基本操作

# 写入一个h5文件
s100 = np.arange(100).reshape((10,10))
with h5py.File("s100.h5","w") as f:
    f["s100"] = s100

类似于字典，"GROUP路径"（类似于文件夹目录）作为key，dataset（多维数组）作为value

进阶操作

with h5py.File('data.h5', 'w') as f:  # 'w'模式覆盖写入
    # 创建数据集并写入数组
    dset = f.create_dataset("matrix", data=np.random.rand(100, 100))
    
    # 添加元数据属性
    dset.attrs["description"] = "100x100 random matrix"
    
    # 创建组并在组内添加数据集
    grp = f.create_group("experiment")
    grp.create_dataset("temperatures", data=[25.3, 26.1, 24.8])

数据格式转换

使用numpy的.astype(np.int8)转换精度，实现低精度储存，节省内存资源

with h5py.File("s100-int8.h5", "w") as opt:
    opt["s100-int8"] = s100.astype(np.int8)

读取h5

with h5py.File("s100.h5", 'r') as f:
    h5_s100 = f["s100"][...]
print(h5_s100) # =>一个二维数组
print(h5_s100.dtype) # =>int64

[...]或者[()] ，代表把所有数据读进内存
把 s100 取出时，HDF5 自我描述可自动把 NumPy 的类型设置好

数据集移动

用复制和删除组合实现。我们把 /home/s100 移动到 /s100

with h5py.File("hzg.h5", "a") as ipt:
    ipt["s100"] = ipt["home/s100"]
    del ipt["home/s100"]

访问属性

使用读出的数据集的.attrs[]方法

attr_value = dataset.attrs['attribute_name']
 print(attr_value)
 # 
遍历所有属性
for key, value in dataset.attrs.items():
 print(f"{key}: {value}")

遍历 HDF5 文件结构

def print_hdf5_structure(name, obj):
    indent = '  ' * name.count('/')
    if isinstance(obj, h5py.Group):
        print(f"{indent}Group: {name}")
    elif isinstance(obj, h5py.Dataset):
        print(f"{indent}Dataset: {name}, shape: {obj.shape}, dtype: {obj.dtype}")

file_path = 'path/to/your/file.h5'
with h5py.File(file_path, 'r') as h5_file:
h5_file.visititems(print_hdf5_structure)

JSON

当数据没有整齐形态，可能伴随有分支、嵌套等时，使用JSON更方便
JSON 借鉴了 Python 字典和列表的语法，与 Python 交互极其方便。
但是 JSON 面向的纯文本数据，与 CSV 类似，对数字的表现力弱。
JSON就是python的字典不断嵌套

JSON读取

open()+ json.load()

import json

with open("BBH_events_v3.json", "r") as ipt:
    events = json.load(ipt)

events此时就是一个python字典对象

JSON写入

with open("BBH_events_rewrite.json", 'w') as opt:
    json.dump(events, opt)

Structured Array数组中的符合数据结构

见python目录下相关文件
复合数组可以直接保存为HDF5的表格，而且可以在vitables或其他数据处理软件中很好地转化为表格

with h5py.File("people.h5", "w") as opt:
    opt['record'] = r

DataFrame表格

alt text

复合数组是逐行保存地，DataFrame是逐列保存的
- 逐行保存有利于逐步积累数据，HDF5是逐行的
- 逐列保存有利于处理数据，parquet是逐列的

CSV表格

读入CSV：
```
data = np.read_csv("文件名", skiprows=4)
```
- 可选参数skiprows：跳过指定的行数
可以把复合数组写到CSV中，形成一个表格。

df_r.to_csv("people.csv")

index=False表示不保存行的标号

读入使用pd.read_csv()

Apache Arrow与Parquet

读写parquet

mydataframe.to_parquet("people.pq",index=False)
pq = pd.read_parquet("people.pq")

pandas读写hdf5

写入hdf5
```
ds.to_hdf("osci.h5", ds_name)
```
- ds_name为数据集的名字

标准的hdf5输出

with h5py.File("osci.h5", "w") as opt:
    # 先转化为复合数组，再写入文件兼容性更好
    opt[ds_name] = ds.to_records(index=False)
    # 写入数据标签
    for k, v in meta.items():
    opt[ds_name].attrs[k] = v

《实验物理的大数据方法》