If you are using Go ad PostgreSQL, and need to performa a bulk import a CSV,
it’s most likely you will find the COPY
protocol is the feature that suits you better.
In that direction, you will find examples using pgx CopyFrom that relies on the native protocol, and it’s fairly easy to use.
However, depending how it’s used you can have an exponencial increase of memory consumption of your application making it unreliable and more expensive to run.
TL;DR;: Don’t load the file in memory and use
pgx.CopyFromRows
. Instead, use theio.Reader
of the file and implement a custompgx.CopyFromSource
.
Checkout the repository with examples and details presented here.
Context
Most of the examples out there, are either using CopyFromRows or CopyFromSlice. However, the big problem is these two options require you to have the entire content in memory to use. This is not a big deal when dealing with small files, there won’t be concurrent usage, or you have infinite memory 😅.
How big is the problem?
Comparing the memory consumption for the two distinct approaches importing a file with ~16MB (1M rows).
Approach/metric | TotalAlloc | Sys | ||
---|---|---|---|---|
Stream file | 61 MiB | 12 Mib | ||
Read entire file | 84 MiB | 58 Mib | ||
+37.70% | +346.15% |
Yes, you read it right! using CopyFromRows
obtained +346.15% of memory from the OS! 58 MiB instead of 12 MiB.
You can read more about the meaning of each metric on https://golang.org/pkg/runtime/#MemStats and find the source code and details of the comparisson on this repository..
What’s the most efficient way for using the COPY protocol with pgx?
Instead of reading the entire in memory, the idea is to stream each line of the file directly to PostgreSQL. In this way, we only need to keep the current line in-memory as opposed to the entire file.
How can we do it?
The CopyFrom method receives an implementation of the interface CopyFromSource.
Let’s implement this interface using a CSV file with the followng three columns: first_name, last_name, and city.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
func newPeopleCopyFromSource(csvStream io.Reader) *peopleCopyFromSource {
csvReader := csv.NewReader(csvStream)
csvReader.ReuseRecord = true // reuse slice to return the record line by line
csvReader.FieldsPerRecord = 3
return &peopleCopyFromSource{
reader: csvReader,
isBOF: true, // first line is header
record: make([]interface{}, len(peopleColumns)),
}
}
type peopleCopyFromSource struct {
reader *csv.Reader
err error
currentCsvRow []string
record []interface{}
isEOF bool
isBOF bool
}
func (pfs *peopleCopyFromSource) Values() ([]any, error) {
if pfs.isEOF {
return nil, nil
}
if pfs.err != nil {
return nil, pfs.err
}
// the order of the elements of the record array, must match with
// the order of the columns in passed into the copy method
pfs.record[0] = pfs.currentCsvRow[0]
pfs.record[1] = pfs.currentCsvRow[1]
pfs.record[2] = pfs.currentCsvRow[2]
return pfs.record, nil
}
func (pfs *peopleCopyFromSource) Next() bool {
pfs.currentCsvRow, pfs.err = pfs.reader.Read()
if pfs.err != nil {
// when get to the end of the file return false and clean the error.
// If it's io.EOF we can't return an error
if errors.Is(pfs.err, io.EOF) {
pfs.isEOF = true
pfs.err = nil
}
return false
}
if pfs.isBOF {
pfs.isBOF = false
return pfs.Next()
}
return true
}
func (pfs *peopleCopyFromSource) Err() error {
return pfs.err
}
You can now use this implementation in the CopyFrom
method. e.g.
1
_, err := pgxConn.CopyFrom(ctx, pgx.Identifier{"people"}, peopleColumns, newPeopleCopyFromSource(csvStream))
Conclusion
Using the CopyFrom
with CopyFromRows
or CopyFrom
will significantly increase the memory comsultion of your application. The high memory usage can bring several problems like OOM errors, increase of costs, unavailability, etc.
By using a custom implementation of CopyFromSource will make your application much more efficient, reliable, and cheaper to ru.
You can find the entire source code of the examples above on this repository. There you will also find more deatils about the comparisson and the not-so-great implementation.