Deleting documents in CouchDB for real

Posted on Feb 28, 2023

Recently, while syncing a CouchDB database to a new cluster, we noticed a massive amount of deleted documents being synced. This was completely unnecessary for our migration and slowed things down drastically.

I started looking into the proper way to clean up, purge, compact a CouchDB and was surprised that it was actually quite hard (and at first illogical) to really removed old, deleted documents

Revisions and deleted documents

Whenever a document gets updated (even if unchanged), a new revision gets created. When updating that doc, the previous revision (at least according to the caller) must be specified

$ curl -X PUT -H "Content-Type: application/json" $URL/learnings/doc1 -d '{}'
{"ok":true,"id":"doc1","rev":"1-967a00dff5e02add41819138abb3284d"}
$ curl -X PUT -H "Content-Type: application/json" $URL/learnings/doc1 -d '{"_rev": "1-967a00dff5e02add41819138abb3284d", "a":1}'
{"ok":true,"id":"doc1","rev":"2-4f54ab3740f3104eec1cf2ec2b0327ed"}

Deleting a document can be done in two ways: using the DELETE method, or by simply setting the _deleted value

$ curl -X PUT -H "Content-Type: application/json" $URL/learnings/doc1 -d '{"_rev": "2-4f54ab3740f3104eec1cf2ec2b0327ed", "_deleted": true}'
{"ok":true,"id":"doc1","rev":"3-14d49873da88656c1027ec23cc432ee1"}
$ curl $URL/learnings/doc1
{"error":"not_found","reason":"deleted"}

When deleting you can chose to preserve the body or to clear the body entirely. But in the end, it does generate a new revision. The document still exists. And since the document is the latest revision, it will not be cleaned up using _compact

We can verify this!

$ curl -H "Content-Type: application/json" "$URL/learnings/doc1?revs=true&open_revs=all" 
--75d3fcc52318112f03135564aeaa66c7
Content-Type: application/json

{"_id":"doc1","_rev":"3-14d49873da88656c1027ec23cc432ee1","_deleted":true,"_revisions":{"start":3,"ids":["14d49873da88656c1027ec23cc432ee1","4f54ab3740f3104eec1cf2ec2b0327ed","967a00dff5e02add41819138abb3284d"]}}
--75d3fcc52318112f03135564aeaa66c7--

(not sure why this doesn’t return a proper JSON body). But it shows all individual known revisions:

"_revisions":{
  "start":3,
  "ids":[
    "14d49873da88656c1027ec23cc432ee1",
    "4f54ab3740f3104eec1cf2ec2b0327ed",
    "967a00dff5e02add41819138abb3284d"
  ]
}

It shows the latest revision which is actually the deleted document, but we can also retrieve documents pre-deletion

$ curl -s -H "Content-Type: application/json" $URL/learnings/doc1?rev=3-14d49873da88656c1027ec23cc432ee1 | jq .
{
  "_id": "doc1",
  "_rev": "3-14d49873da88656c1027ec23cc432ee1",
  "_deleted": true
}
$ curl -s -H "Content-Type: application/json" $URL/learnings/doc1?rev=2-4f54ab3740f3104eec1cf2ec2b0327ed | jq .
{
  "_id": "doc1",
  "_rev": "2-4f54ab3740f3104eec1cf2ec2b0327ed",
  "a": 1
}

Compacting will remove older revisions though it will only completely remove the revisions if the _revs_limitis set lower than the default, which is currently 1000

$ curl -s -H "Content-Type: application/json" $URL/learnings/_revs_limit 
1000

Running compation now will remove the older revision but couchdb will remember the revision itself

$ curl -s -X POST -H "Content-Type: application/json" $URL/learnings/_compact
{"ok":true}

(give it some time)

$ curl -s -H "Content-Type: application/json" $URL/learnings/doc1?rev=2-4f54ab3740f3104eec1cf2ec2b0327ed | jq .
{
  "error": "not_found",
  "reason": "missing"
}

$ curl -H "Content-Type: application/json" "$URL/learnings/doc1?revs=true&open_revs=all" 
--cdbab51fa25e0ef489bceb87773ab58f
Content-Type: application/json

{"_id":"doc1","_rev":"3-14d49873da88656c1027ec23cc432ee1","_deleted":true,"_revisions":{"start":3,"ids":["14d49873da88656c1027ec23cc432ee1","4f54ab3740f3104eec1cf2ec2b0327ed","967a00dff5e02add41819138abb3284d"]}}
--cdbab51fa25e0ef489bceb87773ab58f--

The document is gone but couchdb still knows the revisions, and still knows the document was once there but has been deleted. Only if we hit the _revs_limit will it start to forget about older revs:

$ curl -X PUT -d "1" -H "Content-Type: application/json" "$URL/learnings/_revs_limit" 
{"ok":true}
$ curl -s -X POST -H "Content-Type: application/json" $URL/learnings/_compact
{"ok":true}
$ curl -H "Content-Type: application/json" "$URL/learnings/doc1?revs=true&open_revs=all" 
--4cf9cd18472d65d05da0433afc903d8b
Content-Type: application/json

{"_id":"doc1","_rev":"3-14d49873da88656c1027ec23cc432ee1","_deleted":true,"_revisions":{"start":3,"ids":["14d49873da88656c1027ec23cc432ee1"]}}
--4cf9cd18472d65d05da0433afc903d8b--ivo@slootje ~

So now it did “forget” the older revisions, but the deleted doc is still there:

$ curl -s -H "Content-Type: application/json" $URL/learnings/doc1?rev=3-14d49873da88656c1027ec23cc432ee1 | jq .
{
  "_id": "doc1",
  "_rev": "3-14d49873da88656c1027ec23cc432ee1",
  "_deleted": true
}

The only proper way to truely get rid of the document is by _purge-ing it, explicitly.This means specifying the document id and revision(s) to purge

$ curl -s -d '{"doc1": ["3-14d49873da88656c1027ec23cc432ee1"]}' -X POST -H "Content-Type: application/json" $URL/learnings/_purge
{"purge_seq":null,"purged":{"doc1":["3-14d49873da88656c1027ec23cc432ee1"]}}

$ curl -s -H "Content-Type: application/json" $URL/learnings/doc1?rev=3-14d49873da88656c1027ec23cc432ee1 | jq .
{
  "error": "not_found",
  "reason": "missing"
}

But we shouldn’t!

There’s a reason CouchDB keeps the deleted docs and revisions around: replication! Replication means CouchDB also needs to synchronize deletes which may even cause conflicts: One node deleting a document that another node updates.

This is why, in general, you shouldn’t bother too much with deleting documents. Compaction is usually sufficient: it will get rid of the full document bodies and just keep track of the revision history.

But what if you really want to or need to?

If you’re sure all your nodes/clusters are in sync, you can forcefully remove documents. There are scripts that can do this for you.

Alternatively, you can organize your databases in such a way that you can delete entire databases to get rid of all your (deleted) documents. And if there’s data to preserve, you can chose to sync to a new database but with a filter that filters out deleted documents

This filter can also be used to migrate to a new cluster, skipping deleted documents in the process.

Summarized

Explicitly removing deleted documents (through _purge) is discouraged. Compacting using _compact is fine, but CouchDB should be able to keep track of revisions and deleted documents to function properly. It’s also fine to leave the _revs_limit at its (seemingly high) default value.

If you really want to clean up, consider syncing to a new (remote) database with a filter that skips the deleted docs or use a script that explicitly purges if you’re sure it’s safe!