https://pine32.be - © pine32.be 2025
Welcome! - 103 total posts. [RSS]
A Funny little cycle 2.0 [LATEST]


Search 103 posts with 45 unique tags


#1753545757


[ dev ]

I recently implemented ThumbHash in my project. I’ve been searching for a solution like this for a while. Initially, I looked into the more popular BlurHash, but I found the implementation to complicated (using base83 etc.), and I wasn’t a fan of rendering via a canvas element. I also had my doubts about performance when loading 100’s of these.

Then looked into my own implementation based on bitmaps (.bpm) with a static header that is coded into the frontend for extra compactness. This would be in the ballpark in terms of the amount of bytes over the wire. And it would have the least overhead on the frontend. Just concat the header with the payload and render it as a base64 encoded in html natively.

ThumbHash works in a similar way, using base64 PNG images that are generated on the frontend. While this introduces a bit more overhead compared to my bitmap solution, the result is almost half the size of my bitmaps and offers more flexibility, since it doesn’t require a hardcoded header.

I highly recommend checking out the ThumbHash site if you’re interested.

#1752783987


[ dev ]

I made an unholy oneliner to scrape, download and upload some data for my current project. Works fine on linux, not on my Windows machine :\. It would be in all aspects better as a bash script or even a python script (cross platform). But what’s the fun in that, this is way cooler. Even though I will probably rewrite something similar in python so I can also run it on my windows machine. But it is good to keep my shell skills sharp.

seq 1 4 | xargs -I% sh -c 'echo "Fetching page %" >&2; curl -s "https://www.last.fm/charts/weekly?page=%"' \
        | pup 'td.weeklychart-name a.weeklychart-cover-link attr{href}' \
        | sed 's|^|https://www.last.fm|' \
        | xargs -I {} sh -c 'echo "Processing artist URL: $1" >&2; curl -s "$1" | pup "h1.header-new-title, div.header-new-background-image"' _ {} \
        | pup 'json{}' \
        | jq '[ .. | objects | select(.tag=="h1" or .itemprop=="image") ] as $list | [ range(0; $list|length; 2) | { name: $list[.].text, img:  $list[. + 1].content } ]' \
        | tee artists.json \
        | jq -r '.[].img' \
        | xargs -n 1 sh -c 'echo "Downloading image: $1" >&2; wget -q "$1"' _ && \
jq -j '.[] | .img |= (split("/")[-1]) \
           | @json, "\u0000"' artists.json \
		   | xargs -0 -I{} sh -c 'img=$(printf "%s" "$1" \
		   | jq -r .img); name=$(printf "%s" "$1" \
		   | jq -r .name); echo "Adding $name"; curl -s  -X POST -H "Authorization: Bearer $TOKEN" -F "img=@$img" -F "json={\"name\":\"$name\"}" http://localhost:3000/api/artists/add' _ {}

#1749396915


[ dev ]

I tried Ent again… I know I said that I was done with ORMs, but people were telling me that it shouldn’t be that bad, so I gave it another chance. Again with Ent, because I feel like it’s the only ORM that has a chance to work with everything I want to use it for. And up to now, it is still holding up, it hasn’t been a blocker yet for this project. Performance is fine, of course there is overhead but nothing major (unlike SQLAlchemy, which halves your performance…). It can now handle many-to-many relations with custom join tables without a problem. I was also able to change some codegen that I didn’t like (omitempty on boolean fields) using a codegen hook, and add a runtime hook for some custom row-level validation (see codeblock). The function signature of the hooks takes some getting used to, but as long as it works, it’s fine by me. I also like the bulk insert API, it is really flexible. So overall I am happily using it but I am staying watchful about any possible limitations.

func (Image) Hooks() []ent.Hook {
	return []ent.Hook{
		hook.On(
			func(next ent.Mutator) ent.Mutator {
				return hook.ImageFunc(
					func(ctx context.Context, m *gen.ImageMutation) (
						ent.Value, error) {
						_, hasRelease := m.ReleaseID()
						_, hasArtist := m.ArtistID()
						if hasRelease == hasArtist {
							if hasRelease {
								return nil, fmt.Errorf("but both were provided")
							} else {
								return nil, fmt.Errorf("neither was provided")
							}
						}
						return next.Mutate(ctx, m)
					})
			},
			ent.OpCreate|ent.OpUpdate|ent.OpUpdateOne,
		),
	}
}

#1747598386


[ dev ]

I have spent far too much time designing a custom ID type for my current project. I wanted to use it as the primary key in a SQLite database, which imposes some constraints. Specifically, the ID must fit within 63 bits, since SQLite only supports signed integers and I want to avoid negative values. (Technically, negative IDs would work, but they’re not ideal.)

You might be thinking, “Why not just use a BLOB as the primary key? That gives you much more flexibility.” And that’s a fair point, but I am intentionally avoiding that because of how SQLite handles its hidden rowid. When you use an integer as the primary key, SQLite internally aliases it to the rowid, which makes operations significantly faster. Using a BLOB would remove that performance advantage and make the database larger.

So the next step is choosing the bit layout. The first bit is unused to prevent negative values. Then I went with 43 bits for a Unix millisecond timestamp. This gives me 278 years of ranges, should be plenty. Using the default Unix epoch this will work until the year 2248. It will outlive me so that is more than enough.

The remaining 20 bits are random, which gives 1_048_576 possible values. I am using random values because I don’t want to keep track of state (as with an autoincrement), and my current system can handle collisions. It is still possible to swap approaches down the road while keeping the already generated IDs. 1_048_576 Sounds like a lot, but this gives a 1% chance of a collision occurring when only generating 146 IDs. Then again, those IDs would need to be generated within the same millisecond. I am not expecting that much volume.

Bit  | 63 (MSB) | 62 ... 20 | 19 ... 0 |        some        1JTWRZPBJ4DSE
-----|----------|-----------|----------|        example     1JTWRZSPA1NBS
Use  | Unused   | Timestamp | Random   |        IDs         1JTWRZTQSE9G6
Size | 1 bit    | 43 bits   | 20 bits  |        ->          1JTWRZVY5R7RT

The reason for using a timestamp in the leading bits is to minimize B-tree rebalances. As time advances, the generated IDs grow in sequence, allowing the B-tree to insert new entries without reorganizing older pages. By contrast, a completely random primary key (like a UUID v4) forces the B-tree to rebalance frequently, which can significantly degrade database performance.

Finally, the string representation: I chose Crockford’s Base32 (without the check digit). Just 13 characters to represent a int64. To me it’s practically perfect from a technical standpoint, and I like how the IDs look. I know aesthetics shouldn’t matter, but this is my project. So I set the rules, and I want things to look cool and and have some aesthetic appeal. Looks way better than those stupid UUIDs.

One final note, please store your IDs in a binary presentation (BLOB or integer). It hurts me every time I seed a ID stored as its string representation. It is way slower and waists storage. It mainly happens with UUIDs, most people don’t realize it actually is a binary ID and not a string. Even the spec states it but I guess people just don’t read it.

#1740225625


[ dev ]

Turns out that HTTP/0.9 is a thing. No headers, no methods, no status codes, only GET. It is so simple you can easily make a request manually. Firefox still supports it apparently.

echo -e "GET /\r" | nc pine32.be 80

#1738273490


[ dev ]

Still alive, just busy. First post of the year… yay I guess.

I have been working on my fork of Navidrome that will have some audio and music analysis with the help of Essentia, the MVP is almost done. Because fuck Spotify, I am done with their bullshit. But I still want my cool data, so I am making it myself, all open source ofcourse.

Navidrome is written in Golang but most analysis libraries are written in C/C++ or Python, so gRPC was the solution because I am not writing a wrapper. My first time working with gRPC, it’s really nice once you get it all set up. But the setup can be a pain, like WTF are .pyi files. I have never seen them before, they are interface files so you have your types available (this is because Python Protobuf does some weird runtime C thing that processes the proto files or something). It has been nice to use, apart from setup. The end-to-end types are just a huge plus, you just know that they will work.

Python multiprocessing has also been a journey. Tested all types of pools but they always just deadlocked or didn’t run. In the end I just made my own worker pool with each worker having its own process. And they share a multiprocessing-safe queue for input and output. Sort of like I would do in Golang with channels instead of queues. And this dead simple approach worked first try of course, after everything I tried. But I am just happy that it works now. Still feels weird to see python use almost 100% of your CPU on all cores.

One year anniversary coming up for MB.

#1731188466


[ dev | meta_raid ]

I am making a scraper for Spotify meta data. My testing numbers indicates that I could scrape 100% of Spotify in less then a week, something feels wrong.

INFO Stats per minute id=0 request=204 tracks=2451
INFO Stats per minute id=2 request=193 tracks=2086
INFO Stats per minute id=1 request=212 tracks=2392

#1730630725


[ dev | golang | meta_raid ]

Golangs new integrators came in handy for request pagination. I know the code is not optimal but is very readable and it is just for a proof of concept. I am try to get Spotify metadata in bulk. Hopefully I won’t get IP banned, fingers crossed.

for chunk := range slices.Chunk(allSimpleTracks, 100) {
	ids := make([]spotify.ID, len(chunk))
	for i, a := range chunk {
		ids[i] = a.ID
	}

	f, err := client.GetAudioFeatures(ctx, ids...)
	if err != nil {
		return nil, err
	}

	fullTracks := make([]*spotify.FullTrack, len(ids))
	for subChunk := range slices.Chunk(ids, 50) {
		full, err := client.GetTracks(ctx, subChunk, spotify.Limit(50))
		if err != nil {
			return nil, err
		}
		fullTracks = append(fullTracks, full...)
	}
	for i := range len(ids) {
		allTracks[i] = &FullerTrack{
			Track:    fullTracks[i],
			Features: f[i],
		}
	}
}

#1728318672


[ dev | corap ]

Apart from a few minor glitches (which have been fixed) the scraper and scheduler runs fine. The frontend is also coming along nicely, needs a few more pages and then the CSS. And I also need to figure out how to load a dynamic amount of columns from a materialized view, shouldn’t be to hard but I want to make it fault tolerant. I don’t know what I want to do regarding design. But I know some body that maybe wants to help me, fingers crossed.

#1727195369


[ dev | corap ]

Python (scraper) rewrite is done. Almost no dependencies now. Reduced the Docker image from 1.2 GB to less then 100MB. Feels a lot better to update and modify to. Now time for the fronted webserver.

beautifulsoup4==4.12.3
requests==2.32.3
python-dotenv==1.0.1
psycopg==3.2.2
psycopg-binary==3.2.2

#1726927465


[ dev | corap | database ]

The amount of cursed SQL that I am writing just to keep it in pure SQL. It would be way faster to just make the query in Python. Anyway… Corap rewrite is coming along nicely.

DO $$
DECLARE
    cols text;
    query text;
BEGIN
    SELECT string_agg(quote_ident(name) || ' text', ', ')
    INTO cols
    FROM (
        SELECT name
        FROM (SELECT DISTINCT name, priority FROM device_analyses) AS o
        ORDER BY priority DESC
    ) AS o;
    
    BEGIN
        EXECUTE 'DROP MATERIALIZED VIEW IF EXISTS device_analysis_summary';
        query := format('
            CREATE MATERIALIZED VIEW device_analysis_summary AS
            SELECT *
            FROM crosstab(
                ''SELECT d.deveui, da.name, da.value
                FROM devices d
                LEFT JOIN device_analyses da ON d.deveui = da.device_id
                ORDER BY d.deveui, da.name'',
                ''SELECT name
                FROM (SELECT DISTINCT name, priority FROM device_analyses) AS o
                ORDER BY priority DESC''
            ) AS ct(deveui text, %s);
        ', cols);
        EXECUTE query;
    EXCEPTION
        WHEN OTHERS THEN
            RAISE NOTICE 'Error creating materialized view: %', SQLERRM;
            ROLLBACK;
            RETURN;
    END;
END $$;

#1726662585


[ dev | corap ]

Time to rewrite Corap finally, starting with the scheduler. The current docker images is more then 1 GB. Going to remove a lot of dependencies. Also going to rewrite the fronted, learned a lot about Golang sins starting that project.

#1720948971


[ dev | rant ]

I am officially done with ORM’s. My latest experiment was ent, a code gen based ORM for Golang. Works fine, I like the API, and then you want to do something slightly complex and it just doesn’t work. I wanted a many to many with extra data in the join table, so for so good, this did work. Until I wanted to make it not unique. I needed this because I wanted to add one track multiple times to a playlist, in my current project. But this was not allowed, the codegen would not build. Other people have the same issue but no solution is known. So my solution is to rewrite my code again, this time with pgx. I also have tried and used sqlc in some projects but it won’t scale for my current project. But I do like it a lot for smaller projects, like this blog uses it for example.

I have tried a lot of ORM’s over the year but I am finally done, not a chance. They are cool great until they are not, then they are just a pain.