/*
* HOME(PROJECTS) || RESUME || LINKS || ABOUT */

Retrochallenge 2018/09


This time I am finally sitting down and (starting) to write the operating system for the Cat-644. (external link to hackday.io) The hardware is a 20 Mhz 8-bit homebrew computer built around an AtMega 644 microcontroller, and features VGA video output. Retrochallenge main page





Cat-644 Hardware


The Cat-644 is a computer I've been developing around an Atmega 644 microcontroller





Wait, this computer project is only 5 years and, and it's made with a Microcontroller variant that's released in the 21st century. Does it not qualify?


Pre-competition Warm-up

Before the competition starts, I need to verify the hardware and it's software 'demo' programs still work, and my development environment in AVR Studio is in a good state. I may find and work out bugs in the 16-bit emulated CPU or hardware drivers. I won't start the new cat-os github repo until September 1st, nor will I attempt wrapping any of the hardware drivers into any OS-like framework until September 1st. This is just ensuring I have all my hardware and existing software base ready.

Scope for Competition

My goal is to come up with a simple OS that can, (in this order):

  • Run interpreted program that has been copied from flash to SRAM.
  • Read and write "files" from the SD card. (Even if its just 'raw' sector hex editor to start with).
  • Read a program from the SD card and run it. At this point, I would call it a complete 'computer' since it can theoretically extended and bootstrapped by itself by sitting in front of it and writing hex code by hand. The 'minimal viable product' of a computer. Heck, writing hex thru a keyboard is still easier than COSMAC or Altair binary switches. Though, I would really just Hyperterminal 'PASTE' it thru the serial port!

  • It looks like a lot, but all the hardware drivers are already written. They just need to be organized into a framework. For example, reading the keyboard, reading from a file, or reading the serial port, should all the the same interface. RIght now they are readkey(), serin() and ps2key(), but it should be getchar(int devicenum);



    This is the first powerup in probably about a year. The VGA output demo still runs. This is where I'll leave it until competition starts.



    The "scanlines" effect is not a goofy attempt at mimicing a bad CRT monitor. The default video mode of the Cat644 is 256x240. This uses the 640x480 VGA timings, which means every line is doubled. Since this computer generates the screen using the CPU, over 90% of the CPU is taken up by this task. This is "slow" mode. If only every other line is drawn, you get the scanline effect, no loss in screen resolution, and 50% of the CPU is recovered. This is "fast" mode.




    KittyOS

    September 3, 2018


    I am basing this around some previous work the first time I was writing software for this computer. I started making a framework called catos, but it didn't really work out. I never kept with the project. This time around, I am starting over, copy and pasting some of it into a new project called KittyOS.


    So far, this is what is working: Read/write the serial port thru a chardevice structure.


    Unix is based around the idea that everything is a file. Devices are just 'magic files' that when reading and writing causes I/O to happen. I am flipping this around: in KittyOS everything is a device. Part of this is the reason that perhaps I want to use KittyOS in other AVR projects. There may not be a filesystem available for use, and it doesn't make sense to emulate one just for the sake of being like Unix. Yes, there are many embedded Linux devices with no physical filesystem that run out of initrd or initramfs, but that is only because in Unix you need a filesystem because eveything is a file. The AVR has 4k of memory and 64k of flash, and I don't want to fill that up with a fake filesystem. In KittyOS, all you need are devices to read and write. There are a few different classes of devices:

    Basic device: A basic device has one function: ioctl. Ioctl passes 2 integers to a device, a command number and a value. This might mean different things to different devices. For a serial port, it might set the baud rate. For a video card, it might change the video mode. Maybe this initalizes a SPI bus, or sets the mode on a temperature sensor from degrees F to degrees C, or maybe it changes the color of an RGB LED. It's just a basic command channel.

    Char device: A character device supports everything a Basic device does, but supports serialized reading and writing. These might be 'files', serial ports, sockets, keyboards, mice, anything.



    Block device: A device that reads or writes data in chunks, such as a disk. This is a little different from a linux block device. Linux block devices are still 'files' and can be read or written single bytes at a time, and the OS uses a block caching layer to actually read and writes blocks at a time. This is not supported in KittyOS. Reading or writing to a block device is done, in random access, one whole block at a time. This maps cleanly to the sdcard read and write sector commands, and probably most disk devices.

    file device: A file on a disk (in a filesystem) is just a chardevice that supports two additional functions: seek and pos. This will reposition a read/write stream, or tell you where it is. The act of opening a file on a disk creates a file chardevice for the user to interact with. Instead of treating a virtual serial port as a file, KittyOS treats a file as a virtual serial port.

    Selector device: A device that finds, lists, organizes or creates other devices. Supports the call 'subdev' which given a name and some flags, returns another device to use. It will also support another call 'list' which returns a list of available subdevice names. If opening a file on disk, it will create a chardevice that represents the contents of that file. On a disk a selector device might list other selector devices. (Hint: its a directory )

    The SDCard filesystem will be organized like this:

    The above seems like a lot of layers, but I still think the Atmega will mostly be held up waiting for data on the bus. I do have some sdcard routines that directly read/write spi, but I want to try making them go thru a chardevice wrapper, the same way serial routines currently do. Why? This is closer to how a modern OS might be structured. It's also very powerful to abstract things like this. If reading/writing SPI is done thru the chardevice, then what keeps me from simulating SPI devices in software, or debugging what is being sent to an SPI device by sending it over the serial port to my PC? Suppose by SPI bitrate is 2 MHz. Then on a 20Mhz atmega, there's 10 instructions per bit, so 80 per byte. 80 instructions is enough to traverse the chardevice struct pointer to the function that puts the next byte on the SPI bus


    But, SPI isn't a regular chardevice is it? SPI doesn't read/write seperately like a serial port, but exchanges bytes: You always read and write at the same time. In the past I did do some experiments with treating the SPI as a serial port with only a read and write function. This was easy by keeping a 1 byte buffer:


    This fits all the use cases of SPI I've come across so far. Either you write and then immeediately read, to do a byte exchange. Or you write several bytes in a row, ignoring the return value. Or you read several bytes in a row, outputting "don't care" values. This also maps well to how the AVR hardware works.



    VGA Rewrite

    September 6, 2018


    I spent the last two evening rewriting the VGA driver. The VGA signal is bitbanged by the AVR, and all the details are here in this post 4! years ago.


    I am rewriting the driver, because the old one stopped working. In the time since I last worked on the cat644, I switched from Atmel Studio 6 to Atmel Studio 7. My carefully ordered C code doesn't have the right timings anymore. C was never meant for cycle counting anyway. The new driver is in assembly. The old driver had the disadvantage of not being full width. I never managed to cram all 256 pixels in a scanline; the routine always needed to abort early. A compelte rewrite of the video interrupt was long overdue.


    The original C code supported 2 video modes: slow and fast. Slow mode draws every scanline, line-doubling the framebuffer. Fast mode draws every other scanline, giving more CPU back to the user program. I'm doing the same thing here, except the video mode is changed via function pointer. The user can select which pointer is used for the video driver.


    Currently there is no support for the keyboard. The first cat1 driver for the keyboard was interrupt driven. But, the interrupt conflicted with the video interupt, causing timing shifts an an unstable image when a key was pressed. The solution I ended up using was having the video interrupt poll the keyboard port every scanline, and if there was a change to the ps2clk pin, at the end of the scanline the input bit would be processed. This processing took too long, so there was no time to restore CPU state and return to the user program, so instead the interrupt handled the keyboard, waited until the next clock cycle, and then proceeded to draw the next scanline immediately. I can probably use that same approach here. However, now I will have gotten keyboard code in the video driver. I am considering adding a 'timer' interface, where there are multiple possible timer functions that can be called at different priority orders. When the interupt fires, the first thing that should be run is the horizontal sync code. Then, depending on if its vertical blanking, or a 'skippy' blank scanline, runs the keyboard function and returns to the user program. If it is a full scanline, draw the scanline, then call the keyboard function instead of returning to the user program.



    Cycle Scavenging

    September 11, 2018

    10:15 pm


    To output each pixel in the VGA routine takes two instructions:
    	inc zh                     //incremenet address
    	out io(RAMADDR_PORT), zh   //output new address
    
    This is repeated many times as an unrolled loop:
    	//define some macros
    
    	.macro PX
    		inc zh
    		out io(RAMADDR_PORT), zh
    	.endm
    
    	.macro PX4
    		PX
    		PX
    		PX
    		PX
    	.endm
    
    	.macro PX16
    		PX4
    		PX4
    		PX4
    		PX4
    	.endm
    
    
    	//a bit later, when the time is right
    
    	PX16
    	PX16
    	PX16
    	...
    	PX16  //256 pixels
    
    

    The AVR has the feature that writing to the INPUT register of a port performs an XOR on that register and flips bits. If I already have the right value (the number 1) loaded, I can output 2 pixels using only 3 instructoins:

    
    .macro PX2noppy_odd_even
    	nop                       //do nothing!
    	out io(RAMADDR_PIN), zl   //flip the last bit: the 0 on the address becomes a 1
    	subi zh, -2  //skip 2     //count forward 2
    	out io(RAMADDR_PORT) ,zh  //output the next even address
    .endm
    
    

    I replaced 16 pixels of the display with the 'noppy' version as a test. Everything works the same. This creates the same display, but every 4th instruction is a nop. What is the advantage? These nops can be replaced with any other single cycle instruction. So, one quarter of the 256 instructions used to generate the display (64 instructions) is potentially reclaimed. With caveats. I am limited to single cycle instructions (any 2-cycle one will break the timing), and there can be no jumps. Limited to basic arithmetic and bit operations: essentially branchless combinational logic. Not sure what I can put there yet.

    On the original Hackaday.io project page for this, I had implemeneted a higher than possible horizontal resolution by making use of one of the AVR's PWM timers. PORTB is used for addressing memory. PORTB also has PWM pins, so I configured the PWM pin to toggle every cycle. With the code timed correctly, even though I could only increment and output once every 2 clock cycles, the PWM was making each 2-cycle long address break into 2 addresses. Similar to what I have tried above by flipping one bit without incrementing. Well, on the page I just linked to, I did realize even back then that if I used the PWM trick on a low resolution video mode, I only needed to increment and write once every 4 clock cycles. Leaving 2-cycle nops available. The idea back then was to implement an interpreter within those cycles. Now, that idea seems crazy, but I think if I can just reclaim enough CPU, I should be able to maybe process the PS2 keyboard within the nops recaptured from video. There are two problems with this 1)The way the PWM toggles bits complicates horizontal scrolling, since its not as simple as 'add one', the bit being toggles is actually bit number 3; very annoying. To make it work before I needed to write to memory in a strange swizzled order. And 2) If I were to implement hi-res video, the ps2 handling code will still have to go somewhere.

    To get on with this project, I think I am just going to have the PS2 keyboard blank out a scanline for handling incoming bits. I can clean it up later.



    VGA Bug

    September 11, 2018


    I messed up writing the VGA rotines the other day. I was wondering why the screen was vertically offset a little bit, despite the vertical blanking being the in the right spot. Turns out, it was not. I messed up my row counter. I never reset it! It would just overflow, so I was counting out 512 scanlines instead of 525. To save CPU time, I attempted to use two byte variables: one is the row to draw, and the other tracks odd vs even. This came out to less clocks wasted loading, incrementing, testing 16 bit values, etc. I would have to notch out pixels otherwise if I didn't do this optimization.

    The real fix would be to use 16-bit math, but the longer loads, stores, and compares was blowing the length of the interrupt routine. I need the active video scanline case to be fast.

    The trick was to keep the existing active video parts all the same. The interrupt routine is already abbreviated when there is a vertical blanking interval; it runs just enough to get the hsync/vsync timing right and then exits to the user application. So, in the vertical blanking interval, the rowcounter is hacked up to make the frame take 525 lines. The video driver still alternates odd/even for rows 0 to 254. (Scanlines 0 to 509). The next lines, the counters are locked to 'even' and 255. A seperate counter tracks how many scanlines beyond the counter limit, and when it equals 15, it then resets all counters to zero. (510 from 2x254, then + 15 = 525 lines). This works fine! This additional counter was then stored in the leftover bits of the odd/even flag. (Since it was a boolean, 7 other bits are wasted!)


    Reducing display width?

    Originally I rewrote the VGA routine to get better control over timing. Part of the goal was to optimize it so I can get the full 256 pixel width of the display. However, now there's extra stuff I haven't crammed in yet, like ps2 keyboard polling. If there is a change on the PS2CLK input, I suppose a scanline can be skipped, or notched out, without too much ill effect. But, another thing I wanted to do with this is support good horizontal scrolling. The VGA driver already support scrolling in both direction. However verical scrolling has something horizontal doesn't: a hidden area. Therea are 256 pages in a bank of memory, and vertical scrolling just changes which page is the first scanline, and there are only 240 visible lines. This gives 16 lines that can be written to in software BEFORE scrolling, and the user will not see the drawing. To do the same thing horizontally, there needs to be a strip of hidden pixels. If the width of the display is shrunk from 256 to 248 pixels, this gives a 8 pixel wide hidden area, and also provides 16 additional clock cycles to the video interrupt. If the text font used is 8 pixels wide, this gives a strange 31 character wide display. If the font is switched to an 6 pixel wide font, the display will be 41 characters wide.


    Pin Change UnInterrupt

    September 14, 2018

    When I forst wrote the cat1 test program, I interfaced to the keyboard with an interrupt. The AVR had a pin change interrupt. When a key is pressed, the keyboard has a valid bit ready on the falling edge, so having the pin change interrupt make an effecient way to read the keyboard. There was a lot of fussing around getting it to work with the video interrupt. Ultimately, I ended up dropping the interrupt idea, and instead polled the keyboard port in software on every scanline.

    My Hackaday.io post on my previous troubles with the Keyboard interrupt

    Today, I was looking for a better way. I googled 'AVR pin change flag without interrupts,' and came across a few claims like this saying the AVR will still set the pin change interrupt FLAG even with that particular interrupt disabled. I tried it in Atmel Studio's simulator, and it seems to be correct.

    I still need to check the flag... PCIFR, every scanline. But this is much less work than I had before. I added code to read the port, compare it to the last read (which means storing the last state in a variable), and the clock cycles were adding up. It was threatening some pixels int he display: I needed to blank out a few pixels to make space for this. With the flag, I don't think I have to. (I will probably end up blanking a few pixels: both to create an invisible draw space for horizontal smooth scrolling, and to give pixels to the sound output engine.)

    The plan now is to check the flag at the beginning of the scanline... if it needs to be serviced, the scanline will not be drawn. If I get basic keyboard working along with this video driver, I will move on to the next thing: probably the SD card.


    Keyboard Driver 'Done'

    September 16, 2018

    I think I have the assembly code all ready for polling the keyboard during the video interrupt. I need to solder up a new PS/2 connector and try it out. Hope fully tomorrow. (I stole the PS/2 breakout header I previously used in the project for something else last retrochallenge...) It's 'code complete', just untested. :-)


    Keyboard and VGA working together AND a mystery!

    September 19 2018

    I had VGA and video working together a few days ago. I took the 'cheat' route and just had a pin change event blank out an entire scanline. It worked perfectly. At a 30 Khz screen, if the keyboard is up to 15 khz, and the clock has a 50% duty cycle, the routine should be able to sample the bits just fine if it looks once per scanline. And it worked just fine.


    I tried to handle the keypress without blanking out the screen. I did this on the cat1 test program when it was all in C:

    It did not work in this case. Then I remembered that the sampling has to be at the same time every scanline. If sometimes I sample at the beginning (for a blank line), or at the end (for an active line), the important bit transition may have been missed. Basically if the ps2 pin clock is 50% duty cycle, if I don't sample evenly, something will be missed. So, instead I changed it to sample at the beginning of the scnaline cycle, at hsync, exactly at the same clock every scanline. This also didn't work! So I went back to the simple version that blanks out a scanline. (Thanks git!). And I ran by sample-and-save code immediately after my load-and-act code, and it still didn't work. Why the hell not? Well I have two versions of the sampling code here, both almost identical, and in the AVR simulator do the exact same time. One works, one does not! I even padded the working one with a NOP to make the io timings the same!


    WORKING:
    in zl, io(PCIFR)		//read pin change flags
    andi zl, 1<<PCIF0		//isolate the flag we are looking for
    nop				//skip a step to line up with not-working version
    
    out io(PCIFR), zl 		//write the flag back.  If it was 0, this does nothing.  If this was 1, writing a 1 here clears the flag	
    in zh, io(KEY_PIN)		//read the ps2 port
    andi zh, KEY_DATA | KEY_CLK	//isolate the clock and data bits  (bits 6 and 7)
    or zh, zl			//add in the changed flag (bit 0)
    out io(GPIOR0), zh		//store in GPIOR0, which is later read to get the captured ps/2 port snapshot, and the 'changed' flag
    
    NOT WORKING:
    in zl, io(PCIFR)		//read the pin change flags
    sbi io(PCIFR),PCIF0		//read the flags, set bit 0, and write back
    				//technically this might clear other flags, But no other flags in this register are being used, and should be zero
    				//also, clearing an already clear flag should do anything.  It doesn't in the atmel simulator!
    andi zl, 1<<PCIF0		//isolate the pin change flag
    in zh, io(KEY_PIN)		//read ps2 port
    andi zh, KEY_DATA | KEY_CLK	//isolate the clock and data bits
    or zh, zl			//add in the changed flag
    out io(GPIOR0), zh		//store in GPIOR0
    

    Now if you look at the not-working version there is a slight race condition. It is possible that after reading the pin change flag, the change can occur right before the next instruction SBI, and I'll have a pin change state that I haven't captured. But, this race condition is only vulnerable for 2 instruction cycles (.1 us out of every scanline), and if every keypress is magically hitting this, despite me plugging the keyboard in at a random time, that is very very unlucky. Or the pin change circuit of the AVR has large delay in it between a pin change and the flag appearing. Or writing a 1 to a non-set pin change flag causes some errant behavior, maybe suppressing the next up-and-coming pin change? The simulator doesn't seem to have any odd behavior here.

    I plan to apply this fix to the complicated version, and then move on to the SD card.