.. _using_baseband: .. include:: ../tutorials/glossary_substitutions.rst ************** Using Baseband ************** For most file formats, one can simply import baseband and use `baseband.open` to access the file. This gives one a filehandle from which one can read decoded samples:: >>> import baseband >>> from baseband.data import SAMPLE_DADA >>> fh = baseband.open(SAMPLE_DADA) >>> fh.read(3) array([[ -38.-38.j, -38.-38.j], [ -38.-38.j, -40. +0.j], [-105.+60.j, 85.-15.j]], dtype=complex64) >>> fh.close() For other file formats, a bit more information is needed. Below, we cover the basics of :ref:`inspecting files `, :ref:`reading ` from and :ref:`writing ` to files, :ref:`converting ` from one format to another, and :ref:`diagnosing problems `. We assume that Baseband as well as numpy_ and the `astropy.units `_ module have been imported:: >>> import baseband >>> import numpy as np >>> import astropy.units as u .. _using_baseband_inspecting: Inspecting Files ================ Baseband allows you to quickly determine basic properties of a file, including what format it is, using the `baseband.file_info` function. For instance, it shows that the sample VDIF file that comes with Baseband is very short (sample files can all be found in the `baseband.data` module):: >>> import baseband.data >>> baseband.file_info(baseband.data.SAMPLE_VDIF) VDIFStream information: start_time = 2014-06-16T05:56:07.000000000 stop_time = 2014-06-16T05:56:07.001250000 sample_rate = 32.0 MHz shape = (40000, 8) format = vdif bps = 2 complex_data = False verify = fix readable = True checks: decodable: True continuous: no obvious gaps VDIFFile information: edv = 3 number_of_frames = 16 thread_ids = [0, 1, 2, 3, 4, 5, 6, 7] number_of_framesets = 2 frame_rate = 1600.0 Hz samples_per_frame = 20000 sample_shape = (8, 1) The same function will also tell you when more information is needed. For instance, for Mark 5B files one needs the number of channels used, as well as (roughly) when the data were taken:: >>> baseband.file_info(baseband.data.SAMPLE_MARK5B) Mark5BFile information: format = mark5b number_of_frames = 4 frame_rate = 6400.0 Hz bps = 2 complex_data = False readable = False missing: nchan: needed to determine sample shape, frame rate, decode data. kday, ref_time: needed to infer full times. >>> from astropy.time import Time >>> baseband.file_info(baseband.data.SAMPLE_MARK5B, nchan=8, ref_time=Time('2014-01-01')) Mark5BStream information: start_time = 2014-06-13T05:30:01.000000000 stop_time = 2014-06-13T05:30:01.000625000 sample_rate = 32.0 MHz shape = (20000, 8) format = mark5b bps = 2 complex_data = False verify = fix readable = True checks: decodable: True continuous: no obvious gaps Mark5BFile information: number_of_frames = 4 frame_rate = 6400.0 Hz samples_per_frame = 5000 sample_shape = (8,) The information is gleaned from ``info`` properties on the various file and stream readers (see below). .. note:: The one format for which ``file_info`` works a bit differently is `GSB `_, as this format requires separate time-stamp and raw data files. Only the timestamp file can be inspected usefully. .. _using_baseband_reading: Reading Files ============= Opening Files ------------- As shown at the very start, files can be opened with the general `baseband.open` function. This will try to determine the file type using `~baseband.file_info`, load the corresponding baseband module, and then open the file using that module's master input/output function. Generally, if one knows the file type, one might as well work with the corresponding module directly. For instance, to explicitly use the DADA reader to open the sample DADA file included in Baseband, one can use the DADA module's `~baseband.dada.open` function:: >>> from baseband import dada >>> from baseband.data import SAMPLE_DADA >>> fh = dada.open(SAMPLE_DADA, 'rs') >>> fh.read(3) array([[ -38.-38.j, -38.-38.j], [ -38.-38.j, -40. +0.j], [-105.+60.j, 85.-15.j]], dtype=complex64) >>> fh.close() In general, file I/O and data manipulation use the same syntax across all file formats. When opening Mark 4 and Mark 5B files, however, some additional arguments may need to be passed (as was the case above for inspecting a Mark 5B file, and indeed this is a good way to find out what is needed). Notes on such features and quirks of individual formats can be found in the API entries of their ``open`` functions, and within the :ref:`Specific file format ` documentation. For the rest of this section, we will stick to VDIF files. Decoding Data and the Sample File Pointer ----------------------------------------- By giving the openers a ``'rs'`` flag, which is the default, we open files in "stream reader" mode, where a file is accessed as if it were a stream of samples. For VDIF, `~baseband.vdif.open` will then return an instance of `~baseband.vdif.base.VDIFStreamReader`, which wraps a raw data file with methods to decode the binary |data frames| and seek to and read data |samples|. To decode the first 12 samples into a `~numpy.ndarray`, we would use the `~baseband.vdif.base.VDIFStreamReader.read` method:: >>> from baseband import vdif >>> from baseband.data import SAMPLE_VDIF >>> fh = vdif.open(SAMPLE_VDIF, 'rs') >>> d = fh.read(12) >>> type(d) <... 'numpy.ndarray'> >>> d.shape (12, 8) >>> d[:, 0].astype(int) # First thread. array([-1, -1, 3, -1, 1, -1, 3, -1, 1, 3, -1, 1]) As discussed in detail in the :ref:`VDIF section `, VDIF files are sequences of data frames, each of which is comprised of a :term:`header` (which holds information like the time at which the data was taken) and a :term:`payload`, or block of data. Multiple concurrent time streams can be stored within a single frame; each of these is called a ":term:`channel`". Moreover, groups of channels can be stored over multiple frames, each of which is called a ":term:`thread`". Our sample file is an "8-thread, single-channel file" (8 concurrent time streams with 1 stream per frame), and in the example above, ``fh.read`` decoded the first 12 samples from all 8 threads, mapping thread number to the second axis of the decoded data array. Reading files with multiple threads and channels will produce 3-dimensional arrays. ``fh`` includes ``shape``, ``size`` and ``ndim``, which give the shape, total number of elements and dimensionality of the file's entire dataset if it was decoded into an array. The number of |complete samples| - the set of samples from all available threads and channels for one point in time - in the file is given by the first element in ``shape``:: >>> fh.shape # Shape of all data from the file in decoded array form. (40000, 8) >>> fh.shape[0] # Number of complete samples. 40000 >>> fh.size 320000 >>> fh.ndim 2 The shape of a single complete sample, including names indicating the meaning of shape dimensions, is retrievable using:: >>> fh.sample_shape SampleShape(nthread=8) By default, dimensions of length unity are |squeezed|, or removed from the sample shape. To retain them, we can pass ``squeeze=False`` to `~baseband.vdif.open`:: >>> fhns = vdif.open(SAMPLE_VDIF, 'rs', squeeze=False) >>> fhns.sample_shape # Sample shape now keeps channel dimension. SampleShape(nthread=8, nchan=1) >>> fhns.ndim # fh.shape and fh.ndim also change with squeezing. 3 >>> d2 = fhns.read(12) >>> d2.shape # Decoded data has channel dimension. (12, 8, 1) >>> fhns.close() Basic information about the file is obtained by either by ``fh.info`` or simply ``fh`` itself:: >>> fh.info VDIFStream information: start_time = 2014-06-16T05:56:07.000000000 stop_time = 2014-06-16T05:56:07.001250000 sample_rate = 32.0 MHz shape = (40000, 8) format = vdif bps = 2 complex_data = False verify = fix readable = True checks: decodable: True continuous: no obvious gaps VDIFFile information: edv = 3 number_of_frames = 16 thread_ids = [0, 1, 2, 3, 4, 5, 6, 7] number_of_framesets = 2 frame_rate = 1600.0 Hz samples_per_frame = 20000 sample_shape = (8, 1) >>> fh Not coincidentally, the first is identical to what we :ref:`found above ` using `~baseband.file_info`. The filehandle itself also shows the ``offset``, the current location of the sample file pointer. Above, it is at ``12`` since we have read in 12 (complete) samples. If we called ``fh.read (12)`` again we would get the next 12 samples. If we instead called ``fh.read()``, it would read from the pointer's *current* position to the end of the file. If we wanted all the data in one array, we would move the file pointer back to the start of file, using ``fh.seek``, before reading:: >>> fh.seek(0) # Seek to sample 0. Seek returns its offset in counts. 0 >>> d_complete = fh.read() >>> d_complete.shape (40000, 8) We can also move the pointer with respect to the end of file by passing ``2`` as a second argument:: >>> fh.seek(-100, 2) # Second arg is 0 (start of file) by default. 39900 >>> d_end = fh.read(100) >>> np.array_equal(d_complete[-100:], d_end) True ``-100`` means 100 samples before the end of file, so ``d_end`` is equal to the last 100 entries of ``d_complete``. Baseband only keeps the most recently accessed data frame in memory, making it possible to analyze (normally large) files through selective decoding using ``seek`` and ``read``. .. note:: As with file pointers in general, ``fh.seek`` will not return an error if one seeks beyond the end of file. Attempting to read beyond the end of file, however, will result in an ``EOFError``. To determine where the pointer is located, we use ``fh.tell()``:: >>> fh.tell() 40000 >>> fh.close() Caution should be used when decoding large blocks of data using ``fh.read``. For typical files, the resulting arrays are far too large to hold in memory. Seeking and Telling in Time With the Sample Pointer --------------------------------------------------- We can use ``seek`` and ``tell`` with units of time rather than samples. To do this with ``tell``, we can pass an appropriate `astropy.units.Unit` object to its optional ``unit`` parameter:: >>> fh = vdif.open(SAMPLE_VDIF, 'rs') >>> fh.seek(40000) 40000 >>> fh.tell(unit=u.ms) Passing the string ``'time'`` reports the pointer's location in absolute time:: >>> fh.tell(unit='time')