Release Notes

2.2.0 -> 2.3.0 2025-May-05

Maintenance release.

changed development environment to vscode as ide
changed to use poetry as dependency management tool
inform user if a newer version is available on pypi.org

2.1.0 -> 2.2.0 2025-March-07

New
- CIKXXFilter was introduced for RawDataBag and JoinedDataBag
- ciks_filter parameter was added to the load methods of RawDataBag and JoinedDataBag
- The notebook 09_00_segments_basics gives an idea how you can work with the information in the segment column. segment_basics
- The concat processes ConcatByChangedTimestampProcess and ConcatByNewSubfoldersProcess now have a switch to choose whether in_memory or file_based concatenation should be used
- ConcatByChangedTimestampProcess and ConcatByNewSubfoldersProcess now also support the concatenation of StandardizedBag
- StandardizeProcess now also works with multiple subfolders where each contains BS, CF, and IS folders
- A new example of a memory optimized pipeline was introduced: secfsdstools.x_examples.automation.memory_optimized_automation.define_extra_processes. Have a look at the description of this pipeline in 08_02_automation_a_memory_optimized_example_2.2.0
Changes
- The is_xxx_bag_path methods in the module secfsdstools.d_container.databagmodel have been moved into RawDataBag, resp. JoinedDataBag classes in the same module. The StandardizedBag now also has a is_xxx_bag_path method.
Other
- GitHub sponsoring account was activated: https://github.com/sponsors/HansjoergW
- GitHub Discussions was activated: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions

2.0.0 -> 2.1.0 2025-February-18

The main goal of this release was to improve the memory footprint when working with the framework. These mainly includes the support for Predicate Pushdown in the load methods, as well as being able to concatenate bags directly on the file system, which significantly improved the memory footprint during concatenation steps when using “automation”.

Checkout the notebook: bulk_data_processing_memory_efficiency

New
- Predicate Pushdown in load methods of RawDataBag and JoinedDataBag
  Directly apply filters for adshs, statements, forms, and tags during loading of the data
- concat_filebased concatenates RawDataBag and JoinedDataBag folders without loading them into memory
- ConcatByChangedTimestampProcess and ConcatByNewSubfoldersProcess use concat_filebased
- save for RawDataBag, JoinedDataBag, and StandardizedBag create new the target folder if it does not exist.

1.8.2 -> 2.0.0 2025-February-11

Introducing the new version of the datasets that includes the “segments” column in then num tables. The main purpose of this version is to ensure that the new “segments” colomn does not interfere with existing logic.

The following did change:

Checks during starting if only data from the new datasets is present. If not, data have to be reloaded
New NoSegmentInfo filter for raw and joined bags: removes datapoints with non-empty segment info
StandardPresenter has a new show_segments flag. If True, datapoints with segments information are displayed as well
Notebook 03_explore_with_interactive_notebook has new option show_segments for displaying the details of a report
Support for Daily-Datasets has been removed

1.8.1 -> 1.8.2 2025-January-20

Ensures data is read only from the archived version of the datasets without the segments column in num.

1.8.0 -> 1.8.1 2025-January-12

Fix problem with circular import when using the new FilterProcess module in secfsdstools.g_pipeline

1.7.0 -> 1.8.0 2025-January-10

Fix in OfficialTagsOnlyJoinedFilter: did only filter unofficial tags instead of vice versa
Major changes
- Check for update is always executed regardless which feature of the framework is being used. Previously, this just happened if a collector had been used.
Minor changes
- The interface of the classmethod Updater.get_instance was changed and takes now a Configuration instance instead of the individual attributes of the object.
- The concat methods of the JoineDataBag and RawDataBag classes have a new parameter “drop_duplicates_sub_df”, which drops duplicated entries from the sub_df dataframe. Default is set to False. This is being used when concatenating data from the same reports. E.g., if you have a bag with all the balance sheet data and another bag with all the income sheet data, but from the same reports, you should set that parameter to True, otherwise you would have duplicated entries in the sub_df.
New
- Introducing automation pipeline framework in package secfsdestools.c_automation. This framework can be used as standalone, or it can be used to implement additional steps that can be added to the update process.
  - Checkout the documentation for the package secfsdstools.c_automation
  - Checkout example implementations of pipeline steps that can be directly used in your own pipelines: secfsdstools.g_pipeline
  - Checkout the example implementation on how you can add additional processing step to the default update process: see package secfsdstools.x_examples.automation and notebook 08_automation_basics.
- Two hook function that can be implemented and configured that run after the default update process.
  - One hook function to provide additional processing steps that are implemented with the automation pipeline framework described above
  - One hook function that is called at the end of the update process and were you can freely implement any logic you want
  - Both hook functions are configured in the configuration file. Have a look at the notebook 08_00_automation_basics.

1.6.2 -> 1.7.0 2024-December-22

Fix for new path to zip files on SEC.gov
- The SEC did change the location of the zip files and this latest version fixes the path to them

1.6.1 -> 1.6.2 2024-September-15

Major changes
- Compatibility for Python 3.7 is no longer checked
- Compatibility for Python 3.11 was added
Minor changes
- secfsdstools.__version__ now returns the version of the library
- IncomeStatementStandardizer
  - Calculation for OutstandingShares and EarningsPerShare was simplified and improved
  - Validation rule for EarningsPerShare was added
  - Please have a look at the comments in 07_02_IS_standardizer
- Ability to customize the standardizer was improved
  - Configure the columns that are merged from sub_df into the final results can be extended
  - Configure additional tags that should appear in the final result can be defined
  - All constructor parameters of the Standardizer base class can be overwritten via the constructor of the three standardizer classes
  - New notebook that shows the different possibilities for customization: 07_04_customize_standardizer

1.6 -> 1.6.1 2024-August-20

Minor improvements
- filed column added to result of present method of standardizer
- StandardizedBag now has a concat() method to concat multiple instances into one
- Standardizer checks that the data contains just one currency
- IncomeStatementStandardizer now also returns OustandingShares and EarningsPerShare tags
- 03_explore_with_interactive_notebook.ipynb includes use of the CashFlowStandardizer
- improvements in the READMD.md -> thanks to Hamid Ebadi
Documentation
- Medium article Understanding the the SEC Financial Statement Data Sets

1.6.0 2024-July-12

New
- Introducing Cash Flow Standardizer
  The Cash Flow Standardizer makes the cash flow statements easily comparable.
  07_03_CF_standardizer
Improvements
- Small improvements in the Standardizer framework and rules

1.5.0 2024-May-18_

New
- Introducing Income Statement Standardizer
  The Income Statement Standardizer makes the income statements easily comparable.
  07_02_IS_standardizer
Improvements
- Small improvements in the Standardizer framework and rules

1.4.2 2024-Mar-29

Fix
- The StandardStatementPresenter didn’t consider qtrs when displaying the data. This was a problem for the Income Statement and the Cash Flow.
Improvements
- Several in the Standardizer as preparation to implement the Income Statement and Cash Flow Standardizer.

1.4.0 2024-Feb-02

New
- Introducing the Standardizer Framework and the Balance Sheet Standardizer as a first implementation.
  The Balance Sheet Standardizer makes the balance sheets easily comparable.
  Check out the following notebooks:
  07_00_standardizer_basics
  07_01_BS_standardizer
Improvements
- Efficiency improvements for MultiReportCollector: Every zip file is opened just once if there are multiple reports to load from the same zip file.

1.3.0 2023-Dec-28

New
- Notebook 06_bulk_data_processing_deep_dive
  This first version shows how datasets can be created with data from all available zip files. It shows a faster parallel approach which uses more memory and cpu resources and a slower serial approach which uses significant less resources.
- Package u_usecases introduced.
  This package is a place to provide concrete examples showing what you can do with the secfsdstools library. As a first usecase, the logic shown and explained in the 06_bulk_data_processing_deep_dive is provided as logic within the modul bulk_loading.

1.2.0 2023-Dec-02

API Changes
- MainCoregFilter was renamed to MainCoregRawFilter
- OfficialTagsOnlyFilter was renamed to OfficialTagsOnlyRawFilter
New
- secfsdstools.e_filter.rawfiltering.USDOnlyRawFilter is new and removes none USD currency datapoints
- All filters have been implemented for the JoinedDataBag as well: secfsdstools.e_filter.joinedfiltering
- Notebook 05_filter_deep_dive notebook.

1.1.0 2023-Oct-28

API Changes
- Zipcollector has now a factory method that can load multiple zip files as one
- Zipcollector has now a factory method that can load all zip files at one
- Zipcollector factory methods have a new filter parameter “post_load_filter”
New
- Filter for official tags only -> company specific tags are removed
- RawDataBag and JoinedDataBag have now copy_bag method
- Notebook 04_collector_deep_dive

1.0.1 2023-Oct-16

README.md adpated
Added information about using the library on windows because the multiprocessing package is used
https://docs.python.org/3.10/library/multiprocessing.html#the-process-class

1.0.0 2023-Sep-28

ApiChanges:

The API has completely changed, it should be more structured now.
Please check out the README.md and the 01_quickstart notebook for details

0.5.0 2023-Jun-02

use parquet as storing format instead of zipfiles with csv files -> 5-10x faster access to data
auto discover of new zip files on sec.gov
launch first time download of zip files without calling the update method

ApiChanges:

package secfsdstools.d_index was renamed into secfsdstools.c_index

0.4.0 2023-Mar-25

new MultiReportReader - reads reports from different zipfiles at once
new CompanyCollector - reads reports for one company from different zipfiles at once
new merge_pre_and_num() method which only merges the pre and num data but does not pivot it
new notebook that shows how the data can be analyzed with an interactive jupyter notebook

BugFixes:

coreg was not considered correctly when merging the data

0.3.0 2023-Feb-04

integration of https://rapidapi.com/hansjoerg.wingeier/api/daily-sec-financial-statement-dataset. Daily updates instead of quarterly updates.

0.2.1 2023-Jan-21

class ZipReportReader: helps to read data from a whole zip file; has the same interface as report reader
class IndexSearch: helps with searching the index_report table
added a getting started notebook: https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb
ensure runs also with python 3.7
improvements in the API documentation

0.2.0 2023-Jan-14

first simple APi docu on githubpages https://hansjoergw.github.io/sec-fincancial-statement-data-set/secfsdstools/
renaming of internal package structure

0.1.3 2023-Jan-08

dependencies added into pyproject.toml

0.1.1 2023-Jan-07

first version