Hi guys!
I've made a utility that should significantly decrease time of primary ROM analysis, namely, table markup. It can detect table type - boost, ignition timing, tip in and others. Of course, it must first be trained. One should prepare several definitions from which it will study. To achieve best results, these definitions should be from same CPU architecture, car model and close years.
You can get ScoobyTables here:
https://github.com/aalesv/ScoobyTablesYou can get pre-trained model(s) here:
https://github.com/aalesv/ScoobyTables-pretrainedScoobyTablesThis software is designed for automatic Subaru Denso ROM table markup using machine learning. Currently, only the k-nearest neighbors algorithm is supported because it has distance metrics that can effectively eliminate false positives.
What ROMs are supportedFor now, only Subaru Denso ROMs (and maybe other ROMs that have the same table format) are supported. It includes SH7055, SH7058, and SH72531.
How to use, brieflyScoobyTables works with defs in ScoobyRom XML or CSV format. You can
get ScoobyRom here. This is a fork of ScoobyRom that has support for CSV export and some other improvements.
How it works and why exactly this way and not otherwiseScoobyTables uses k-nearest neighbors algorithm. This means that every table should be presented as a point in n-dimensional space. Then the distance from each new, unknown, table to every known table is calculated. Based on this distance, a decision is made about class membership.
The idea is that tables placement does not change significantly from ROM to ROM. Of course, I'm talking about close enough ROMs - same car models, same CPU models, close years. The best illustration of this is Subaru Forester Gen 4 with SH72531 CPU. All the tables in those ROMs are located very similarly relative to each other. That's why this approach works.
The table contents themselves do not matter, only the data that describes the table:
- Table structure relative ROM placement
- Length of table axes
- Data type of X and Y axes - RPM, temperature, engine load, etc.
- Data type of table data itself
- Table multiplier and offset, which are needed to convert from integer to float
Table structure relative ROM placement is easy to calculate if its address is known. Only ScoobyRom stores them. Length of table axes is defined in most defs except ScoobyRom. Data type of X or Y axes could be approximately represented by average number of the axis value - they can be calculated. Fortunately, ScoobyRom stores min and max values in defs' comments. Data type of table data itself are defined in every software. Table multiplier and offset are not stored directly in any definition.
So, no existing defs contain all information. More or less ScoobyRom XML defs are suited, but some info needs to be exctracted from binary ROM file. Or I can modify ScoobyRom to export all the data I need in some format, for example, in CSV, that can be easily imported by pandas. Well, I did both.
That's why if you want to use the XML format, you need both .bin and .xml files. But you can use modified
ScoobyRom to export to CSV, and then you don't need a .bin file. By the way, I highly recommend to use modified
ScoobyRom because it saves all annotated tables and all selected tables, which original ScoobyRom does not do - original saves only annotated tables. And to calculate relative table position, we need to have all the tables in def.
And now a couple of words about results. KNN predicts very well, but it makes many wrong predictions on incomplete data. Hardly someone defined all tables in ROM. So many wrong predictions are made. To filter them, distance thresholds are used - one for 2D tables (because they are more similar) and another for 3D tables. See CLI help.
How to trainFirst, you need to prepare the data. It is crucial that the same tables in all ROMs be called the same. Symbol case does not matter; a table name may end with '_1', '_2' etc. or '_A', '_B' etc. - all this ending will be stripped. For example, names 'Base_Timing_1' and 'Base_Timing_A' are good. And names 'BaseTimingA' and 'Base_Timing1' are not. This greatly affects the accuracy of the prediction.
I assume that you use modified ScoobyRom 0.8.5 or later
Command line parametersOne of the arguments
Code:
--train
or
Code:
--predict
is required. Run with
Code:
--help
to see help.
If you specify
Code:
--predict
, file format is guessed from the extension. Only ScoobyRom XML and CSV formats are supported.
You can set a distance threshold above which classifying will be ignored:
Code:
--knn-min-2d-reliable-metric
for 2D tables,
Code:
--knn-min-3d-reliable-metric
for 3D tables.
Full CLI help:
Code:
usage: ScoobyTables.py [-h] (--train <dirname> | --predict <filename>) [-i {xml,csv}] [--model-dump-file <filename>]
[-v] [--neighbors <number of neighbors>] [--test-accuracy]
[--test-sample-size <float number from 0.0 to 1.0>] [--random-state <int number>] [--dry-run]
[--knn-min-2d-reliable-metric <float number>] [--knn-min-3d-reliable-metric <float number>]
[--pre-xml-filename <filename>] [--dump-txt] [--pre-txt-filename <filename>] [--version]
options:
-h, --help show this help message and exit
--train <dirname> Train, test and dump model. Specify --test-accuracy to test accuracy. (default: None)
--predict <filename> Predict. Get data from <filename>. XML and CVS formats are supported and autodetected based on
the extension. (default: None)
-i {xml,csv}, --input-format {xml,csv}
Input data format. (default: xml)
--model-dump-file <filename>
Model dump file name. (default: scoobytables.dmp)
-v, --verbose Be verbose. (default: False)
--neighbors <number of neighbors>
Number of neighbors for KNN model. (default: 2)
--test-accuracy Test model accuracy during model training. (default: False)
--test-sample-size <float number from 0.0 to 1.0>
Test sample size. (default: 0.1)
--random-state <int number>
Test sample random state pseudo-random number. If not set, test sample is purely random.
(default: None)
--dry-run Do not save anything to files. (default: False)
--knn-min-2d-reliable-metric <float number>
Minimum reliable metric for 2D tables. (default: 0.5)
--knn-min-3d-reliable-metric <float number>
Minimum reliable metric for 3D tables. (default: 5)
--pre-xml-filename <filename>
Predicted XML definitions file name. (default: output.xml)
--dump-txt Write predicted data to text file. (default: False)
--pre-txt-filename <filename>
Predicted dataframe text file name. (default: output.txt)
--version Print version number.
Also there's two additional scripts:
beautify_xml.py makes predicted ScoobyRom XML file more readable:
Code:
usage: beautify_xml.py [-h] [-i <filename>] [-o <filename>] [-F <symbol>] [--numeric-suffix] [--version]
Add suffixes "_A", "_B" etc or "_1", "_2" etc to same table names.
options:
-h, --help show this help message and exit
-i <filename>, --input <filename>
Input filename (default: None)
-o <filename>, --output <filename>
Output filename. stdout if not specified (default: None)
-F <symbol>, --word-separator <symbol>
Word separator (default: _)
--numeric-suffix Suffix is numeric (default: False)
--version Print version number.
For better results please clean and rename tables first!
def_to_sr.py is IDA 6.8 (python 2.7) script to convert RomRaider definitions to ScoobyRom definitions. Run it inside IDA.
Happy hacking!